Diffusers 文档
Prompt 技巧
并获得增强的文档体验
开始使用
Prompt 技巧
Prompts are important because they describe what you want a diffusion model to generate. The best prompts are detailed, specific, and well-structured to help the model realize your vision. But crafting a great prompt takes time and effort and sometimes it may not be enough because language and words can be imprecise. This is where you need to boost your prompt with other techniques, such as prompt enhancing and prompt weighting, to get the results you want.
This guide will show you how you can use these prompt techniques to generate high-quality images with lower effort and adjust the weight of certain keywords in a prompt.
Prompt 工程
This is not an exhaustive guide on prompt engineering, but it will help you understand the necessary parts of a good prompt. We encourage you to continue experimenting with different prompts and combine them in new ways to see what works best. As you write more prompts, you’ll develop an intuition for what works and what doesn’t!
New diffusion models do a pretty good job of generating high-quality images from a basic prompt, but it is still important to create a well-written prompt to get the best results. Here are a few tips for writing a good prompt
- What is the image medium? Is it a photo, a painting, a 3D illustration, or something else?
- What is the image subject? Is it a person, animal, object, or scene?
- What details would you like to see in the image? This is where you can get really creative and have a lot of fun experimenting with different words to bring your image to life. For example, what is the lighting like? What is the vibe and aesthetic? What kind of art or illustration style are you looking for? The more specific and precise words you use, the better the model will understand what you want to generate.


使用 GPT2 增强 Prompt
Prompt enhancing is a technique for quickly improving prompt quality without spending too much effort constructing one. It uses a model like GPT2 pretrained on Stable Diffusion text prompts to automatically enrich a prompt with additional important keywords to generate high-quality images.
The technique works by curating a list of specific keywords and forcing the model to generate those words to enhance the original prompt. This way, your prompt can be “a cat” and GPT2 can enhance the prompt to “cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain quality sharp focus beautiful detailed intricate stunning amazing epic”.
You should also use a offset noise LoRA to improve the contrast in bright and dark images and create better lighting overall. This LoRA is available from stabilityai/stable-diffusion-xl-base-1.0.
Start by defining certain styles and a list of words (you can check out a more comprehensive list of words and styles used by Fooocus) to enhance a prompt with.
import torch
from transformers import GenerationConfig, GPT2LMHeadModel, GPT2Tokenizer, LogitsProcessor, LogitsProcessorList
from diffusers import StableDiffusionXLPipeline
styles = {
"cinematic": "cinematic film still of {prompt}, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
"anime": "anime artwork of {prompt}, anime style, key visual, vibrant, studio anime, highly detailed",
"photographic": "cinematic photo of {prompt}, 35mm photograph, film, professional, 4k, highly detailed",
"comic": "comic of {prompt}, graphic illustration, comic art, graphic novel art, vibrant, highly detailed",
"lineart": "line art drawing {prompt}, professional, sleek, modern, minimalist, graphic, line art, vector graphics",
"pixelart": " pixel-art {prompt}, low-res, blocky, pixel art style, 8-bit graphics",
}
words = [
"aesthetic", "astonishing", "beautiful", "breathtaking", "composition", "contrasted", "epic", "moody", "enhanced",
"exceptional", "fascinating", "flawless", "glamorous", "glorious", "illumination", "impressive", "improved",
"inspirational", "magnificent", "majestic", "hyperrealistic", "smooth", "sharp", "focus", "stunning", "detailed",
"intricate", "dramatic", "high", "quality", "perfect", "light", "ultra", "highly", "radiant", "satisfying",
"soothing", "sophisticated", "stylish", "sublime", "terrific", "touching", "timeless", "wonderful", "unbelievable",
"elegant", "awesome", "amazing", "dynamic", "trendy",
]
You may have noticed in the words
list, there are certain words that can be paired together to create something more meaningful. For example, the words “high” and “quality” can be combined to create “high quality”. Let’s pair these words together and remove the words that can’t be paired.
word_pairs = ["highly detailed", "high quality", "enhanced quality", "perfect composition", "dynamic light"]
def find_and_order_pairs(s, pairs):
words = s.split()
found_pairs = []
for pair in pairs:
pair_words = pair.split()
if pair_words[0] in words and pair_words[1] in words:
found_pairs.append(pair)
words.remove(pair_words[0])
words.remove(pair_words[1])
for word in words[:]:
for pair in pairs:
if word in pair.split():
words.remove(word)
break
ordered_pairs = ", ".join(found_pairs)
remaining_s = ", ".join(words)
return ordered_pairs, remaining_s
Next, implement a custom LogitsProcessor class that assigns tokens in the words
list a value of 0 and assigns tokens not in the words
list a negative value so they aren’t picked during generation. This way, generation is biased towards words in the words
list. After a word from the list is used, it is also assigned a negative value so it isn’t picked again.
class CustomLogitsProcessor(LogitsProcessor):
def __init__(self, bias):
super().__init__()
self.bias = bias
def __call__(self, input_ids, scores):
if len(input_ids.shape) == 2:
last_token_id = input_ids[0, -1]
self.bias[last_token_id] = -1e10
return scores + self.bias
word_ids = [tokenizer.encode(word, add_prefix_space=True)[0] for word in words]
bias = torch.full((tokenizer.vocab_size,), -float("Inf")).to("cuda")
bias[word_ids] = 0
processor = CustomLogitsProcessor(bias)
processor_list = LogitsProcessorList([processor])
Combine the prompt and the cinematic
style prompt defined in the styles
dictionary earlier.
prompt = "a cat basking in the sun on a roof in Turkey"
style = "cinematic"
prompt = styles[style].format(prompt=prompt)
prompt
"cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"
Load a GPT2 tokenizer and model from the Gustavosta/MagicPrompt-Stable-Diffusion checkpoint (this specific checkpoint is trained to generate prompts) to enhance the prompt.
tokenizer = GPT2Tokenizer.from_pretrained("Gustavosta/MagicPrompt-Stable-Diffusion")
model = GPT2LMHeadModel.from_pretrained("Gustavosta/MagicPrompt-Stable-Diffusion", torch_dtype=torch.float16).to(
"cuda"
)
model.eval()
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
token_count = inputs["input_ids"].shape[1]
max_new_tokens = 50 - token_count
generation_config = GenerationConfig(
penalty_alpha=0.7,
top_k=50,
eos_token_id=model.config.eos_token_id,
pad_token_id=model.config.eos_token_id,
pad_token=model.config.pad_token_id,
do_sample=True,
)
with torch.no_grad():
generated_ids = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_new_tokens=max_new_tokens,
generation_config=generation_config,
logits_processor=proccesor_list,
)
Then you can combine the input prompt and the generated prompt. Feel free to take a look at what the generated prompt (generated_part
) is, the word pairs that were found (pairs
), and the remaining words (words
). This is all packed together in the enhanced_prompt
.
output_tokens = [tokenizer.decode(generated_id, skip_special_tokens=True) for generated_id in generated_ids]
input_part, generated_part = output_tokens[0][: len(prompt)], output_tokens[0][len(prompt) :]
pairs, words = find_and_order_pairs(generated_part, word_pairs)
formatted_generated_part = pairs + ", " + words
enhanced_prompt = input_part + ", " + formatted_generated_part
enhanced_prompt
["cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain quality sharp focus beautiful detailed intricate stunning amazing epic"]
Finally, load a pipeline and the offset noise LoRA with a low weight to generate an image with the enhanced prompt.
pipeline = StableDiffusionXLPipeline.from_pretrained(
"RunDiffusion/Juggernaut-XL-v9", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
pipeline.load_lora_weights(
"stabilityai/stable-diffusion-xl-base-1.0",
weight_name="sd_xl_offset_example-lora_1.0.safetensors",
adapter_name="offset",
)
pipeline.set_adapters(["offset"], adapter_weights=[0.2])
image = pipeline(
enhanced_prompt,
width=1152,
height=896,
guidance_scale=7.5,
num_inference_steps=25,
).images[0]
image


Prompt weighting
Prompt weighting provides a way to emphasize or de-emphasize certain parts of a prompt, allowing for more control over the generated image. A prompt can include several concepts, which gets turned into contextualized text embeddings. The embeddings are used by the model to condition its cross-attention layers to generate an image (read the Stable Diffusion blog post to learn more about how it works).
Prompt weighting works by increasing or decreasing the scale of the text embedding vector that corresponds to its concept in the prompt because you may not necessarily want the model to focus on all concepts equally. The easiest way to prepare the prompt-weighted embeddings is to use Compel, a text prompt-weighting and blending library. Once you have the prompt-weighted embeddings, you can pass them to any pipeline that has a prompt_embeds
(and optionally negative_prompt_embeds
) parameter, such as StableDiffusionPipeline, StableDiffusionControlNetPipeline, and StableDiffusionXLPipeline.
If your favorite pipeline doesn’t have a prompt_embeds
parameter, please open an issue so we can add it!
This guide will show you how to weight and blend your prompts with Compel in 🤗 Diffusers.
Before you begin, make sure you have the latest version of Compel installed
# uncomment to install in Colab
#!pip install compel --upgrade
For this guide, let’s generate an image with the prompt "a red cat playing with a ball"
using the StableDiffusionPipeline
from diffusers import StableDiffusionPipeline, UniPCMultistepScheduler
import torch
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", use_safetensors=True)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.to("cuda")
prompt = "a red cat playing with a ball"
generator = torch.Generator(device="cpu").manual_seed(33)
image = pipe(prompt, generator=generator, num_inference_steps=20).images[0]
image

Weighting
You’ll notice there is no “ball” in the image! Let’s use compel to upweight the concept of “ball” in the prompt. Create a Compel
object, and pass it a tokenizer and text encoder
from compel import Compel
compel_proc = Compel(tokenizer=pipe.tokenizer, text_encoder=pipe.text_encoder)
compel uses +
or -
to increase or decrease the weight of a word in the prompt. To increase the weight of “ball”
+
corresponds to the value 1.1
, ++
corresponds to 1.1^2
, and so on. Similarly, -
corresponds to 0.9
and --
corresponds to 0.9^2
. Feel free to experiment with adding more +
or -
in your prompt!
prompt = "a red cat playing with a ball++"
Pass the prompt to compel_proc
to create the new prompt embeddings which are passed to the pipeline
prompt_embeds = compel_proc(prompt)
generator = torch.manual_seed(33)
image = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
image

To downweight parts of the prompt, use the -
suffix
prompt = "a red------- cat playing with a ball"
prompt_embeds = compel_proc(prompt)
generator = torch.manual_seed(33)
image = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
image

You can even up or downweight multiple concepts in the same prompt
prompt = "a red cat++ playing with a ball----"
prompt_embeds = compel_proc(prompt)
generator = torch.manual_seed(33)
image = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
image

混合
您还可以通过将 .blend()
添加到提示列表并传递一些权重来创建提示的加权混合。您的混合结果可能并不总是如您所愿,因为它会打破关于文本编码器如何工作的一些假设,所以尽情享受并进行实验吧!
prompt_embeds = compel_proc('("a red cat playing with a ball", "jungle").blend(0.7, 0.8)')
generator = torch.Generator(device="cuda").manual_seed(33)
image = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
image

连词
连词会独立地扩散每个提示,并通过它们的加权和来连接它们的结果。在提示列表的末尾添加 .and()
以创建一个连词
prompt_embeds = compel_proc('["a red cat", "playing with a", "ball"].and()')
generator = torch.Generator(device="cuda").manual_seed(55)
image = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
image

文本反演
文本反演 是一种从一些图像中学习特定概念的技术,您可以使用它来生成以该概念为条件的新图像。
创建一个 pipeline 并使用 load_textual_inversion() 函数来加载文本反演 embeddings(欢迎浏览 Stable Diffusion Conceptualizer 以获取 100 多个训练好的概念)
import torch
from diffusers import StableDiffusionPipeline
from compel import Compel, DiffusersTextualInversionManager
pipe = StableDiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16,
use_safetensors=True, variant="fp16").to("cuda")
pipe.load_textual_inversion("sd-concepts-library/midjourney-style")
Compel 提供了一个 DiffusersTextualInversionManager
类,以简化使用文本反演的提示权重设置。实例化 DiffusersTextualInversionManager
并将其传递给 Compel
类
textual_inversion_manager = DiffusersTextualInversionManager(pipe) compel_proc = Compel( tokenizer=pipe.tokenizer, text_encoder=pipe.text_encoder, textual_inversion_manager=textual_inversion_manager)
使用 <concept>
语法将概念融入到提示条件中
prompt_embeds = compel_proc('("A red cat++ playing with a ball <midjourney-style>")')
image = pipe(prompt_embeds=prompt_embeds).images[0]
image

DreamBooth
DreamBooth 是一种用于生成主体情境化图像的技术,只需提供少量主体图像进行训练。它类似于文本反演,但 DreamBooth 训练整个模型,而文本反演仅微调文本 embeddings。这意味着您应该使用 from_pretrained() 来加载 DreamBooth 模型(欢迎浏览 Stable Diffusion Dreambooth Concepts Library 以获取 100 多个训练好的模型)
import torch
from diffusers import DiffusionPipeline, UniPCMultistepScheduler
from compel import Compel
pipe = DiffusionPipeline.from_pretrained("sd-dreambooth-library/dndcoverart-v1", torch_dtype=torch.float16).to("cuda")
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
创建一个带有 tokenizer 和文本编码器的 Compel
类,并将您的提示传递给它。根据您使用的模型,您需要将模型的唯一标识符纳入到您的提示中。例如,dndcoverart-v1
模型使用标识符 dndcoverart
compel_proc = Compel(tokenizer=pipe.tokenizer, text_encoder=pipe.text_encoder)
prompt_embeds = compel_proc('("magazine cover of a dndcoverart dragon, high quality, intricate details, larry elmore art style").and()')
image = pipe(prompt_embeds=prompt_embeds).images[0]
image

Stable Diffusion XL
Stable Diffusion XL (SDXL) 有两个 tokenizers 和文本编码器,因此它的用法有点不同。为了解决这个问题,您应该将两个 tokenizers 和编码器都传递给 Compel
类
from compel import Compel, ReturnedEmbeddingsType
from diffusers import DiffusionPipeline
from diffusers.utils import make_image_grid
import torch
pipeline = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
variant="fp16",
use_safetensors=True,
torch_dtype=torch.float16
).to("cuda")
compel = Compel(
tokenizer=[pipeline.tokenizer, pipeline.tokenizer_2] ,
text_encoder=[pipeline.text_encoder, pipeline.text_encoder_2],
returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED,
requires_pooled=[False, True]
)
这次,让我们将第一个提示中的 “ball” 的权重提高 1.5 倍,并将第二个提示中的 “ball” 的权重降低 0.6 倍。StableDiffusionXLPipeline 还要求 pooled_prompt_embeds
(以及可选的 negative_pooled_prompt_embeds
),因此您应该将它们与 conditioning tensors 一起传递给 pipeline
# apply weights
prompt = ["a red cat playing with a (ball)1.5", "a red cat playing with a (ball)0.6"]
conditioning, pooled = compel(prompt)
# generate image
generator = [torch.Generator().manual_seed(33) for _ in range(len(prompt))]
images = pipeline(prompt_embeds=conditioning, pooled_prompt_embeds=pooled, generator=generator, num_inference_steps=30).images
make_image_grid(images, rows=1, cols=2)

