Diffusers 文档
提示词技巧
并获得增强的文档体验
开始使用
提示词技巧
提示词很重要,因为它们描述了您希望扩散模型生成的内容。最好的提示词是详细、具体且结构良好的,以帮助模型实现您的愿景。但是,制作一个好的提示词需要时间和精力,有时这可能还不够,因为语言和词语可能不精确。这时,您需要通过提示词增强和提示词加权等其他技术来提升您的提示词,以获得您想要的结果。
本指南将向您展示如何使用这些提示词技巧,以更少的精力生成高质量图像,并调整提示词中某些关键词的权重。
提示词工程
这不是一份详尽的提示词工程指南,但它将帮助您理解一个好的提示词的必要部分。我们鼓励您继续尝试不同的提示词,并以新的方式组合它们,看看哪种效果最好。随着您编写更多的提示词,您将培养出对什么有效和什么无效的直觉!
新的扩散模型在从基本提示词生成高质量图像方面做得很好,但创建一个编写良好的提示词仍然很重要,以获得最佳结果。以下是一些编写良好提示词的技巧:
- 图像的_媒介_是什么?是照片、绘画、3D 插图还是其他?
- 图像的_主体_是什么?是人、动物、物体还是场景?
- 您希望在图像中看到哪些_细节_?在这里,您可以发挥创造力,尽情尝试不同的词语来让您的图像栩栩如生。例如,光线如何?氛围和美学是怎样的?您正在寻找哪种艺术或插图风格?您使用的词语越具体和精确,模型就越能理解您想要生成的内容。


用 GPT2 增强提示词
提示词增强是一种快速提高提示词质量而无需花费过多精力构建提示词的技术。它使用像 GPT2 这样在 Stable Diffusion 文本提示词上预训练的模型,自动用额外的关键关键词丰富提示词,以生成高质量图像。
该技术通过策划一个特定关键词列表并强制模型生成这些词语来增强原始提示词。这样,您的提示词可以是“一只猫”,而 GPT2 可以将提示词增强为“土耳其屋顶上晒太阳的猫的电影剧照,高度细节,高预算好莱坞电影,宽银幕,情绪化,史诗,华丽,电影颗粒质量清晰聚焦美丽细节复杂惊艳史诗”。
您还应该使用偏移噪声 LoRA 来改善明亮和黑暗图像的对比度,并整体创建更好的光照。这个LoRA 可从 stabilityai/stable-diffusion-xl-base-1.0 获取。
首先定义某些样式和词语列表(您可以查看 Fooocus 使用的更全面的词语列表和样式)来增强提示词。
import torch
from transformers import GenerationConfig, GPT2LMHeadModel, GPT2Tokenizer, LogitsProcessor, LogitsProcessorList
from diffusers import StableDiffusionXLPipeline
styles = {
"cinematic": "cinematic film still of {prompt}, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
"anime": "anime artwork of {prompt}, anime style, key visual, vibrant, studio anime, highly detailed",
"photographic": "cinematic photo of {prompt}, 35mm photograph, film, professional, 4k, highly detailed",
"comic": "comic of {prompt}, graphic illustration, comic art, graphic novel art, vibrant, highly detailed",
"lineart": "line art drawing {prompt}, professional, sleek, modern, minimalist, graphic, line art, vector graphics",
"pixelart": " pixel-art {prompt}, low-res, blocky, pixel art style, 8-bit graphics",
}
words = [
"aesthetic", "astonishing", "beautiful", "breathtaking", "composition", "contrasted", "epic", "moody", "enhanced",
"exceptional", "fascinating", "flawless", "glamorous", "glorious", "illumination", "impressive", "improved",
"inspirational", "magnificent", "majestic", "hyperrealistic", "smooth", "sharp", "focus", "stunning", "detailed",
"intricate", "dramatic", "high", "quality", "perfect", "light", "ultra", "highly", "radiant", "satisfying",
"soothing", "sophisticated", "stylish", "sublime", "terrific", "touching", "timeless", "wonderful", "unbelievable",
"elegant", "awesome", "amazing", "dynamic", "trendy",
]
您可能已经注意到,在`words`列表中,有些词语可以组合在一起以创建更有意义的内容。例如,“high”和“quality”可以组合成“high quality”。让我们将这些词语组合起来,并删除无法组合的词语。
word_pairs = ["highly detailed", "high quality", "enhanced quality", "perfect composition", "dynamic light"]
def find_and_order_pairs(s, pairs):
words = s.split()
found_pairs = []
for pair in pairs:
pair_words = pair.split()
if pair_words[0] in words and pair_words[1] in words:
found_pairs.append(pair)
words.remove(pair_words[0])
words.remove(pair_words[1])
for word in words[:]:
for pair in pairs:
if word in pair.split():
words.remove(word)
break
ordered_pairs = ", ".join(found_pairs)
remaining_s = ", ".join(words)
return ordered_pairs, remaining_s
接下来,实现一个自定义的 LogitsProcessor 类,该类将 `words` 列表中的 token 赋值为 0,并将不在 `words` 列表中的 token 赋值为负值,以便它们在生成期间不会被选中。这样,生成将偏向于 `words` 列表中的词语。当列表中的一个词语被使用后,它也会被赋值为负值,这样它就不会再次被选中。
class CustomLogitsProcessor(LogitsProcessor):
def __init__(self, bias):
super().__init__()
self.bias = bias
def __call__(self, input_ids, scores):
if len(input_ids.shape) == 2:
last_token_id = input_ids[0, -1]
self.bias[last_token_id] = -1e10
return scores + self.bias
word_ids = [tokenizer.encode(word, add_prefix_space=True)[0] for word in words]
bias = torch.full((tokenizer.vocab_size,), -float("Inf")).to("cuda")
bias[word_ids] = 0
processor = CustomLogitsProcessor(bias)
processor_list = LogitsProcessorList([processor])
将提示词与之前在 `styles` 字典中定义的 `cinematic` 风格提示词结合起来。
prompt = "a cat basking in the sun on a roof in Turkey"
style = "cinematic"
prompt = styles[style].format(prompt=prompt)
prompt
"cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"
从 Gustavosta/MagicPrompt-Stable-Diffusion 检查点(此特定检查点经过训练用于生成提示词)加载 GPT2 分词器和模型以增强提示词。
tokenizer = GPT2Tokenizer.from_pretrained("Gustavosta/MagicPrompt-Stable-Diffusion")
model = GPT2LMHeadModel.from_pretrained("Gustavosta/MagicPrompt-Stable-Diffusion", torch_dtype=torch.float16).to(
"cuda"
)
model.eval()
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
token_count = inputs["input_ids"].shape[1]
max_new_tokens = 50 - token_count
generation_config = GenerationConfig(
penalty_alpha=0.7,
top_k=50,
eos_token_id=model.config.eos_token_id,
pad_token_id=model.config.eos_token_id,
pad_token=model.config.pad_token_id,
do_sample=True,
)
with torch.no_grad():
generated_ids = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_new_tokens=max_new_tokens,
generation_config=generation_config,
logits_processor=proccesor_list,
)
然后您可以将输入提示词和生成的提示词结合起来。请随意查看生成的提示词(`generated_part`)、找到的词对(`pairs`)和剩余的词语(`words`)。所有这些都打包在 `enhanced_prompt` 中。
output_tokens = [tokenizer.decode(generated_id, skip_special_tokens=True) for generated_id in generated_ids]
input_part, generated_part = output_tokens[0][: len(prompt)], output_tokens[0][len(prompt) :]
pairs, words = find_and_order_pairs(generated_part, word_pairs)
formatted_generated_part = pairs + ", " + words
enhanced_prompt = input_part + ", " + formatted_generated_part
enhanced_prompt
["cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain quality sharp focus beautiful detailed intricate stunning amazing epic"]
最后,加载一个管道和带_低权重_的偏移噪声 LoRA,用增强的提示词生成图像。
pipeline = StableDiffusionXLPipeline.from_pretrained(
"RunDiffusion/Juggernaut-XL-v9", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
pipeline.load_lora_weights(
"stabilityai/stable-diffusion-xl-base-1.0",
weight_name="sd_xl_offset_example-lora_1.0.safetensors",
adapter_name="offset",
)
pipeline.set_adapters(["offset"], adapter_weights=[0.2])
image = pipeline(
enhanced_prompt,
width=1152,
height=896,
guidance_scale=7.5,
num_inference_steps=25,
).images[0]
image


提示词加权
提示词加权提供了一种强调或减弱提示词某些部分的方法,从而更好地控制生成的图像。一个提示词可以包含多个概念,这些概念会转化为上下文文本嵌入。模型使用这些嵌入来调节其交叉注意力层以生成图像(阅读 Stable Diffusion 博客文章以了解其工作原理)。
提示词加权通过增加或减少对应于提示词中概念的文本嵌入向量的比例来实现,因为您可能不一定希望模型平等地关注所有概念。准备提示词嵌入最简单的方法是使用 Stable Diffusion 长提示词加权嵌入(sd_embed)。一旦您有了加权的提示词嵌入,您可以将它们传递给任何具有 prompt_embeds(以及可选的 negative_prompt_embeds)参数的管道,例如 StableDiffusionPipeline、StableDiffusionControlNetPipeline 和 StableDiffusionXLPipeline。
如果您喜欢的管道没有 `prompt_embeds` 参数,请打开一个问题,以便我们可以添加它!
本指南将向您展示如何使用 sd_embed 加权您的提示词。
开始之前,请确保您已安装最新版本的 sd_embed
pip install git+https://github.com/xhinker/sd_embed.git@main
对于本示例,我们使用 StableDiffusionXLPipeline。
from diffusers import StableDiffusionXLPipeline, UniPCMultistepScheduler
import torch
pipe = StableDiffusionXLPipeline.from_pretrained("Lykon/dreamshaper-xl-1-0", torch_dtype=torch.float16)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.to("cuda")
要提升或降低某个概念的权重,请将文本用括号括起来。括号越多,文本的权重就越大。您还可以向文本附加一个数字乘数,以表示您希望增加或减少其权重的程度。
格式 | 乘数 |
---|---|
(hippo) | 增加1.1倍 |
((hippo)) | 增加1.21倍 |
(hippo:1.5) | 增加1.5倍 |
(hippo:0.5) | 减少4倍 |
创建提示词,并结合使用括号和数字乘数来提升各种文本的权重。
from sd_embed.embedding_funcs import get_weighted_text_embeddings_sdxl
prompt = """A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus.
This imaginative creature features the distinctive, bulky body of a hippo,
but with a texture and appearance resembling a golden-brown, crispy waffle.
The creature might have elements like waffle squares across its skin and a syrup-like sheen.
It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting,
possibly including oversized utensils or plates in the background.
The image should evoke a sense of playful absurdity and culinary fantasy.
"""
neg_prompt = """\
skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),\
(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),\
extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),\
(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,\
bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),\
(normal quality:2),lowres,((monochrome)),((grayscale))
"""
使用 `get_weighted_text_embeddings_sdxl` 函数生成提示词嵌入和负提示词嵌入。由于您使用的是 SDXL 模型,它还将生成 pooled 和 negative pooled 提示词嵌入。
您可以安全地忽略下面的错误消息,即 token 索引长度超出模型的最大序列长度。您的所有 token 都将用于嵌入过程。
Token indices sequence length is longer than the specified maximum sequence length for this model
(
prompt_embeds,
prompt_neg_embeds,
pooled_prompt_embeds,
negative_pooled_prompt_embeds
) = get_weighted_text_embeddings_sdxl(
pipe,
prompt=prompt,
neg_prompt=neg_prompt
)
image = pipe(
prompt_embeds=prompt_embeds,
negative_prompt_embeds=prompt_neg_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
num_inference_steps=30,
height=1024,
width=1024 + 512,
guidance_scale=4.0,
generator=torch.Generator("cuda").manual_seed(2)
).images[0]
image

有关 FLUX.1、Stable Cascade 和 Stable Diffusion 1.5 的长提示词加权的其他详细信息,请参阅 sd_embed 仓库。
文本反演
文本反演是一种从一些图像中学习特定概念的技术,您可以使用它来生成以该概念为条件的新图像。
创建一个管道并使用 load_textual_inversion() 函数加载文本反演嵌入(您可以随意浏览 Stable Diffusion Conceptualizer 以获取 100 多个训练好的概念)
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16,
).to("cuda")
pipe.load_textual_inversion("sd-concepts-library/midjourney-style")
在提示词中添加 `<midjourney-style>` 文本以触发文本反演。
from sd_embed.embedding_funcs import get_weighted_text_embeddings_sd15
prompt = """<midjourney-style> A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus.
This imaginative creature features the distinctive, bulky body of a hippo,
but with a texture and appearance resembling a golden-brown, crispy waffle.
The creature might have elements like waffle squares across its skin and a syrup-like sheen.
It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting,
possibly including oversized utensils or plates in the background.
The image should evoke a sense of playful absurdity and culinary fantasy.
"""
neg_prompt = """\
skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),\
(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),\
extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),\
(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,\
bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),\
(normal quality:2),lowres,((monochrome)),((grayscale))
"""
使用 `get_weighted_text_embeddings_sd15` 函数生成提示词嵌入和负提示词嵌入。
(
prompt_embeds,
prompt_neg_embeds,
) = get_weighted_text_embeddings_sd15(
pipe,
prompt=prompt,
neg_prompt=neg_prompt
)
image = pipe(
prompt_embeds=prompt_embeds,
negative_prompt_embeds=prompt_neg_embeds,
height=768,
width=896,
guidance_scale=4.0,
generator=torch.Generator("cuda").manual_seed(2)
).images[0]
image

DreamBooth
DreamBooth 是一种根据少量图像对主题进行训练,然后生成该主题的上下文图像的技术。它与文本反演类似,但 DreamBooth 训练的是完整模型,而文本反演仅对文本嵌入进行微调。这意味着您应该使用 from_pretrained() 来加载 DreamBooth 模型(您可以随意浏览 Stable Diffusion DreamBooth 概念库以获取 100 多个训练好的模型)
import torch
from diffusers import DiffusionPipeline, UniPCMultistepScheduler
pipe = DiffusionPipeline.from_pretrained("sd-dreambooth-library/dndcoverart-v1", torch_dtype=torch.float16).to("cuda")
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
根据您使用的模型,您需要在提示词中包含模型的唯一标识符。例如,`dndcoverart-v1` 模型使用标识符 `dndcoverart`
from sd_embed.embedding_funcs import get_weighted_text_embeddings_sd15
prompt = """dndcoverart of A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus.
This imaginative creature features the distinctive, bulky body of a hippo,
but with a texture and appearance resembling a golden-brown, crispy waffle.
The creature might have elements like waffle squares across its skin and a syrup-like sheen.
It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting,
possibly including oversized utensils or plates in the background.
The image should evoke a sense of playful absurdity and culinary fantasy.
"""
neg_prompt = """\
skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),\
(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),\
extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),\
(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,\
bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),\
(normal quality:2),lowres,((monochrome)),((grayscale))
"""
(
prompt_embeds
, prompt_neg_embeds
) = get_weighted_text_embeddings_sd15(
pipe
, prompt = prompt
, neg_prompt = neg_prompt
)
