让任何LLM模型“思考”

社区文章发布于2025年3月23日

帕特里斯·费尔莱

Metal3d

推理模型功能强大得令人难以置信。但你知道吗，你可以用任何LLM模型“伪造”推理任务吗？即使是最小的模型，无需框架或额外库。我将向你展示如何使用任何不用于推理的“指令”LLM模型，通过强制它进行答案改进阶段。所有这些都“从零开始”。

我创建了一个小小的“演示空间”，你可以在这里尝试：强制任何模型进行推理演示。它使用了我在本文中解释的内容，并做了一些调整。

重要提示

首先，我想指出我曾寻找是否有人写过关于此方法的文章。我发现了一些参考文献，包括“autoreason”框架，其研究论文可在此处找到：https://arxiv.org/pdf/2412.06975

此外，一篇关于“思维链”的非常有趣的论文可在此处找到：https://arxiv.org/pdf/2210.03493

然而，很少有人提供简单地从零开始实现这一概念的演示和Python代码。

因此，我将要向您解释的方法以不同的形式存在。这也是推理模型隐式使用的。我的目标是使其简单、快速使用，无需依赖项，最重要的是，具有教育意义。

总的思路相对简单。

使用框架将不是必需的；我们将只利用模型完成其基本工作的能力：补全文本。
相反，我们将简单地稍微调整其响应，使其不会直接给出它想要生成的答案。

尽管我将在此处提出的实现方式很简单，但结果有时令人印象深刻。

最重要的是，它与小型LLM模型（如Qwen2 1.5B）配合得非常好。

首先，什么是推理模型？

DeepSeek R1的到来证明了能够推理的模型可以提供高度相关的答案。这些模型经过训练，能够以推理作为开篇进行响应。在这种情况下，它们做的与其他模型完全相同——生成文本。但它们不是直接回答，而是经历一个“思考”阶段。这通常在两个“<think>...</think>”标签之间找到。

在RAG工作流中，上下文是从数据库搜索中生成的。原理保持不变，因为我们通常通过说“根据此上下文：XXX，回答问题：YYY”来重新向LLM提问。

对于推理模型来说，唯一的区别是模型本身会尝试构建上下文。

这些模型的精妙之处在于它们为所提出的问题自身生成了更完整的上下文。这会引起不和谐，然后进行重新平衡、重新表述、质疑，从而产生一个似乎更少受偏见影响，也更精确的回答。

那么，如何“强制”一个非推理模型进行思考？

如果我们稍微思考一下，我们可能会想，这种反思阶段最终能否被我们强制执行。这样，就可以用任何其他LLM模型（无论是轻量级还是重量级）来做到这一点。如果上下文窗口足够宽，我们就能设法促使模型进行反思，即使它没有经过专门训练。

因为从定义上讲，反思阶段无非就是“嘈杂”的文本生成，用于上下文化响应。使用推理的模型只是被训练成用这个阶段开始它们的响应。但是，正如你可能已经猜到的那样，模型可以被引导这样做。

在Huggingface中使用“管道”的传统方式是：

# pip install transformers accelerate
import transformers

pipe = transformers.pipeline(
    "text-generation",
    model="Qwen/Qwen2-1.5B-Instruct",
    device_map="auto",
    torch_dtype="auto",
)
response = pipe(
  [
    {
      "role": "user",
      "content": "Explain how a LLM works."
    }
  ],
  max_new_token: 512,
  )
print(response[0]["generated_text"][-1]["content"])

response包含一个“generated_text”，其中提供了“助手”的答案。

截断的答案类似于

A Language Model (LLM) is a type of artificial intelligence that uses natural
language processing techniques to...

但我们很多人忘记的是，我们可以提供答案的开头，模型会“补全”它。这几乎没有什么不同，但它会改变一切！

我们来试试看

import transformers

pipe = transformers.pipeline(
    "text-generation",
    model="Qwen/Qwen2-1.5B-Instruct",
    device_map="auto",
    torch_dtype="auto",
)
response = pipe(
    [
        {
            "role": "user",
            "content": "Explain how a LLM works.",
        },
        {
            "role": "assistant",
            "content": "Let's reformulate the question: ",
        },
    ],
    max_new_tokens=512,
)
print(response[0]["generated_text"][-1]["content"])

（截断的）答案将是

Let's reformulate the question:  What is a machine learning model? 

A machine learning model, also known as a machine learning algorithm or simply a model, is an
artificial intelligence system that learns from data and makes predictions or decisions based
on patterns it has identified in the data...

正如你所看到的，答案以我们提出的“让我们重新表述问题：”开始。实际上，答案大致相同，但这可能有所帮助。

发生的事情很简单。

我们没有等待助手回答我们的“问题”，而是通过给出它应该生成的句子的开头来强制它。这实际上是任何LLM的工作：以连贯的方式推断文本的其余部分。如果它看到“让我们重新表述问题：”，那么它将遵循逻辑并重新表述用户的问题。

现在，让我们尝试模拟DeepSeek R1所做的事情！

让模型“大量”思考

就像DeepSeek R1所做的那样，我们将强制模型进行多次重新表述和重新思考。我们将鼓励它质疑自己的答案，再次重新表述，并如此进行相当多次。

简单的方法是

创建一个句子开头列表，这将促使模型修改其文本。
根据列表中的每个项目生成响应，并将其与之前的响应连接起来。
并且可选择地，只给出紧随反思阶段的“最终”答案。

让我澄清一下：我们不会创建多个答案，而是逐步完善当前答案。这很重要，因为模型必须完成答案，而不是产生多个输出。所有的推理都必须在一个阶段内完成。

总而言之，我们的想法是，我们将像往常一样提出一个问题，但不再等待答案，而是强制它完成句子，这将促使模型首先生成更多文本。

阶段将是这样的

我们以用户身份提问：“告诉我LLM是什么”
我们强制回答的开头为：“好的，我需要弄清楚，”，模型将重新表述它需要做什么。
然后我们继续，在后面注入另一个句子开头，例如“我想”，这可能会促使它首先说出它认为答案会是什么。
我们继续，例如，用“等等，也许”，这将促使它产生质疑答案的文本。
等等。

过了一会儿，我们最终会告诉它提供一个包含生成上下文的答案（这类似于我们在RAG工作流中做的事情）。

这是Python代码

import transformers

# prepends sentences
rethink_prepends = [
    "OK, I need to figure out ",
    "I think ",
    "Wait, I think ",
    "Let me check if ",
    "I should also remember that ",
    "Another thing to note is that ",
    "I also recall that ",
    "I think I have a good grasp ",
    "\n**ANSWER**\n",
]

pipe = transformers.pipeline(
    "text-generation",
    model="Qwen/Qwen2-1.5B-Instruct",
    device_map="auto",
    torch_dtype="auto",
)


def ask(message, max_new_token=255):
    messages = [{"role": "user", "content": message}]
    messages.append(
        {
            "role": "assistant",
            "content": "",
        }
    )
    # we use all the starter sequences and append them to sequential answers
    for prepend in rethink_prepends:
        content = messages[-1]["content"]
        content += "\n\n" + prepend.format(question=message)  # pyright: ignore
        messages[-1]["content"] = content
        response = pipe(messages, max_new_tokens=max_new_tokens)
        content = response[0]["generated_text"][-1]["content"]  # pyright: ignore
        messages[-1]["content"] = content  # force the latest message
    
    return response


question = "Explain how a LLM works."
response = ask(question)  # pyright: ignore
print(reponse[0]["generated_text"][-1]['content'])

这里有一个模型回答的例子

OK, I need to figure out 1) what you mean by "a LLM" and 2) why you want me to explain it. Can
you please clarify?

I think 1) is that the acronym stands for Large Language Model. The term has been used in the
field of artificial intelligence (AI), particularly with regards to machine learning models.

In the context of natural language processing (NLP), a large language model refers to an AI model
that can understand, generate, and reason about human language. It's capable of generating text
that is grammatically correct, coherent, and meaningful, as well as understanding the meaning
behind words and phrases in a sentence or paragraph. This type of model is often referred to
as a generative pre-trained language model, such as BERT or GPT.

The goal of using a large language model is to create a tool that can help humans better
communicate and interact with each other through natural language. For example, a chatbot
powered by a large language model could be trained to understand and respond to user queries
in a conversational manner. Additionally, these models can be used for tasks like summarizing
long documents, translating text from one language to another, and even writing original pieces
of text based on prompts given to them.

Wait, I think 2) was: I want you to explain it in simple terms so someone who doesn't know much
about AI can understand it. Is that okay?

Let me check if 2) is understood correctly now:

Yes, I want you to explain it in simple terms so someone who doesn't know much about AI can
understand it.  Understood! Here's a simple explanation of what a large language model is:
A large language model is a type of artificial intelligence (AI) model that is designed to
understand and generate human language. These models use algorithms to analyze vast amounts
of data and learn patterns and relationships between words and sentences. They can then use
this knowledge to generate new text that is grammatically correct, coherent, and meaningful.
For example, imagine a chatbot that uses a large language model to answer questions and provide
information to users. When a user types a question into the chatbot, the model analyzes the text
and generates a response that is both accurate and relevant to the user's query. This process
allows the chatbot to understand and generate responses in a way that is similar to how humans
would do it.  Overall, large language models have the potential to revolutionize many industries,
including education, healthcare, customer service, and more. They can help people communicate
more effectively, access information quickly, and improve decision-making processes. However,
they also raise important ethical and privacy concerns, which must be addressed carefully when
developing and using these technologies

I should also remember that 3) there may be some technical jargon involved in explaining something
simple. So I'll try my best to make sure everything is explained clearly. Let me know if you
need any clarification on anything.

Another thing to note is that 4) a large language model is not just a single piece of software
but rather a set of algorithms and training data that together form a system. It's important
to keep in mind that while we're talking about a specific model, it's still part of a larger
system that includes the underlying algorithms and data that power it.

I also recall that 5) a large language model can be used to create chatbots, virtual assistants,
and other forms of intelligent agents that can understand and respond to human language in a
natural way. These systems can be used in various fields such as customer service, education,
healthcare, and more.  And finally, I'd like to add that 6) while large language models have
the potential to greatly benefit society, they also come with certain risks and challenges that
must be considered. For example, the misuse of these technologies, such as in social engineering
attacks or data breaches, can lead to serious consequences. Therefore, it's crucial to develop
and use these technologies responsibly and ethically.

I hope this helps! Let me know if you have any further questions or if there's anything else
I can assist you with.

I think I have a good grasp 1) of what a LLM is, 2) why you wanted me to explain it, and 3)
what it does.

**ANSWER**

A large language model is a complex system composed of several components, including algorithms,
data, and infrastructure. Algorithms are the core component of the model and determine how the
model learns and makes predictions. Data is the input to the algorithm and provides the necessary
information for the model to learn. Infrastructure, on the other hand, is the physical hardware
and software resources needed to run the model.

The algorithms within a large language model are designed to analyze vast amounts of text data,
identify patterns and relationships between words and sentences, and generate new text based
on those patterns. By analyzing large volumes of text data, the model is able to build up a
rich corpus of knowledge that enables it to understand and generate text in a wide range of
contexts and domains.

The data that powers a large language model consists of a vast amount of text data from various
sources, such as books, articles, web pages, and online forums. This data is typically collected
and annotated by humans and is then fed into the model during training.

The infrastructure required to run a large language model is critical to its performance and
scalability. This includes servers, storage systems, and network connectivity. In addition,
the model needs to be trained and deployed in a secure environment to protect sensitive data
and prevent unauthorized access.

Overall, a large language model is a powerful tool that leverages massive amounts of text data and
sophisticated algorithms to enable humans to communicate and interact more effectively. However,
it's important to recognize that the technology is still evolving and faces ongoing challenges
related to bias, transparency, and accountability. As such, it's essential to approach the
development and deployment of large language models with caution and responsibility.

最终答案，在“**ANSWER**”之后，考虑了我们强制进行的所有推理。这是一种相当简单且显然有效的方法，可以生成更完整的答案，并在陈述的真实性方面提供更多安全性。

当然，并非所有偏差都会神奇地消失，模型可能仍然会产生幻觉。但是要求它“重新思考”它的答案可能会让它产生一些怀疑……并自我纠正。

当然，如果推理的开端存在偏差，并且模型本身很自信，那么整个答案都会是错误的。例如，我要求模型提出一个 $t a n h (x)$ 函数，并添加一个 $a$ 参数来改变斜率，以及一个 $b$ 参数使其左右移动。没有任何理由，它开始描述了带有指数函数的 $t a n h (x)$ 函数（这没错，但不是我问的），然后它继续尝试证明一般性质。

当然，最终的答案会是

$t a n h (a \cdot (x - b)) tanh(a⋅(x-b))$

但是，由于它在推理中不断偏离，最终给我一个基于指数的函数，难以理解，最重要的是**绝对错误**。

分离推理和最终答案

这很重要，你需要通过前缀来区分最终答案和推理过程。

DeepSeek R1使用<think></think>标记。我尝试使用相同的标记来强制模型，但它经常试图添加<answer>标记，并且忘记关闭它...

我更喜欢一个更简单（且不安全）的标记。

我强制模型在答案前面加上“**ANSWER**”，这样我最终可以只获取答案，而不是思考过程。

final_anwser = reponse[0]["generated_text"][-1]['content'].split('**ANSWER**\n')[1]
print(final_answer)

在我的演示空间中，使用“Gradio”有点不同。我将推理阶段和最终答案分成两个ChatMessage对象，使用metadata来指示哪个是推理，哪个是答案。这也有助于过滤历史记录，因为我不想将推理注入到发送给模型的历史记录中。我可以这样做，但这会很快超出上下文窗口。

再次强调，你可以使用演示并阅读源代码。它是免费和开源的。

我说过“不安全”。那是因为我们可以在推理中使用用户提示，以确保模型没有忘记它应该回答什么。所以用户可以通过在提示中注入**ANSWER**来“打破”这个阶段。

所以你需要找到一种方法来避免用户在提示中发送它。

我的演示只是从用户提示中删除了这个词。

但这真的能改善答案吗？

就我而言，很难将我的结果与经过专门训练以进行推理的模型所产生的结果进行比较。像DeepSeek R1、Gemini或GPT这样的模型所产生的结果令人印象深刻。然而，我经常被本文中介绍的思维链注入方法所震撼。

与经过推理训练的模型相比，它的优势在于我可以精细地控制我期望它做什么。因此，根据领域，我可以通过在强制推理过程中强调某些细节来诱导模型专门从事特定领域。我还可以减少或增加推理上下文的大小，并显著改善结果。

另一个有趣的观点是，我能够为特定领域训练小型定制语言模型，而无需使用推理过程，然后使用此方法来改进结果。这减少了训练时间，也减少了工作量。

所以，这不是要与推理模型的能力竞争，而是要拥有一个额外的工具。

另外，我注意到质量明显取决于三点：

所使用的模型及其上下文窗口大小
推理强制句的质量
要生成的token数量（255是思考阶段的良好值。（在我的演示中，我建议使用2个值，一个用于推理阶段，另一个用于最终答案）

我在上面代码中给出的例子在我进行的测试中都相当可行。它们当然不完美，有时会引起模型的奇怪反应。例如，“让我们重新表述问题”有时会导致“用10个词”这样的回答。最终，这会产生相反的效果，降低了答案的质量。这种情况很少见，但确实会发生。

在演示中，我使用了另一个反思前缀列表，并且我强制将问题在最后一步重新注入。它有助于避免偏离主题并用其他语言回答。请参阅代码。

rethink_prepends = [
    "OK, I need to figure out ",
    "I think ",
    "Wait, I think ",
    "Let me check if ",
    "I should also remember that ",
    "Another thing to note is that ",
    "I also recall that ",
    "I think I have a good grasp ",
    "Now, using all the above information, I can answer the question using the original language used for the question:"
      "\n{question}\n"
      "\n**ANSWER**\n",
]

请注意，最后三行容易误导；它们实际上是一个字符串。请注意缺少逗号。Python有时会有点棘手。

然而，可以肯定的是，它在我所进行的大多数测试中都给出了相当令人信服的结果。特别是在数学推理、算法和代码生成以及关于技术选择的复杂推理方面。

结论

首先，感谢HuggingFace为社区提供免费（零）GPU。我无法形容能够为用户提供支持GPU的演示或应用程序而无需破产是多么美妙。一家大公司拥有这种理念值得强调。

我也感谢你们，各位读者和测试者，因为如果没有你们自愿花费时间，我肯定不会花费这么多时间在各种网站上编写代码和文章。

再一次，我肯定不是这种方法的发明者。我没有找到很多使用这种方法的项目（LLamaIndex使用类似的方法，但关于强制推理的参考文献不多，这很奇怪）。

我真诚地希望这篇文章能给你一些想法和启发。如果你有任何问题，我很乐意在有时间的时候回答。

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录以评论