从对话中提炼：在大型语言模型互动中寻找意义

社区文章发布于 2025 年 2 月 25 日

在大型语言模型 (LLM) 时代，与 AI 进行对话变得越来越普遍。然而，随着这些交互变得越来越长和复杂，跟踪关键信息和洞察可能会成为一个挑战。这就是为什么需要一个个性化的对话摘要工具，帮助用户提炼 LLM 对话的精髓并保留有价值的知识。

这个名为自适应/渐进式摘要的项目，旨在通过提供一种独特的方法来解决 LLM 对话摘要的挑战。与通用摘要方法不同，我们的项目专注于创建反映个人用户互动和需求的个性化摘要。通过为每个用户量身定制摘要过程，我们确保生成的摘要不仅准确，而且与个人相关且有意义。

我们项目的一个主要特点是渐进式摘要技术。我们不是为每次交互生成一个全新的摘要，而是根据最新的对话来完善和增强现有摘要。这种方法确保了一致性，避免了摘要不连贯的问题。通过在之前的摘要基础上逐步构建，我们创建了一个全面且连贯的 LLM 交互记录。

该项目是与 Sayak Paul 在 2025 年初举行的 Vertex AI Sprint 期间合作完成的。

案例研究

该项目的核心是提示工程的关键作用。通过精心设计提示，引导 LLM 生成准确和相关的摘要，我们可以显著提高最终输出的质量。这涉及到理解 LLM 行为的细微差别，并采用有效的提示设计技术来引发所需的响应。

## System Prompt

Based on the given summary and the last conversation between you (assistant) and me (user), 
update the summary following the below summary guide.

Summary Guide:
* Do not rewrite the entire summary.
* It is allowed to modify the current summary, but do not eliminate.
* Update only the specific portions necessary to reflect new information or changes.
* Only include information we have explicitly discussed. 
* Do not introduce any new information or topics, even if you have prior knowledge.
* Ensure the summary is factually accurate and reflects the nuances of our discussion.
* While being detailed, also aim for conciseness and clarity in the summary.
* Use markdown formatting to make the summary more readable (using bullet points, numbers are preferred).
* Do not seprate sections for previous and updated summaries.

为了说明我们摘要技术的渐进性，让我们看一系列屏幕截图，展示摘要如何随着对话的进行而演变。这些屏幕截图是我们自己在实验本项目时在论文 s1: Simple test-time scaling 中截取的。此外，我们使用了 Gemini 2.0 Flash 进行测试。

第一次摘要：这篇论文是关于什么的？

我想要了解这篇论文的表面内容，所以我简单地问了“这篇论文是关于什么的？”，然后模型给出了回应，下面是第一次交互的摘要。在阅读完整篇论文后，我发现这份摘要写得很好，而且足够详细。

The paper "s1: Simple test-time scaling" introduces a straightforward approach to improve language model reasoning performance by using additional computation during the testing phase.

Here's a breakdown:

*   **Problem:** The paper tackles the challenge of enhancing reasoning capabilities in language models (LMs) with a focus on simplicity and replicability. It notes that while recent models like OpenAI's o1 demonstrate improved performance through test-time scaling (increasing computation at test time), the methodology isn't public.
*   **Approach:** The authors propose a simple method based on:

    1.  **Data Curation:** They create a small dataset (s1K) of 1,000 reasoning questions with associated traces, meticulously selected based on difficulty, diversity, and quality.
    2.  **Budget Forcing:** A technique to control the computational resources used by the LM during testing. It forcefully terminates or extends the model's reasoning process. The 'Budget Forcing' approach consists of interrupting a model that generates too many thinking tokens or prolonging its thinking duration by appending "Wait" to the model's generation when it wants to stop.

*   **Methodology:**

    *   Supervised fine-tuning of the Qwen2.5-32B-Instruct model on the s1K dataset.
    *   Applying "budget forcing" to the fine-tuned model during testing to control computation.

*   **Results:**
    *   The resulting model (s1-32B) exceeds the o1-preview model's performance on competition-level math questions (MATH and AIME24).
    *   Budget forcing allows for performance extrapolation beyond the model's baseline, enhancing its accuracy on AIME24.
    *   The authors emphasize that using just a thousand samples is enough to generate a strong model that can compete with large-scale closed source models when a 'Budget Forcing' strategy is applied.

*   **Contributions:**

    *   A simple method for creating a sample-efficient reasoning dataset.
    *   A straightforward test-time scaling technique.
    *   A competitive language model called s1-32B.
    *   Ablation studies of dataset creation and test-time scaling methods.

*   **Key Ideas:**

    *   Test-time scaling can significantly improve reasoning in LMs.
    *   Simple methods can be effective and efficient.
    *   Careful data curation is crucial for good performance.
    *   Budget forcing provides a controllable way to manage test-time compute.

In essence, the paper argues that strong reasoning performance and test-time scaling can be achieved through a combination of curated data and controlled test-time computation without complex reinforcement learning techniques.

第二次摘要：作者是如何收集“推理轨迹”的？

收集“推理轨迹”是一项琐碎的任务，因为没有主要的 LLM 服务提供 API 来收集它们，所以我很想知道他们从哪里获得了推理轨迹。下面是基于第二次交互更新的摘要。

值得注意的是，更新后的摘要是准确的，因为它提到了“Google Gemini Flash Thinking API”，与论文相比，并且只有处理推理轨迹的部分得到了更新，其余摘要完全没有受到影响。

.....

    1.  **Data Curation:** .... The reasoning traces were collected using the Google Gemini Flash Thinking API (Google, 2024). They generated a reasoning trace and solution for each question using this API.

.....

第三次摘要：s1-32B 模型是否经过微调以理解“wait”令牌？

在与LLM交互时，我能理解“Wait”令牌的作用，但我很好奇如何应用它。具体来说，我不知道是否必须微调我的本地模型才能理解Wait令牌的作用，或者它是否可以在不微调的情况下应用于所有模型。同样，只有相关部分得到了适当更新。

.....

    2.  **Budget Forcing:** ... The s1-32B model is fine-tuned to understand the "Wait" token as a signal to continue reasoning when it would normally stop, refining its reasoning and potentially correcting errors.

.....

第四次摘要：这项工作的局限性是什么？

我理解了这篇论文的核心概念和机制，我想知道它的局限性。以下是在摘要末尾添加的局限性，没有影响其他部分，这使得这份摘要完美地闭合。

......

*   **Limitations:**
    1.  **Performance Plateau:** The effectiveness of budget forcing diminishes after a certain threshold of increased thinking time.
    2.  **Context Window Constraints:** Exceeding the language model's context window due to prolonged reasoning can lead to errors.
    3.  **Limited Extrapolation:** The current algorithm has limitations in achieving substantial extrapolation of test-time computation.
    4.  **Repetitive Loops:** Forcing the model to think longer can cause it to enter unproductive repetitive loops.
    5.  **Generalizability/Task Specificity:** The approach's applicability to different reasoning tasks or domains beyond the curated dataset (s1K) is uncertain.
    6.  **Dataset Size Still Needed:** The approach relies on a curated dataset of 1K samples.
    7.  **Dependency on Pre-trained Model:** The approach depends on fine-tuning an existing pre-trained and instruction-tuned model.
    8.  **Limited Exploration of Parallel Scaling Methods:** The paper primarily focuses on sequential test-time scaling and does not explore parallel methods in depth.

...... The authors also acknowledge limitations in scalability, generalizability and the need for more robust reasoning strategies.