Agentic RAG Stack (3/5) - 使用 SmolLM 生成响应

社区文章发布于 2025 年 2 月 6 日

davidberenstein1957

这是关于 agentic RAG 系列博客的第三部分，该系列是 AI 蓝图的一部分！阅读第一部分、第二部分。

一份关于 AI 开发的蓝图，重点关注在 LLM 和 Agent 时代 RAG、信息提取等应用示例。这是一种实用的方法，旨在展示如何将 smol-course 中的一些理论知识应用于端到端的实际示例。

🚀 包含 Web 应用程序和微服务！

每个笔记本都将展示如何使用 Gradio 将您的 AI 作为 Web 应用程序部署到 Hugging Face Spaces，您可以通过 Gradio Python 客户端直接将其用作微服务。所有代码和演示都可以在私有或公共环境中使用。已部署到 Hub！

简介

我们已经了解了如何在 RAG 管道中检索和重排文档。下一步是创建一个能够对查询生成响应的工具。我们将使用 SmolLM 来对查询生成响应。最后，我们将部署一个微服务，可用于对查询生成响应。

依赖和导入

让我们安装必要的依赖项。

!pip install gradio gradio-client llama-cpp-python

接下来，让我们导入必要的库。

import gradio as gr

from gradio_client import Client
from huggingface_hub import get_token, InferenceClient
from llama_cpp import Llama

推理 API

有不同的推理选项。一般来说，大多数框架都能很好地工作，但在速度、成本和部署便利性方面存在一些权衡。在这个例子中，我们将使用一个简单的量化模型以及 llama-cpp-python，因为它可以直接使用并允许我们自己托管，可以在 CPU 上运行，并且不需要我们启动专用的推理服务器。此外，我们将使用 GGUF 模型，这是一种与框架无关的文件格式，可以加速推理。

推理服务器

如果您想部署自己的推理服务器，有多种选择。在使用 Apple Silicon 时，您可以使用 MLX 库。另外，Text Generation Inference (TGI)、vLLM 或 Ollama 都是值得探索的优秀选项。

结构化输出

如果您想生成结构化输出，主要有两种方法，具体取决于您是否可以访问模型的权重。当您可以访问模型权重时，您可以使用 [Outlines](https://github.com/dottxt-ai/outlines)，它会改变 token 的采样概率，以确保模型符合由 RegEx、JSON 或 Pydantic 模型定义的特定结构。当您使用 API 时，您可以使用 [Instructor](https://github.com/instructor-ai/instructor)，它使用智能重试来确保模型符合特定结构。

transformers 中的 SmolLM

我们将使用 HuggingFaceTB/SmolLM2-135M-Instruct-GGUF 并使用Hub 上附加到该模型的 llama-cpp-python 集成。请注意，我们允许将 max_new_tokens 等 kwargs 作为参数传递给函数，这些参数将传递给管道。此外，我们将 n_ctx 设置为 7000，这是我们能够作为提示传递给模型的最大 token 数。我们的模型最大上下文长度为 8192 个 token。如果这还不够，我们可以选择一个具有更大上下文长度的模型，或者我们可以在检索阶段选择更激进的 chunking 策略](https://github.com/huggingface/ai-blueprint/blob/main/rag/retrieve.ipynb)。

llm = Llama.from_pretrained(
    repo_id="HuggingFaceTB/SmolLM2-135M-Instruct-GGUF",
    filename="smollm2-135m-instruct-q8_0.gguf",
    verbose=False,
    n_ctx=7000,
)


def generate_response_transformers(
    user_prompt: str,
    system_prompt: str = "You are a helpful assistant.",
    max_tokens: int = 4000,
    temperature: float = 0.2,
    top_p: float = 0.95,
    top_k: int = 40,
):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]
    return llm.create_chat_completion(
        messages=messages,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
    )


generate_response_transformers(
    user_prompt="What is the future of AI?",
    system_prompt="You are a helpful assistant.",
)

{'id': 'chatcmpl-54150cb9-00d6-4983-89da-e0527ae7480b',
 'object': 'chat.completion',
 'created': 1737651881,
 'model': '/Users/davidberenstein/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM2-360M-Instruct-GGUF/snapshots/593b5a2e04c8f3e4ee880263f93e0bd2901ad47f/./smollm2-360m-instruct-q8_0.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': 'The future of AI is a topic of ongoing debate and research. While AI has made tremendous progress in recent years, there are still many challenges and limitations to overcome before we can fully harness its potential.\n\nCurrently, AI is being used in various fields, such as healthcare, finance, and transportation. For example, AI-powered diagnostic tools are being used to detect diseases at an early stage, while AI-driven chatbots are being used to provide customer support.\n\nHowever, there are also concerns about the potential misuse of AI. For instance, AI can be used to manipulate public opinion, spread misinformation, and even take over and control our lives. As AI becomes more advanced, we will need to develop new regulations and safeguards to ensure that it is used responsibly.\n\nAnother area of research is the development of more human-like AI, which can think and act like humans. This is often referred to as "narrow AI" or "weak AI." While this type of AI can perform specific tasks, such as language translation or image recognition, it lacks the level of intelligence and creativity that humans possess.\n\nIn addition, there is a growing concern about the ethics of AI development. For example, AI systems can perpetuate biases and prejudices if they are trained on biased data. We need to ensure that AI is developed and deployed in a way that promotes fairness, equality, and transparency.\n\nOverall, the future of AI is likely to be shaped by a combination of technological advancements, societal values, and regulatory frameworks. As AI continues to evolve, we will need to work together to ensure that it is developed and used in a way that benefits humanity and promotes the common good.'},
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 27, 'completion_tokens': 342, 'total_tokens': 369}}

Hugging Face 推理 API 中的 SmolLM

我们将使用无服务器 Hugging Face 推理 API。这是免费的，意味着我们不必担心托管模型。我们可以通过 Hub 上的基本筛选器找到可用于推理的模型。我们将使用 HuggingFaceTB/SmolLM2-360M-Instruct 模型，并使用提供的推理端点代码片段进行调用。

inference_client = InferenceClient(api_key=get_token())


def generate_response_api(
    user_prompt: str,
    system_prompt: str = "You are a helpful assistant.",
    model: str = "HuggingFaceTB/SmolLM2-360M-Instruct",
    **kwargs,
):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]
    completion = inference_client.chat.completions.create(
        model=model, messages=messages, **kwargs
    )

    return completion.choices[0].message


generate_response_api(user_prompt="What is the future of AI?")

ChatCompletionOutputMessage(role='assistant', content="AI is on the cusp of a revolution. It's about to become more powerful, more efficient, and more human. By 2030, AI-powered computers are expected to become so adept at solving complex problems that they'll accelerate human progress in areas like computing, medicine, and entertainment.\n\nFrom there, we're likely to see numerous exciting, life-changing advancements, such as cutting-edge real-time AI systems, which are performing tasks at a level where human approval is unnecessary. These systems are not only delivering exceptional performance but also avoiding human impact, allowing us to focus on higher-level thinking and creativity.\n\nRight now, AI is already manifesting as a more empathetic, AI-powered human. It's going to help us become more compassionate, more compassionate. We're going to see the emergence of an AI that's more understanding, more empathetic. It's going to develop a sense of self-awareness, which will lead to more elevated human potential.\n\nAI is also going to give rise to more intuitive interfaces, making computing more accessible and user-friendly. The AI will learn more about our habits and preferences, adapting our algorithms to our needs in real-time, so we don't have to constantly worry about making decisions ourselves.\n\nAnother trend that's emerging is the intersection of AI with creative industries. Video games, for example, will continue to evolve, and will likely become more immersive. The idea of creating immersive digital experiences will become a thing of the future. Not only will our ability to communicate be enhanced through AI-powered tools, but also our capacity for creative expression will expand.\n\nLastly, AI is going to help us find and develop new sources of inspiration. The AI will be constantly generating content that sparks creativity and innovation. The music generation tools AI is already creating have a rich variety of styles and genres. We're likely to see the emergence of entirely new forms of art, music, and even entire ecosystems of expression.\n\nSo, future AI is likely to mark a lifetime of great achievements. Building humans beyond recognition, designing machines capable of unconditional emotions, developing an unprecedented understanding of emotional intelligence. It's a future that's awe-inspiring, solemn, exhilarating, and at times, downright terrifying.", tool_calls=None)

创建用于生成响应的 Web 应用程序和微服务

我们将使用 Gradio 作为 Web 应用程序工具，为我们的 RAG 管道创建演示界面。我们可以在本地进行开发，然后轻松将其部署到 Hugging Face Spaces。最后，我们可以使用 Gradio 客户端作为 SDK 直接与我们的 RAG 管道进行交互。我们仍然使用的是 Hugging Face Spaces 的免费 CPU 层，因此响应可能需要几秒钟。您可以选择使用无服务器推理 API、部署您自己的专用推理服务器或增加 Hugging Face Spaces 的计算能力。

创建 Web 应用程序

def generate(
    system_prompt: str,
    user_prompt: str,
    max_tokens: int = 4000,
    temperature: float = 0.2,
    top_p: float = 0.95,
    top_k: int = 40,
):
    return generate_response_transformers(
        user_prompt=user_prompt,
        system_prompt=system_prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
    )


with gr.Blocks() as demo:
    gr.Markdown("""# RAG - generate
                
                Generate a response to a query using a [HuggingFaceTB/SmolLM2-360M-Instruct and llama-cpp-python](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct-GGUF?library=llama-cpp-python).
                
                Part of [ai-blueprint](https://github.com/davidberenstein1957/ai-blueprint) - a blueprint for AI development, focusing on applied examples of RAG, information extraction, analysis and fine-tuning in the age of LLMs and agents.""")

    with gr.Row():
        system_prompt = gr.Textbox(
            label="System prompt", lines=3, value="You are a helpful assistant."
        )
        user_prompt = gr.Textbox(label="Query", lines=3)

    with gr.Accordion("kwargs"):
        with gr.Row(variant="panel"):
            max_tokens = gr.Number(label="Max tokens", value=512)
            temperature = gr.Number(label="Temperature", value=0.2)
            top_p = gr.Number(label="Top p", value=0.95)
            top_k = gr.Number(label="Top k", value=40)

    submit_btn = gr.Button("Submit")
    response_output = gr.Textbox(label="Response", lines=10)
    documents_output = gr.Dataframe(
        label="Documents", headers=["chunk", "url", "distance", "rank"], wrap=True
    )

    submit_btn.click(
        fn=generate,
        inputs=[
            user_prompt,
            system_prompt,
            max_tokens,
            temperature,
            top_p,
            top_k,
        ],
        outputs=[response_output],
    )

demo.launch()

* Running on local URL:  http://127.0.0.1:7867

To create a public link, set `share=True` in `launch()`.

将 Web 应用程序部署到 Hugging Face

现在我们可以将我们的 Gradio 应用程序部署到 Hugging Face Spaces。

点击“创建空间”按钮。
将 Gradio 界面中的代码复制并粘贴到 app.py 文件中。不要忘记复制 generate_response_* 函数以及执行生成函数的代码。
创建一个包含 gradio、gradio-client 和 llama-cpp-python 的 requirements.txt 文件。
如果您正在使用推理 API，请在空间设置中将 Hugging Face API 设置为 HF_TOKEN 秘密变量。

我们等待几分钟让应用程序部署完成，瞧，我们得到了一个公共生成界面！

Gradio 作为 REST API

我们现在可以使用 Gradio 客户端作为 SDK 直接与我们的生成函数进行交互。每个 Gradio 应用程序都有一个 API 文档，描述了可用的端点及其参数，您可以从 Gradio 应用程序空间页面底部的按钮访问该文档。我们会发现它不是最快的，因为它运行在 Hugging Face Spaces 的免费层，但它是一个很好的基准。

client = Client("ai-blueprint/rag-generate")
result = client.predict(
    user_prompt="What is the future of AI?",
    system_prompt="You are a helpful assistant.",
    max_tokens=512,
    temperature=0.2,
    top_p=0.95,
    top_k=40,
    api_name="/generate"
)
result

Loaded as API: https://ai-blueprint-rag-generate.hf.space ✔





"{'id': 'chatcmpl-38bd4960-655c-447a-be2b-5fc50bb1789e', 'object': 'chat.completion', 'created': 1737652209, 'model': '/home/user/.cache/huggingface/hub/models--prithivMLmods--SmolLM2-135M-Instruct-GGUF/snapshots/5dc548ea9191fd97d817832f51012ae86cded1b5/./SmolLM2-135M-Instruct.Q5_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'The future of AI is multifaceted and continues to evolve at a rapid pace. As we move forward, AI is being used to augment human capabilities, enhance our understanding of the world, and improve our quality of life. Here are some of the most significant developments and innovations that are shaping the future of AI:\\n\\nAI is being used to enhance human capabilities, such as language translation, image recognition, and decision-making. This has the potential to revolutionize the way we interact with each other and the world at large.\\n\\nAI is also being used to enhance our understanding of the world, such as by providing insights into the human condition, the impact of technology on society, and the ethics of AI.\\n\\nIn the field of healthcare, AI is being used to improve diagnostic accuracy, enhance patient care, and optimize treatment plans.\\n\\nIn the field of education, AI is being used to improve learning outcomes, enhance teaching methods, and optimize learning experiences.\\n\\nAI is also being used to improve the quality of life, such as by providing personalized recommendations, improving productivity, and enhancing the quality of life for people with disabilities.\\n\\nIn the field of transportation, AI is being used to improve safety, enhance safety features, and optimize transportation systems.\\n\\nIn the field of healthcare, AI is being used to improve diagnostic accuracy, enhance treatment planning, and optimize healthcare systems.\\n\\nIn the field of education, AI is being used to improve learning outcomes, enhance teaching methods, and optimize learning experiences.\\n\\nIn the field of healthcare, AI is being used to improve diagnostic accuracy, enhance treatment planning, and optimize healthcare systems.\\n\\nIn the field of transportation, AI is being used to improve diagnostic accuracy, enhance treatment planning, and optimize transportation systems.\\n\\nIn the field of healthcare, AI is being used to improve diagnostic accuracy, enhance treatment planning, and optimize transportation systems.\\n\\nIn the field of education, AI is being used to improve diagnostic accuracy, enhance treatment planning, and optimize transportation systems.\\n\\nIn the field of healthcare, AI is being used to improve diagnostic accuracy, enhance treatment planning, and optimize transportation systems.\\n\\nIn the field of transportation, AI is being used to improve diagnostic accuracy, enhance treatment planning, and optimize transportation systems.\\n\\nIn the field of healthcare, AI is being used to improve diagnostic accuracy, enhance treatment planning, and optimize'}, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 27, 'completion_tokens': 485, 'total_tokens': 512}}"

结论

我们已经了解了如何使用 llama-cpp-python 库创建生成函数，以及如何将其作为微服务部署到 Hugging Face Spaces。接下来，我们将了解如何将 R-A-G 组件组合成一个单一的 RAG 管道。

后续步骤

继续 - 将所有组件组合到 RAG 管道中。
贡献 - 缺少什么？随时欢迎 PR。
学习 - Hugging Face 课程或 smol-course 中方法背后的理论。
探索 - Hugging Face Cookbook 中使用类似技术的笔记本。

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录评论