在 Azure AI 上部署 SmolLM3

本示例展示了如何将 Hugging Face Collection 中的 SmolLM3 作为 Azure ML 托管在线端点部署在 Azure AI Foundry Hub 上，由 Transformers 提供支持并具有 OpenAI 兼容路由。此外，本示例还展示了如何使用 OpenAI Python SDK 进行不同场景和用例的推理。

SmolLM3 3B Logo image

总结：Transformers 是文本、计算机视觉、音频、视频和多模态模型领域最先进机器学习模型的模型定义框架，适用于推理和训练。Azure AI Foundry 为企业 AI 操作、模型构建器和应用程序开发提供了统一平台。Azure Machine Learning 是一种云服务，用于加速和管理机器学习 (ML) 项目生命周期。

本示例将具体部署来自 Hugging Face Hub 的 HuggingFaceTB/SmolLM3-3B（或在 AzureML 或 Azure AI Foundry 上查看）作为 Azure AI Foundry Hub 上的 Azure ML 托管在线端点。

SmolLM3 是一种 30 亿参数的语言模型，旨在突破小型模型的极限。它支持双模式推理、6 种语言和长上下文。SmolLM3 是一个完全开放的模型，在 30 亿至 40 亿参数规模下提供强大的性能。

Small LLM win-rate on benchmarks per model size

该模型是一个仅解码器 transformer，使用 GQA 和 NoPE（比例为 3:1），它在 11.2T 令牌上进行预训练，采用分阶段的 Web、代码、数学和推理数据课程。训练后包括对 140B 推理令牌进行中训练，然后通过锚定偏好优化（APO）进行监督微调和对齐。

指令模型，针对**混合推理**进行了优化。
**完全开放的模型**：开放权重 + 完整的训练细节，包括公共数据混合和训练配置
**长上下文：** 在 64k 上下文上训练，并支持使用 YARN 推断法扩展至 **128k 令牌**
**多语言**：原生支持 6 种语言（英语、法语、西班牙语、德语、意大利语和葡萄牙语）

SmolLM3 3B on the Hugging Face Hub

SmolLM3 3B on Azure AI Foundry

欲了解更多信息，请务必查看我们在 Hugging Face Hub 上的模型卡。

先决条件

要运行以下示例，您需要满足以下先决条件，或者，您也可以在Azure Machine Learning 教程：创建入门所需资源中阅读更多相关信息。

具有活动订阅的 Azure 帐户。
已安装并登录 Azure CLI。
适用于 Azure CLI 的 Azure 机器学习扩展。
一个 Azure 资源组。
基于 Azure AI Foundry Hub 的项目。

有关更多信息，请按照为 Azure AI 配置 Microsoft Azure 中的步骤操作。

设置与安装

在本例中，将使用适用于 Python 的 Azure 机器学习 SDK 创建端点和部署，并调用部署的 API。此外，您还需要安装 azure-identity 以通过 Python 使用您的 Azure 凭据进行身份验证。

%pip install azure-ai-ml azure-identity --upgrade --quiet

更多信息请参见适用于 Python 的 Azure 机器学习 SDK。

然后，为了方便起见，建议设置以下环境变量，因为它们将在示例中用于 Azure ML 客户端，因此请务必根据您的 Microsoft Azure 帐户和资源更新并设置这些值。

%env LOCATION eastus
%env SUBSCRIPTION_ID <YOUR_SUBSCRIPTION_ID>
%env RESOURCE_GROUP <YOUR_RESOURCE_GROUP>
%env AI_FOUNDRY_HUB_PROJECT <YOUR_AI_FOUNDRY_HUB_PROJECT>

最后，您还需要定义终结点和部署名称，因为它们也将在整个示例中使用。

请注意，端点名称在每个区域内都必须是全局唯一的，即即使您的订阅下没有以此名称运行的任何端点，如果该名称已被其他 Azure 客户预留，则您将无法使用相同的名称。建议添加时间戳或自定义标识符，以防止在尝试部署已锁定/预留名称的端点时遇到 HTTP 400 验证问题。此外，端点名称长度必须在 3 到 32 个字符之间。

import os
from uuid import uuid4

os.environ["ENDPOINT_NAME"] = f"smollm3-endpoint-{str(uuid4())[:8]}"
os.environ["DEPLOYMENT_NAME"] = f"smollm3-deployment-{str(uuid4())[:8]}"

向 Azure ML 进行身份验证

首先，您需要通过 Azure ML Python SDK 向 Azure ML 的 Azure AI Foundry Hub 进行身份验证，之后将使用该 SDK 将 HuggingFaceTB/SmolLM3-3B 作为 Azure ML 托管在线端点部署到您的 Azure AI Foundry Hub 中。

在标准的 Azure ML 部署中，您需要使用 Azure ML 工作区作为 workspace_name 来创建 MLClient，而在 Azure AI Foundry 中，您需要将 Azure AI Foundry Hub 名称作为 workspace_name 提供，这样也会将端点部署到 Azure AI Foundry 中。

import os
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

client = MLClient(
    credential=DefaultAzureCredential(),
    subscription_id=os.getenv("SUBSCRIPTION_ID"),
    resource_group_name=os.getenv("RESOURCE_GROUP"),
    workspace_name=os.getenv("AI_FOUNDRY_HUB_PROJECT"),
)

创建和部署 Azure AI 端点

在创建托管在线端点之前，您需要构建模型 URI，其格式为 azureml://registries/HuggingFace/models/<MODEL_ID>/labels/latest，其中 MODEL_ID 不是 Hugging Face Hub ID，而是其在 Azure 上的名称，如下所示

model_id = "HuggingFaceTB/SmolLM3-3B"

model_uri = (
    f"azureml://registries/HuggingFace/models/{model_id.replace('/', '-').replace('_', '-').lower()}/labels/latest"
)
model_uri

要检查 Hugging Face Hub 中的模型是否在 Azure 中可用，您应该在支持的模型中阅读相关信息。如果不可用，您随时可以请求在 Azure 的 Hugging Face 集合中添加模型）。

然后，您需要通过 Azure ML Python SDK 创建 ManagedOnlineEndpoint，如下所示。

Hugging Face Collection 中的每个模型都由高效的推理后端提供支持，并且每个模型都可以在各种实例类型上运行（如支持的硬件中所列）。由于模型和推理引擎需要 GPU 加速实例，您可能需要根据管理和增加 Azure 机器学习资源的配额和限制来请求增加配额。

from azure.ai.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment

endpoint = ManagedOnlineEndpoint(name=os.getenv("ENDPOINT_NAME"))

deployment = ManagedOnlineDeployment(
    name=os.getenv("DEPLOYMENT_NAME"),
    endpoint_name=os.getenv("ENDPOINT_NAME"),
    model=model_uri,
    instance_type="Standard_NC40ads_H100_v5",
    instance_count=1,
)

client.begin_create_or_update(endpoint).wait()

Azure AI Endpoint from Azure AI Foundry

在 Azure AI Foundry 中，终结点只有在部署创建后才会在“我的资产 -> 模型 + 终结点”选项卡中列出，不像 Azure ML 那样，即使终结点不包含任何活动或正在进行的部署也会显示。

client.online_deployments.begin_create_or_update(deployment).wait()

Azure AI Deployment from Azure AI Foundry

请注意，尽管 Azure AI 端点创建相对较快，但部署将需要更长时间，因为它需要在 Azure 上分配资源，因此预计需要约 10-15 分钟，但根据实例预置和可用性，也可能会花费更长时间。

部署后，您可以通过 Azure AI Foundry 或 Azure ML Studio 检查端点详细信息、实时日志、如何使用端点，甚至可以使用（仍处于预览阶段）监控功能。欲了解更多信息，请访问Azure ML 托管在线端点

向 Azure AI 端点发送请求

最后，Azure AI 端点部署完成后，您可以向其发送请求。在这种情况下，由于模型的任务是 `text-generation`（也称为 `chat-completion`），因此您可以使用 OpenAI SDK，通过 OpenAI 兼容路由向评分 URI 发送请求，即 `/v1/chat/completions`。

请注意，下面仅列出了一些选项，但只要您发送的 HTTP 请求设置了 `azureml-model-deployment` 标头（设置为 Azure AI 部署的名称，而不是端点的名称），并且拥有向给定端点发送请求所需的身份验证令牌/密钥，就可以向已部署的端点发送请求；然后您可以向后端引擎公开的所有路由发送 HTTP 请求，而不仅仅是评分路由。

%pip install openai --upgrade --quiet

要将 OpenAI Python SDK 与 Azure ML 托管在线终结点一起使用，您需要首先检索

api_url，带 /v1 路由（包含 OpenAI Python SDK 将向其发送请求的 v1/chat/completions 终结点）
api_key，它是 Azure AI 中的 API 密钥或 Azure ML 中的主密钥（除非使用专用的 Azure ML 令牌）

from urllib.parse import urlsplit

api_key = client.online_endpoints.get_keys(os.getenv("ENDPOINT_NAME")).primary_key

url_parts = urlsplit(client.online_endpoints.get(os.getenv("ENDPOINT_NAME")).scoring_uri)
api_url = f"{url_parts.scheme}://{url_parts.netloc}/v1"

或者，您也可以手动构建 API URL，如下所示，因为 URI 在每个区域中都是全局唯一的，这意味着在同一区域中只会有一个同名终结点。

api_url = f"https://{os.getenv('ENDPOINT_NAME')}.{os.getenv('LOCATION')}.inference.ml.azure.com/v1"

或者直接从 Azure AI Foundry 或 Azure ML Studio 中检索。

然后，您可以正常使用 OpenAI Python SDK，确保包含包含 Azure AI / ML 部署名称的额外标头 azureml-model-deployment。

通过 OpenAI Python SDK，可以通过 `chat.completions.create` 的每次调用中的 `extra_headers` 参数进行设置（如下文注释），或者在实例化 `OpenAI` 客户端时通过 `default_headers` 参数进行设置（这是推荐的方法，因为标头需要在每个请求中都存在，所以只需设置一次即可）。

import os
from openai import OpenAI

openai_client = OpenAI(
    base_url=api_url,
    api_key=api_key,
    default_headers={"azureml-model-deployment": os.getenv("DEPLOYMENT_NAME")},
)

聊天完成

completion = openai_client.chat.completions.create(
    model="HuggingFaceTB/SmolLM3-3B",
    messages=[
        {
            "role": "system",
            "content": "You are an assistant that responds like a pirate.",
        },
        {
            "role": "user",
            "content": "Give me a brief explanation of gravity in simple terms.",
        },
    ],
    max_tokens=128,
)
print(completion)
# ChatCompletion(id='chatcmpl-74f6852e28', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content="<think>\nOkay, the user wants a simple explanation of gravity. Let me start by recalling what I know. Gravity is the force that pulls objects towards each other. But how to explain that simply?\n\nMaybe start with a common example, like how you fall when you jump. That's gravity pulling you down. But wait, I should mention that it's not just on Earth. The moon orbits the Earth because of gravity too. But how to make that easy to understand?\n\nI need to avoid technical terms. Maybe use metaphors. Like comparing gravity to a magnet, but not exactly. Or think of it as a stretchy rope pulling", refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1753178803, model='HuggingFaceTB/SmolLM3-3B', object='chat.completion', service_tier='default', system_fingerprint='1a28be5c-df18-4e97-822f-118bf57374c8', usage=CompletionUsage(completion_tokens=128, prompt_tokens=66, total_tokens=194, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0), reasoning_tokens=0))

扩展思考模式

默认情况下，`SmolLM3-3B` 启用扩展思考，因此上述示例会生成带有推理跟踪的输出，因为推理默认是启用的。

要启用和禁用它，您可以在系统提示中分别提供 `/think` 和 `/no_think`。

completion = openai_client.chat.completions.create(
    model="HuggingFaceTB/SmolLM3-3B",
    messages=[
        {
            "role": "system",
            "content": "/no_think You are an assistant that responds like a pirate.",
        },
        {
            "role": "user",
            "content": "Give me a brief explanation of gravity in simple terms.",
        },
    ],
    max_tokens=128,
)
print(completion)
# ChatCompletion(id='chatcmpl-776e84a272', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content="Arr matey! Ye be askin' about gravity, the mighty force that keeps us swabbin' the decks and not floatin' off into the vast blue yonder. Gravity be the pull o' the Earth, a mighty force that keeps us grounded and keeps the stars in their place. It's like a giant invisible hand that pulls us towards the center o' the Earth, makin' sure we don't float off into space. It's what makes the apples fall from the tree and the moon orbit 'round the Earth. So, gravity be the force that keeps us all tied to this fine planet we call home.", refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1753178805, model='HuggingFaceTB/SmolLM3-3B', object='chat.completion', service_tier='default', system_fingerprint='d644cb1c-84d6-49ae-b790-ac6011851042', usage=CompletionUsage(completion_tokens=128, prompt_tokens=72, total_tokens=200, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0), reasoning_tokens=0))

多语言能力

如前所述，`SmolLM3-3B` 经过训练，原生支持 6 种语言：英语、法语、西班牙语、德语、意大利语和葡萄牙语；这意味着您可以通过使用这些语言中的任何一种发送请求来利用其多语言潜力。

completion = openai_client.chat.completions.create(
    model="HuggingFaceTB/SmolLM3-3B",
    messages=[
        {
            "role": "system",
            "content": "/no_think You are an expert translator.",
        },
        {
            "role": "user",
            "content": "Translate the following English sentence into both Spanish and German: 'The brown cat sat on the mat.'",
        },
    ],
    max_tokens=128,
)
print(completion)
# ChatCompletion(id='chatcmpl-da6188629f', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="The translation of the English sentence 'The brown cat sat on the mat.' into Spanish is: 'El gato marrón se sentó en el tapete.'\n\nThe translation of the English sentence 'The brown cat sat on the mat.' into German is: 'Der braune Katze saß auf dem Teppich.'", refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1753178807, model='HuggingFaceTB/SmolLM3-3B', object='chat.completion', service_tier='default', system_fingerprint='054f8a76-4e8c-4a2f-90eb-31f0e802916c', usage=CompletionUsage(completion_tokens=68, prompt_tokens=77, total_tokens=145, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0), reasoning_tokens=0))

代理用例和工具调用

SmolLM3-3B 具有工具调用能力，这意味着您可以提供一个或多个 LLM 可以利用和使用的工具。

为防止 `tool_call` 不完整，您可能需要取消设置 `max_completion_tokens`（原 `max_tokens`）的值，或将其设置为足够大的值，以便模型在 `tool_call` 完成之前不会因长度限制而停止生成令牌。

completion = openai_client.chat.completions.create(
    model="HuggingFaceTB/SmolLM3-3B",
    messages=[{"role": "user", "content": "What is the weather like in New York?"}],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the current weather in a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "The unit of temperature",
                        },
                    },
                    "required": ["location"],
                },
            },
        }
    ],
    tool_choice="auto",
    max_completion_tokens=256,
)
print(completion)
# ChatCompletion(id='chatcmpl-c36090e6b5', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content='<think>I need to retrieve the current weather information for New York, so I\'ll use the get_weather function with the location set to \'New York\' and the unit set to \'fahrenheit\'.</think>\n<tool_call>{"name": "get_weather", "arguments": {"location": "New York", "unit": "fahrenheit"}}</tool_call>', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call-5d5eb71a', function=Function(arguments='{"location": "New York", "unit": "fahrenheit"}', name='get_weather'), type='function')]))], created=1753178808, model='HuggingFaceTB/SmolLM3-3B', object='chat.completion', service_tier='default', system_fingerprint='5e58b305-773c-40b6-900b-fe5b177aeab9', usage=CompletionUsage(completion_tokens=68, prompt_tokens=442, total_tokens=510, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0), reasoning_tokens=0))

释放资源

完成 Azure AI 终结点/部署的使用后，您可以按如下方式删除资源，这意味着您将停止支付模型运行所在的实例费用，并且所有相关费用都将停止。

client.online_endpoints.begin_delete(name=os.getenv("ENDPOINT_NAME")).result()

结论

通过本示例，您学习了如何为 Azure ML 和 Azure AI Foundry 创建和配置 Azure 帐户，如何在 Azure ML / Azure AI Foundry 模型目录中创建并运行 Hugging Face Collection 中的开放模型的托管在线端点，如何使用 OpenAI SDK 发送各种用例的推理请求，以及最后如何停止并释放资源。

如果您对此示例有任何疑问、问题或疑问，请随时提出问题，我们将尽力提供帮助！

📍 在 GitHub 上找到完整的示例此处！

< > 在 GitHub 上更新