Neuron 模型缓存

为什么要使用缓存？

问题：大型模型的 Neuron 编译需要 30-60 分钟
解决方案：在几秒钟内下载预编译模型

缓存系统将编译好的 Neuron 模型存储在 HuggingFace Hub 上，为您的团队省去了重新编译的时间。当您训练或加载模型时，系统会自动检查是否存在缓存版本，然后再开始昂贵的编译过程。

主要优势

节省时间：在几秒钟内下载编译好的模型，而不是花费数小时进行编译
团队协作：在团队成员和实例之间共享编译好的模型
降低成本：避免在云实例上重复编译的成本
自动操作：与现有代码透明地协同工作

快速入门

训练

from optimum.neuron import NeuronTrainer

# Cache works automatically - no configuration needed
trainer = NeuronTrainer(model=model, args=training_args)
trainer.train()  # Downloads cached models if available

推理

from optimum.neuron import NeuronModelForCausalLM

# Cache works automatically 
model = NeuronModelForCausalLM.from_pretrained("model_id")

就这样！对于支持的模型类，缓存会自动工作。

支持的模型

模型类	缓存支持	用例	备注
`NeuronTrainer`	✅ 完全支持	训练	训练期间自动下载和上传
`NeuronModelForCausalLM`	✅ 完全支持	推理	自动下载用于推理
其他 `NeuronModelForXXX`	❌ 无	推理	使用不同的导出机制，无缓存集成

重要限制：像 NeuronModelForSequenceClassification、NeuronModelForQuestionAnswering 等模型使用不同的编译路径，该路径不与缓存系统集成。只有 NeuronModelForCausalLM 和训练工作流支持缓存。

工作原理

缓存系统在两个层面上运行以最小化编译时间

缓存优先级（从快到慢）

本地缓存 → 从 /var/tmp/neuron-compile-cache 即时访问
Hub 缓存 → 在几秒钟内从 HuggingFace Hub 下载
从头编译 → 大型模型需要 30-60 分钟

缓存内容：系统缓存 NEFF 文件（Neuron 可执行文件格式）—— 即在 Neuron 核心上运行的已编译二进制产物，而不是原始模型文件。

缓存标识：每个缓存的编译都会根据以下因素获得一个唯一的哈希值

模型因素：架构、精度（fp16/bf16）、输入形状、任务类型
编译因素：NeuronX 编译器版本、核心数、优化标志
环境因素：模型检查点版本、Optimum Neuron 版本

这意味着即使对您的设置进行微小更改也可能需要重新编译，但相同的配置将始终命中缓存。

私有缓存设置

默认的公共缓存（aws-neuron/optimum-neuron-cache）对用户是只读的——您可以下载缓存的模型，但不能上传您自己的编译结果。这个公共缓存只包含由 Optimum 团队为常见配置编译的模型。

对于大多数用例，您需要创建一个私有缓存仓库，用来存储您自己编译的模型。

为什么使用私有缓存？

上传您的编译结果：存储您编译的模型以供团队重复使用
私有模型：确保专有模型编译结果的安全
团队协作：在团队成员和 CI/CD 之间共享编译好的产物
自定义配置：缓存具有您特定批量大小、序列长度等的模型

方法 1：CLI 设置（推荐）

# Create private cache repository
optimum-cli neuron cache create

# Set as default cache
optimum-cli neuron cache set your-org/your-cache-name

方法 2：环境变量

# Use for single training run
CUSTOM_CACHE_REPO="your-org/your-cache" python train.py

# Or export for session
export CUSTOM_CACHE_REPO="your-org/your-cache"

先决条件

登录：huggingface-cli login
缓存仓库的写权限

CLI 命令

# Create new cache repository
optimum-cli neuron cache create [-n NAME] [--public]

# Set default cache repository  
optimum-cli neuron cache set REPO_NAME

# Search for cached models
optimum-cli neuron cache lookup MODEL_ID

# Sync local cache with Hub
optimum-cli neuron cache synchronize

高级用法

在训练循环中使用缓存

如果您不使用 NeuronTrainer 类，您仍然可以在自定义训练循环中利用缓存系统。当您需要更多地控制训练过程，或者在与自定义训练框架集成时，同时又想从缓存的编译结果中受益，这会非常有用。

何时使用此方法

自定义训练循环不符合 NeuronTrainer 模式
需要精细控制的高级优化场景

注意：对于大多数用例，NeuronTrainer 会自动处理缓存，是推荐的方法。

from optimum.neuron.cache import hub_neuronx_cache, synchronize_hub_cache
from optimum.neuron.cache.entries import SingleModelCacheEntry
from optimum.neuron.cache.training import patch_neuron_cc_wrapper

# Create cache entry
cache_entry = SingleModelCacheEntry(model_id, task, config, neuron_config)

# The NeuronX compiler will use the Hugging Face Hub cache system
with patch_neuron_cc_wrapper():
    # The compiler will check the specified remote cache for pre-compiled NEFF files
    with hub_neuronx_cache(entry=cache_entry, cache_repo_id="my-org/cache"):
        model = training_loop()  # Will use specified cache

# Synchronize local cache with Hub
synchronize_hub_cache(cache_repo_id="my-org/cache")

缓存查找

推理缓存包含一个注册表，让您可以在尝试编译之前搜索兼容的预编译模型。这对于您希望完全避免编译的推理场景尤其有用。

optimum-cli neuron cache lookup meta-llama/Llama-2-7b-chat-hf

示例输出

*** 1 entries found in cache ***
task: text-generation
batch_size: 1, sequence_length: 2048
num_cores: 24, precision: fp16
compiler_version: 2.12.54.0
checkpoint_revision: c1b0db933684edbfe29a06fa47eb19cc48025e93

重要：找到条目并不能保证缓存命中。您的确切配置必须与缓存的参数相匹配，包括编译器版本和模型修订版本。

CI/CD 集成

缓存系统在自动化环境中无缝工作

环境变量：在 CI 工作流中使用 CUSTOM_CACHE_REPO 指定缓存仓库

# In your CI configuration
CUSTOM_CACHE_REPO="your-org/your-cache" python train.py

认证：确保您的 CI 环境有权访问您的私有缓存仓库

设置具有适当读/写权限的 HF_TOKEN 环境变量
对于 GitHub Actions，将其存储为仓库机密

最佳实践:

为不同环境（开发/预发布/生产）使用单独的缓存仓库
在设置自动化工作流时，考虑缓存仓库的权限
在长期运行的 CI 工作流中监控缓存仓库的大小

问题排查

“缓存仓库不存在”

Fix: Check repository name and login status
→ huggingface-cli login
→ Verify repo format: org/repo-name

“计算图将被重新编译”

Cause: No cached model matches your exact configuration
Fix: Use lookup to find compatible configurations
→ optimum-cli neuron cache lookup MODEL_ID

训练期间缓存未上传

Cause: No write permissions to cache repository  
Fix: Verify access and authentication
→ huggingface-cli whoami
→ Check cache repo permissions

下载缓慢

Cause: Large compiled models (GBs) downloading
Fix: Ensure good internet connection
→ Monitor logs for download progress

清除损坏的本地缓存

rm -rf /var/tmp/neuron-compile-cache/*

AWS Trainium 和 Inferentia