评估自定义模型

Lighteval 允许你通过创建一个继承自 LightevalModel 的自定义模型类来评估自定义模型实现。当你想评估标准后端（transformers, vllm 等）不直接支持的模型时，这非常有用。

创建自定义模型

创建一个包含自定义模型实现的 Python 文件。该模型必须继承自 LightevalModel 并实现所有必需的方法。

这是一个基础示例

from lighteval.models.abstract_model import LightevalModel

class MyCustomModel(LightevalModel):
    def __init__(self, config):
        super().__init__(config)
        # Initialize your model here...

    def greedy_until(self, requests, max_tokens=None, stop_sequences=None):
        # Implement generation logic
        pass

    def loglikelihood(self, requests, log=True):
        # Implement loglikelihood computation
        pass

    def loglikelihood_rolling(self, requests):
        # Implement rolling loglikelihood computation
        pass

    def loglikelihood_single_token(self, requests):
        # Implement single token loglikelihood computation
        pass

自定义模型文件应只包含一个继承自 LightevalModel 的类。加载模型时，这个类将被自动检测和实例化。

你可以在 examples/custom_models/google_translate_model.py 中找到一个完整的自定义模型实现示例。

运行评估

你可以使用命令行界面或 Python API 来评估你的自定义模型。

使用命令行

lighteval custom \
    "google-translate" \
    "examples/custom_models/google_translate_model.py" \
    "lighteval|wmt20:fr-de|0|0" \
    --max-samples 10

该命令需要三个必需的参数

模型名称（用于在结果/日志中跟踪）
你的模型实现文件的路径
要评估的任务（格式与其他后端相同）

使用 Python API

from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.custom.custom_model import CustomModelConfig
from lighteval.pipeline import Pipeline, PipelineParameters

# Set up evaluation tracking
evaluation_tracker = EvaluationTracker(
    output_dir="results",
    save_details=True
)

# Configure the pipeline
pipeline_params = PipelineParameters(
    launcher_type=ParallelismManager.CUSTOM,
)

# Configure your custom model
model_config = CustomModelConfig(
    model="my-custom-model",
    model_definition_file_path="path/to/my_model.py"
)

# Create and run the pipeline
pipeline = Pipeline(
    tasks="leaderboard|truthfulqa:mc|0|0",
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    model_config=model_config
)

pipeline.evaluate()
pipeline.save_and_push_results()

必需的方法

你的自定义模型必须实现以下核心方法

greedy_until：用于生成文本，直到达到停止序列或最大令牌数
loglikelihood：用于计算特定续写的对数概率
loglikelihood_rolling：用于计算序列的滚动对数概率
loglikelihood_single_token：用于计算单个令牌的对数概率

请参阅 LightevalModel 基类文档，了解详细的方法签名和要求。

最佳实践

错误处理：在模型方法中实现稳健的错误处理，以优雅地处理边缘情况。
批处理：考虑在模型方法中实现高效的批处理以提高性能。
资源管理：在模型的 __init__ 和 __del__ 方法中正确管理任何资源（例如，API 连接、模型权重）。
文档：为你的模型类和方法添加清晰的文档字符串，解释任何具体要求或限制。

用例示例

自定义模型在以下场景中特别有用

评估通过自定义 API 访问的模型
封装具有专门预处理/后处理的模型
测试新颖的模型架构
评估集成模型
与外部服务或工具集成

有关封装 Google 翻译 API 的自定义模型的完整示例，请参阅 examples/custom_models/google_translate_model.py。

< > 在 GitHub 上更新