TRL 文档

RLOO 训练器

Hugging Face's logo
加入 Hugging Face 社区

并获得增强型文档体验

开始

RLOO 训练器

TRL 支持使用 REINFORCE 留一法 (RLOO) 训练 LLM。其理念是,RLOO 不会使用价值函数,而是为每个提示生成 K 个补全。对于每个补全,RLOO 使用其他 K-1 个补全的平均分数作为基线来计算优势。RLOO 还将整个补全建模为单个操作,而 PPO 将每个标记建模为单个操作。请注意,REINFORCE/A2C 是 PPO 的特例,当 PPO 时期数为 1 且小批量数为 1 时,这就是我们在 TRL 中实现 RLOO 的方式。

参考资料

入门

要只运行 RLOO 脚本以确保训练器可以运行,您可以运行以下命令来训练带有虚拟奖励模型的 RLOO 模型。

python examples/scripts/rloo/rloo.py \
    --learning_rate 3e-6 \
    --output_dir models/minimal/rloo \
    --per_device_train_batch_size 64 \
    --gradient_accumulation_steps 1 \
    --total_episodes 10000 \
    --model_name_or_path EleutherAI/pythia-14m \
    --reward_model_path EleutherAI/pythia-14m \
    --missing_eos_penalty 1.0

记录指标的解释

记录的指标如下。以下是一个在 Weights and Biases 上跟踪的运行示例

  • eps:跟踪每秒的剧集数。
  • objective/kl:当前策略和参考策略之间的平均 Kullback-Leibler (KL) 散度。
  • objective/entropy:策略的平均熵,表示策略选择的动作的随机性。
  • objective/non_score_reward:来自非分数相关来源的平均奖励,基本上是 beta * kl.sum(1),其中 beta 是 KL 惩罚系数,kl 是每个标记的 KL 散度。
  • objective/rlhf_reward:平均 RLHF 奖励,即 score - non_score_reward
  • objective/scores:奖励模型/环境返回的平均分数。
  • policy/approxkl_avg:连续 PPO 策略之间的平均近似 KL 散度。请注意,这与 objective/kl 不同。
  • policy/clipfrac_avg: 策略更新中被裁剪的平均比例,指示策略更新被限制以防止发生重大更改的频率。
  • loss/policy_avg: 策略损失的平均值,指示策略的执行效果。
  • val/clipfrac_avg: 值函数更新中被裁剪的平均比例,类似于 policy/clipfrac_avg,但针对的是值函数。
  • policy/entropy_avg: 训练过程中策略的平均熵,指示策略动作的多样性。
  • val/ratio: 当前策略概率与旧策略概率的平均比率,提供策略变化程度的度量。
  • val/ratio_var: val/ratio 的方差,指示策略变化的可变性。
  • val/num_eos_tokens: 生成的序列结束 (EOS) 标记的数量,可以指示完整响应的数量。
  • lr: lr: 优化器当前使用的学习率。
  • episode: episode: 训练过程中的当前全局步骤或回合计数。

食谱

  • 调试提示:objective/rlhf_reward: 这是 RLHF 训练的最终目标。如果训练按预期进行,此指标应不断上升。
  • 调试提示:val/ratio: 此数字应在 1.0 左右浮动,并且通过 PPO 的代理损失被 --cliprange 0.2 裁剪。因此,如果此 ratio 太高,例如 2.0 或 1000.0,或者太小,例如 0.1,这意味着连续策略之间的更新过于剧烈。您应该尝试了解为什么会发生这种情况并尝试修复它。
  • 内存提示:如果您遇到内存不足的问题,可以尝试减少 --per_device_train_batch_size 或增加 --gradient_accumulation_steps 来减少内存占用。
  • 内存提示:如果您有多个 GPU,还可以使用 DeepSpeed 阶段 3 运行训练以减少内存占用 accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml
  • 使用提示:我们建议通过 --missing_eos_penalty 使用“EOS 技巧”,它从不以 EOS 标记结尾的完成分数中减去一个静态标量惩罚。这可以帮助模型学习生成更连贯的完成。

我的模型到底在做什么?

为了帮助您了解模型的行为,我们定期记录模型的一些示例完成。以下是完成的示例。在 Weights and Biases 中跟踪的运行示例 中,它看起来像这样,允许您查看模型在训练的不同阶段的响应。默认情况下,我们在训练期间生成 --num_sample_generations 10,但您可以自定义生成的次数。

在日志中,采样生成的看起来像

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ query                           ┃ model response                  ┃ score    ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│  SUBREDDIT: r/AskReddit         │  I'm in love with a friend, and3.921875 │
│                                 │ I don't know how to get rid of  │          │
│ TITLE: How do you get someone   │ those feelings. I'm             │          │
│ out of your head?               │ desperate.<|endoftext|>[PAD][P… │          │
│                                 │                                 │          │
│ POST: Hi,                       │                                 │          │
│ I'm 22, and I have been with my │                                 │          │
│ girlfriend for 5 years now. We  │                                 │          │
│ recently moved together. We've  │                                 │          │
│ always loved each other         │                                 │          │
│ intensely.                      │                                 │          │
│                                 │                                 │          │
│ Problem, I recently started to  │                                 │          │
│ have feelings for an other      │                                 │          │
│ person (a friend). This person  │                                 │          │
│ has had a boyfriend for now 3   │                                 │          │
│ years, and has absolutely no    │                                 │          │
│ ideas. Those feelings were so   │                                 │          │
│ strong, it was hard to hide     │                                 │          │
│ them. After 2 months of me      │                                 │          │
│ being distant and really sad,   │                                 │          │
│ my girlfriend forced me to say  │                                 │          │
│ what was bothering me. I'm not  │                                 │          │
│ a good liar, and now she knows. │                                 │          │
│                                 │                                 │          │
│ We decided to give us a week    │                                 │          │
│ alone, I went to my parents.    │                                 │          │
│                                 │                                 │          │
│ Now, I'm completely lost. I     │                                 │          │
│ keep on thinking about this     │                                 │          │
│ person, and I hate that. I      │                                 │          │
│ would like for those feelings   │                                 │          │
│ to go away, to leave me alone.  │                                 │          │
│ But I can't.                    │                                 │          │
│                                 │                                 │          │
│ What do I do? It's been 3       │                                 │          │
│ months now, and I'm just        │                                 │          │
│ desperate.                      │                                 │          │
│                                 │                                 │          │
│ TL;DR:                          │                                 │          │
├─────────────────────────────────┼─────────────────────────────────┼──────────┤
│  SUBREDDIT: r/pettyrevenge      │  My mom woke me up with a loud  │ 6.84375  │
│                                 │ TV. I blasted Gangnam Style on  │          │
│ TITLE: So, my mom woke me up    │ repeat, with the bass cranked   │          │
│ with a loud TV.                 │ up as high as it could          │          │
│                                 │ go.<|endoftext|>[PAD][PAD][PAD… │          │
│ POST: She was in her living     │                                 │          │
│ room, watching TV. This was at  │                                 │          │
│ about 8:30 in the morning, and  │                                 │          │
│ she was exercising. She turned  │                                 │          │
│ the TV up extra loud to hear it │                                 │          │
│ over her excercycle, and woke   │                                 │          │
│ me up. I went in there asking   │                                 │          │
│ for her to turn it down. She    │                                 │          │
│ said she didn't have to; I      │                                 │          │
│ explained that I always used    │                                 │          │
│ headphones so she didn't have   │                                 │          │
│ to deal with my noise and that  │                                 │          │
│ she should give me a little     │                                 │          │
│ more respect, given that I paid │                                 │          │
│ rent at the time.               │                                 │          │
│                                 │                                 │          │
│ She disagreed. I went back to   │                                 │          │
│ my room, rather pissed off at   │                                 │          │
│ the lack of equality. I had no  │                                 │          │
│ lock on my door; but I had a    │                                 │          │
│ dresser right next to it, so I  │                                 │          │
│ pulled one of the drawers out   │                                 │          │
│ enough so that it caused the    │                                 │          │
│ door to not be openable. Then,  │                                 │          │
│ I turned my speakers up really  │                                 │          │
│ loud and blasted Gangnam Style  │                                 │          │
│ on repeat, with the bass        │                                 │          │
│ cranked up as high as it could  │                                 │          │
│ go.                             │                                 │          │
│                                 │                                 │          │
│ If you hate Gangnam Style for   │                                 │          │
│ being overplayed, you will see  │                                 │          │
│ why I chose that particular     │                                 │          │
│ song. I personally don't mind   │                                 │          │
│ it. But here's the thing about  │                                 │          │
│ my bass; it vibrates the walls, │                                 │          │
│ making one hell of a lot of     │                                 │          │
│ noise. Needless to say, my mom  │                                 │          │
│ was not pleased and shut off    │                                 │          │
│ the internet. But it was oh so  │                                 │          │
│ worth it.                       │                                 │          │
│                                 │                                 │          │
│ TL;DR:                          │                                 │          │
└─────────────────────────────────┴─────────────────────────────────┴──────────┘

实施细节

RLOOTrainer 的大部分内容基于 PPO 实现,该实现基于 RLHF 与 PPO 的 N+ 实现细节:TL;DR 摘要的案例研究.

以下是 RLOO 的向量化优势计算

def test_rloo_reward():
    local_batch_size = 3
    rloo_k = 4
    rlhf_reward = torch.tensor([
        1, 2, 3, # first rlhf reward for three prompts
        2, 3, 4, # second rlhf reward for three prompts
        5, 6, 7, # third rlhf reward for three prompts
        8, 9, 10, # fourth rlhf reward for three prompts
    ]).float() # here we have 3 prompts which have 4 completions each

    baseline = (rlhf_reward.sum(0) - rlhf_reward) / (rloo_k - 1)
    advantages = torch.zeros_like(rlhf_reward)
    for i in range(0, len(advantages), local_batch_size):
        other_response_rlhf_rewards = []
        for j in range(0, len(advantages), local_batch_size):
            if i != j:
                other_response_rlhf_rewards.append(rlhf_reward[j : j + local_batch_size])
        advantages[i : i + local_batch_size] = rlhf_reward[i : i + local_batch_size] - torch.stack(other_response_rlhf_rewards).mean(0)
    
    assert (1 - (2 + 5 + 8) / 3 - advantages[0].item()) < 1e-6  # First rlhf reward for the first prompt
    assert (6 - (3 + 2 + 9) / 3 - advantages[7].item()) < 1e-6  # Third rlhf reward for the second prompt

    # Vectorized implementation
    rlhf_reward = rlhf_reward.reshape(rloo_k, local_batch_size)
    baseline = (rlhf_reward.sum(0) - rlhf_reward) / (rloo_k - 1)
    vec_advantages = rlhf_reward - baseline
    torch.testing.assert_close(vec_advantages.flatten(), advantages)

基准实验

为了验证 RLOO 实现有效,我们在 1B 模型上进行了实验。以下是我们用于运行实验的命令。我们直接从 RLHF 与 PPO 的 N+ 实现细节:TL;DR 摘要的案例研究 中获取 SFT/RM 模型。

accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
    examples/scripts/rloo/rloo_tldr.py \
    --output_dir models/minimal/rloo_tldr \
    --num_ppo_epochs 2 \
    --num_mini_batches 2 \
    --learning_rate 3e-6 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 8 \
    --total_episodes 1000000 \
    --model_name_or_path EleutherAI/pythia-1b-deduped \
    --sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
    --local_rollout_forward_batch_size 16 \
    --missing_eos_penalty 1.0 \
    --stop_token eos \
    --kl_coef 0.03

检查点和实验跟踪可在以下位置获取

为了评估,我们使用 vLLM 加载检查点,并使用 GPT-4o mini 作为评判模型来评估生成的 TL;DR 与参考 TL;DR 的比较。有关如何使用评判模型的更多信息,请参阅 评判模型

$ python examples/scripts/evals/judge_tldr.py --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr --judge_model gpt-4o-mini --num_examples 1000
Model win rate: 33.00%
$ python examples/scripts/evals/judge_tldr.py --model_name_or_path vwxyzjn/rloo_tldr --judge_model gpt-4o-mini --num_examples 1000
Model win rate: 51.20%

RLOO 检查点的偏好率为 51.2%,而 SFT 检查点的偏好率为 33.0%。这是一个好迹象,表明 RLOO 训练按预期进行。

指标

# pip install openrlbenchmark==0.2.1a5
# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
# to use it, change `?we=huggingface&wpn=trl` to your own project and `?tag=pr-1540` to your own tag
python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=train/episode&ceik=output_dir&cen=sft_model_path&metrics=train/objective/rlhf_reward&metrics=train/objective/scores&metrics=train/objective/kl&metrics=train/objective/non_score_reward&metrics=train/objective/entropy&metrics=train/policy/approxkl_avg&metrics=train/policy/clipfrac_avg&metrics=train/loss/policy_avg&metrics=train/policy/entropy_avg&metrics=train/val/ratio&metrics=train/val/ratio_var&metrics=train/val/num_eos_tokens&metrics=train/lr&metrics=train/eps' \
        "cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr?tag=pr-1540" \
    --env-ids models/minimal/rloo_tldr \
    --pc.ncols 4 \
    --pc.ncols-legend 1 \
    --pc.xlabel "Episode" \
    --output-filename benchmark/trl/pr-1540/rloo \
    --scan-history

RLOOTrainer

class trl.RLOOTrainer

< >

( config: RLOOConfig tokenizer: PreTrainedTokenizer policy: 模块 ref_policy: 模块 reward_model: 模块 train_dataset: 数据集 data_collator: 可选 = None eval_dataset: 联合 = None optimizers: 元组 = (None, None) callbacks: 可选 = None )

RLOOConfig

class trl.RLOOConfig

< >

( output_dir: str overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False eval_strategy: Union = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: Optional = None per_gpu_eval_batch_size: Optional = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: Optional = None eval_delay: Optional = 0 torch_empty_cache_steps: Optional = None learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: Union = 'linear' lr_scheduler_kwargs: Union = <factory> warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: Optional = 'passive' log_level_replica: Optional = 'warning' log_on_each_node: bool = True logging_dir: Optional = None logging_strategy: Union = 'steps' logging_first_step: bool = False logging_steps: float = 500 logging_nan_inf_filter: bool = True save_strategy: Union = 'steps' save_steps: float = 500 save_total_limit: Optional = None save_safetensors: Optional = True save_on_each_node: bool = False save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False no_cuda: bool = False use_cpu: bool = False use_mps_device: bool = False seed: int = 42 data_seed: Optional = None jit_mode_eval: bool = False use_ipex: bool = False bf16: bool = False fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'auto' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: Optional = None local_rank: int = -1 ddp_backend: Optional = None tpu_num_cores: Optional = None tpu_metrics_debug: bool = False debug: Union = '' dataloader_drop_last: bool = False eval_steps: Optional = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: Optional = None past_index: int = -1 run_name: Optional = None disable_tqdm: Optional = None remove_unused_columns: Optional = True label_names: Optional = None load_best_model_at_end: Optional = False metric_for_best_model: Optional = None greater_is_better: Optional = None ignore_data_skip: bool = False fsdp: Union = '' fsdp_min_num_params: int = 0 fsdp_config: Union = None fsdp_transformer_layer_cls_to_wrap: Optional = None accelerator_config: Union = None deepspeed: Union = None label_smoothing_factor: float = 0.0 optim: Union = 'adamw_torch' optim_args: Optional = None adafactor: bool = False group_by_length: bool = False length_column_name: Optional = 'length' report_to: Union = None ddp_find_unused_parameters: Optional = None ddp_bucket_cap_mb: Optional = None ddp_broadcast_buffers: Optional = None dataloader_pin_memory: bool = True dataloader_persistent_workers: bool = False skip_memory_metrics: bool = True use_legacy_prediction_loop: bool = False push_to_hub: bool = False resume_from_checkpoint: Optional = None hub_model_id: Optional = None hub_strategy: Union = 'every_save' hub_token: Optional = None hub_private_repo: bool = False hub_always_push: bool = False gradient_checkpointing: bool = False gradient_checkpointing_kwargs: Union = None include_inputs_for_metrics: bool = False include_for_metrics: List = <factory> eval_do_concat_batches: bool = True fp16_backend: str = 'auto' evaluation_strategy: Union = None push_to_hub_model_id: Optional = None push_to_hub_organization: Optional = None push_to_hub_token: Optional = None mp_parameters: str = '' auto_find_batch_size: bool = False full_determinism: bool = False torchdynamo: Optional = None ray_scope: Optional = 'last' ddp_timeout: Optional = 1800 torch_compile: bool = False torch_compile_backend: Optional = None torch_compile_mode: Optional = None dispatch_batches: Optional = None split_batches: Optional = None include_tokens_per_second: Optional = False include_num_input_tokens_seen: Optional = False neftune_noise_alpha: Optional = None optim_target_modules: Union = None batch_eval_metrics: bool = False eval_on_start: bool = False use_liger_kernel: Optional = False eval_use_gather_object: Optional = False dataset_num_proc: Optional = None num_mini_batches: int = 1 total_episodes: Optional = None local_rollout_forward_batch_size: int = 64 num_sample_generations: int = 10 response_length: int = 53 stop_token: Optional = None stop_token_id: Optional = None temperature: float = 0.7 missing_eos_penalty: Optional = None sft_model_path: str = 'EleutherAI/pythia-160m' world_size: Optional = None num_total_batches: Optional = None micro_batch_size: Optional = None local_batch_size: Optional = None batch_size: Optional = None local_mini_batch_size: Optional = None mini_batch_size: Optional = None exp_name: str = 'rloo_config' reward_model_path: str = 'EleutherAI/pythia-160m' num_ppo_epochs: int = 4 whiten_rewards: bool = False kl_coef: float = 0.05 cliprange: float = 0.2 rloo_k: int = 2 )

参数

  • exp_name (str, 可选, 默认值为 os.path.basename(__file__)[ -- -len(".py")]): 实验名称。
  • reward_model_path (str, 可选, 默认值为 "EleutherAI/pythia-160m") — 奖励模型路径。
  • num_ppo_epochs (int, 可选, 默认值为 4) — 训练的轮数。
  • whiten_rewards (bool, 可选, 默认值为 False) — 是否对奖励进行白化。
  • kl_coef (float, 可选, 默认值为 0.05) — KL 系数。
  • cliprange (float, 可选, 默认值为 0.2) — 剪切范围。
  • rloo_k (int, 可选, 默认值为 2) — REINFORCE 留一法 (RLOO) 每个提示的在线样本数量。

用于 RLOOTrainer 的配置类。

使用 HfArgumentParser,我们可以将此类转换为 argparse 参数,这些参数可以在命令行中指定。

< > 在 GitHub 上更新