PPOv2 训练器
TRL 支持使用 近端策略优化 (PPO) 训练 LLMs。
参考文献
入门
要运行 PPO 脚本以确保训练器可以运行,可以使用以下命令使用虚拟奖励模型训练 PPO 模型。
python examples/scripts/ppo/ppo.py \ --learning_rate 3e-6 \ --num_ppo_epochs 1 \ --num_mini_batches 1 \ --output_dir models/minimal/ppo \ --per_device_train_batch_size 64 \ --gradient_accumulation_steps 1 \ --total_episodes 10000 \ --model_name_or_path EleutherAI/pythia-1b-deduped \ --missing_eos_penalty 1.0
记录指标的解释
记录的指标如下。这是一个 Weights and Biases 上跟踪运行的示例
eps
:跟踪每秒的事件数。objective/kl
:当前策略与参考策略之间的平均 Kullback-Leibler (KL) 散度。objective/entropy
:策略的平均熵,指示策略选择的动作的随机性。objective/non_score_reward
:来自非分数相关来源的平均奖励,基本上是beta * kl.sum(1)
,其中beta
是 KL 惩罚系数,kl
是每个标记的 KL 散度。objective/rlhf_reward
:平均 RLHF 奖励,即score - non_score_reward
。objective/scores
:奖励模型/环境返回的平均分数。policy/approxkl_avg
:连续 PPO 策略之间的平均近似 KL 散度。注意,这与objective/kl
不同。policy/clipfrac_avg
:被裁剪的策略更新的平均比例,指示策略更新被限制以防止发生巨大变化的频率。loss/policy_avg
:平均策略损失,指示策略执行情况。loss/value_avg
:平均值损失,指示预测值与实际奖励之间的差异。val/clipfrac_avg
:被裁剪的值函数更新的平均比例,类似于policy/clipfrac_avg
,但适用于值函数。policy/entropy_avg
:训练期间策略的平均熵,指示策略的动作有多么多样化。val/ratio
:当前策略概率与旧策略概率的平均比率,提供了一个衡量策略变化程度的指标。val/ratio_var
:val/ratio
的方差,指示策略变化的可变性。val/num_eos_tokens
:生成的序列结束 (EOS) 标记的数量,可以指示完整响应的数量。lr
:当前优化器使用的学习率。episode
: episode: 训练过程中的当前全局步数或episode计数。
食谱
- 调试技巧:
objective/rlhf_reward
:这是RLHF训练的最终目标。如果训练按预期进行,该指标应该不断上升。 - 调试技巧:
val/ratio
:这个数字应该在1.0附近浮动,并且会被PPO的代理损失函数通过--cliprange 0.2
裁剪。所以如果这个ratio
过高,比如2.0或1000.0,或者过低,比如0.1,这意味着连续策略之间的更新过于剧烈。你应该尝试理解为什么会出现这种情况并尝试解决它。 - 内存技巧:如果你遇到内存不足的问题,可以尝试减小
--per_device_train_batch_size
或者增加--gradient_accumulation_steps
来减少内存占用。 - 内存技巧:如果你有多个GPU,你也可以使用DeepSpeed stage 3来进行训练以减少内存占用:
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml
。 - 使用技巧:我们建议使用
--missing_eos_penalty
通过“EOS技巧”来减去一个静态标量惩罚,从不以EOS标记结尾的补全的分数中减去。这可以帮助模型学习生成更连贯的补全。
我的模型到底在做什么?
为了帮助你理解你的模型在做什么,我们定期记录模型的一些样本补全。这是一个补全的示例。在 Weights and Biases 上跟踪的运行 中,它看起来像这样,允许你查看模型在训练的不同阶段的响应。默认情况下,我们在训练期间生成--num_sample_generations 10
,但你可以自定义生成数量。
在日志中,采样生成的补全看起来像
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ query ┃ model response ┃ score ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ SUBREDDIT: r/AskReddit │ I'm in love with a friend, and │ 3.921875 │
│ │ I don't know how to get rid of │ │
│ TITLE: How do you get someone │ those feelings. I'm │ │
│ out of your head? │ desperate.<|endoftext|>[PAD][P… │ │
│ │ │ │
│ POST: Hi, │ │ │
│ I'm 22, and I have been with my │ │ │
│ girlfriend for 5 years now. We │ │ │
│ recently moved together. We've │ │ │
│ always loved each other │ │ │
│ intensely. │ │ │
│ │ │ │
│ Problem, I recently started to │ │ │
│ have feelings for an other │ │ │
│ person (a friend). This person │ │ │
│ has had a boyfriend for now 3 │ │ │
│ years, and has absolutely no │ │ │
│ ideas. Those feelings were so │ │ │
│ strong, it was hard to hide │ │ │
│ them. After 2 months of me │ │ │
│ being distant and really sad, │ │ │
│ my girlfriend forced me to say │ │ │
│ what was bothering me. I'm not │ │ │
│ a good liar, and now she knows. │ │ │
│ │ │ │
│ We decided to give us a week │ │ │
│ alone, I went to my parents. │ │ │
│ │ │ │
│ Now, I'm completely lost. I │ │ │
│ keep on thinking about this │ │ │
│ person, and I hate that. I │ │ │
│ would like for those feelings │ │ │
│ to go away, to leave me alone. │ │ │
│ But I can't. │ │ │
│ │ │ │
│ What do I do? It's been 3 │ │ │
│ months now, and I'm just │ │ │
│ desperate. │ │ │
│ │ │ │
│ TL;DR: │ │ │
├─────────────────────────────────┼─────────────────────────────────┼──────────┤
│ SUBREDDIT: r/pettyrevenge │ My mom woke me up with a loud │ 6.84375 │
│ │ TV. I blasted Gangnam Style on │ │
│ TITLE: So, my mom woke me up │ repeat, with the bass cranked │ │
│ with a loud TV. │ up as high as it could │ │
│ │ go.<|endoftext|>[PAD][PAD][PAD… │ │
│ POST: She was in her living │ │ │
│ room, watching TV. This was at │ │ │
│ about 8:30 in the morning, and │ │ │
│ she was exercising. She turned │ │ │
│ the TV up extra loud to hear it │ │ │
│ over her excercycle, and woke │ │ │
│ me up. I went in there asking │ │ │
│ for her to turn it down. She │ │ │
│ said she didn't have to; I │ │ │
│ explained that I always used │ │ │
│ headphones so she didn't have │ │ │
│ to deal with my noise and that │ │ │
│ she should give me a little │ │ │
│ more respect, given that I paid │ │ │
│ rent at the time. │ │ │
│ │ │ │
│ She disagreed. I went back to │ │ │
│ my room, rather pissed off at │ │ │
│ the lack of equality. I had no │ │ │
│ lock on my door; but I had a │ │ │
│ dresser right next to it, so I │ │ │
│ pulled one of the drawers out │ │ │
│ enough so that it caused the │ │ │
│ door to not be openable. Then, │ │ │
│ I turned my speakers up really │ │ │
│ loud and blasted Gangnam Style │ │ │
│ on repeat, with the bass │ │ │
│ cranked up as high as it could │ │ │
│ go. │ │ │
│ │ │ │
│ If you hate Gangnam Style for │ │ │
│ being overplayed, you will see │ │ │
│ why I chose that particular │ │ │
│ song. I personally don't mind │ │ │
│ it. But here's the thing about │ │ │
│ my bass; it vibrates the walls, │ │ │
│ making one hell of a lot of │ │ │
│ noise. Needless to say, my mom │ │ │
│ was not pleased and shut off │ │ │
│ the internet. But it was oh so │ │ │
│ worth it. │ │ │
│ │ │ │
│ TL;DR: │ │ │
└─────────────────────────────────┴─────────────────────────────────┴──────────┘
实现细节
这个PPOv2实现基于 使用PPO的RLHF的N+实现细节:关于TL;DR摘要的案例研究。
基准实验
为了验证PPO实现有效,我们在1B模型上运行了实验。以下是我们用来运行实验的命令。我们直接从 使用PPO的RLHF的N+实现细节:关于TL;DR摘要的案例研究 中获取SFT/RM模型。
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
examples/scripts/ppo/ppo_tldr.py \
--output_dir models/minimal/ppo_tldr \
--learning_rate 3e-6 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 4 \
--total_episodes 1000000 \
--model_name_or_path EleutherAI/pythia-1b-deduped \
--sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
--reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
--local_rollout_forward_batch_size 16 \
--missing_eos_penalty 1.0 \
--stop_token eos
检查点和实验跟踪可在以下位置找到:
为了评估,我们使用 vLLM 加载检查点,并使用GPT-4o mini作为评判模型来评估生成的TL;DR与参考TL;DR的匹配程度。有关如何使用评判模型的更多信息,请参见 评判模型。
$ python examples/scripts/evals/judge_tldr.py --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr --judge_model gpt-4o-mini --num_examples 1000 Model win rate: 33.00% $ python examples/scripts/evals/judge_tldr.py --model_name_or_path vwxyzjn/ppo_tldr --judge_model gpt-4o-mini --num_examples 1000 Model win rate: 64.70%
PPO检查点获得64.7%的偏好率,而SFT检查点的偏好率为33.0%。这是一个很好的迹象,表明PPO训练按预期进行。
指标
# pip install openrlbenchmark==0.2.1a5
# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
# to use it, change `?we=huggingface&wpn=trl` to your own project and `?tag=pr-1540` to your own tag
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=train/episode&ceik=output_dir&cen=sft_model_path&metrics=train/objective/rlhf_reward&metrics=train/objective/scores&metrics=train/objective/kl&metrics=train/objective/non_score_reward&metrics=train/objective/entropy&metrics=train/policy/approxkl_avg&metrics=train/policy/clipfrac_avg&metrics=train/loss/policy_avg&metrics=train/loss/value_avg&metrics=train/val/clipfrac_avg&metrics=train/policy/entropy_avg&metrics=train/val/ratio&metrics=train/val/ratio_var&metrics=train/val/num_eos_tokens&metrics=train/lr&metrics=train/eps' \
"cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr?tag=pr-1540" \
--env-ids models/minimal/ppo_tldr \
--pc.ncols 4 \
--pc.ncols-legend 1 \
--pc.xlabel "Episode" \
--output-filename benchmark/trl/pr-1540/ppov2 \
--scan-history
PPOv2Trainer
class trl.PPOv2Trainer
< 源代码 >( config: PPOv2Config tokenizer: PreTrainedTokenizer policy: Module ref_policy: Module reward_model: Module train_dataset: Dataset value_model: Optional = None data_collator: Optional = None eval_dataset: Union = None optimizers: Tuple = (None, None) callbacks: Optional = None )
PPOv2Config
class trl.PPOv2Config
< 源代码 >( output_dir: str overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False eval_strategy: Union = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: Optional = None per_gpu_eval_batch_size: Optional = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: Optional = None eval_delay: Optional = 0 torch_empty_cache_steps: Optional = None learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: Union = 'linear' lr_scheduler_kwargs: Union = <factory> warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: Optional = 'passive' log_level_replica: Optional = 'warning' log_on_each_node: bool = True logging_dir: Optional = None logging_strategy: Union = 'steps' logging_first_step: bool = False logging_steps: float = 500 logging_nan_inf_filter: bool = True save_strategy: Union = 'steps' save_steps: float = 500 save_total_limit: Optional = None save_safetensors: Optional = True save_on_each_node: bool = False save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False no_cuda: bool = False use_cpu: bool = False use_mps_device: bool = False seed: int = 42 data_seed: Optional = None jit_mode_eval: bool = False use_ipex: bool = False bf16: bool = False fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'auto' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: Optional = None local_rank: int = -1 ddp_backend: Optional = None tpu_num_cores: Optional = None tpu_metrics_debug: bool = False debug: Union = '' dataloader_drop_last: bool = False eval_steps: Optional = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: Optional = None past_index: int = -1 run_name: Optional = None disable_tqdm: Optional = None remove_unused_columns: Optional = True label_names: Optional = None load_best_model_at_end: Optional = False metric_for_best_model: Optional = None greater_is_better: Optional = None ignore_data_skip: bool = False fsdp: Union = '' fsdp_min_num_params: int = 0 fsdp_config: Union = None fsdp_transformer_layer_cls_to_wrap: Optional = None accelerator_config: Union = None deepspeed: Union = None label_smoothing_factor: float = 0.0 optim: Union = 'adamw_torch' optim_args: Optional = None adafactor: bool = False group_by_length: bool = False length_column_name: Optional = 'length' report_to: Union = None ddp_find_unused_parameters: Optional = None ddp_bucket_cap_mb: Optional = None ddp_broadcast_buffers: Optional = None dataloader_pin_memory: bool = True dataloader_persistent_workers: bool = False skip_memory_metrics: bool = True use_legacy_prediction_loop: bool = False push_to_hub: bool = False resume_from_checkpoint: Optional = None hub_model_id: Optional = None hub_strategy: Union = 'every_save' hub_token: Optional = None hub_private_repo: bool = False hub_always_push: bool = False gradient_checkpointing: bool = False gradient_checkpointing_kwargs: Union = None include_inputs_for_metrics: bool = False include_for_metrics: List = <factory> eval_do_concat_batches: bool = True fp16_backend: str = 'auto' evaluation_strategy: Union = None push_to_hub_model_id: Optional = None push_to_hub_organization: Optional = None push_to_hub_token: Optional = None mp_parameters: str = '' auto_find_batch_size: bool = False full_determinism: bool = False torchdynamo: Optional = None ray_scope: Optional = 'last' ddp_timeout: Optional = 1800 torch_compile: bool = False torch_compile_backend: Optional = None torch_compile_mode: Optional = None dispatch_batches: Optional = None split_batches: Optional = None include_tokens_per_second: Optional = False include_num_input_tokens_seen: Optional = False neftune_noise_alpha: Optional = None optim_target_modules: Union = None batch_eval_metrics: bool = False eval_on_start: bool = False use_liger_kernel: Optional = False eval_use_gather_object: Optional = False dataset_num_proc: Optional = None num_mini_batches: int = 1 total_episodes: Optional = None local_rollout_forward_batch_size: int = 64 num_sample_generations: int = 10 response_length: int = 53 stop_token: Optional = None stop_token_id: Optional = None temperature: float = 0.7 missing_eos_penalty: Optional = None sft_model_path: str = 'EleutherAI/pythia-160m' world_size: Optional = None num_total_batches: Optional = None micro_batch_size: Optional = None local_batch_size: Optional = None batch_size: Optional = None local_mini_batch_size: Optional = None mini_batch_size: Optional = None exp_name: str = 'ppov2_config' reward_model_path: str = 'EleutherAI/pythia-160m' num_ppo_epochs: int = 4 whiten_rewards: bool = False kl_coef: float = 0.05 cliprange: float = 0.2 vf_coef: float = 0.1 cliprange_value: float = 0.2 gamma: float = 1.0 lam: float = 0.95 )
参数
- num_ppo_epochs (
int
, 可选, 默认为4
) — 训练的轮数。 - whiten_rewards (
bool
, 可选, 默认为False
) — 是否对奖励进行白化处理。 - kl_coef (
float
, 可选, 默认为0.05
) — KL 系数。 - cliprange (
float
, 可选, 默认为0.2
) — 裁剪范围。 - vf_coef (
float
, 可选, 默认为0.1
) — 值函数系数。 - cliprange_value (
float
, 可选, 默认为0.2
) — 值函数的裁剪范围。 - gamma (
float
, 可选, 默认为1.0
) — 折扣因子。 - lam (
float
, 可选, 默认为0.95
) — GAE 的 Lambda 值。
用于 PPOv2Trainer 的配置类。
使用 HfArgumentParser,我们可以将此类转换为 argparse 参数,这些参数可以在命令行上指定。