使用Accelerate进行梯度累积
梯度累积是一种技术,您可以使用比机器通常能够容纳在内存中的更大的批次大小进行训练。这是通过在多个批次上累积梯度来完成的,并且仅在执行一定数量的批次后才更新优化器。
虽然从技术上讲,标准的梯度累积代码在分布式设置中可以正常工作,但它并不是执行此操作的最有效方法,您可能会遇到相当大的速度下降!
在本教程中,您将了解如何快速设置梯度累积并使用Accelerate提供的实用程序执行它,这可能总共只需添加一行新代码!
此示例将使用一个非常简单的PyTorch训练循环,每两个批次执行一次梯度累积
device = "cuda"
model.to(device)
gradient_accumulation_steps = 2
for index, batch in enumerate(training_dataloader):
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
loss = loss / gradient_accumulation_steps
loss.backward()
if (index + 1) % gradient_accumulation_steps == 0:
optimizer.step()
scheduler.step()
optimizer.zero_grad()
转换为Accelerate
首先,前面显示的代码将转换为利用Accelerate,而无需特殊的梯度累积助手
+ from accelerate import Accelerator
+ accelerator = Accelerator()
+ model, optimizer, training_dataloader, scheduler = accelerator.prepare(
+ model, optimizer, training_dataloader, scheduler
+ )
for index, batch in enumerate(training_dataloader):
inputs, targets = batch
- inputs = inputs.to(device)
- targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
loss = loss / gradient_accumulation_steps
+ accelerator.backward(loss)
if (index+1) % gradient_accumulation_steps == 0:
optimizer.step()
scheduler.step()
optimizer.zero_grad()
在其当前状态下,由于一个称为梯度同步的过程,此代码将无法有效地执行梯度累积。在概念教程中了解更多相关信息!
让Accelerate处理梯度累积
现在剩下的就是让Accelerate为我们处理梯度累积。为此,您应该将gradient_accumulation_steps
参数传递给Accelerator,指示在每次调用step()
之前要执行的步骤数以及如何在调用backward()期间自动调整损失。
from accelerate import Accelerator
- accelerator = Accelerator()
+ accelerator = Accelerator(gradient_accumulation_steps=2)
或者,您可以将gradient_accumulation_plugin
参数传递给Accelerator对象的__init__
,这将允许您进一步自定义梯度累积行为。在GradientAccumulationPlugin文档中了解更多相关信息。
从这里,您可以使用训练循环内的accumulate()上下文管理器为您自动执行梯度累积!您只需将其包装在代码的整个训练部分即可
- for index, batch in enumerate(training_dataloader):
+ for batch in training_dataloader:
+ with accelerator.accumulate(model):
inputs, targets = batch
outputs = model(inputs)
您可以删除所有用于步骤编号和损失调整的特殊检查
- loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
- if (index+1) % gradient_accumulation_steps == 0:
optimizer.step()
scheduler.step()
optimizer.zero_grad()
Accelerator能够跟踪您所在的批次号,它会自动知道是否要逐步遍历准备好的优化器以及如何调整损失。
通常在梯度累积中,您需要调整步骤数以反映您正在训练的总批次的变化。Accelerate默认情况下会自动为您执行此操作。在后台,我们实例化一个配置为执行此操作的GradientAccumulationPlugin
。
state.GradientState
与正在迭代的活动数据加载器同步。因此,它天真地假设当我们到达数据加载器的末尾时,一切都会同步并执行一步。要禁用此功能,请在GradientAccumulationPlugin
中将sync_with_dataloader
设置为False
from accelerate import Accelerator
from accelerate.utils import GradientAccumulationPlugin
plugin = GradientAccumulationPlugin(sync_with_dataloader=False)
accelerator = Accelerator(..., gradient_accumulation_plugin=plugin)
完成的代码
以下是使用Accelerate执行梯度累积的完成实现
from accelerate import Accelerator
accelerator = Accelerator(gradient_accumulation_steps=2)
model, optimizer, training_dataloader, scheduler = accelerator.prepare(
model, optimizer, training_dataloader, scheduler
)
for batch in training_dataloader:
with accelerator.accumulate(model):
inputs, targets = batch
outputs = model(inputs)
loss = loss_function(outputs, targets)
accelerator.backward(loss)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
重要的是,在上下文管理器with accelerator.accumulate(model)
中只能执行**一次前向/后向**。
要了解有关其包装的内容的更多信息,请阅读梯度同步概念指南
自包含示例
这是一个自包含的示例,您可以运行它以查看Accelerate中梯度累积的实际效果
import torch
import copy
from accelerate import Accelerator
from accelerate.utils import set_seed
from torch.utils.data import TensorDataset, DataLoader
# seed
set_seed(0)
# define toy inputs and labels
x = torch.tensor([1., 2., 3., 4., 5., 6., 7., 8.])
y = torch.tensor([2., 4., 6., 8., 10., 12., 14., 16.])
gradient_accumulation_steps = 4
batch_size = len(x) // gradient_accumulation_steps
# define dataset and dataloader
dataset = TensorDataset(x, y)
dataloader = DataLoader(dataset, batch_size=batch_size)
# define model, optimizer and loss function
model = torch.zeros((1, 1), requires_grad=True)
model_clone = copy.deepcopy(model)
criterion = torch.nn.MSELoss()
model_optimizer = torch.optim.SGD([model], lr=0.02)
accelerator = Accelerator(gradient_accumulation_steps=gradient_accumulation_steps)
model, model_optimizer, dataloader = accelerator.prepare(model, model_optimizer, dataloader)
model_clone_optimizer = torch.optim.SGD([model_clone], lr=0.02)
print(f"initial model weight is {model.mean().item():.5f}")
print(f"initial model weight is {model_clone.mean().item():.5f}")
for i, (inputs, labels) in enumerate(dataloader):
with accelerator.accumulate(model):
inputs = inputs.view(-1, 1)
print(i, inputs.flatten())
labels = labels.view(-1, 1)
outputs = inputs @ model
loss = criterion(outputs, labels)
accelerator.backward(loss)
model_optimizer.step()
model_optimizer.zero_grad()
loss = criterion(x.view(-1, 1) @ model_clone, y.view(-1, 1))
model_clone_optimizer.zero_grad()
loss.backward()
model_clone_optimizer.step()
print(f"w/ accumulation, the final model weight is {model.mean().item():.5f}")
print(f"w/o accumulation, the final model weight is {model_clone.mean().item():.5f}")
initial model weight is 0.00000
initial model weight is 0.00000
0 tensor([1., 2.])
1 tensor([3., 4.])
2 tensor([5., 6.])
3 tensor([7., 8.])
w/ accumulation, the final model weight is 2.04000
w/o accumulation, the final model weight is 2.04000