Deep RL 课程文档

(可选) Policy Gradient Theorem

Hugging Face's logo
加入 Hugging Face 社区

并获取增强的文档体验

开始使用

(可选) Policy Gradient Theorem

在这个可选章节中,我们将研究如何区分目标函数,我们将使用该目标函数来近似策略梯度

让我们首先回顾一下不同的公式

  1. 目标函数
Return
  1. 轨迹的概率(假设动作来自πθ\pi_\theta):
Probability

所以我们有θJ(θ)=θτP(τ;θ)R(τ)\nabla_\theta J(\theta) = \nabla_\theta \sum_{\tau}P(\tau;\theta)R(\tau)

我们可以将总和的梯度重写为梯度的总和=τθ(P(τ;θ)R(τ))=τθP(τ;θ)R(τ) = \sum_{\tau} \nabla_\theta (P(\tau;\theta)R(\tau)) = \sum_{\tau} \nabla_\theta P(\tau;\theta)R(\tau) 因为R(τ)R(\tau)不依赖于θ\theta

然后我们将总和中的每一项乘以P(τ;θ)P(τ;θ)\frac{P(\tau;\theta)}{P(\tau;\theta)}(这是可能的,因为它 = 1)=τP(τ;θ)P(τ;θ)θP(τ;θ)R(τ) = \sum_{\tau} \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta)R(\tau)

我们可以进一步简化,因为P(τ;θ)P(τ;θ)θP(τ;θ)=P(τ;θ)θP(τ;θ)P(τ;θ) \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta) = P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} .

因此,我们可以将总和重写为P(τ;θ)θP(τ;θ)P(τ;θ)=τP(τ;θ)θP(τ;θ)P(τ;θ)R(τ) P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}= \sum_{\tau} P(\tau;\theta) \frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}R(\tau)

然后我们可以使用导数对数技巧(也称为 似然比技巧REINFORCE 技巧),这是微积分中的一个简单规则,意味着xlogf(x)=xf(x)f(x) \nabla_x log f(x) = \frac{\nabla_x f(x)}{f(x)}

所以给定我们有θP(τ;θ)P(τ;θ)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} we transform it asθlogP(τθ)\nabla_\theta log P(\tau|\theta)

So this is our likelihood policy gradientθJ(θ)=τP(τ;θ)θlogP(τ;θ)R(τ) \nabla_\theta J(\theta) = \sum_{\tau} P(\tau;\theta) \nabla_\theta log P(\tau;\theta) R(\tau)

Thanks for this new formula, we can estimate the gradient using trajectory samples (we can approximate the likelihood ratio policy gradient with sample-based estimate if you prefer).θJ(θ)=1mi=1mθlogP(τ(i);θ)R(τ(i))\nabla_\theta J(\theta) = \frac{1}{m} \sum^{m}_{i=1} \nabla_\theta log P(\tau^{(i)};\theta)R(\tau^{(i)})where eachτ(i)\tau^{(i)}is a sampled trajectory.

But we still have some mathematics work to do there: we need to simplifyθlogP(τθ) \nabla_\theta log P(\tau|\theta)

We know thatθlogP(τ(i);θ)=θlog[μ(s0)t=0HP(st+1(i)st(i),at(

Whereμ(s0)\mu(s_0)is the initial state distribution andP(st+1(i)st(i),at(i)) P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) is the state transition dynamics of the MDP.

We know that the log of a product is equal to the sum of the logsθlogP(τ(i);θ)=θ[logμ(s0)+t=0HlogP(st+1(i)st(i)at(i))+t=0Hlogπθ(at(i)st(i))]\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta \left[log \mu(s_0) + \sum\limits_{t=0}^{H}log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) + \sum\limits_{t=0}^{H}log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\right]

We also know that the gradient of the sum is equal to the sum of gradientθlogP(τ(i);θ)=θlogμ(s0)+θt=0HlogP(st+1(i)st( \nabla_\theta log P(\tau^{(i)};\theta)=\nabla_\theta log\mu(s_0) + \nabla_\theta \sum\limits_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) + \nabla_\theta \sum\limits_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})

由于 MDP 的初始状态分布和状态转移动态都不依赖于θ\theta,因此这两项的导数都为 0。所以我们可以移除它们

由于θt=0HlogP(st+1(i)st(i)at(i))=0\nabla_\theta \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) = 0 并且θμ(s0)=0 \nabla_\theta \mu(s_0) = 0 θlogP(τ(i);θ)=θt=0Hlogπθ(at(i)st(i))\nabla_\theta log P(\tau^{(i)};\theta) = \nabla_\theta \sum_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})

我们可以将和的梯度重写为梯度的和θlogP(τ(i);θ)=t=0Hθlogπθ(at(i)st(i)) \nabla_\theta log P(\tau^{(i)};\theta)= \sum_{t=0}^{H} \nabla_\theta log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})

因此,用于估计策略梯度的最终公式是θJ(θ)=g^=1mi=1mt=0Hθlogπθ(at(i)st(i))R(τ(i)) \nabla_{\theta} J(\theta) = \hat{g} = \frac{1}{m} \sum^{m}_{i=1} \sum^{H}_{t=0} \nabla_\theta \log \pi_\theta(a^{(i)}_{t} | s_{t}^{(i)})R(\tau^{(i)})

< > 在 GitHub 上更新