An Actor-Critic Algorithm for Sequence Prediction

Recurrent neural networks

RNNs for sequence prediction

In our models, the sequence of vectors is produced by either a bidirectional RNN (Schuster and Paliwal, 1997) or a convolutional encoder (Rush et al., 2015).

3 Actor-Critic for Sequence Prediction

We note that this way of re-writing the gradient of the expected reward is known in RL under the names policy gradient theorem (Sutton et al., 1999) and stochastic actor-critic (Sutton, 1984).
我们注意到，重写预期回报的梯度这样的RL是已知的名字政策梯度定理下（萨顿等，1999）和随机演员评论家（萨顿，1984）。

Training the critic

Applying deep RL techniques

Attempts to remove the target network by propagating the gradient through qt resulted in a lower square error (Qˆ(ˆyt ; Yˆ 1...T ) − qt) 2 , but the resulting Qˆ values proved very unreliable as training signals for the actor

采样 5page
To compensate for this, we sample predictions from a delayed actor, whose weights are slowly updated to follow the actor that is actually trained. This is inspired by (Lillicrap et al., 2015), where a delayer actor is used for a similar purpose。

有关target critic network解释

CONTINUOUS CONTROL WITH DEEP REINFORCEMENT 1509.02971.pdf

最后编辑于：2017.12.04 00:51:31

©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成，浏览时请结合常识与多方信息审慎甄别。
平台声明：文章内容（如有图片或视频亦包括在内）由作者上传并发布，文章内容仅代表作者本人观点，简书系信息发布平台，仅提供信息存储服务。

An Actor-Critic Algorithm for Sequence Prediction