Inverse Reward Design

Dylan Hadfield-Menell Smitha Milli Pieter Abbeel∗ Stuart Russell Anca Dragan
Department of Electrical Engineering and Computer Science
University of California, Berkeley
Berkeley, CA 94709
{dhm, smilli, pabbeel, russell, anca}@cs.berkeley.edu
Abstract
Autonomous agents optimize the reward function we give them. What they don’t
know is how hard it is for us to design a reward function that actually captures
what we want. When designing the reward, we might think of some specific
training scenarios, and make sure that the reward will lead to the right behavior
in those scenarios. Inevitably, agents encounter new scenarios (e.g., new types of
terrain) where optimizing that same reward may lead to undesired behavior. Our
insight is that reward functions are merely observations about what the designer
actually wants, and that they should be interpreted in the context in which they were
designed. We introduce inverse reward design (IRD) as the problem of inferring the
true objective based on the designed reward and the training MDP. We introduce
approximate methods for solving IRD problems, and use their solution to plan
risk-averse behavior in test MDPs. Empirical results suggest that this approach can
help alleviate negative side effects of misspecified reward functions and mitigate
reward hacking.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • **2014真题Directions:Read the following text. Choose the be...
    又是夜半惊坐起阅读 13,449评论 0 23
  • 《舍得舍不得》 这是一本散文集,三卷三个层次。卷一《回头》,我抄了好几首诗词,作者的游历带入古人的情思。卷二《肉...
    朵悟阅读 2,773评论 0 2
  • 梦想开始了,就不要停下来。未来怎么样我们谁都不知道,坚持下去,终会有个答案。
    说真话的爱人阅读 994评论 0 1
  • 我是一名95后,将来工作的对象是00后,甚至是10后,顿感压力山大,这种压力不是空穴来风,是真实存在的。 几个星期...
    陌上花开可缓缓归唉阅读 4,887评论 3 2
  • 昨夜,10点已入睡。3点半被蚊子咬醒。擦了点无比滴,浑身凉飕飕的。有一段时间,我认为下班前把当天的工作完成,下班后...
    Double丁阅读 2,325评论 0 1