Neil Zhu，简书ID Not_GOD，University AI 创始人 & Chief Scientist，致力于推进世界人工智能化进程。制定并实施 UAI 中长期增长战略和目标，带领团队快速成长为人工智能领域最专业的力量。
作为行业领导者，他和UAI一起在2014年创建了TASA（中国最早的人工智能社团）, DL Center（深度学习知识中心全球价值网络），AI growth（行业智库培训）等，为中国的人工智能人才建设输送了大量的血液和养分。此外，他还参与或者举办过各类国际性的人工智能峰会和活动，产生了巨大的影响力，书写了60万字的人工智能精品技术内容，生产翻译了全球第一本深度学习入门书《神经网络与深度学习》，生产的内容被大量的专业垂直公众号和媒体转载与连载。曾经受邀为国内顶尖大学制定人工智能学习规划和教授人工智能前沿课程，均受学生和老师好评。

Continuous Deep Q-Learning with Model-based Acceleration

http://arxiv.org/pdf/1603.00748v1.pdf

简介：Model-free reinforcement learning has been successfully
applied to a range of challenging problems,
and has recently been extended to handle
large neural network policies and value functions.
However, the sample complexity of modelfree
algorithms, particularly when using highdimensional
function approximators, tends to
limit their applicability to physical systems. In
this paper, we explore algorithms and representations
to reduce the sample complexity of
deep reinforcement learning for continuous control
tasks. We propose two complementary techniques
for improving the efficiency of such algorithms.
First, we derive a continuous variant of
the Q-learning algorithm, which we call normalized
adantage functions (NAF), as an alternative
to the more commonly used policy gradient and
actor-critic methods. NAF representation allows
us to apply Q-learning with experience replay to
continuous tasks, and substantially improves performance
on a set of simulated robotic control
tasks. To further improve the efficiency of our
approach, we explore the use of learned models
for accelerating model-free reinforcement learning.
We show that iteratively refitted local linear
models are especially effective for this, and
demonstrate substantially faster learning on domains
where such models are applicable.

Learning functions across many orders of magnitudes

http://arxiv.org/pdf/1602.07714v1.pdf

简介：Learning non-linear functions can be hard when
the magnitude of the target function is unknown
beforehand, as most learning algorithms are not
scale invariant. We propose an algorithm to adaptively
normalize these targets. This is complementary
to recent advances in input normalization.
Importantly, the proposed method preserves
the unnormalized outputs whenever the normalization
is updated to avoid instability caused by
non-stationarity. It can be combined with any
learning algorithm and any non-linear function
approximation, including the important special
case of deep learning. We empirically validate
the method in supervised learning and reinforcement
learning and apply it to learning how to play
Atari 2600 games. Previous work on applying
deep learning to this domain relied on clipping
the rewards to make learning in different games
more homogeneous, but this uses the domainspecific
knowledge that in these games counting
rewards is often almost as informative as summing
these. Using our adaptive normalization
we can remove this heuristic without diminishing
overall performance, and even improve performance
on some games, such as Ms. Pac-Man
and Centipede, on which previous methods did
not perform well.

Deep Exploration via Bootstrapped DQN

http://arxiv.org/pdf/1602.04621v1.pdf

简介：Efficient exploration in complex environments
remains a major challenge for reinforcement
learning. We propose bootstrapped DQN, a simple
algorithm that explores in a computationally
and statistically efficient manner through use
of randomized value functions. Unlike dithering
strategies such as �-greedy exploration, bootstrapped
DQN carries out temporally-extended
(or deep) exploration; this can lead to exponentially
faster learning. We demonstrate these
benefits in complex stochastic MDPs and in the
large-scale Arcade Learning Environment. Bootstrapped
DQN substantially improves learning
times and performance across most Atari games.

One-shot Learning with Memory-Augmented Neural Networks

https://arxiv.org/pdf/1605.06065v1.pdf

简介：One-shot 学习上的工作，传统方法需要大量数据进行学习. 新数据进来，模型必然是低效地重新学习参数来平滑地引入新数据的信息. 拥有增强记忆能力的结构，如 NTM，提供了快速编码和检索新信息的能力，因此潜在地避开了传统模型的弱点。这篇文章给出了记忆增强神经网络可以快速吸收新的数据，并利用新数据仅仅在加入很少样本后作出准确的预测. 另外还引入一种新的获取外部记忆的方法，这种方法聚焦在记忆内容，不像之前的方法额外使用了基于记忆的位置的机制来定位。

Deep Reinforcement Learning with Attention for Slate Markov Decision Processes with High-Dimensional States and Actions

http://arxiv.org/pdf/1512.01124v2.pdf

简介：Many real-world problems come with action spaces represented as feature vectors. Although high-dimensional control is a largely unsolved problem, there has recently been progress for modest dimensionalities. Here we report on a successful attempt at addressing problems of dimensionality as high as 2000, of a particular form. Motivated by important applications such as recommendation systems
that do not fit the standard reinforcement learning frameworks, we introduce Slate Markov Decision Processes (slate-MDPs).

Slate-MDP 是一个组合行动（在基础 MDP 中的原始行动的元组）空间的 MDP. agent 并不去控制这个行动的选择，行动甚至可能不是来自组合行动的，比如说，推荐系统中所有的推荐都可以被用户忽略的。我们使用深度 Q-学习基于状态和行动的特征表示来学习整个组合行动的值。

Unlike existing methods, we optimize for both the combinatorial and sequential aspects of our tasks. The new agent’s superiority over agents that either ignore the combinatorial or sequential long-term value aspect is demonstrated on a range of environments with dynamics from a real-world recommendation
system. Further, we use deep deterministic policy gradients to learn
a policy that for each position of the slate, guides attention towards the part of the action space in which the value is the highest and we only evaluate actions in this area. The attention is used within a sequentially greedy procedure leveraging submodularity. Finally, we show how introducing risk-seeking can dramatically improve the agents performance and ability to discover more far reaching strategies.

Increasing the Action Gap: New Operators for Reinforcement Learning

http://arxiv.org/pdf/1512.04860v1.pdf

简介：This paper introduces new optimality-preserving operators
on Q-functions. We first describe an operator for tabular representations, the consistent Bellman operator, which incorporates a notion of local policy consistency. We show that this local consistency leads to an increase in the action gap at each state; increasing this gap, we argue, mitigates the undesirable effects of approximation and estimation errors on the induced greedy policies. This operator can also be applied to discretized continuous space and time problems, and we provide empirical results evidencing superior performance in this context. Extending the idea of a locally consistent operator, we then derive sufficient conditions for an operator to preserve optimality, leading to a family of operators which includes our consistent Bellman
operator. As corollaries we provide a proof of optimality for Baird’s advantage learning algorithm and derive other gap-increasing operators with interesting properties. We conclude with an empirical study on 60 Atari 2600 games illustrating the strong potential of these new operators.

MUPROP: UNBIASED BACKPROPAGATION FOR STOCHASTIC NEURAL NETWORKS

http://arxiv.org/pdf/1511.05176v2.pdf

简介：Deep neural networks are powerful parametric models that can be trained effi-
ciently using the backpropagation algorithm. Stochastic neural networks combine
the power of large parametric functions with that of graphical models, which
makes it possible to learn very complex distributions. However, as backpropagation
is not directly applicable to stochastic networks that include discrete sampling
operations within their computational graph, training such networks remains diffi-
cult. We present MuProp, an unbiased gradient estimator for stochastic networks,
designed to make this task easier. MuProp improves on the likelihood-ratio estimator
by reducing its variance using a control variate based on the first-order Taylor
expansion of a mean-field network. Crucially, unlike prior attempts at using
backpropagation for training stochastic networks, the resulting estimator is unbiased
and well behaved. Our experiments on structured output prediction and discrete
latent variable modeling demonstrate that MuProp yields consistently good
performance across a range of difficult tasks.

POLICY DISTILLATION

http://arxiv.org/pdf/1511.06295.pdf

DeepMind 最新论文合集