Pointer Networks
Oriol Vinyals, Meire Fortunato, Navdeep Jaitly
Google, Berkeley
NIPS 2015
Introduction
The key motivation and contribution of this work is a pointer network framework which can solve discrete mapping problems with different dictionary size across instances.
Our model solves the problem of variable size output dictionaries using a recently proposed mechanism of neural attention.
The method is to
use attention as a pointer to select a member of the input sequence as the output.
This work provides a new approach to solve discrete optimization problems with sequential model.
Problem Setup
This paper focuses on solving a specific type of seqence-to-sequence task in a supervised learning approach. In the training data, the inputs are planar point sets with elements each, where are the cartesian coordinates of the points. The training instances are sampled from a uniform distribution in . The outputs are sequences of point indices representing the solution associated to the point set .
Models
Sequence-to-Sequence Model
This model learns the parameters of an encoder-decoder to maximize the conditional probabilitity of the output sequence on the training samples:
where
Note that this model makes no statistical independence assumptions, thus it is not a Markov chain.
During inference, as the number of possible suquences grows exponentially with the sequence length, beam search is utilized to find the best possible sequence.
Notice that in the standard sequence-to-sequence model, the output dictionary size for all symbols is fixed and equal to , which means that the model trained for a particular sequence length can not generalize to other sequences with different length.
Content Based Input Attention
The vanilla sequence-to-sequence model only uses the final state of the encoder to represent the whole input sequence, which constrains the amount of information and computation that can flow through to the decoder. The attention model augments the decoder RNNs with an attention module over the encoder states to provide further information:
where and are encoder state and decoder state respectively. and are concatenated and used as the hidden state for prediction and input to the next time step.
Pointer Network
The pointer network simplifies the attention mechanism by normalizing the vector to be an output distribution over the dictionary of inputs, which guarantees that the dictionary size is always consistent with the input dictionary size:
Experiments
The paper experiments on three different problems, i.e. convex hull, Delaunay triangulation and TSP, all relating to finding a solution with respect to a discrete input sequence.
(The output is actually a cycle or set in these problems, which means that any point in the solution can be the start point in the decoder sequence. RNN actually can't reflect this property, and the authors had to artificially define a start point of the output sequence in the experimental setup. )
Convex Hull
- Instance representation: The elements are indices between 1 and n corresponding to positions in the sequence . To represent the output as a sequence, start from the point with the lowest index, and go counter-clockwise.
Delaunay Triangulation
- Instance representation: The outputs are the corresponding sequences representing the triangulation of the point set . Each is a triple of integers from 1 to n corresponding to the position of triangle vertices in .
TSP
- Instance representation: For consistency, in the training dataset, always start in the first city without loss of generality.
Generally speaking, the pointer network can work across different sequence length, and perform relatively well for small instance size. However, as it uses 1M training instances for each task, and all of them are uniformly sampled in an unit square, I doubt that it is fitting instead of actually learning.