By Adam R. Kosiorek \ Sara Sabourx \ Yee Whye Tehr \ Geoffrey E. Hintonx
原文链接：https://arxiv.org/abs/1906.06818

Abstract

摘要

An object can be seen as a geometrically organized set of interrelated parts. A system that makes explicit use of these geometric relationships to recognize objects should be naturally robust to changes in viewpoint, because the intrinsic geometric relationships are viewpoint-invariant.

一个物体可以看作一个由相互关联部分组成的几何组织。一个明确使用几何关系辨识物体的系统在改变视点时应当天然具有稳健性，因为物体内在的几何关系是不随视点变化的。

We describe an unsupervised version of capsule networks, in which a neural encoder, which looks at all of the parts, is used to infer the presence and poses of object capsules. The encoder is trained by backpropagating through a decoder, which predicts the pose of each already discovered part using a mixture of pose predictions.

我们将描述一个无监督版本的胶囊网络，其中查看所有部件的神经编码器是用来推测物体胶囊的存在和姿势的。该编码器通过解码器的反向传播进行训练，使用姿势预测的混合来预测已发现部分的姿势。

The parts are discovered directly from an image, in a similar manner, by using a neural encoder, which infers parts and their affine transformations. The corresponding decoder models each image pixel as a mixture of predictions made by affine-transformed parts.

通过使用一个推断部分及其仿射变换的神经编码器，这些部分以类似的方式被直接从图像中发现。对应的编码器将每个图片像素建模为由仿射变换部分做出的预测的混合。

We learn object- and their part-capsules on unlabeled data, and then cluster the vectors of presences of object capsules. When told the names of these clusters, we achieve state-of-the-art results for unsupervised classification on SVHN (55%) and near state-of-the-art on MNIST (98.5%).

我们在无标签数据上学习物体胶囊和其部分胶囊，然后聚集物体胶囊的存在向量。当分辨出这些群集的名字时，我们在SVHN（55%）上达到最高水准，在MNIST上接近最高水准（98.5%）。

1 Introduction

1 导论

Convolutional neural networks (CNN) work better than networks without weight-sharing because of their inductive bias: if a local feature is useful in one image location, the same feature is likely to be useful in other locations. It is tempting to exploit other effects of viewpoint changes by replicating features across scale, orientation and other affine degrees of freedom, but this quickly leads to cumbersome high-dimensional feature maps.

由于采用了归纳偏差，卷积神经网络（CNN）比不共享权重的网络工作的更好：如果局部特征在一个图片位置有用，那么同样的特征很有可能在另一个位置也有用。通过在比例、方向和其他仿射自由度上复制特征来探索视点变化的其他效果是诱人的，但这将很快引致笨重的高维特征图。

An alternative to replicating features across the non-translational degrees of freedom is to explicitly learn transformations between the natural coordinate frame of a whole object and the natural coordinate frames of each of its parts. Computer graphics relies on such object→part coordinate transformations to represent the geometry of an object in a viewpoint-invariant manner. Moreover, there is strong evidence that, unlike standard CNNs, human vision also relies on coordinate frames: imposing an unfamiliar coordinate frame on a familiar object makes it difficult to recognize the object or its geometry (Rock, 1973; Hinton, 1979).

明确地学习整个物体的自然坐标框架与其每个部分的自然坐标框架之间的转换是在非平移自由度上复制特征的替代方案。计算机图形依赖从物体到部分的坐标转换以视点不变的方式来表示物体的几何形状。此外，有很强的证据表明，不同于标准CNNs, 人类视觉同样依赖于坐标框架：将陌生的坐标框架施加于熟悉的物体使得辨认此物体的几何形状变得困难（Rock, 1973; Hinton, 1979）。

A neural system can learn to reason about transformation between objects, their parts and the viewer, but each of the transformations is likely to require different representation. An object-part-relationship (OP) is viewpoint-invariant and is naturally coded by learned weights. The relationship of an object or part to the viewer changes with the viewpoint (it is viewpoint-equivariant) and is naturally coded using neural activations. With this representation, pose of a single object is represented by its relationship to the viewer. Consequently, representing a single object does not necessitate replicating neural activations across space, unlike in CNNs. It is only processing two (or more) different instances of the same type of object in parallel that requires spatial replicas of both model parameters and neural activations.

神经系统可以学习推理物体、其部分和观察者之间的变换，但每次变换可能需要不同的表示。物体-部分关系（OP）是视点不变的，且易于由习得权重编码。物体或者部分对于观察者视点的变化（这是视点等变的）的关系是易于由神经激活编码的。通过这种表示方法，单个物体的姿势由它和观察者之间的关系表示的。因此，表示单个物体不必像CNNs那样跨空间复制神经激活。它仅仅并行处理两个（或更多）同类型物体的不同实例，这需要两个模型参数和神经激活的空间复制。

In this paper we propose the Stacked Capsule Autoencoder (SCAE), which has two stages (Fig. 1). The first stage, the Part Capsule Autoencoder (PCAE), segments an image into constituent parts, infers their poses, and reconstructs each image pixel as a mixture of the pixels of transformed part templates. The second stage, the Object Capsule Autoencoder (OCAE), tries to organize discovered parts and their poses into a smaller set of objects that can explain the part poses using a separate mixture of predictions for each part. Every object capsule contributes components to each of these mixtures by multiplying its pose—the object-viewer-relationship (OV)—by the relevant object-part-relationship (OP).

这这篇论文中，我们提出了堆栈式胶囊自动编码器（SCAE），它有两个阶段。第一个阶段，部分胶囊自动编码器（PCAE），将图片分割成连续的部分，推测它们的姿势，并且将每个图片像素重构成转换部分模板的像素混合。第二个阶段，物体胶囊自动编码器（OCAE），尝试将发现的部分和它们的姿势组成更小的物体集合，它可以通过每个部分的预测的单独混合来解释部分的姿势。每个物体胶囊都通过乘以它的姿势来向每个混合贡献组件——物体-观察者关系（OV）—— 通过相关物体-部分关系。

Stacked Capsule Autoencoders (Section 2) capture spatial relationships between whole objects and their parts when trained on unlabelled data. The vectors of presence probabilities for the object capsules tend to form tight clusters, and when we assign a class to each cluster we achieve state-ofthe-art results for unsupervised classification on SVHN (55%) and near state-of-the-art on MNIST (98.5%), which can be further improved to 67% and 99%, respectively, by learning fewer than 300 parameters. We also present promising proof-of-concept results on CIFAR10 (Section 3). We describe related work in Section 4 and discuss implications of our work and future directions in Section.

堆栈式胶囊自动编码器（第二部分）在无标签数据上训练时捕获整体和部分之间的空间关系。物体胶囊的存在率矢量有助于形成紧密的集群，且当我们给每个集群分类时，我们在SVHN的无监督分类上达到了最先进的水平（55%），在MNIST上达到了接近最先进水平（98.5%），其结果还可以被进一步提高至67%（SVHN）和99%（MNIST）。

To be continued...

Stacked Capsule Autoencoders中英对照

Stacked Capsule Autoencoders中英对照

Abstract

摘要

1 Introduction

1 导论

相关阅读更多精彩内容

友情链接更多精彩内容