paper: https://arxiv.org/pdf/1704.07556.pdf
code: https://github.com/FudanNLP
Abstract
中文分词(CWS)有很多不同的分词标准criterion,这篇文章就是想要利用对抗学习,提取多种不同的标准中的共享知识。
In this paper, we propose adversarial multi-criteria learning for CWS by integrating shared knowledge from multiple heterogeneous segmentation criteria.
以前也有类似利用多个corpora的方法,不过大多都只是利用linear classifier with discrete features。这篇文章其实就是一个multi-task任务,他把每个分词标准当作一个task,然后有三个不同的share-private models:shared / private layer,提取与标准无关/相关的特征。用对抗的方法确保共享层提取common underlying and criteria-invariant features。
The contributions of this paper could be summarized as follows.
• Multi-criteria learning is first introduced for CWS, in which we propose three shared-private models to integrate multiple segmentation criteria.
• An adversarial strategy is used to force the shared layer to learn criteria-invariant features, in which a new objective function is also proposed instead of the original cross-entropy loss.
• We conduct extensive experiments on eight CWS corpora with different segmentation criteria, which is by far the largest number of datasets used simultaneously.
Methods
对每个字符标记 {B, M, E, S} (begin, middle, end, single)。普通结构:character embedding layer -> feature layers (BLSTM) -> tag inference layer (CRF).
Model 1: Parallel Shared-Private Model
把private和shared layer看作并行的,在隐层的计算相互独立。不过两个隐层一起进入CRF层
Model 2: Stacked Shared-Private Model
把shared层的输出也作为private输入的一部分,并只将private的隐层输入CRF层
Model 3: Skip-Layer Shared-Private Model
Eq.14 + 15 + 16
Adversarial Training for Shared Layer
为了让shared层提取到的特征是criterion-invariant的。用一个criterion discriminator判别是句子被shared features用哪个criterion标注。
Training
Experiments
CWS
dataset: MSRA, AS, PKU, CTB, CKIP, CITYU, NCC, SXU
Knowledge Transfer
1. simplified Chinese to traditional Chinese: 先在简体中文数据集上训练,再在繁体数据集上训练并固定shared层参数。在繁体数据集上测试: AS, CKIP, CITYU
2. formal texts to informal texts: 在NLPCC2016上训练,在微博数据上测试