C2W 论文详解

Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation（从字符中生成嵌入：用于开发词表示的组合字符模型）

一、论文总览

Abstract：提出了一种新的通过字符来构建词向量的方法，这种方法可以学习到单词中复杂的形式结构，从而在两个任务上得到最优的结果

Introduction：单词之间不应该是彼此独立的，形式上的一致性会造成功能上的一致性，我们通过双向LSTM来学习这种形式一致性，并且取得了非常好的结果

Experiments: Language Model：在语言模型任务上探究C2W模型的效果

C2W Model：介绍C2W模型，并分析其复杂度

Word Vectors and Wordless Word Vectors：介绍词向量机制的两个缺点，并且介绍非词向量机制的方法

Experiments:Part-of-speech Tagging 在词性标注任务上探究C2W模型的效果

Related Word

Conclusion

二、目标：

（一）背景介绍：

词嵌入机制的缺点

C2W研究意义

（二）C2W模型

模型介绍

效率分析

（三）实验分析

语言模型实验分析

词性标注实验分析

（四）代码实验

三、论文详解

Abstract

We introduce a model for constructing vector representations of wordsC2W怎么做的： by composing characters using bidirectional LSTMs.Relative to traditional word representation models that have independent vectors for each word type, our model requires only a single vector per character type and a fixed set of parameters for the compositional model. Despite the compactness of this model and, more importantly, the arbitrary nature of the form–function relationship in language, our “composed” word representations yield state-of-the-art results in language modeling and part-of-speech tagging. Benefits over traditional baselines are particularly pronounced in morphologically rich languages (e.g., Turkish).

1.我们提出了一种新的使用字符和双向LSTM生成词表示的模型。

2.相对于传统的词向量方法，我们的C2W模型需要的参数比较少，主要有两部分，一部分是字符映射成向量的参数，一部分是组合模块LSTM的参数。

3.尽管我们的模型参数少，并且单词中的形式-功能关系很难学习，我们的模型在语言模型和词性标注任务上取得最优的结果。

4.这种优势在形态丰富的语言中更加明显。

（一）Introduction

词表示的重要性：Good representations of words are important for good generalization in natural language processing applications（在自然语言处理应用程序中，单词的良好表示对于良好的泛化非常重要）.Of central importance are vector space models that capture functional (i.e., semantic and syntactic) similarity in terms of geometric locality. However, when word vectors are learned—a practice that is becoming increasingly common—词向量表示的问题：most models assume that each word type has its own vector representation that can vary independently of other model components（独立性假设）. This paper argues that this independence assumption is inherently problematic, in particular in morphologically rich languages (e.g., Turkish). In such languages, a more reasonable assumption would be that本文的出发点：orthographic (formal) similarity is evidence for functional similarity（形态的相似性可以证明功能的相似性）.

However, it is manifestly clear that问题难点，引出使用LSTM：similarity in form is neither a necessary nor sufficient condition for similarity in function（形态上的相似性并不是功能上相似的必要条件）:small orthographic differences may correspond to large semantic or syntactic differences (butter vs. batter), and large orthographic differences may obscure nearly perfect functional correspondence (rich vs. affluent). Thus, any orthographically aware model must be able to capture non-compositional effects in addition to more regular effects due to, e.g., morphological processes. To model the complex form– function relationship, we turn to long short-term memories (LSTMs), which are designed to be able to capture complex non-linear and non-local dynamics in sequences (Hochreiter and Schmidhuber, 1997). We use bidirectional LSTMs to “read” the character sequences that constitute each word and combine them into a vector representation of the word. This model assumes that each character type is associated with a vector, and the LSTM parameters encode both idiosyncratic lexical and regular morphological knowledge.

To evaluate our model, we use a vectorbased model for part-of-speech (POS) tagging and for language modeling, and we report experiments on these tasks in several languages comparing to baselines that use more traditional, orthographically-unaware parameterizations. These experiments show:贡献(i) our characterbased model is able to generate similar representations for words that aresemantically and syntactically similar, even for words are orthographically distant (e.g., October and January); our model achieves improvements over word lookup tablesusing only a fraction of the number of parametersin two tasks; (iii) our model obtainsstate-of-theart performance on POS tagging(including establishing a new best performance in English); and (iv) performance improvements are especially dramatic in morphologically rich languages.

The paper is organized as follows: Section 2 presents our character-based model to generate word embeddings. Experiments on Language Modeling and POS tagging are described in Sections 4 and 5. We present related work in Section 6; and we conclude in Section 7.

1.词向量的学习对于自然语言处理的应用非常重要，词向量可以在空间上捕获词之间的语法和语义相似性。

2.但是词向量机制中的词和词之间是独立的，这种独立性假设是有问题的，词之间形式上的相似性会一定程度造成功能的相似性，尤其是在形态丰富的语言中。

3.但是这种形态和功能之间的关系有不是绝对的，为了学习这种关系，本文在字符嵌入上使用双向LSTM来捕捉这种关系。

4.本文的C2W模型能够很好地捕捉词之间的语法和语义相似度，并且在两个任务上取得最优的结果。

（二）、Word Vectors and Wordless Word Vectors

词向量简介：It is commonplace to represent words as vectors. In contrast to na¨ıve models in which all word types in a vocabulary V are equally different from each other, vector space models capture the intuition that words may be different or similar along a variety of dimensions. Learning vector representations of words by treating them as optimizable parameters in various kinds of language models has been found to be a remarkably effective means for generating vector representations that perform well in other tasks (Collobert et al., 2011; Kalchbrenner and Blunsom, 2013; Liu et al., 2014; Chen and Manning, 2014). Formally, such models define a matrix P ∈ R d×|V | , which contains d parameters for each word in the vocabulary V . For a given word type w ∈ V , a column is selected by right-multiplying P by a one-hot vector of length |V |, which we write 1w, that is zero in every dimension except for the element corresponding to w. Thus, P is often referred to as word lookup table and we shall denote by eW w ∈ R d the embedding obtained from a word lookup table for w as eW w = P· 1w. This allows tasks with low amounts of annotated data to be trained jointly with other tasks with large amounts of data and leverage the similarities in these tasks. A common practice to this end is to initialize the word lookup table with the parameters trained on an unsupervised task. Some examples of these include the skip-n-gram and CBOW models of Mikolov et al. (2013).

2.1 Problem: Independent Parameters

词向量机制的问题：

There are two practical problems with word lookup tables.Firstly, while they can be pretrained with large amounts of data to learn semantic and syntactic similarities between words, each vector is independent. （首先，虽然可以通过大量的数据对它们进行预训练来学习单词之间的语义和句法相似性，但每个向量是独立的。）That is, even though models based on word lookup tables are often observed to learn that cats, kings and queens exist in roughly the same linear correspondences to each other as cat, king and queen do, the model does not represent the fact that adding an s at the end of the word is evidence for this transformation. This means that word lookup tables cannot generate representations for previously unseen words, such as Frenchification, even if the components, French and -ification, are observed in other contexts. （无法进行推理，如单复数情况或者加后缀的情况）

Second, even if copious data is available, it is impractical to actually store vectors for all word types（保存所有的词向量是不现实的）.As each word type gets a set of parameters d, the total number of parameters is d×|V |, where |V | is the size of the vocabulary. Even in relatively morphological poor English, the number of word types tends to scale to the order of hundreds of thousands, and in noisier domains, such as online data, the number of word types raises considerably. For instance, in the English wikipedia dump with 60 million sentences, there are approximately 20 million different lowercased and tokenized word types, each of which would need its own vector. Intuitively, it is not sensible to use the same number of parameters for each word type.

Finally, it is important to remark that it is uncontroversial among cognitive scientists that our lexicon is structured into related forms—i.e., their parameters are not independent. The wellknown “past tense debate” between connectionists and proponents of symbolic accounts concerns disagreements about how humans represent knowledge of inflectional processes (e.g., the formation of the English past tense), not whether such knowledge exists (Marslen-Wilson and Tyler, 1998).（这种信息一直存在，但是不知道应该如何使用）

2.2 Solution: Compositional Models

Our solution to these problems is to construct a vector representation of a word by composing smaller pieces into a representation of the larger form. This idea has been explored in prior work by composing morphemes into representations of words (Luong et al., 2013; Botha and Blunsom, 2014; Soricut and Och, 2015). Morphemes are an ideal primitive for such a model since they are— by definition—the minimal meaning-bearing (or syntax-bearing) units of language. The drawback to such approaches is they depend on a morphological analyzer.

In contrast, we would like to compose representations of characters into representations of words. However, the relationship between words forms and their meanings is non-trivial (de Saussure, 1916). While some compositional relationships exist, e.g., morphological processes such as adding -ing or -ly to a stem have relatively regular effects, many words with lexical similarities convey different meanings, such as, the word pairs lesson ⇐⇒ lessen and coarse ⇐⇒ course.

（三）、C2W Model

Our compositional character to word (C2W) model is based on bidirectional LSTMs (Graves and Schmidhuber, 2005), which are able to learn complex non-local dependencies in sequence models. An illustration is shown in Figure 1.介绍模型的输入和输出：The input of the C2W model (illustrated on bottom) is a single word type w, and we wish to obtain is a d-dimensional vector used to represent w. （输入时一个单词，输出是一个d维的向量去表示w）This model shares the same input and output of a word lookup table (illustrated on top), allowing it to easily replace then in any network. （可以随意替换）

形式化表示：As input, we define an alphabet of characters C. For English, this vocabulary would contain an entry for each uppercase and lowercase letter as well as numbers and punctuation. The input word w is decomposed into a sequence of characters c1, . . . , cm, where m is the length of w. Each ci is defined as a one hot vector 1ci , with one on the index of ci in vocabulary M. We define a projection layer PC ∈ R dC ×|C| , where dC is the number of parameters for each character in the character set C. This of course just a character lookup table, and is used to capture similarities between characters in a language (e.g., vowels vs. consonants). Thus, we write the projection of each input char acter ci as eci = PC · 1ci .

Given the input vectors x1, . . . , xm, a LSTM computes the state sequence h1, . . . , hm+1 by iteratively applying the following updates:

where σ is thecomponent-wise（element-wise：矩阵对应元素相乘，component-wise：矩阵对应位置相乘）logistic sigmoid function, and is the component-wise (Hadamard) product. LSTMs define an extra cell memory ct , which is combined linearly at each timestamp t. The information that is propagated from ct−1 to ct is controlled by the three gates it , ft , and ot , which determine the what to include from the input xt , the what to forget from ct−1 and what is relevant to the current state ht . We write W to refer to all parameters the LSTM (Wix, Wfx, bf , . . . ). Thus, given a sequence of character representations e C c1 , . . . , e C cm as input, the forward LSTM, yields the state sequence s f 0 , . . . , s f m, while the backward LSTM receives as input the reverse sequence, and yields statess b m, . . . , s b 0 .Both LSTMs use a different set of parameters Wf and Wb . The representation of the word w is obtained by combining the forward and backward states:

where Df , Db and bd are parameters that deter- mine how the states are combined.

Caching for Efficiency. 降低复杂度：Relative to eW w , computing e C w iscomputational expensive, as it requires two LSTMs traversals of length m. However, e C w only depends on the character sequence of that word, which means that unless the parameters are updated, it is possible to cache the value of e C w for each different w’s that will be used repeatedly. Thus, the model cankeep a list of the most frequently occurring word types in memory and run the compositional model only for rare words.（在内存中保存最经常出现的单词类型列表，并仅对罕见的单词运行组合模型） Obviously,caching all words would yield the same performance as using a word lookup table eW w , but also using the same amount of memory（缓存词向量同样大小的cache）. Consequently, the number of word types used in cache can be adjusted to satisfy memory vs. performance requirements of a particular application.

At training time, when parameters are changing, repeated words within the same batch only need to be computed once, and the gradient at the output can be accumulated within the batch so that only one update needs to be done per word type. For this reason, it is preferable to define larger batches.

（四）、Experiments: Language Modeling

4 Experiments: Language Modeling Our proposed model is similar to models used to compute composed representations of sentences from words (Cho et al., 2014; Li et al., 2015)（我们的目标模型和通过词来表示句子的模型很像）. However, the relationship between the meanings of individual words and the composite meaning of a phrase or sentence is arguably more regular than the relationship of representations of characters and the meaning of a word. Is our model capable of learning such an irregular relationship? We now explore this question empirically.

Language modeling is a task with many applications in NLP.An effective LM requires syntactic （语法的）aspects of language to be modeled, such as word orderings (e.g., “John is smart” vs. “John smart is”), but also semantic（语义的） aspects (e.g., “John ate fish” vs. “fish ate John”). Thus,if our C2W model only captures regular aspects of words, such as, prefixes and suffixes, the model will yield worse results compared to word lookup tables.（如果我们的C2W模型只捕获单词的常规方面，比如前缀和后缀，那么与单词查找表相比，该模型将产生更糟糕的结果）

4.1 Language Model

Language modeling amounts to learning a function that computes the log probability, log p(w), of a sentence w = (w1, . . . , wn). This quantity can be decomposed according to the chain rule into the sum of the conditional log probabilities Pn i=1 log p(wi | w1, . . . , wi−1). Our language model computes log p(wi | w1, . . . , wi−1) by composing representations of words w1, . . . , wi−1 using an recurrent LSTM model (Mikolov et al., 2010; Sundermeyer et al., 2012).

The model is illustrated in Figure 2, where we observe on the first level that each word wi is projected into their word representations. This can be done by using word lookup tables eW wi , in which case, we will have a regular recurrent language model. To use our C2W model, we can simply replace the word lookup table with the model f(wi) = e C wi . Each LSTM block si , is used to predict word wi+1. This is performed by projecting the si into a vector of size of the vocabulary V and performing a softmax.

The softmax is still simply a d × V table, which encodes the likelihood of every word type in a given context, which is a closed-vocabulary model（softmax仍然是一个简单的d×V表，它对给定上下文中每个单词类型的可能性进行编码，这是一个封闭词汇表模型。）.Thus, at test time out-of-vocabulary (OOV) words cannot be addressed（测试时OOV问题仍然不能处理）.OOV解决方案：A strategy that is generally applied is to prune the vocabulary V by replacing word types with lower frequencies as an OOV token. At test time, the probability of words not in vocabulary is estimated as the OOV token.（在词表建立时设置min-count，将低频词设置为OOV token。在测试时就会被标记为OOVtoken）Thus, depending on the number of word types that are pruned, the global perplexities may decrease, since there are fewer outcomes in the softmax（因此，当一定数量的单词被删除时，softmax的结果更少了，从而全局的困惑度可能减少）,which makes the absolute value of perplexity not informative when comparing models of different vocabulary sizes（这使得在比较不同词汇量的模型时，困惑度的绝对值不能提供有用的信息）. Yet, the relative perplexity between different models indicates which models can better predict words based on their contexts.（然而，不同模型之间的相对困惑表明，哪些模型能够更好地根据上下文预测单词。PS:当词表小时，每个单词的概率会变大，从而困惑度会变小）

解决OOV的方法：To address OOV words in the baseline setup, these are replaced by an unknown token, and also associated with a set of embeddings. 解决输入输出不一致的问题：During training, word types that occur once are replaced with the unknown token stochastically with 0.5 probability（将出现次数少的词，随机转换为OOV词）. The same process is applied at the character level for the C2W model.

4.2 Experiments

DatasetsWe look at the language model performance on English, Portuguese, Catalan, German and Turkish, which have a broad range of morphological typologies. While all these languages contain inflections, in agglutinative languages affixes tend to be unchanged, while in fusional languages they are not. For each language, Wikipedia articles were randomly extracted until 1 million words are obtained and these were used for training. For development and testing, we extracted an additional set of 20,000 words.

Setup We define the size of the word representation d to 50. In the C2W model requires setting the dimensionality of characters dC and current states dCS. We set dC = 50 （embedding 大小）and dCS = 150（隐藏单元个数）.Each LSTM state used in the language model sequence si is set to 150for both states and cell memories. Training is performed withmini-batchgradient descent with100 sentences. Thelearning rate and momentum were set to 0.2 and 0.95. The softmax over words is always performed on lowercased words.We restrict the output vocabulary to the most frequent 5000 words（严格限制5000个常用词）.Remaining word types will be replaced by an unknown token, which must also be predicted. The word representation layer is still performed over all word types (i.e., completely open vocabulary). When using word lookup tables, the input words are also lowercased, as this setup produces the best results. In the C2W, case information is preserved.

Evaluation is performed by computing the perplexities over the test data, and the parameters that yield the highest perplexity over the development data are used.

Perplexities Perplexities over the testset are reported on Table 4. From these results, we can see that in general, it is clear that C2W always outperforms word lookup tables (row “Word”), and that improvements are especially pronounced in Turkish, which is a highly morphological language, where word meanings differ radically depending on the suffixes used (evde → in the house vs. evden → from the house).

Number of Parameters 参数计算 As for the number of parameters (illustrated for block “#Parameters”), the number of parameters in word lookup tables is V ×d. If a language contains 80,000 word types (a conservative estimate in morphologically rich languages), 4 million parameters would be necessary. On the other hand, the compositional model consists of 8 matrices of dimensions dCS×dC+2dCS. Additionally, there is also the matrix that combines the forward and backward states of size d × 2dCS. Thus, the number of parameters is roughly 150,000 parameters—substantially fewer. This model also needs a character lookup table with dC parameters for each entry. For English, there are 618 characters, for an additional 30,900 parameters. So the total number of parameters for English is roughly 180,000 parameters (2 to 3 parameters per word type), which is an order of magnitude lower than word lookup tables.

PerformanceAs for efficiency, both representations can label sentences at a rate of approximately 300 words per second during training. While this is surprising, due to the fact that the C2W model requires a composition over characters, the main bottleneck of the system is the softmax over the vocabulary. Furthermore, caching is used to avoid composing the same word type twice in the same batch. This shows that the C2W model, is relatively fast compared operations such as a softmax.

Representations of (nonce) words OOV词的表示While is is promising that the model is not simply learning lexical features, what is most interesting is that the model can propose embeddings for nonce words, in stark contrast to the situation observed with lookup table models. We show the 5-most-similar in-vocabulary words (measured with cosine similarity) as computed by our character model on two in-vocabulary words and two nonce words1 .This makes our model generalize significantly better than lookup tables that generally use unknown tokens for OOV words. Furthermore, this ability to generalize is much more similar to that of human beings, who are able to infer meanings for new words based on its form.

（五）、Experiments: Part-of-speech Tagging

As a second illustration of the utility of our model, we turn to POS tagging. As morphology（形态） is a strong indicator for syntax in many languages, a much effort has been spent engineering features (Nakagawa et al., 2001; Mueller et al., 2013). We now show that some of these features can be learnt automatically using our model.

5.1 Bi-LSTM Tagging Model

Our tagging model is likewise novel, but very straightforward. It builds a Bi-LSTM over words as illustrated in Figure 3. The input of the model is a sequence of features f(w1), . . . , f(wn). Once again, word vectors can either be generated using the C2W model f(wi) = e C wi , or word lookup tables f(wi) = eW wi . We also test the usage of hand-engineered features, in which case f1(wi), . . . , fn(wi). Then, the sequential features f(w1), . . . , f(wn) are fed into a bidirectional LSTM model, obtaining the forward states s f 0 , . . . , s f n and the backward states s b N+1, . . . , s b 0 . Thus, state s f i contains the information of all words from 0 to i and s b i from n to i. The forward and backward states are combined, for each index from 1 to n, as follows:

where L f , L b and bl are parameters defining how the forward and backward states are combined.

The size of the forward s f and backward states s b and the combined state l are hyperparameters of the model, denoted as d f W S, d b W S and dW S, respectively. Finally, the output labels for index i are obtained as a softmax over the POS tagset, by projecting the combined state li .

5.2 Experiments

DatasetsFor English, we conduct experiments on the Wall Street Journal of thePenn Treebank dataset (Marcus et al., 1993), using the standard splits (sections 1–18 for train, 19–21 for tuning and 22–24 for testing). We also perform tests on 4 other languages, which we obtained from the CoNLL shared tasks (Mart´ı et al., 2007; Brants et al., 2002; Afonso et al., 2002; Atalay et al., 2003). While the PTB dataset provides standard train, tuning and test splits, there are no tuning sets in the datasets in other languages, so we withdraw the last 100 sentences from the training dataset and use them for tuning.

SetupThe POS model requires two sets of hyperparameters. Firstly, words must be converted into continuous representations and the same hyperparametrization as in language modeling (Section 4) is used. Secondly, words representations are combined to encode context. Our POS tagger has three hyperparameters d f W S, d b W S and dW S, which correspond to the sizes of LSTM states, and are all set to 50. As for the learning algorithm, use the same setup (learning rate, momentum and mini-batch sizes) as used in language modeling.

Once again, we replace OOV words with an unknown token, in the setup that uses word lookup tables, and the same with OOV characters in the C2W model. In setups using pre-trained word embeddings, we consider a word an OOV if it was not seen in the labelled training data as well as in the unlabeled data used for pre-training.

Compositional Model Comparison 组合模型的比较A comparison of different recurrent neural networks for the C2W model is presented in Table 3. We used our proposed tagger tagger in all experiments and results are reported for the English Penn Treebank. Results on label accuracy test set is shown in the column “acc”. The number of parameters in the word composition model is shown in the column “parameters”. Finally, the number of words processed at test time per second are shown in column “words/sec”.

We observe that approaches using RNN yield worse results than their LSTM counterparts with a difference of approximately 2%. This suggests that while regular RNNs can learn shorter character sequence dependencies, they are not ideal to learn longer dependencies. LSTMs, on the other hand, seem to effectively obtain relatively higher results, on par with using word look up tables (row “Word Lookup”), even when using forward (row “Forward LSTM”) and backward (row “Backward LSTM”) LSTMs individually. The best results are obtained using the bidirectional LSTM (row “BiLSTM”), which achieves an accuracy of 97.29% on the test set, surpassing the word lookup table.

There are approximately 40k lowercased word types in the training data in the PTB dataset. Thus, a word lookup table with 50 dimensions per type contains approximately 2 million parameters. In the C2W models, the number of characters types (including uppercase and lowercase) is approximately 80. Thus, the character look up table consists of only 4k parameters, which is negligible compared to the number of parameters in the compositional model, which is once again 150k parameters. One could argue that results in the BiLSTM model are higher than those achieved by other models as it contains more parameters, so we set the state size dCS = 50 (row “Bi-LSTM dCS = 50”) and obtained similar results.

In terms of computational speed, we can observe that there is a more significant slowdown when applying the C2W models compared to language modeling. This is because there is no longer a softmax over the whole word vocabulary as the main bottleneck of the network. However, we can observe that while the Bi-LSTM system is 3 times slower, it is does not significantly hurt the performance of the system.

Results on Multiple Languages Resultson 5 languages are shown in Table 4. In general, we can observe that the model using word lookup tables (row “Word”) performs consistently worse than the C2W model (row “C2W”). We also compare our results with Stanford’s POS tagger, with the default set of features, found in Table 4. Results using these tagger are comparable or better than state-of-the-art systems. We can observe that in most cases we can slightly outperform the scores obtained using their tagger. This is a promising result, considering that we use the same training data and do not handcraft any features. Furthermore, we can observe that for Turkish, our results are significantly higher (>4%).

Comparison with BenchmarksMost state-ofthe-art POS tagging systems are obtained by either learning or handcrafting good lexical features (Manning, 2011; Sun, 2014) or using additional raw data to learn features in an unsupervised fashion. Generally, optimal results are obtained by performing both. Table 5 shows the current Benchmarks in this task for the English PTB. Accuracies on the test set is reported on column “acc”. Columns “feat” and “data” define whether hand-crafted features are used and whether additional data was used. We can see that even without feature engineering or unsupervised pretraining, our C2W model (row “C2W”) is on par with the current state-of-the-art system (row “structReg”). However, if we add hand-crafted features, we can obtain further improvements on this dataset (row “C2W + features”).

However, there are many words that do not contain morphological cues to their part-of-speech. For instance, the word snake does not contain any morphological cues that determine its tag. In these cases, if they are not found labelled in the training data, the model would be dependent on context to determine their tags, which could lead to errors in ambiguous contexts. Unsupervised training methods such as the Skip-n-gram model (Mikolov et al., 2013) can be used to pretrain the word representations on unannotated corpora. If such pretraining places cat, dog and snake near each other in vector space, and the supervised POS data contains evidence that cat and dog are nouns, our model will be likely to label snake with the same tag.

We train embeddings using English wikipedia with the dataset used in (Ling et al., 2015), and the Structured Skip-n-gram model. Results using pre-trained word lookup tables and the C2W with the pre-trained word lookup tables as additional parameters are shown in rows “word(sskip)” and “C2W + word(sskip)”. We can observe that both systems can obtain improvements over their random initializations (rows “word” and (C2W)).

Finally, we also found that when using the C2W model in conjunction pre-trained word embeddings,that adding a non-linearity to the representations extracted from the C2W model e C w improves the results over using a simple linear transformation (row “C2W(tanh)+word (sskip)”). （将线性组合替换为tanh激活函数）This setup, obtains 0.28 points over the current state-ofthe-art system(row “SCCN”).

A similar model that a convolutional model to learn additional representations for words (Santos and Zadrozny, 2014) (row “CNN (Santos and Zadrozny, 2014)”). However, results are not directly comparable as a different set of embeddings is used to initialize the word lookup table.

5.3 Discussion

It is important to refer here that these results do not imply that our model always outperforms existing benchmarks, in fact in most experiments, results are typically fairly similar to existing systems. Even in Turkish, using morphological analysers in order to extract additional features could also accomplish similar results.The goal of our work is not to overcome existing benchmarks, but show that much of the feature engineering done in the benchmarks can be learnt automatically from the task specific data（我们工作的目标不是超越现有的基准测试，而是表明在基准测试中完成的许多特性工程可以从特定任务的数据中自动学习）.More importantly, we wish to show large dimensionality word look tables can be compacted into a lookup table using characters and a compositional model allowing the model scale better with the size of the training data. This is a desirable property of the model as data becomes more abundant in many NLP tasks.

（六）Related Work

（七）Conclusion

We propose a C2W model that builds word embeddings for words without an explicit word lookup table（不用词表）. Thus, it benefits from being sensitive to lexical aspects within words, as it takes characters as atomic units to derive the embeddings for the word（通过字符来表示为一个词的原子单元，来生成词向量）. On POS tagging, our models using characters alone can still achieve comparable or better results than state-of-the-art systems, without the need to manually engineer such lexical features. Although both language modeling and POS tagging both benefit strongly from morphological cues, the success of our models in languages with impoverished morphological cues shows that it is able to learn non-compositional aspects of how letters fit together.

The code for the C2W model and our language model and POS tagger implementations is available from https://github.com/wlin12/ JNN.

四、研究成果及意义

（一）成果

在英语、葡萄牙语、加泰罗尼亚语、德语和土耳其语五种语言的语言模型上均取得最优结果。

在英语的词性标注任务上取得最优的结果。

（二）意义

提供了一种新的训练词表示的方法，并且首次学习词内部的形式

（三）优点

训练时还需要通过LSTM生成词表示，速度比词向量机制要慢

测试时虽然可以通过缓存的方式预先生成一些词向量，但是OOV词的词表示依旧很慢

（四）缺点

能够解决字符间的结构信息

可以推理出相似结果的词表示

（五）适应场景

序列标注任务

OOV词较多的任务，如对抗样本、舆论方法

五、论文总结

关键点

词向量机制的两个问题：1.独立性假设 2词表太多

如何学习单词中形态-功能（语义、语法）关系

C2W

创新点

提出一种新的词表示方法

在语言模型和词性标注上取得非常好的结果

在形态丰富的语言中效果更好

启发点

这种词的独立性假设是存在本质问题的，尤其是在形态上比较丰富的语言中。在这种形态丰富的语言中，更合理的假设是形态相似的词功能（语法与语义）上也可能相似

我们这篇工作的目的不是为了超越基准模型，二是为了说明基准模型中的特征工程可以从数据中自动学习出来。