会议 ACL 2015 paper 的概述

写了几十篇 ACL 2015 paper 的概述，大部分看过，错误应该不少，欢迎指正。首发于和朋友一起做的公众号“程序媛的日常”上；现在汇总发成几篇长微博：OACL 2015 selected paper 概述（1）；OACL 2015 selected paper 概述（2）；OACL 2015 selected paper 概述（3）；OACL 2015 selected paper 概述（4）。

开完 ACL 2015 大会，选了自己感兴趣的几十篇论文，大部分是自己已经读过的，做了一些概述。相信里面有很多错误，欢迎指正。另外，图文并茂版本在公众号查看，长微博复制图片也许有很多错误显示不出来。

1. Text to 3D Scene Generation with Rich Lexical Grounding

Angel Chang, Will Monroe, Manolis Savva, Christopher Potts, Christopher D. Manning

这篇论文很 fancy，就是如何利用简易的文本，建立 3D 图形。比如如何根据文本，画出一个房间里的角落后有一个冰箱，冰箱上面有一盆花。做的工作很细致。语料也很特别！

2. MultiGranCNN: An Architecture for General Matching of Text Chunks on Multiple Levels of Granularity

Wenpeng Yin and Hinrich Schütze

他这两年的研究重点基本都放在 textchunks 的表达上，他的这系列工作（包括这篇）都强调他想 handle various granularity in the sentence reprentation，具体到他的模型中，就体现在了 unigram (word) feature, short ngram feature, long ngram feature 和 sentence feature 上。我理解他的 various granularity 要同时 model 两种 advantage：1）different granularity should be compared (between two sentence representations) at corrosponding granular-level (do not compare single words with entire sentences）;2) should model interactions among different granularities. 对于1）他们认为这点比 Socher'11 的 RNN 工作要好；同时他们把这个工作 extend 到了 ACL'15 里；对于2）他们于是在他们的 model 中加入了一个 interaction NN。

这篇论文中有两个 technique，一个是 unsupervised pretraining CNN scheme，这个东西他们说特有有用，大概就是把最上层的 sentence representation layer output，当做 one unit，然后再加上整个 NN input 的原始的一个个 single word unit，去组成一个新的 sequence，然后结合 NCE（noise-contrastive estimation）技术，改造成一种 sentence-enhanced word prediction 的玩意。他们这个 technique 的思想源自两篇论文，一个很显然就是 word2vec 那种 unsupervised 的 prediction central word 的思想，一个是 Baroni'14 的 Dont count, predict! 论文，认为 predict-fashioned 的 LM 更好。

这是第一个 technique，这个 technique 被他用在了后面这一系列论文当中。但有多大用处各位可以一起检验一下。

第二个 technique 就是另二种 dynamic pooling，追随 Socher'11 的工作。

3. [TACL]* Improving Distributional Similarity with Lessons Learned from Word Embeddings

Omer Levy, Yoav Goldberg, Ido Dagan

据说 Levy 在 oral presentation 当场战斗力爆表，直接说自己做了 5600 组实验都无法重复出某些模型的好的实验结果。

4. Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, Xueqi Cheng

首先，这篇论文言辞朴素踏实。如果文如其人，可以透过论文看出这个作者对待研究的沉稳态度。全文很少有 fancy 或者渲染的形容词。踏实地描述工作，认真地对比方方面面。有数学，有分析，有 case study。再来补充推荐，这篇论文甚至可以当做一篇简要的 survey。Introduction, Related Work 和后面的 Discussion 部分，对于 syntagmatic 和 paradigmatic models 的总结十分全面，评价客观到位。推荐给想了解一下这边内容的童鞋。

以下进入正题：

Motivation:我们都知道，一般去找 word similarity，会出现两种，一种更像 word relatedness，一种才是 word similarity。可以理解为“横向”和“纵向”的 similarity。文中用 The wolf is a fierce animal. The tiger is a fierce animal. 两句话来解释。(wolf, tiger) 是 paradigmatic relation，(wolf, fierce) 和 (tiger, fierce) 都分别是 syntagmatic relation。以前的许多 model 分别 capture 了这两种 relation 中的某一种。本文想 jointly 学这两种，并且认为 jointly 的学习是可以互相 boost 整体结果的（并在最后 case study 中给出了分析）。

Concepts:关于 syntagmatic vs. paradigmatic，本文中其实有四对相似的概念。首先是 (syntagmatic, paradigmatic)，对应的是 (representations based on the text region, representations based on similar contexts)，第三个对应的是 (combinatorial relations, substitutional relations)，第四个对应的是 (words-by-documents co-occurrence matrix, words-by-words co-occurrence matrix).

Idea:jointly 的学习其实也算是一个 NLP 中比较有卖点的东西。进攻的（Hanyang 爱用的词）是 NLP 中经常使用的 pipeline framework，jointly 的工作可以减少 error propagation 和 accumlation。虽然这篇文章中不涉及 pipeline 工作，但是 jointly 的学习确实可以互相 boost。

Models:基于 word2vec 的 CBOW 和 SkipGram，改造了两个模型。虽然改造这俩模型的 paper 已经太多，但是这篇的改造确实给人眼前一点点亮的感觉。而且给出了严格的数学推导（还有源码呀）。表述清晰，数学不好的各位童鞋的福利（包括我）。简单来说，两者都是用 word2vec 的 contexts (neighboring words) 继续 capture paradigmatic，而用整个 documents capture syntagmatic。比改造 CBOW 的直接“并联”更巧妙的是改造 SkipGram，变成了 “Hierarchical”的形式，用 documents 先 predict （conditioned）中心词 w_0，再和 SkipGram 一样去用 w_0 predict context words，一样达到同时 capture 两种 relation 的目的。

Experiments:在公开的大数据集上，横纵向（多种 dim，多个 baseline model）比较了在 word similarity 和 word analogy 的表现。全部 beat baseline。

Case Study:这部分我觉得最认真。我很喜欢。

5. Compositional Vector Space Models for Knowledge Base Completion

Arvind Neelakantan, Benjamin Roth, Andrew McCallum

思想很简单，去弥补 knowledge path，然后就可以推导出一些 transitional & compositional 的 relation in KB。

1. Learning Answer-Entailing Structures for Machine Comprehension

Mrinmaya Sachan, Kumar Dubey, Eric Xing, Matthew Richardson

CMU 出品，Eric Xing 老师的组。本文不是 NN，数学上还算简单。个人觉得有两个亮点，一个就是假设了一个中间的 hypothesis，一个是在数学的地方结合了 multi-task，并使用了 feature map 的 technique 把 multi-task 给“退化”成了原始问题。

先说第一个，第一个就是说，他们先用 Question 和 Answer，学出一个 hypothesis，这个 hypothesis 就是一种 latent variable，也可以认为是一种 embedding 后的 fact。如果我们认为 question + answer 共同描述了一个 fact/truth/event 的话。基于这个 hypothesis，再去 match 原始 paragraph/text 里的 relevant words。具体可以看看 Figure 1.我觉得这个蛮有趣的。因为让我想起编码解码。Question + Answer 的组合就是一种对于这篇 doc 的一种表达；而这篇 doc 本身是另一种表达。这两种表达就是两种 representation 的结果，那么中间真实的事情是什么？所谓的完整的 information 是什么？他这样直接结合的 hypothesis 肯定也是 reduce 了信息的。实际我觉得现在 Machine Translation/Conversation 那边也在做类似的事情。我们不要直接一对一，要有中间一个看不见的“hypothesis”。

第二个 multi-task，这个和他们用到的另一篇论文有关，《Toward AI-Complete Question Answering: A Set of Prerequisite Toy Tasks》。这里面定义了20种 AI 需要解决的问题。是种。就是上面说的问题是分类的，how/what/which/why/when/who 啥的。他们用了这20类，把任务细分，细分成 20个 subtask。这样就变成了 multi-task 的问题。然后使用了 feature map（Evgeniou 2004）的技术，把 multi-task 又给转化成了原始问题。我觉得还蛮有趣的。当然 multi-task 已经有非常多的解决办法了，这个只是一种适用于他的模型的有效简单的办法。

2. A Generalisation of Lexical Functions for Composition in Distributional Semantics

Antoine Bride, Tim Van de Cruys, Nicholas Asher

论文也是关注一个热点，compositional。论文提出了一种比较 general 的框架去囊括 composition。同时还着重分析了形容词（adj）和名词（noun）的 composition 性质。

3. Simple Learning and Compositional Application of Perceptually Grounded Word Meanings for Incremental Reference Resolution

Casey Kennington and David Schlangen

这篇论文的报告非常非常 cute！一直以右下角的三个俄罗斯方块作为动画主体。内容也很 fancy！所谓 grounded word meaning 就是那种描述性的事实性的修饰词。比如一个“十字”“红色”“方块”。这样。数据集也是他们自制的，公开。很不错很有趣的论文。

4. Learning to Adapt Credible Knowledge in Cross-lingual Sentiment Analysis

Qiang Chen, Wenjie Li, Yu Lei, Xule Liu, Yanxiang He

这篇工作中，作者使用情感信息去 supervise 双语之间的翻译——很直观的假设就是，source language 和 target language 之间情感词性应该是不变的。一句话不可能翻译前是正向情感，翻译后就变成负向了。他们采用了 knowledge validation 进行了多次验证。

5. Event-Driven Headline Generation

Rui Sun, Yue Zhang, Meishan Zhang, Donghong Ji

文章非常自然地用event structure 和 information 去 tradeoff 了 extractive-based method 和 abstractive-based method 的优缺点。关于这两种方法，这篇论文的 related work 写得很好，可以看一下（related wok 和 Background 都有）。

论文的思想是说，我们 event structure 就涵盖了非常 informative 的有利于 summarization 的东西。一个 event 被定义为一个 tuple。我们先 extract 全部的 event tuple，再做 generation。无论是 event tuple 还是 generation，这个工作都很妙。妙就妙在，event 的 structure 几乎涵盖了上面那篇 ACL'15 的 NP 和 VP 的信息（见Section 3.1.1），并且，更好的地方在于，它可以利用 event tuples 中的第二个元素，predicate 进行去重。这个就是利用了 event 这种 tuple 的数据结构，抓了 dependency parsing 的结果，用其中 NSUBJ 和 DOBJ relation 去处理 NP VP。

Section 3.1.3 就是很自然地 graph-based summarization 的常用思想，word event 不是一个 alignment pair 么，这种时候大招就是——A should be more important if it occurs in more important B. And verse visa. 所以我就把 event 和 words （in the lexical chains）联系起来了。

所以直到这一步都可以看出，是 event 这种 tuple 结构帮了大忙了。而作者也意识到了这点，他自己就认为 tuple 这个结构式一种很好的 tradeoff between extractive and abstractive，又比 abstractive 纯 Phrase-based 的多一些 grammatical 的 information，又可以减轻 extracitve 的 sparse 问题（见 Introduction）。

1. How Far are We from Fully Automatic High Quality Grammatical Error Correction?

Christopher Bryant and Hwee Tou Ng

出发点很好，就是用 human evaluation 做了 agreement 的评价。发现人都做不到 90% 以上，所以我们不能要求机器翻译应该做到……

2. Efficient Methods for Inferring Large Sparse Topic Hierarchies

Doug Downey, Chandra Bhagavatula, Yi Yang

我觉得他的卖点就依然在于 hierarchy，并且看起来能解决 hierarchy model 的 efficiency 的问题。这篇文章即使也是 pre-defined topic/structure，但是它给出了一种 expansion，就是用已经学好的一个他的 hierarchical 模型，去作为“seed”，学新的。提速。而且我认为也是符合认知的。

接下来说说这文章中，重点攻击的俩模型，和他自己的区别。由区别就可以看出为啥他快。LDA 作为一种最广泛应用的 topic model，简洁有效是不用说的。但是无论是 LDA 还是一些变种 LDA，他们最大的问题是，那个概率假设。要满足 topic 和 topic 之间是独立的（并不是合1的那个假设有问题）。这个 topic 和 topic 之间独立，带来的问题是，数据量不够时，topic 定多的时候，就会学出很多非常 general，nonsensical 的 topic，

对应于中文就是“我，的，我们，一个，一个人，生活”这类。这也是为啥 LDA 不 hierarchical 的原因（hierarchical LDA 也没打破这个假设）。所以，第一个重点区别就是，PAM 和这个论文里的 SBT 都是打破这个假设的，都是可以 modelling correlations between topics 的。那么 SBT 和 PAM 的区别是什么呢，就是它用的那个名字复杂和 fancy 的 tree prior 了。这种 prior 的 motivation 在我看来还是在 prior 的阶段，就去假设这种 hierarchy，从而在 sampling 阶段可以“recursive”。细节上来说，就是使得 sampling 的时候，topic 的 coherence 会更大。不会乱 sampling。会更倾向于 draw 相关的 topic。

3. Jointly optimizing word representations for lexical and sentential tasks with the C-PHRASE model

Nghia The Pham, Germán Kruszewski, Angeliki Lazaridou, Marco Baroni

基于 CBOW 的改造模型，作者的出发点是——既然 CBOW 可以基于 contexts 中的 words combination（ngram）来预测中心词，我们应该可以找出一种方法，使得 contexts 不再是简单的自然 combination，而是符合 linguistic rule 符合 syntax 的 combination。

4. Co-training for Semi-supervised Sentiment Classification Based on Dual-view Bags-of-words Representation

Rui Xia, Cheng Wang, Xin-Yu Dai, Tao Li

这篇文章的出发点很有趣——自制反例！在 sentiment 相关的任务中，由于数据稀疏性，可能会使得正负向情感词没有出现在 training instances 中，这时候我们可以通过自制反例来减少这种稀疏性。具体时，用 lexical rules 来匹配出一些情感词，然后把 sentiment 的 label 反转，0变1，1变0，从而得到对应的负例。

然后，正例和负例分别进入两个 view，便是 cotraining。和作者聊，Rui Xia 老师认为这种方法只能用在 sentiment 这种可以把 label 变负的问题上。

5. A Hierarchical Neural Autoencoder for Paragraphs and Documents

Jiwei Li, Thang Luong, Dan Jurafsky

作者验证了 LSTM 变成 hierarchical 架构的可行性，给出了几种直观的改造方案。第三种是基于 attention machenism 进行的 partial part alignment 的 LSTM。经过 hierarchical 改造的 LSTM 可以进行 sentence - paraphrase - document 的多层次表达。

6. A Re-ranking Model for Dependency Parser with Recursive Convolutional Neural Network

Chenxi Zhu, Xipeng Qiu, Xinchi Chen, Xuanjing Huang

这个论文最大的贡献是，他们把以前 Socher 提出的用原始 RNN 做 compositional 这种 relation 的方法，给改良了。可以不再只能 model binary composition 了，可以 triple even more 了。具体可以见 Section 4 开始的那段写的，就是一个 constituent parsing vs. dependency parsing 的问题。这个是他这个论文最大的贡献。variant of RNN to handle more-than-two units of composition。

另外，distance embedding，in Section 3.1，用 [-2,2] 这种 relative position 直接作为 feature，然后直接 concatenate 到 embedding vector 里（见 Equ. 4）。方法取自The best paper in COLING 2014，《Relation Classification via Convolutional Deep Neural Network》。

7. Cross-lingual Dependency Parsing Based on Distributed Representations

Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, Ting Liu

作者利用双语对应信息，分别采用 alignment 和 CCA 的方法融合到了传统 NN-based dependency parsing 中去。其中 alignment 方法是允许 one-to-many relation 的 alignment 的，而 CCA 则只是 one-to-one。

8. A Unified Multilingual Semantic Representation of Concepts

José Camacho-Collados, Mohammad Taher Pilehvar, Roberto Navigli

作者简直是在这个 word semantic representation/ word semantic disambiguation 上苦心修行多年：http://wwwusers.di.uniroma1.it/~navigli/pubs_by_cat.html。即使是在今年，也在 WWW/TACL/NAACL 上都分别发表了相关工作。2013 年的这个工作的前身还被提名为 ACL best paper 候选。

先说一下和这篇 paper 相关的几个工作：

Socher 2013a, Bilingual Word Embeddings for Phrase-Based Machine Translation,

Guo 2014, Learning Sense-specific Word Embeddings By Exploiting Bilingual Resources

NAACL 2015, Deep Multilingual Correlation for Improved Word Embeddings

NAACL 2015 (与本文同一作者), Simple task-specific bilingual word embeddings

Socher 2013a 的工作应该是第一个提出把双语映射的（不敢肯定）到同一个空间的——去学一个共同的 word embedding space。这个思想后来也算是被发扬到 text/image pair，各种各种吧。这个工作的结果还是很不错的，简略的介绍可以看当时神童的一篇博文：http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

那么后来的工作其实分为了两个 step，第一个 step 其实是，我们用 multi-lingual（multi-resource）去（1）增强 word representation 的表达，和（2）我们去增进更细致的 concept 的表达（disambiguation）。

关于（1），除了 Socher 2013a 的工作， NAACL'15 的 Deep 那篇，也是用双语增进表达——这里他们是基于 CCA/DCCA 的假设，把 MT pair 作为 CCA/DCCA 的输入（CCA 就是之前讲的那两篇 model-based ACL'15 和 NIPS 的工作里的 CCA）。这篇主要认为 DCCA 作为一种 nonlinear subspace 的 transformation，要更加优于 CCA 这种 linear transformation。

关于（2），比如 Guo 2014 的工作，可以看他的 paper 里的 Table 1，一目了然。基于 MT 的 alignment model，去一步步剔除/选择想要的 cluster——把一词多义分进多个 cluster。

接下来来说本文这篇，A Unified Multilingual Semantic Representation of Concepts，它也是为了（2）服务的——去学一种 concept 的 embedding，其实就是把一词多义的 word 的不同 sense 认为是一个 concept。但是他不同的地方是什么呢，他不仅是用了 multilingual，还用了 external information——Wikipedia。而且它讨巧的一点在于，它不是选择 translation pair，而是用了一个“纯天然”的 multilingual synset database：Babelnet——http://babelnet.org/ 这玩意号称是整合了 WordNet 和 Wikipedia 等，直接使得每个它里面的 concept 有多种语言中的 synset word。这样他们就有起点了！也就是说，他们用这些 synset words 和 concept，再去遵循一定规则，去爬 Wikipedia，去增进他们的语义 corpus。

工作做的很 linguistic，但是有个东西挺有趣（除了那个 Babelnet），就是他们在 Section 3.1 中用到的 similarity metric。并不是大家常用的 consine or Haiming，而是 square-rooted Weighted Overlap（WO），孤陋寡闻的我还是第一次听说 orz——他们工作里说这玩意已经被证实比传统的 cosine 好。基于这个 WO metric（for vector representations of words），两个 word 之间的 similarity 还得再有个转换（公式3）。

1. Dependency-based Convolutional Neural Networks for Sentence Embedding

Mingbo Ma, Liang Huang, Bowen Zhou, Bing Xiang

黄亮老师二作的论文，一作学生主讲。讲的非常非常清晰。语速快，掷地有声，slides 的可视化辅助理解。思想非常 straightforward，不再是简单的 sequential Convoluational NN，而是利用 dependency 的 relation，进行 Convolutional。这样的思想有点像改造 CBOW/Skip-Gram 时融入 dependency relation information。

2. A Unified Learning Framework of Skip-Grams and Global Vectors

Jun Suzuki and Masaaki Nagata

一篇思想上希望从数学（Machine Learning）角度把 SkipGram （with negative sampling，SGNS）和 GloVe 囊括在一个框架下的论文。但是论文比较有争议的地方在一起，他们使用的两个模型的公式少了 bias 项。从某种程度上并不能算一个完全精确的囊括。

3. Distributional Neural Networks for Automatic Resolution of Crossword Puzzles

Aliaksei Severyn, Massimo Nicosia, Gianni Barlacchi, Alessandro Moschitti

很有趣的一个任务，拼字游戏。作者同时公开了数据集。在 presentation 的时候做了一个小游戏，给出了四个 information，让大家猜一个词——最后猜出来是 Tux 小企鹅。事实上拼字游戏并没有想得那么简单。他们的模型中比较特殊的一点是，把两个 input unit 的 similarity 算出来后，会继续把 input unit x，input unit y，similarity 和其他 feature 一起 embedding 在同一层里。

4. A Dependency-Based Neural Network for Relation Classification

Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, Houfeng WANG

本文有两个贡献，首先提出了一种新的 dependency relation 相关的 path——ADP，Augmented dependency path。ADP 不仅包含了经典 relation classification 中的 dependency shortest paths，还包括了 path 相关的 subtrees。第二个贡献便是基于 ADP，改造了一种 Recursive NN 的模型，叫 DepNN。

1. Machine Comprehension with Discourse Relations

Karthik Narasimhan and Regina Barzilay

MIT CSAIL 出品。开源。是一篇很 neat 的论文，而且不是 NN。这篇文章的卖点是：discourse information + less human annotation所以他们的 model，可以使用 discourse relation（relations between sentences, learned, not annotated) 去增强 machine comprehension 的 performance。具体的，他们先使用 parsing 等方法，去选出和 question 最 relevant 的一个句子（Model 1）或者多个句子（Model 2 和 Model 3），并在这个过程中建立 relation，最后预测。思想都是 discriminative model 的最简单的思想，找 hidden variable，概率连乘。如果对本文有兴趣，推荐看 Section 3.1，讨论了一下他们认为这个 task 上可能相关的四【类】feature。

2. Model-based Word Embeddings from Decompositions of Count Matrices

Karl Stratos, Michael Collins, Daniel Hsu

首先推荐所有对 word embeddings 或者 low-dimensional lexical representation 有兴趣的童鞋读本文。本文主要是想从数学角度理解 word embedding，并想提出一种 template 去满足我们的 embedding 目标（其实只是降维）。

如果可以提出一种可以减少像 negative sampling derived word embeddings 中的 estimation error（即提高 estimation 准确度，但依然是 estimation），就可以提高 word embedding 的 performance。

于是本文从 CCA （用来求解 word similarity evaluation 中 Pearson ranking 的）入手，强调 CCA 是可以用来优化两个 vector，使得它们最大相关化（这不就是 context-based model 的假设么？the famous quote, You shall know a word by the company). 然后想把 corpus 中，central word 和它周围的 context words 构成这样的两个 vector（其实是 vector pairs，假设中心词是 c, 窗口大小是 K，那么就会有 2K 个pair 的vectors），就弄成这个 CCA 的优化里。但是这显然很耗费计算量。又通过各种 lemma 加观察，开始转化近似求解（当数据量大的时候）。近似求解之后的求解公式就联系到了用 CCA 做 parameter estimation，spectral estimation。由此提出了 spectral template for word embeddings。并且还把已经提出的对于 word embeddings 的拆解方式（如Levy 的 PPMI），都”归“进了它这个 template 里（Section 5，Figure 2）。然后做了实验。所以我觉得它们是通过另一种数学角度，把 word embedding 整件事给从 estimation error 的角度做了优化（直接把 negative sampling derived word embeddings 当靶子，而不是试图解释这个东西），也算是做了更进一步的事情。

鉴于 ACL'15 这篇，也引用了 NIPS'11 的。我先把它在引用时，自己的 comment 的贴出来：

Dhillon et al. (2011) and (2012) propose novel modifications of CCA (LRMVL and two-step CCA) to derive word embeddings, but do not establish any explicit connection to learning HMM parameters or justify the squareroot transformation.

看完论文的我，还是觉得这话说的很中肯的。下面我来对比一下这两篇论文：

1. 首先 ACL'15 这篇不仅仅包括 NIPS'11，所以以下对比只强调它延续 NIPS'11 的工作的内容。

2. 在 NIPS'11 中，作者所谓的 Multi-View，其实是，左 contexts L，右 contexts R，当前 target word W。三个 contexts。以及作者不是很强调的 previous and future view（HMM中的 hidden state）。用两部分来理解，L、R、W，其实是综合考虑上下文信息，这没的说；而 previous 和 future view，则是利用 HMM 的 state 假设（在 learning 过程中，这个 state 大概迭代 5-7 次）。

3. NIPS'11 把 HMM 的假设搞到 word representation 里，其实也没什么新鲜的。但是我认为这个 HMM 中假设和学到的 hidden state 和我们的 word embedding 还是不同的，虽然都是 low-rank/dim 的表达，但是 hidden state 可以进一步被用来学习 context-specific 的 word embedding。也就是说 word embedding 是一种结果，一种 projected result，hidden state 是一种 learning method，一种 projection。（这里只是我的理解）

4. NIPS'11 于是实际上，是用 CCA 先学出了 L,R 在 hidden state 假设下的一个降维后的 A，再用这个 A 去第二次使用 CCA，和 W 计算——所以是两个步骤，两次 CCA。作者有讨论，如果当我们是 infinite corpus 的情况，我们其实可以等价为一步到位的 CCA。但是当我们的 corpus 符合 Zips' Law 的时候，我们这样分两步走，才是更准确的。

5. 而 ACL'15 这篇，可以说，ACL'15 = NIPS'11 + Stratos (2014) + strict condition (squaredroot transformation)。就是说，它把在 strict condition 下，applied Stratos (2014) to NIPS'11。使得满足了他所说的“establish any explicit connection to learning HMM parameters or justify the squareroot transformation”，这部分就是 ACL'15 中 Section 4 的内容。

6. 当然，为此，ACL'15 和 NIPS'11 的切入点/行文逻辑顺序就不一样，NIPS'11 就是告诉大家， CCA 可以学 low-rank，为了达到这个目的，我们需要满足什么假设，运用什么技巧；ACL'15 则是说，CCA 可以做我们知道，but CCA 还可以理解为一种 parameter estimation for HMM（Section 4.1 开篇），啥叫 parameter estimation 呢，在这 estimation 角度来讲，我们其实只是要找一个矩阵 O——可是这个矩阵 O 啊，最好要达到俩性质，这俩性质我们就需要两个额外的技巧才能满足。

7. 具体举个例子来讲，NIPS'11 中 exponential smooth，是为了 low-rank 的 L,R 表达服务的，很自然地引入，以一种 smooth 的角度；ACL'15 中 exponential smooth 则是以一种为了满足 O 的性质，我们要这样做的 explicit proof 角度引入的。

8. NIPS'11 是 convex 的，直接求解，没 local optimal 问题；ACL'15 是 non-convex 的（Stratos 2014 的工作是 non-convex 的因为），所以有点麻烦。

3. Entity Hierarchy Embedding

Zhiting Hu, Poyao Huang, Yuntian Deng, Yingkai Gao, Eric Xing

4. The Users Who Say 'Ni': Audience Identification in Chinese-language Restaurant Reviews

Rob Voigt and Dan Jurafsky

5. PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification

Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch, Benjamin Van Durme, Chris Callison-Burch

推荐只是因为 poster 做的太有个性……

6. Non-distributional Word Vector Representations

Manaal Faruqui and Chris Dyer

7. A Hierarchical Knowledge Representation for Expert Finding on Social Media

Yanran Li, Wenjie Li, Sujian Li

作者通过层次化模型，将新浪微博上的每个用户的全部帖子表达成其层次化的知识结构——并用来和不同领域的专家的知识结构进行对比，从而判断这个用户是否是某个领域的专家。具体上，建立知识结构的过程使用了 Pachinko Allocation Model，不同于 LDA，这样的 model 放宽了 LDA 的 topic 之间是独立的假设，从而可以进行层次化建模。在进行结构 matching 的过程，基于 edit-distance，tree 上的编辑距离，改造了 approximate tree matching 算法，融入了 word embedding 的 semantic matching——从而提升了效果。

8. Learning Summary Prior Representation for Extractive Summarization

Ziqiang Cao, Furu Wei, Sujian Li, Wenjie Li, Ming Zhou, Houfeng WANG

传统的框架是，两步走，先有一个 sentence ranking 的过程，再用 ranking score 去做第二步的 sentence selection。这两步基本都是 feature-based。所以过去的工作多数是在 feature 上做文章，各显身手。这篇论文在 ranking 的过程套用了一个 CNN，提升了效果。

1. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks

Kai Sheng Tai, Richard Socher, Christopher D. Manning

思想很简单，就跟昨天说的黄亮老师组把 sequential CNN 变成基于 dependency relation 的 CNN 一样，这篇就是把 sequential LSTM 变成了 Tree-Structured LSTM。

2. genCNN: A Convolutional Architecture for Word Sequence Prediction

Mingxuan Wang, Zhengdong Lu, Hang Li, Wenbin Jiang, Qun Liu

这篇论文基本是用好几个 CNN 模拟 RNN，然后加上了 shared weight/ no shared weight (two feature maps)，做的工作，效果不错。

3. Abstractive Multi-Document Summarization via Phrase Selection and Merging

Lidong Bing, Piji Li, Yi Liao, Wai Lam, Weiwei Guo, Rebecca Passonneau

他们的 main idea 是把 abstrative summarization 这件事，建立在对于 phrase 的 extract 和 combine 上。基本单元是 phrase。而且由于他们有两个 observation，认为 NP phrase 主要表示了 concept，VP phrase 主要表示了 fact。所以他们的工作只集中于抽取这两种 phrase，并基于他们来做 abstractive summarization。所以他们的 framework 分为三个部分——phrase extraction，phrase salience scoring and sentence generation as an optimization problem (simultanously)，postprocessing。我感觉还是很直观的。所有的评价和选择都是基于 phrase 这个 unit，然后把 sentence generation 作为一个 optimization 的问题来处理。三个部分都有许多 heuristic，但看起来并不觉得很 dirty。最后 evaluation 部分的第二个部分，用 DUC 那五个方面，grammaticality, non-redundancy, referential clarity, focus and coherence 来评价。不知道是否已经是“标配”。最后我感觉他的 introduction 写的很好，但是把 extractive 中的 compression-based 单提出来当第二类方法，可能有点另类。

4. Deep Unordered Composition Rivals Syntactic Methods for Text Classification

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, Hal Daumé III

idea 很简单很简单很简单（有点像 SIGIR'15 的 HRM 的架构），就是deep averaging network——DAN。那用这个 DAN 做啥捏——他们是说，你们 ReNN（RecNN，作者是这么叫，但我记得我好像看到的 Socher 是叫 ReNN），就是 recursive NN，可以 handle 特别复杂的 syntactic + ordered 的 composition 关系——negation 啊那些句法特征都可以 handle 进来。然并卵呀，你太复杂啦，你为了能提高准确性，在 ReNN 的每个 node 都要加个 classifier 来监督，每个 node 还都有不同的计算——你训练太慢啦。有没有可能你就是杀鸡焉用宰牛刀啊？

于是乎作者就搞了这么个 simple but useful 的架构。每个 sentence input 的时候，都是按词为单位，并且 input unit 是每个词的 word embedding。然后直接 average——作者表示，在以前的工作中大部分人认为 average 比 sum 效果好。这是简单的 neural bag of words——NBOW。然后再变 deep——反正 deep FFNN 的思想就是我每 deep 一层，就更 abstract 嘛。然后实验证明，这样的 deep averaging （DAN）真的几乎和 ReNN 无差别噢，训练速度和单层 NBOW 几乎无差别呢。虽然任务很简单，是 text classification。但是实验后面的分析很不错。有兴趣的就直接看看那个 Figure 架构和 Section 5 就好了。

今年的 Best Student Paper 得主是来自慕尼黑大学的 Sascha Rothe 和其老师 Hinrich Schutze 的工作，《AutoExtend: Extending Word Embeddings to Embeddings for Synsets and Lexemes》。

论文看起来也不是很顺畅。主要是概念有点多，重新组织一下：

论文想探究的是三种data type，word, synset, lexeme，这三种 data type 都常见于 Lexical Resources，比如 WordNet，Freebase, Wiktionary 等等。作者想通过他们在这种 resources 中的关系，来作为 constraints，去把 word embedding，synset embedding, lexeme embedding 一起学在同一个空间里。同时，论文基于我们任何已有的 word embedding，和任何已有的 resources，不需要额外的 training corpus，就可以得到 synset, lexeme embedding。

先来说三种 data type：

word，不用说了。synset，一组同义词，由多个与不同 word 有关的 lexeme 组成；lexeme，不知道中文叫啥，反正既有一词多义的意思，也有一词多种形态的意思（syntactic）。具体举例可以见 Section 2 的第二段。

基于三种 data type，作者给出了两个 motivation 和两个 observation 和两个 assumption（都是一个东西）：

A word in WordNet can be viewed as a composition

of several lexemes. Lexemes from different

words together can form a synset. When a synset

is given, it can be decomposed into its lexemes.

And these lexemes then join to form words. These

observations are the basis for the formalization of

the constraints encoded in WordNet that will be

presented in the next section: we view words as

the sum of their lexemes and, analogously, synsets

as the sum of their lexemes.

然后这个东西就可以用来做 constraints 了，就是公式（1）（2），也是 Figure 1 架构的主要顺序。word->lexeme->synset->lexeme->word.

除了这俩 motivation 和这俩 constraints，作者还有第三个 motivation 和第三个 constraints：

Section 1 中的，认为

The next thing to notice is that this does not only work for words that combine several properties, but also for words that combine several senses. The vector of suit can be seen as the sum of a vector representing lawsuit and a vector representing business suit. AutoExtend is designed to take word vectors as input and unravel the word vectors to the vectors of their lexemes. The lexeme vectors will then give us the synset vectors

而 constraints 第三个则是基于 resources 的性质，在 Section 2.4，用于解决的是当 word 没有 synset 时的问题。

会议 ACL 2015 paper 的概述

会议 ACL 2015 paper 的概述

相关阅读更多精彩内容

友情链接更多精彩内容