轉載: 自然語言處理研究方向

Sebastian Ruder

6 Mar, 2018

Research

Research directions at AYLIEN in NLP and transfer learning

12 mins read

In a recent blog post I outlined some interesting research directions for people who are just getting into NLP and ML (you can read the original post here). These ideas can also serve as a starting point and source of inspiration if you are considering starting a PhD in this field, which is why I’m highlighting them again here. At AYLIEN, we offer the opportunity for people to join our research team and complete a PhD, while also working on the application of that research directly as part of our products. If this sounds interesting, you should consider applying to work with us on these (or related) ideas. See this blog post for more information.

Table of contents:

Task-independent data augmentation for NLP

Few-shot learning for NLP

Transfer learning for NLP

Multi-task learning

Cross-lingual learning

Task-independent architecture improvements

It can be hard to find compelling topics to work on and know what questions are interesting to ask when you are just starting as a researcher in a new field. Machine learning research in particular moves so fast these days that it is difficult to find an opening.

This post aims to provide inspiration and ideas for research directions to junior researchers and those trying to get into research. It gathers a collection of research topics that are interesting to me, with a focus on NLP and transfer learning. As such, they might obviously not be of interest to everyone. If you are interested in Reinforcement Learning, OpenAI provides a selection of interesting RL-focused research topics. In case you’d like to collaborate with others or are interested in a broader range of topics, have a look at the Artificial Intelligence Open Network.

Most of these topics are not thoroughly thought out yet; in many cases, the general description is quite vague and subjective and many directions are possible. In addition, most of these are not low-hanging fruit, so serious effort is necessary to come up with a solution. I am happy to provide feedback with regard to any of these, but will not have time to provide more detailed guidance unless you have a working proof-of-concept. I will update this post periodically with new research directions and advances in already listed ones. Note that this collection does not attempt to review the extensive literature but only aims to give a glimpse of a topic; consequently, the references won’t be comprehensive.

I hope that this collection will peak your interest and serve as inspiration for your own research agenda.

Task-independent data augmentation for NLP

Data augmentation aims to create additional training data by producing variations of existing training examples through transformations, which can mirror those encountered in the real world. In Computer Vision (CV), common augmentation techniques are mirroring, random cropping, shearing, etc. Data augmentation is super useful in CV. For instance, it has been used to great effect in AlexNet (Krizhevsky et al., 2012) [1] to combat overfitting and in most state-of-the-art models since. In addition, data augmentation makes intuitive sense as it makes the training data more diverse and should thus increase a model’s generalization ability.

However, in NLP, data augmentation is not widely used. In my mind, this is for two reasons:

Data in NLP is discrete. This prevents us from applying simple transformations directly to the input data. Most recently proposed augmentation methods in CV focus on such transformations, e.g. domain randomization (Tobin et al., 2017) [2].

Small perturbations may change the meaning. Deleting a negation may change a sentence’s sentiment, while modifying a word in a paragraph might inadvertently change the answer to a question about that paragraph. This is not the case in CV where perturbing individual pixels does not change whether an image is a cat or dog and even stark changes such as interpolation of different images can be useful (Zhang et al., 2017) [3].

Existing approaches that I am aware of are either rule-based (Li et al., 2017) [5] or task-specific, e.g. for parsing (Wang and Eisner, 2016) [6] or zero-pronoun resolution (Liu et al., 2017) [7]. Xie et al. (2017) [39] replace words with samples from different distributions for language modelling and Machine Translation. Recent work focuses on creating adversarial examples either by replacing words or characters (Samanta and Mehta, 2017; Ebrahimi et al., 2017) [89], concatenation (Jia and Liang, 2017) [11], or adding adversarial perturbations (Yasunaga et al., 2017) [10]. An adversarial setup is also used by Li et al. (2017) [16] who train a system to produce sequences that are indistinguishable from human-generated dialogue utterances.

Back-translation (Sennrich et al., 2015; Sennrich et al., 2016) [1213] is a common data augmentation method in Machine Translation (MT) that allows us to incorporate monolingual training data. For instance, when training a EN→FR system, monolingual French text is translated to English using an FR→EN system; the synthetic parallel data can then be used for training. Back-translation can also be used for paraphrasing (Mallinson et al., 2017) [14]. Paraphrasing has been used for data augmentation for QA (Dong et al., 2017) [15], but I am not aware of its use for other tasks.

Another method that is close to paraphrasing is generating sentences from a continuous space using a variational autoencoder (Bowman et al., 2016; Guu et al., 2017) [1719]. If the representations are disentangled as in (Hu et al., 2017) [18], then we are also not too far from style transfer (Shen et al., 2017) [20].

There are a few research directions that would be interesting to pursue:

Evaluation study: Evaluate a range of existing data augmentation methods as well as techniques that have not been widely used for augmentation such as paraphrasing and style transfer on a diverse range of tasks including text classification and sequence labelling. Identify what types of data augmentation are robust across task and which are task-specific. This could be packaged as a software library to make future benchmarking easier (think CleverHans for NLP).

Data augmentation with style transfer: Investigate if style transfer can be used to modify various attributes of training examples for more robust learning.

Learn the augmentation: Similar to Dong et al. (2017) we could learn either to paraphrase or to generate transformations for a particular task.

Learn a word embedding space for data augmentation: A typical word embedding space clusters synonyms and antonyms together; using nearest neighbours in this space for replacement is thus infeasible. Inspired by recent work (Mrkšić et al., 2017) [21], we could specialize the word embedding space to make it more suitable for data augmentation.

Adversarial data augmentation: Related to recent work in interpretability (Ribeiro et al., 2016) [22], we could change the most salient words in an example, i.e. those that a model depends on for a prediction. This still requires a semantics-preserving replacement method, however.

Few-shot learning for NLP

Zero-shot, one-shot and few-shot learning are one of the most interesting recent research directions IMO. Following the key insight from Vinyals et al. (2016) [4] that a few-shot learning model should be explicitly trained to perform few-shot learning, we have seen several recent advances (Ravi and Larochelle, 2017; Snell et al., 2017) [2324].

Learning from few labeled samples is one of the hardest problems IMO and one of the core capabilities that separates the current generation of ML models from more generally applicable systems. Zero-shot learning has only been investigated in the context of learning word embeddings for unknown words AFAIK. Dataless classification (Song and Roth, 2014; Song et al., 2016) [2526] is an interesting related direction that embeds labels and documents in a joint space, but requires interpretable labels with good descriptions.

Potential research directions are the following:

Standardized benchmarks: Create standardized benchmarks for few-shot learning for NLP. Vinyals et al. (2016) introduce a one-shot language modelling task for the Penn Treebank. The task, while useful, is dwarfed by the extensive evaluation on CV benchmarks and has not seen much use AFAIK. A few-shot learning benchmark for NLP should contain a large number of classes and provide a standardized split for reproducibility. Good candidate tasks would be topic classification or fine-grained entity recognition.

Evaluation study: After creating such a benchmark, the next step would be to evaluate how well existing few-shot learning models from CV perform for NLP.

Novel methods for NLP: Given a dataset for benchmarking and an empirical evaluation study, we could then start developing novel methods that can perform few-shot learning for NLP.

Transfer learning for NLP

Transfer learning has had a large impact on computer vision (CV) and has greatly lowered the entry threshold for people wanting to apply CV algorithms to their own problems. CV practicioners are no longer required to perform extensive feature-engineering for every new task, but can simply fine-tune a model pretrained on a large dataset with a small number of examples.

In NLP, however, we have so far only been pretraining the first layer of our models via pretrained embeddings. Recent approaches (Peters et al., 2017, 2018) [3132] add pretrained language model embedddings, but these still require custom architectures for every task. In my opinion, in order to unlock the true potential of transfer learning for NLP, we need to pretrain the entire model and fine-tune it on the target task, akin to fine-tuning ImageNet models. Language modelling, for instance, is a great task for pretraining and could be to NLP what ImageNet classification is to CV (Howard and Ruder, 2018) [33].

Here are some potential research directions in this context:

Identify useful pretraining tasks: The choice of the pretraining task is very important as even fine-tuning a model on a related task might only provide limited success (Mou et al., 2016) [38]. Other tasks such as those explored in recent work on learning general-purpose sentence embeddings (Conneau et al., 2017; Subramanian et al., 2018; Nie et al., 2017) [343540] might be complementary to language model pretraining or suitable for other target tasks.

Fine-tuning of complex architectures: Pretraining is most useful when a model can be applied to many target tasks. However, it is still unclear how to pretrain more complex architectures, such as those used for pairwise classification tasks (Augenstein et al., 2018) or reasoning tasks such as QA or reading comprehension.

Multi-task learning

Multi-task learning (MTL) has become more commonly used in NLP. See herefor a general overview of multi-task learning and here for MTL objectives for NLP. However, there is still much we don’t understand about multi-task learning in general.

The main questions regarding MTL give rise to many interesting research directions:

Identify effective auxiliary tasks: One of the main questions is which tasks are useful for multi-task learning. Label entropy has been shown to be a predictor of MTL success (Alonso and Plank, 2017) [28], but this does not tell the whole story. In recent work (Augenstein et al., 2018) [27], we have found that auxiliary tasks with more data and more fine-grained labels are more useful. It would be useful if future MTL papers would not only propose a new model or auxiliary task, but also try to understand why a certain auxiliary task might be better than another closely related one.

Alternatives to hard parameter sharing: Hard parameter sharing is still the default modus operandi for MTL, but places a strong constraint on the model to compress knowledge pertaining to different tasks with the same parameters, which often makes learning difficult. We need better ways of doing MTL that are easy to use and work reliably across many tasks. Recently proposed methods such as cross-stitch units (Misra et al., 2017; Ruder et al., 2017) [2930] and a label embedding layer (Augenstein et al., 2018) are promising steps in this direction.

Artificial auxiliary tasks: The best auxiliary tasks are those, which are tailored to the target task and do not require any additional data. I have outlined a list of potential artificial auxiliary tasks here. However, it is not clear which of these work reliably across a number of diverse tasks or what variations or task-specific modifications are useful.

Cross-lingual learning

Creating models that perform well across languages and that can transfer knowledge from resource-rich to resource-poor languages is one of the most important research directions IMO. There has been much progress in learning cross-lingual representations that project different languages into a shared embedding space. Refer to Ruder et al. (2017) [36] for a survey.

Cross-lingual representations are commonly evaluated either intrinsically on similarity benchmarks or extrinsically on downstream tasks, such as text classification. While recent methods have advanced the state-of-the-art for many of these settings, we do not have a good understanding of the tasks or languages for which these methods fail and how to mitigate these failures in a task-independent manner, e.g. by injecting task-specific constraints (Mrkšić et al., 2017).

Task-independent architecture improvements

Novel architectures that outperform the current state-of-the-art and are tailored to specific tasks are regularly introduced, superseding the previous architecture. I have outlined best practices for different NLP tasks before, but without comparing such architectures on different tasks, it is often hard to gain insights from specialized architectures and tell which components would also be useful in other settings.

A particularly promising recent model is the Transformer (Vaswani et al., 2017) [37]. While the complete model might not be appropriate for every task, components such as multi-head attention or position-based encoding could be building blocks that are generally useful for many NLP tasks.

Conclusion

I hope you’ve found this collection of research directions useful. If you have suggestions on how to tackle some of these problems or ideas for related research topics, feel free to comment below.

References

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105). 

Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. arXiv Preprint arXiv:1703.06907. Retrieved from http://arxiv.org/abs/1703.06907 

Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond Empirical Risk Minimization, 1–11. Retrieved from http://arxiv.org/abs/1710.09412 

Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching Networks for One Shot Learning. NIPS 2016. Retrieved from http://arxiv.org/abs/1606.04080 

Li, Y., Cohn, T., & Baldwin, T. (2017). Robust Training under Linguistic Adversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (Vol. 2, pp. 21–27). 

Wang, D., & Eisner, J. (2016). The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages. Tacl, 4, 491–505. Retrieved from https://www.transacl.org/ojs/index.php/tacl/article/viewFile/917/212%0Ahttps://transacl.org/ojs/index.php/tacl/article/view/917

Liu, T., Cui, Y., Yin, Q., Zhang, W., Wang, S., & Hu, G. (2017). Generating and Exploiting Large-scale Pseudo Training Data for Zero Pronoun Resolution. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 102–111). 

Samanta, S., & Mehta, S. (2017). Towards Crafting Text Adversarial Samples. arXiv preprint arXiv:1707.02812. 

Ebrahimi, J., Rao, A., Lowd, D., & Dou, D. (2017). HotFlip: White-Box Adversarial Examples for NLP. Retrieved from http://arxiv.org/abs/1712.06751 

Yasunaga, M., Kasai, J., & Radev, D. (2017). Robust Multilingual Part-of-Speech Tagging via Adversarial Training. In Proceedings of NAACL 2018. Retrieved from http://arxiv.org/abs/1711.04903 

Jia, R., & Liang, P. (2017). Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 

Sennrich, R., Haddow, B., & Birch, A. (2015). Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709. 

Sennrich, R., Haddow, B., & Birch, A. (2016). Edinburgh neural machine translation systems for wmt 16. arXiv preprint arXiv:1606.02891. 

Mallinson, J., Sennrich, R., & Lapata, M. (2017). Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (Vol. 1, pp. 881-893). 

Dong, L., Mallinson, J., Reddy, S., & Lapata, M. (2017). Learning to Paraphrase for Question Answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 

Li, J., Monroe, W., Shi, T., Ritter, A., & Jurafsky, D. (2017). Adversarial Learning for Neural Dialogue Generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Retrieved from http://arxiv.org/abs/1701.06547 

Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., & Bengio, S. (2016). Generating Sentences from a Continuous Space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL). Retrieved from http://arxiv.org/abs/1511.06349 

Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., & Xing, E. P. (2017). Toward Controlled Generation of Text. In Proceedings of the 34th International Conference on Machine Learning. 

Guu, K., Hashimoto, T. B., Oren, Y., & Liang, P. (2017). Generating Sentences by Editing Prototypes. 

Shen, T., Lei, T., Barzilay, R., & Jaakkola, T. (2017). Style Transfer from Non-Parallel Text by Cross-Alignment. In Advances in Neural Information Processing Systems. Retrieved from http://arxiv.org/abs/1705.09655 

Mrkšić, N., Vulić, I., Séaghdha, D. Ó., Leviant, I., Reichart, R., Gašić, M., … Young, S. (2017). Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints. TACL. Retrieved from http://arxiv.org/abs/1706.00374 

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135-1144). ACM. 

Ravi, S., & Larochelle, H. (2017). Optimization as a Model for Few-Shot Learning. In ICLR 2017. 

Snell, J., Swersky, K., & Zemel, R. S. (2017). Prototypical Networks for Few-shot Learning. In Advances in Neural Information Processing Systems. 

Song, Y., & Roth, D. (2014). On dataless hierarchical text classification. Proceedings of AAAI, 1579–1585. Retrieved from http://cogcomp.cs.illinois.edu/papers/SongSoRo14.pdf 

Song, Y., Upadhyay, S., Peng, H., & Roth, D. (2016). Cross-Lingual Dataless Classification for Many Languages. Ijcai, 2901–2907. 

Augenstein, I., Ruder, S., & Søgaard, A. (2018). Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces. In Proceedings of NAACL 2018. 

Alonso, H. M., & Plank, B. (2017). When is multitask learning effective? Multitask learning for semantic sequence prediction under varying data conditions. In EACL. Retrieved from http://arxiv.org/abs/1612.02251 

Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch Networks for Multi-task Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. http://doi.org/10.1109/CVPR.2016.433 

Ruder, S., Bingel, J., Augenstein, I., & Søgaard, A. (2017). Sluice networks: Learning what to share between loosely related tasks. arXiv preprint arXiv:1705.08142. 

Peters, M. E., Ammar, W., Bhagavatula, C., & Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017). 

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of NAACL. 

Howard, J., & Ruder, S. (2018). Fine-tuned Language Models for Text Classification. arXiv preprint arXiv:1801.06146. 

Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 

Subramanian, S., Trischler, A., Bengio, Y., & Pal, C. J. (2018). Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning. In Proceedings of ICLR 2018. 

Ruder, S., Vulić, I., & Søgaard, A. (2017). A Survey of Cross-lingual Word Embedding Models. arXiv Preprint arXiv:1706.04902. Retrieved from http://arxiv.org/abs/1706.04902 

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems. 

Mou, L., Meng, Z., Yan, R., Li, G., Xu, Y., Zhang, L., & Jin, Z. (2016). How Transferable are Neural Networks in NLP Applications? Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. 

Xie, Z., Wang, S. I., Li, J., Levy, D., Nie, A., Jurafsky, D., & Ng, A. Y. (2017). Data Noising as Smoothing in Neural Network Language Models. In Proceedings of ICLR 2017. 

Nie, A., Bennett, E. D., & Goodman, N. D. (2017). DisSent: Sentence Representation Learning from Explicit Discourse Relations. arXiv Preprint arXiv:1710.04334. Retrieved from http://arxiv.org/abs/1710.04334 

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 219,539评论 6 508
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,594评论 3 396
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 165,871评论 0 356
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,963评论 1 295
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,984评论 6 393
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,763评论 1 307
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,468评论 3 420
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,357评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,850评论 1 317
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 38,002评论 3 338
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,144评论 1 351
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,823评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,483评论 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,026评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,150评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,415评论 3 373
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,092评论 2 355

推荐阅读更多精彩内容

  • 今天阅读了《上脑和下脑》,在我们的日常生活中常听人谈论左脑和右脑,阅读了本书的部分后让我对大脑又有一个全新的认知。...
    君利阅读 182评论 0 0
  • 1 今天是一年一度的中秋节。 每逢佳节倍思亲。每到这个时候,我们更加思念故乡,思念亲人。身处异乡的游子,中秋明月照...
    橡树lin阅读 811评论 0 7
  • 今晚阅读的第一本书是<每天重要的两小时>,虽然只阅读了半个小时,但还是感受颇深。 大成功有很多种,但与你勤勤勉...
    快乐拉拉阅读 189评论 0 0
  • 在很久很久以前,有一座岩石山 很高很高,有100层楼房那么高 听说站在上面可180度观看一整座城的风景 但是从来没...
    天野丢阅读 377评论 0 3