天池CIKM2018 - Sentence Similarity

cikm 2018 - Sentence Similarity

AliMe is a chatbot for online shopping in a global context, this task is solving the short text matching problem with different language (Spanish & English)

Competition Website: https://tianchi.aliyun.com/competition/introduction.htm?spm=5176.100150.711.6.600e2784M5Z0uW&raceId=231661

Github Link: https://github.com/MarvinLSJ/LSTM-siamese

Result:
Single Deep Model: 66/1027
Ensemble Model: 38/1027

cikm 2018 - Sentence Similarity

Competition Introduction

Data Description

image.png

Training Data

21400 Labeled Spanish sentence pairs & English sentence pairs are provided;

55669 Unlabeled Spanish sentences & corresponding English translations are provided.

Test Data

5000 Spanish sentence pairs

Goal and Evaluation

Predicting the similarity of Spanish sentence pairs in test set.

Evaluated result by logloss.

ML Model

Developed by freedomwyl in Link

Deep Model

Common thoughts would be finding a way to represent sentences and calculate their similarity, with a little elaboration, here comes the basic model.

Basic Model: LSTM-Siamese

image.png

Name Origin

The name comes from Siamese twins in Thailand, the conjoined twins whose body is partially shared with each other. Later the word "Siamese" refers to the phenomenon of twin structures, like this neural network.

Main Idea

This model takes in one sentence pair, encoding each sentence into vector representation through LSTM word by word (which gives the sentence embedding the information of word sequences). Then generate some vector features from them, feed into classifier to get the similarity.

Baseline

With standard parameter settings as follows, the validation loss can be 0.3463, which is a pretty well off-line score.

image.png

Baseline configuration

experiment_name: 'siamese-baseline'

task: 'train' make_dict: False data_preprocessing: False

ckpt_dir: 'ckpt/'

training: num_epochs: 20 learning_rate: 0.01 #options = ['adam', 'adadelta', 'rmsprop'] optimizer: 'sgd'

embedding: full_embedding_path: 'input/wiki.es.vec' cur_embedding_path: 'input/embedding.pkl'

model: fc_dim: 100 name: 'siamese' embed_size: 300 batch_size: 1 embedding_freeze: False encoder: hidden_size: 150 num_layers: 1 bidirectional: False dropout: 0.5

result: filename: 'result.txt' filepath: 'res/'

Some Attempts

Tuning Parameters

Classifier

fc_dim: classifier fully connected layer size
Encoder

hidden_size: lstm hidden size

num_layers: lstm layer

bidirectional: bidirectional lstm can get more info

dropout: avoid overfitting
Embedding

embedding_freeze: Set it to false, then the embedding will participate backpropogation. Not so good from my experience, especially small training dataset.

Structure

Classifier

fc layers, non-linear fc layers(add ReLU)
Encoder

Features generating method, current method is (v1, v2, abs(v1-v2), v1*v2), more features with different vector distance measurement?

Training

Early stopping

Stop training whenever the |valid loss - train loss| <= 0.02
Optimizer

Default SGD;

Rmsprop for self-adaptive learning rate;

Adam for self-adaptive learning rate and momentum to get out of local optima;
Learning rate

It should be small enough to avoid oscillation. Furthur exploration can be dynamic learning rate clipping.

Baseline result

The basic model turns out to perform bad online, the reason is probably:

This test set is very different from train set, no matter from class distribution (pos:neg = 1:3 for train set), or sentence features.
This deep neural model is too sophisticated, with so much weights in LSTM and fully connected classifier, it overfits and get overtrained easily.

Data Augmentation

Based one the baseline result, we need to consider other path to avoid overfitting. The amount of data can always give us a surprise. We have a unexploited treasure - 55669 unlabeled data sentences which can be critical with proper use.

Main Idea

Here's how we do it:

Constructing Spanish sentence pairs by aligning them in rows and columns, and calculating their similarities in a unsupervised way.

image.png

First question is how to embedding the sentences.

Following the simple and effective fashion, the first choice would be averaging every word embeddings in the sentence.

Alternatively, it could be done in a more elaborate way, using AutoEncoder to train a sentence encoder. As the data amount is large enough, the encoder may be able to capture proper representation.

Secondly, the similarity between two sentences can be measured by several kind of distances, I prefer the cosine and the word mover's distance. Here are a example done during my intern applying these two method to calculate phrases'(store tags) similarity. (Link)

Here are some other thoughts about the data augmentation, in a traditional way with synonym substitution, and an effective but not so practical way of double translation. (Link)

Problems

In doing so, I encountered a large problem when calculating the huge similarity matrix. In this calculation, we need to do O(n^2) to get the similarity matrix, at best O(n)*O(logn) to select the k best and worse result for every sentence, while the n is near 50k, that is impossible to run on single PC, and still haven't figured out how to do it now.

Thus, I run this augmentation with some twitching on 700 to get 13216 positive samples and 11569 negtive samples, and had another run on 1000 sentences to get 38345 positive samples and 28635 negative samples. (To balance the 3:1 neg-pos ratio in original dataset)

Augmentation result

image.png

This is the result with augmentation with 1000 sentences. Local loss is really good to be around 0.1, but online still not ideal.

That may cause by the selection from the similarity matrix, selecting 10 best and 10 worse to be positive and negative examples makes the augmented data looks good on amounts, using only 700 sentences to get 24000 boosting on training data. But it actually has so many repeating data like (s1, s2) (s2, s1), that leads to a even more servere overfitting.

The ideal way of doing so is using all sentences to find top and bottom 1 and not duplicated sentence pairs. But how to do this efficiently is still puzzling me, hope readers can give me some hints. After doing so, the amount to be added into train set is still a problem to be discussed, how much is suitable to alleviate the overfitting?

Transfer Learning

image.png

As we are provided labeled English data, another thoughts would be using transfer learning.

A number of animal words went directly from Indian languages into Spanish and then English, (Puma originated in Quechua, while jaguar comes from yaguar). So I thought transfer may be useful on this task.

Main Idea

The idea is rather simple, train the siamese-LSTM on English labeled data first, and transfer neural network's weight to initialize Spanish model.

Transfer result

image.png

That is a quick and not fully extended attempt. As we can see from above, the result get better using 2 layer LSTM, but transfer result still can't beat former result.

Here are some after-thoughts: After transfer, there should be some frozen and unfrozen layers, especially the classifier layers, the English siamese may learn different features from Spanish input, so the classifier is doing a totally different job, which lead to a worse loss. Maybe we can freeze the classifier first and train encoder part, and then fine-tune the encoder part.

Result

Siamese-LSTM	Train Loss	Valid Loss	Optimizer Learning Rate	Explanation	Analysis
Baseline	0.3464	0.3463	SGD 0.01	baseline model with 0.5 dropout
	0.3651	0.3667	Adam 0.0001	change optimizer
Bidirectional	0.4427	0.4413	SGD 0.01	Bidirectional LSTM	Not helpful
Dropout	0.3833	0.3928	SGD 0.01	Dropout 0.7	Too much dropout
2-features	0.3421	0.3668	SGD 0.01	using embeded sentence vector v1, v2 as features	Model discriminating ability is constrained by only 2 features, but may get more generalization ability
3-features	0.4974	0.5100	SGD 0.01	v1, v2, v1-v2
	0.4096	0.4415	Adadelta 0.01	change optimizer	Adadelta can do better with adaptive learning rate
4-features	0.3914	0.3972	SGD 0.01	v1, v2, v1-v2, (v1+v2)/2	Changing from v1v2 to (v1+v2)/2, thought the avg can extract more info than v1v2, seems not that way
	0.3801	0.3740	RMSprop 0.0001	change optimizer	Adaptive learning rate wins again
5-features	0.4112	0.4407	Adadelta 0.01	v1, v2, v1-v2, (v1+v2)/2, v1*v2	Thus adding avg features even has negative effect
Transfer	0.3657-0.4765	0.3794-0.4986	SGD 0.01	All trainable transfer from English to Spanish model	English and Spanish may not that similar, or at least according to this model …
	0.4208-0.3605	0.4376-0.3699	SGD 0.01	2 layer LSTM	Adding 1 layer give us some hope, but it’s just better a bit.
Data Augmentation	0.1082	0.1136	SGD 0.01	Adding 38345 positive samples and 28635 negative samples generated from 1000 sentences	Proved data is the most critical point. But the way we augmented need to be modified.

Ensemble

Weighted Average

The result is probabilities which is a number between [0,1], the simplest way to do ensemble is the weighted average on this probabilities. The weight on each model can be manully adjusted according to single model performance. As the deep and ML models may perform well on different part of the data, this simple way renders a good result and our final submission is based on 0.5 weights on each model.
Stacking

image.png

Stacking can be more comprehensive, using the first level model to extract different features.

Implementation Details

Basic Model

Step by step Jupyter Notebook explanation: Explanation

Main : Run this to train model and inference

Configuration File : All configurations and parameters are set in here

Model : Siamese-LSTM model in PyTorch

Dataset : How samples are stored and extracted

Pre-processing for Sentences & Embedding : Pre-processing from raw data, embedding

Data Augmentation

Data Augmentation Jupyter notebook : Details in data augmentation using unlabeled data

Train with augmented data : Using augmented data with 700 unlabeled sentences to train model

Other Augmentation Methods: Augmentation with synonym substitution and double-translation

Transfer Learning

Transfer learning Jupyter Notebook explanation: Transfer Explanation

Transfer Main : Run this to train transfering model and inference

Transfer Configuration : Configuration file for transfer learning

最后编辑于：2018.08.24 13:27:01

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 219,635评论 6赞 508
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 93,628评论 3赞 396
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 165,971评论 0赞 356
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,986评论 1赞 295
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 68,006评论 6赞 394
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,784评论 1赞 307
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,475评论 3赞 420
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,364评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,860评论 1赞 317
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 38,008评论 3赞 338
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,152评论 1赞 351
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,829评论 5赞 346
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,490评论 3赞 331
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,035评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,156评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,428评论 3赞 373
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,127评论 2赞 356

天池CIKM2018 - Sentence Similarity

天池CIKM2018 - Sentence Similarity

cikm 2018 - Sentence Similarity

Table of Contents

Competition Introduction

Data Description

Training Data

Test Data

Goal and Evaluation

ML Model

Deep Model

Basic Model: LSTM-Siamese

Name Origin

Main Idea

Baseline

Baseline configuration

Some Attempts

Tuning Parameters

Structure

Training

Baseline result

Data Augmentation

Main Idea

Problems

Augmentation result

Transfer Learning

Main Idea

Transfer result

Result

Ensemble

Implementation Details

Basic Model

Data Augmentation

Transfer Learning

推荐阅读更多精彩内容