FastText

论文：Bag of Tricks for Efficient Text Classification

1.Introduce

We evaluate the quality of our approach fastText1 on two different tasks, namely tag prediction and sentiment analysis.

两种评价方法:标签预测、情感分析

2.Model architecture

A simple and efficient baseline for sentence classification is to represent sentences as bag of words (BoW) and train a linear classifier, e.g., a logistic regression or an SVM

句子分类：使用词袋模型 BoW表示句子，然后训练线性分类器

However, linear classifiers do not share parameters among features and classes.This possibly limits their generalization in the context of large output space where some classes have very few examples. Common solutions to this problem are to factorize the linear classifier into low rank matrices or to use multilayer neural networks

线性分类器的缺点：不共享参数，在输出空间很大的情况下泛化能力较差

解决办法：将线性分类器分解为低秩矩阵或者使用多层神经网络

FastText架构

The first weight matrix A is a look-up table over the words.

The word representations are then averaged into a text representation, which is in turn fed to a linear classifier.

The text representation is an hidden variable which can be potentially be reused.

FastText模型和CBOW模型类似，CBOW是上下文单词的词向量平均去预测中心词，fasttext是整个文档的单词的词向量平均去预测标签

使用softmax模型计算分类的概率，使用negiative log-likelihood作为代价函数

代价函数

This model is trained asynchronously on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate.

2.1 Hierarchical softmax

When the number of classes is large

基于哈夫曼编码：每个节点与从根节点到该节点的概率有关

复杂度：O(kh) ---> O(h log2(k)) ，where k is the number of classes and h the dimension of the text representation

优势：when searching for the most likely class

Each node is associated with a probability that is the probability of the path from the root to that node. If the node is at depth l+1 with parents n1, . . . , nl, its probability is

第l+1个节点的计算公式

一个节点的概率总是小于其父节点。DFS遍历一棵树，并且总是遍历叶子节点中概率较大的那个。

This approach is further extended to compute the T-top targets at the cost of O(log(T)), using a binary heap.

将输入层中的词和词组构成特征向量，再将特征向量通过线性变换映射到隐藏层，隐藏层通过求解最大似然函数，然后根据每个类别的权重和模型参数构建Huffman树，将Huffman树作为输出。

FastText 也利用了类别（class）不均衡这个事实（一些类别出现次数比其他的更多），通过使用 Huffman 算法建立用于表征类别的树形结构。因此，频繁出现类别的树形结构的深度要比不频繁出现类别的树形结构的深度要小，这也使得进一步的计算效率更高。

Huffman 树

2.2 N-gram features

词袋模型（BoW）对于一个文本，忽略其词序和语法，句法，将其仅仅看做是一个词集合，或者说是词的一个组合，文本中每个词的出现都是独立的，不依赖于其他词是否出现。

we use a bag of n-grams as additional features to capture some partial information about the local word order.

加入N-gram特征，以捕捉局部词序；使用Hash-Trick方法降维

3 Experiments

First, we compare it to existing text classifers on the problem of sentiment analysis.

Then, we evaluate its capacity to scale to large output space on a tag prediction dataset.

①情感分析

②输出空间很大标签预测

3.1 Sentiment analysis

Table 1

We present the results in Figure 1. We use 10 hidden units and run fastText for 5 epochs with a learning rate selected on a validation set from {0.05, 0.1, 0.25, 0.5}.

On this task,adding bigram information improves the performanceby 1-4%. Overall our accuracy is slightly better than char-CNN and char-CRNN and, a bit worse than VDCNN.

Note that we can increase the accuracy slightly by using more n-grams, for example with trigrams.

Table 3

We tune the hyperparameters on the validation set and observe that using n-grams up to 5 leads to the best performance.

3.2 Tag prediction

To test scalability of our approach, further evaluation is carried on the YFCC100M dataset which consists of almost 100M images with captions,titles and tags. We focus on predicting the tags according to the title and caption (we do not use the images).

We remove the words and tags occurring less than 100 times and split the data into a train, validation and test set.

We consider a frequency-based baseline whichpredicts the most frequent tag. we consider the linear version.

Table 5

We run fastText for 5 epochs and compare it to Tagspace for two sizes of the hidden layer, i.e., 50 and 200. Both models achieve a similar performance with a small hidden layer, but adding bigrams gives us a significant boost in accuracy.

At test time, Tagspace needs to compute the scores for all the classes which makes it relatively slow,while our fast inference gives a significant speed-up when the number of classes is large (more than 300K here).

Overall, we are more than an order of magnitude faster to obtain model with a better quality.

4 Discussion and conclusion

Unlike unsupervisedly trained word vectors from word2vec, our word features can be averaged together to form good sentence representations.

In several tasks, fastText obtains performance on par with recently proposed methods inspired by deep learning, while being much faster.

Although deep neural networks have in theory much higher representational power than shallow models, it is not clear if simple text classification problems such as sentiment analysis are the right ones to evaluate them.

输入是一句话，x1到xN是这句话的单词或是ngram。每一个都对应一个向量，对这些向量取平均就得到了文本向量。然后用文本向量去预测标签。当类别不多的时候，就是最最简单的softmax。当标签数量巨大的时候，就是要用到hierarchical softmax了。由于这个文章除了词向量还引入了ngram向量，ngram的数量非常大，会导致参数很多。所以这里使用了哈希桶，会可能把几个ngram映射到同一个向量。这样会大大的节省内存

word2vec和fasttext的对比：

word2vec对局部上下文中的单词的词向量取平均，预测中心词；fasttext对整个句子（或是文档）的单词的词向量取平均，预测标签。

word2vec中不使用正常的softmax，因为要预测的单词实在是太多了。word2vec中可以使用hierarchical softmax或是negative sampling。fasttext中当标签数量不多的时候使用正常的softmax，在标签数量很多的时候用hierarchical softmax。fasttext中不会使用negative sampling是因为negative sampling得到的不是严格的概率。

补充知识：

Negative log-likelihood function

词袋模型与Hash-Trick

代码：

After embed each word in the sentence, this word representations are then averaged into a text representation, which is in turn fed to a linear classifier.

It use softmax function to compute the probability distribution over the predefined classes.

Then cross entropy is used to compute loss.

Bag of word representation does not consider word order.

In order to take account of word order, n-gram features is used to capture some partial information about the local word order

When the number of classes is large, computing the linear classifier is computational expensive. So it use hierarchical softmax to speed training process.

use bi-gram and/or tri-gram

use NCE loss to speed us softmax computation(not use hierarchy softmax as original paper)

训练模型

1.load data(X:list of lint,y:int).

2.create session.

3.feed data.

4.training

(5.validation)

(6.prediction)

最后编辑于：2018.04.08 10:14:37

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 213,014评论 6赞 492
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 90,796评论 3赞 386
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 158,484评论 0赞 348
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 56,830评论 1赞 285
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 65,946评论 6赞 386
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 50,114评论 1赞 292
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 39,182评论 3赞 412
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 37,927评论 0赞 268
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 44,369评论 1赞 303
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 36,678评论 2赞 327
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 38,832评论 1赞 341
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 34,533评论 4赞 335
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 40,166评论 3赞 317
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,885评论 0赞 21
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,128评论 1赞 267
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 46,659评论 2赞 362
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 43,738评论 2赞 351