文本分类调研

持续更新中

Introduction

1. Definition

什么是文本分类,即我们常说的text classification,简单的说就是把一段文本划分到我们提前定义好的一个或多个类别。可以说是属于document classification的范畴。
Input:
a document d
a fixed set of classes C = {c1, c2, ... , cn}
Output:
a predicted class ci from C

2. Some simple application

  1. spam detection
  2. authorship attribution
  3. age/gender identification
  4. sentiment analysis
  5. assigning subject categories, topics or genes
    ......

Traditional methods

1. Naive Bayes

two assumptions:

  1. Bag of words assumption:
    position doesn't matter
  2. Conditional independency:

to compute these probabilities:

add-one smoothing to prevent the situation in which we get zero:(you can add other number as well)

to deal with unknown/unshown words:

main features:

  1. very fast, low storage requirements
  2. robust to irrelevant features
  3. good in domains with many equally important features
  4. optimal if the indolence assumption hold
  5. lacks accuracy in general

2. SVM

cost function of SVM:

2. SVM decision boundary
when C is very large:

about kernel:

until now,it seems that the SVM are only applicable to two-class classification.

Comparing with Logistic regression:

while applying SVM and Logistic regression to text classification, all you need to do is to get the labeled data and find a proper way to represent the texts with vectors (you can use one-hot representation , word2vec, doc2vec ......)

Neural network methods

1. CNN

(1) the paper Convolutional Neural Networks for Sentence Classification which appeared in EMNLP 2014
(2) the paper A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification

The model uses multiple filters to obtain multiple features. These features form the penultimate layer and are passed to a fully connected softmax layer whose output is the probability distribution over labels.

For regularization we employ dropout on the penultimate layer with a constraint on l2-norms of the weight vectors. Dropout prevents co-adaptation of hidden units by randomly dropping out.

Pre-trained Word Vectors
We use the publicly available word2vec vectors that were trained on 100 billion words from Google News.

Results

There is simplified implementation using Tensorflow on Github:https://github.com/dennybritz/cnn-text-classification-tf

2. RNN

the paper Hierarchical Attention Networks for Document Classification which appeared in NAACL 2016

in this paper we test the hypothesis that better representations can be obtained by incorporating knowledge of document structure in the model architecture

  1. It is observed that different words and sentences in a documents are differentially informative.
  2. Moreover, the importance of words and sentences are highly context dependent.
    i.e. the same word or sentence may be dif- ferentially important in different context

Attention serves two benefits: not only does it often result in better performance, but it also provides in- sight into which words and sentences contribute to the classification decision which can be of value in applications and analysis

Hierarchical Attention Network

If you want to learn more about Attention Mechanisms:http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/

In the model they used the GRU-based sequence encoder.
1. Word Encoder:

2. Word Attention:

3. Sentence Encoder:

4. Sentence Attention:

5. Document Classification:
Because the document vector v is a high level representation of document d

j is the label of document d

Results

There is simplified implementation written in Python on Github:https://github.com/richliao/textClassifier

References

https://www.cs.cmu.edu/%7Ediyiy/docs/naacl16.pdf
https://www.cs.cmu.edu/%7Ediyiy/docs/naacl16.pdf
https://www.coursera.org/learn/machine-learning/home/
https://www.youtube.com/playlist?list=PL6397E4B26D00A269

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 唐风吹过宋室的梦 那一笔挥毫凌空 绣点了牡丹红 蝶衣在月光下舞动 忆往事小叙如风 几世情缘若断流水 花飘香悲了谁 ...
    夜已空阅读 183评论 0 4
  • 一个北方人真的被江浙的醉蟹醉倒了
    海岸线177阅读 160评论 0 1
  • 枕上听雨久未眠,心思辗转几时鼾? 雨下叮零声如脆,静赏仙乐醉音梵。 落花春雨恼春愁,新赞春暖又春寒。 何时心头淋洁...
    me挥之即去阅读 179评论 0 0
  • 周五,是儿子满月后从姥姥姥爷、爷爷奶奶家游历一圈后回楼上住的日子。 这小子已经习惯了爷爷奶奶家的环境,反而到了自己...
    此木无言阅读 168评论 0 0