2020-07-20

NLP学习—embeddings

Text data must be encoded as numbers to be used as input or output for machine learning and deep learning models.

第一步把text split into words，Words are called tokens and the process of splitting text into tokens is called tokenization

in a word, define the examples, encode them as integers, then pad the sequences to be the same length.

另一方面，可以考虑使用tokenizer class for preparing text documents for deep learning,the tokenizer must be constructed and then fit on either raw text documents or integer encoded text documents.

重点：word embeddings

Word embeddings can be learned from text data and reused among projects. They can also be learned as part of fitting a neural network on text data.

A word embedding is a class of approaches for representing words and documents using a dense vector representation

in a word，嵌入层将正整数（下标）转换为具有固定大小的向量，如[[4],[20]]->[[0.25,0.1],[0.6,-0.2]]

相反，在嵌入中，单词由密集向量表示，其中向量表示单词在连续向量空间中的投影。单词在向量空间中的位置是从文本中得知的，并基于使用时围绕该单词的单词。一个单词在学习向量空间中的位置被称为它的嵌入。

单词嵌入可以作为深度学习模型的一部分来学习。这可能是一种较慢的方法，但会将模型定制为特定的训练数据集,The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.

embedding layer是一个非常灵活的层，可用于多种方式，比如：a、It can be used alone to learn a word embedding that can be saved and used in another model later；b、It can be used as part of a deep learning model where the embedding is learned along with the model itself；c、It can be used to load a pre-trained word embedding model, a type of transfer learning.

The Embedding layer is defined as the first hidden layer of a network，The Embedding layer has weights that are learned. If you save your model to file, this will include weights for the Embedding layer，for example,

e = Embedding(200,32,input_length=50), integer 200 encoded words from 0 to 199, inclusive, a vector space of 32 dimensions in which words will be embedded, and input documents that have 50 words each。

The output of the Embedding layer is a 2D vector with one embedding for each word in the input sequence of words (input document), If you wish to connect a Dense layer directly to an Embedding layer, you must first flatten the 2D output matrix to a 1D vector using the Flatten layer.

代码参考我自己的github：NLP learning.ipynb

参考：https://blog.csdn.net/weixin_42078618/article/details/82999906_一篇用插图表示的简明说明

关于预训练的embedding,一般有GloVe、word2vec两个，那么两者的区别参见：https://zhuanlan.zhihu.com/p/31023929