英语语料库
谷歌 word2vec
谷歌新闻预训练词向量 (about 100 billion words). 300维向量,大约3百万个单词和短语。实现论文
脸书 fastText
1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
2 million word vectors trained on Common Crawl (600B tokens).
斯坦福 GloVe
Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download)
Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download)
Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)
Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download)
中文语料库
word2vec
Wikipedia database, Vector Size 300, Corpus Size 1G, Vocabulary Size 50101, Jieba tokenizor
fastText
Trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We used the Stanford word segmenter for Tokenization
download link | source link
附录,处理方法:
https://github.com/Kyubyong/wordvectors