预训练的词向量整理(Pretrained Word Embeddings)

English Corpus

word2vec

Pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in this paper

download link | source link

fastText

1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).

download link | source link

1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).

download link | source link

2 million word vectors trained on Common Crawl (600B tokens).

download link | source link

GloVe

Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download)

download link | source link

Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download)

download link | source link

Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

download link | source link

Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download)

download link | source link

Chinese Corpus

word2vec

Wikipedia database, Vector Size 300, Corpus Size 1G, Vocabulary Size 50101, Jieba tokenizor

download link | source link

fastText

Trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We used the Stanford word segmenter for Tokenization

download link | source link

Reference

https://github.com/Hironsan/awesome-embedding-models
http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/
https://code.google.com/archive/p/word2vec/
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
https://fasttext.cc/docs/en/english-vectors.html
https://arxiv.org/pdf/1310.4546.pdf

github link

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

  • 这个叫路由层,实际上是把几个层拼在一块。 这里面提供 4 个函数: route_layer make_route_...
    陈继科阅读 9,399评论 0 2
  • DAY19 NO.32 小树 (白帽子-事实陈述)最近我陷入了纠结,我究竟该不该辞职作专职心理咨询师?我白天工...
    小树_阅读 3,627评论 3 3
  • 这是一部童话剧,写的是毛克利的故事,反映的却是我们每一个人成长路上的真实写照。 就像影片里面不断提到的那句“要想活...
    ss终生学习者阅读 3,957评论 0 1
  • 换种方式感受孩子 ——记第一次精品阅读课 不仅孩子,很多人的心灵世界都很精微,我们有时表面大大咧咧,实则我们莫名的...
    团的花园阅读 1,427评论 0 0

友情链接更多精彩内容