【zt】Google's trained Word2Vec model in Python

From here: http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/

12 Apr 2016

In this post I’m going to describe how to get Google’s pre-trained Word2Vec model up and running in Python to play with.

As an interface to word2vec, I decided to go with a Python package called gensim. gensim appears to be a popular NLP package, and has some nice documentation and tutorials, including for word2vec.

You can download Google’s pre-trained model here. It’s 1.5GB! It includes word vectors for a vocabulary of 3 million words and phrases that they trained on roughly 100 billion words from a Google News dataset. The vector length is 300 features.

Loading this model using gensim is a piece of cake; you just need to pass in the path to the model file (update the path in the code below to wherever you’ve placed the file).

However, if you’re running 32-bit Python (like I was) you’re going to get a memory error!

This is because gensim allocates a big matrix to hold all of the word vectors, and if you do the math…

…that’s a big matrix!

Assuming you’ve got a 64-bit machine and a decent amount of RAM (I’ve got 16GB; maybe you could get away with 8GB?), your best bet is to switch to 64-bit Python. I had a little trouble with this–see my notes down at the end of the post.

Inspecting the Model

I have a small Python project on GitHub called inspect_word2vec that loads Google’s model, and inspects a few different properties of it.

If you’d like to browse the 3M word list in Google’s pre-trained model, you can just look at the text files in the vocabulary folder of that project. I split the word list across 50 files, and each text file contains 100,000 entries from the model. I split it up like this so your editor wouldn’t completely choke (hopefully) when you try to open them. The words are stored in their original order–I haven’t sorted the list alphabetically. I don’t know what determined the original order.

Here are some the questions I had about the vocabulary, which I answered in this project:

Does it include stop words?

Answer: Some stop words like “a”, “and”, “of” are excluded, but others like “the”, “also”, “should” are included.

Does it include misspellings of words?

Answer: Yes. For instance, it includes both “mispelled” and “misspelled”–the latter is the correct one.

Does it include commonly paired words?

Answer: Yes. For instance, it includes “Soviet_Union” and “New_York”.

Does it include numbers?

Answer: Not directly; e.g., you won’t find “100”. But it does include entries like “###MHz_DDR2_SDRAM” where I’m assuming the ‘#’ are intended to match any digit.

Here’s a selection of 30 “terms” from the vocabulary. Pretty weird stuff in there!

Al_Qods

Surendra_Pal

Leaflet

guitar_harmonica

Yeoval

Suhardi

VoATM

Streaming_Coverage

Vawda

Lisa_Vanderpump

Nevern

Saleema

Saleemi

rbracken@centredaily.com

yellow_wagtails

P_&C;

CHICOPEE_Mass._WWLP

Gardiners_Rd

Nevers

Stocks_Advance_Paced

IIT_alumnus

Popery

Kapumpa

fashionably_rumpled

WDTV_Live

ARTICLES_##V_##W

Yerga

Weegs

Paris_IPN_Euronext

##bFM_Audio_Simon

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 分手一年多,似乎也走出来了。认识了一些女生,有想脱单的心,但是真正开始行动的时候,却怎么也提不起精神。 有时候在想...
    生活不只有你阅读 159评论 0 0
  • 残缺的点 新历史观 新历史馆 中国教育残缺的一个点:拥抱! 练习拥抱,从小练习拥抱会让孩子们放松,感到满足进而获得...
    54f70f613c7c阅读 336评论 0 1
  • 1.感恩张波总众途软件,让精准客户不是梦。科技成就生活之美。 2.感恩高速公路,让距离不是问题,感恩小福克斯一路辛...
    山东慧恩贺守金阅读 125评论 0 0