作业代码:
import graphlab
# Limit number of worker processes. This preserves system memory, which prevents hosted notebooks from crashing.
graphlab.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 4)
#导入数据
people = graphlab.SFrame("people_wiki.gl/")
#建立一个单词统计向量(为每条评论建立单词统计向量)【分词】
people["word_count"] = graphlab.text_analytics.count_words(people["text"])
#计算td-idf
tfidf = graphlab.text_analytics.tf_idf(people["word_count"])
people['tfidf'] = tfidf
- Top word count words for Elton John
elton = people[people["name"] == "Elton John"]
elton[["word_count"]].stack("word_count",new_column_name = ["word","count"]).sort("count",ascending = False)
输出结果如下:
2 . Top TF-IDF words for Elton John
elton[["tfidf"]].stack("tfidf",new_column_name = ["word","tfidf"]).sort("tfidf",ascending = False)
输出结果如下:
3 . The cosine distance between 'Elton John's and 'Victoria Beckham's articles (represented with TF-IDF) falls within which range?
4 . The cosine distance between 'Elton John's and 'Paul McCartney's articles (represented with TF-IDF) falls within which range?
5 . Who is closer to 'Elton John', 'Victoria Beckham' or 'Paul McCartney'?
victoria = people[people['name'] == 'Victoria Beckham']
paul = people[people["name"] == "Paul McCartney"]
graphlab.distances.cosine(elton['tfidf'][0],victoria['tfidf'][0])
graphlab.distances.cosine(elton["tfidf"][0],paul["tfidf"][0])
输出结果如下:
0.9567006376655429
0.8250310029221779
knn_tfdif_model = graphlab.nearest_neighbors.create(people,features = ["tfidf"],label = "name",distance = "cosine")
knn_wordcount_model = graphlab.nearest_neighbors.create(people,features = ["word_count"],label = "name",distance = "cosine")
6 . Who is the nearest neighbor to 'Elton John' using raw word counts?
8 . Who is the nearest neighbor to 'Victoria Beckham' using raw word counts?
knn_wordcount_model.query(elton)
knn_wordcount_model.query(victoria)
输出结果如下:
7 . Who is the nearest neighbor to 'Elton John' using TF-IDF?
9 . Who is the nearest neighbor to 'Victoria Beckham' using TF-IDF?
knn_tfdif_model.query(elton)
knn_tfdif_model.query(victoria)
输出结果如下: