网络爬虫更新版

def crawl_web(seed):

tocrawl=[seed]
crawled=[]
index=[]

while tocrawl:
page=tocrawl.pop()

if page not in crawled:
    conten=get_page(page)
    add_page_to_index(index,page,content)
    union(tocrawl,get_all_links(content))
    crawled.append(page)

return index

def add_to_index(index, keyword, url):

for entry in index:
    if entry[0] == keyword:
        entry[1].append(url)
        
        return
        
index.append([keyword, [url]])

def lookup(index, keyword):

for entry in index:
    if entry[0] == keyword:
        return entry[1]
        
return []

def add_page_to_index(index, url, content):

words = content.split()
for word in words:
    add_to_index(index, word, url)

最后编辑于：2017.12.05 15:27:28

©著作权归作者所有,转载或内容合作请联系作者
平台声明：文章内容（如有图片或视频亦包括在内）由作者上传并发布，文章内容仅代表作者本人观点，简书系信息发布平台，仅提供信息存储服务。

推荐阅读更多精彩内容

python第四课概念
IntroductionIn unit 4 you are going to learn how to finis...
丁昆朋阅读 3,452评论 0赞 0
网络爬虫完整版
一个简单的搜索引擎该搜索引擎具有以下功能：通过一个种子链接不断爬取网页，可以指定爬取网页的层数，以此来控制搜索的...
丁昆朋阅读 1,885评论 0赞 0
Spring Cloud
Spring Cloud为开发人员提供了快速构建分布式系统中一些常见模式的工具（例如配置管理，服务发现，断路器，智...
卡卡罗2017阅读 135,292评论 19赞 139
秋天的穿越
读了一首宋词将秋天铺展成宋朝的模样透过时光的纸笺与你一起投宿在某个不知名的客栈就这样穿过千年前幽深的长巷 ...
独秀一栀阅读 1,748评论 18赞 22
为你，千千万万遍
之默物语：哈桑对阿米尔说：为你，千千万万遍。这世上，很多事，没有因果，不求回报，只有付出。这就是，...
范雨辰阅读 2,241评论 0赞 0

赞1赞

赞赏

手机看全文