02 爬虫爬取网页,探测网页变化,追踪github话题热度

简单爬网页

from bs4 import BeautifulSoup
import requests

url  = 'https://knewone.com/?page=2'
wb_data  = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')

imgs = soup.select('article > header > a > img') #wrapper > ul > li:nth-child(39) > article > header > a > img
titles = soup.select('article > section > h4 > a') #wrapper > ul > li:nth-child(39) > article > section > h4 > a
links =  soup.select('article > section > h4 > a')  #wrapper > ul > li:nth-child(39) > article > section > h4 > a

for img, title, link in zip(imgs, titles, links):
    data = {
        'img': img.get('src'),
        'title' : title.get('title'),
        'link':  'https://knewone.com/' + link.get('href')
    }
    print(data)

如果是动态异步加载的网页,需要审查元素点network,然后其XHS里面,你再加载信息,就可以得到尾缀了。

假设我们想要探测如下网页的变化,看看作者有没有更新。首先,网页地址:

https://github.com/lennylxx/ipv6-hosts
截图:

截图

对应的api为:https://api.github.com/repos/lennylxx/ipv6-hosts
打开以后会有如下的JSON代码,很像python里面的字典:

{
  "id": 21858929,
  "node_id": "MDEwOlJlcG9zaXRvcnkyMTg1ODkyOQ==",
  "name": "ipv6-hosts",
  "full_name": "lennylxx/ipv6-hosts",
  "owner": {
    "login": "lennylxx",
    "id": 5811576,
    "node_id": "MDQ6VXNlcjU4MTE1NzY=",
    "avatar_url": "https://avatars3.githubusercontent.com/u/5811576?v=4",
    "gravatar_id": "",
    "url": "https://api.github.com/users/lennylxx",
    "html_url": "https://github.com/lennylxx",
    "followers_url": "https://api.github.com/users/lennylxx/followers",
    "following_url": "https://api.github.com/users/lennylxx/following{/other_user}",
    "gists_url": "https://api.github.com/users/lennylxx/gists{/gist_id}",
    "starred_url": "https://api.github.com/users/lennylxx/starred{/owner}{/repo}",
    "subscriptions_url": "https://api.github.com/users/lennylxx/subscriptions",
    "organizations_url": "https://api.github.com/users/lennylxx/orgs",
    "repos_url": "https://api.github.com/users/lennylxx/repos",
    "events_url": "https://api.github.com/users/lennylxx/events{/privacy}",
    "received_events_url": "https://api.github.com/users/lennylxx/received_events",
    "type": "User",
    "site_admin": false
  },
  "private": false,
  "html_url": "https://github.com/lennylxx/ipv6-hosts",
  "description": null,
  "fork": false,
  "url": "https://api.github.com/repos/lennylxx/ipv6-hosts",
  "forks_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/forks",
  "keys_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/keys{/key_id}",
  "collaborators_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/collaborators{/collaborator}",
  "teams_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/teams",
  "hooks_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/hooks",
  "issue_events_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/issues/events{/number}",
  "events_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/events",
  "assignees_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/assignees{/user}",
  "branches_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/branches{/branch}",
  "tags_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/tags",
  "blobs_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/git/blobs{/sha}",
  "git_tags_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/git/tags{/sha}",
  "git_refs_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/git/refs{/sha}",
  "trees_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/git/trees{/sha}",
  "statuses_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/statuses/{sha}",
  "languages_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/languages",
  "stargazers_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/stargazers",
  "contributors_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/contributors",
  "subscribers_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/subscribers",
  "subscription_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/subscription",
  "commits_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/commits{/sha}",
  "git_commits_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/git/commits{/sha}",
  "comments_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/comments{/number}",
  "issue_comment_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/issues/comments{/number}",
  "contents_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/contents/{+path}",
  "compare_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/compare/{base}...{head}",
  "merges_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/merges",
  "archive_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/{archive_format}{/ref}",
  "downloads_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/downloads",
  "issues_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/issues{/number}",
  "pulls_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/pulls{/number}",
  "milestones_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/milestones{/number}",
  "notifications_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/notifications{?since,all,participating}",
  "labels_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/labels{/name}",
  "releases_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/releases{/id}",
  "deployments_url": "https://api.github.com/repos/lennylxx/ipv6-hosts/deployments",
  "created_at": "2014-07-15T12:36:53Z",
  "updated_at": "2018-07-04T07:31:08Z",
  "pushed_at": "2018-06-22T01:57:04Z",
  "git_url": "git://github.com/lennylxx/ipv6-hosts.git",
  "ssh_url": "git@github.com:lennylxx/ipv6-hosts.git",
  "clone_url": "https://github.com/lennylxx/ipv6-hosts.git",
  "svn_url": "https://github.com/lennylxx/ipv6-hosts",
  "homepage": "",
  "size": 7345,
  "stargazers_count": 2858,
  "watchers_count": 2858,
  "language": "Python",
  "has_issues": true,
  "has_projects": true,
  "has_downloads": true,
  "has_wiki": true,
  "has_pages": false,
  "forks_count": 861,
  "mirror_url": null,
  "archived": false,
  "open_issues_count": 12,
  "license": {
    "key": "mit",
    "name": "MIT License",
    "spdx_id": "MIT",
    "url": "https://api.github.com/licenses/mit",
    "node_id": "MDc6TGljZW5zZTEz"
  },
  "forks": 861,
  "open_issues": 12,
  "watchers": 2858,
  "default_branch": "master",
  "network_count": 861,
  "subscribers_count": 313
}

更新时间在哪里?
在上述JSON文件里,标注了"updated_at": "2018-07-04T07:31:08Z",,这就是更新时间
如果想要看网页是否变化,就对更新时间进行检测即可。

import requests
import time

api = 'https://api.github.com/users/kennethreitz/starred'
web_page = 'https://github.com/kennethreitz'
last_update = None
all_info = requests.get(api).json()
cur_update = all_info['updated_at']
print(cur_update)
while True:
    if not last_update:
        last_update = cur_update

    if last_update < cur_update:
        webbrowser.open(webpage)
    time.sleep(600)

对比几个热门库的热度

这里可以使用这里的api,现成的:https://developer.github.com/v3/search/,我用的是q。
以django为例,https://api.github.com/search/repositories?q=django,这是django相关的项目,api有一个好处,那就是简单,json呈现。python中有.json()方法,可以使得json转化为python的字典、列表等等。


再比如topic内容是Django的,都有现成的api可以用:https://api.github.com/search/repositories?q=topic:django

那么使用的时候应该这样去做:

#https://api.github.com/search/repositories?q=topic:django
#https://api.github.com/search/repositories?q=django

#get_names -- check_repos

import requests

def get_names():
    print('Separate each name with Space')
    names = input()
    return names.split()

def check_repos(names):
    repo_api = 'https://api.github.com/search/repositories?q='
    ecosys_api = 'https://api.github.com/search/repositories?q=topic:'
    for name in names:
        repo_info = requests.get(repo_api+name).json()['items'][0]  
#1/json - 2/dict - 3/dict['items'] - list[0] -- django{"name": "django","stargazers_count": 34961}
        stars = repo_info['stargazers_count']
        forks = repo_info['forks_count']
        ecosys_info = requests.get(ecosys_api+name).json()['total_count']
        print(name)
        print('Stars:'+str(stars))
        print('Forks:'+str(forks))
        print('Ecosys:'+str(ecosys_info))
        print('-------------------')

names  =get_names()
check_repos(names)

输出结果:

>>>Separate each name with Space
flask django sanic bottle
flask
Stars37174
Forks11015
Ecosys:6734
-------------------
django
Stars34965
Forks14861
Ecosys:10212
-------------------
sanic
Stars9640
Forks895
Ecosys:158
-------------------
bottle
Stars5528
Forks1125
Ecosys:117
-------------------
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 217,734评论 6 505
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,931评论 3 394
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,133评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,532评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,585评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,462评论 1 302
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,262评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,153评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,587评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,792评论 3 336
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,919评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,635评论 5 345
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,237评论 3 329
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,855评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,983评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,048评论 3 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,864评论 2 354

推荐阅读更多精彩内容

  • # Python 资源大全中文版 我想很多程序员应该记得 GitHub 上有一个 Awesome - XXX 系列...
    小迈克阅读 2,985评论 1 3
  • # Python 资源大全中文版 我想很多程序员应该记得 GitHub 上有一个 Awesome - XXX 系列...
    aimaile阅读 26,481评论 6 427
  • # Awesome Python [![Awesome](https://cdn.rawgit.com/sindr...
    emily_007阅读 2,210评论 0 3
  • Hint: this game is purely entertainment, and the game's w...
    lucky_dev阅读 383评论 0 0
  • 积攒失望 盖过希望为止 人总是 太低估 给予原谅的能力 难估计 忍耐终点的距离 决心要 离开时 已没有力气 最后一...
    雪山飞狐_122阅读 87评论 0 0