python 使用requests第三方库抓取网页HTML代码，并使用正则进行匹配检索代码

以简书首页为例

如果未检索成功请copy加载的HTML代码，然后检验正则匹配的是否正确，网页标签元素可能改变，导致正则匹配不正确

#!/usr/bin/python
# coding: utf-8

import os, sys
import requests
import re

# page = 1
# url = 'http://www.qiushibaike.com/hot/page/' + str(page)
# user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
# headers = { 'User-Agent' : user_agent }

# 嗅事百科
def getHTMLHeader(urlString,headers):
    try:
        r = requests.get(urlString,headers = headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except r.URLError, e:
        if hasattr(e,"code"):
            print e.code
        if hasattr(e,"reason"):
            print e.reason

def getHTML(urlString):
    try:
        r = requests.get(urlString)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except r.URLError, e:
        if hasattr(e, "code"):
            print e.code
        if hasattr(e, "reason"):
            print e.reason


#正则提取数据  爬取简书首页 文章标题，链接，阅读量，评论量为例
def regJianshuHtml(url):
    if url.strip() == '':
        html = getHTML("http://www.jianshu.com")
    else:
        html = getHTML(url)
    reg = r'<a class="title" target="_blank" href="(.*?)">(.*?)</a>'
    hotre = re.compile(reg)
    artlist = re.findall(hotre,html)

    for article in artlist:
        for com in article:
            if com.startswith("/p/"):
                print "http://www.jianshu.com" + com
            else:
                print com

if __name__ == '__main__':
    # url = 'https://www.pmcaff.com/'
    # print(getHtmlText(url))

    # html = getHTMLHeader(url,headers)
    # print html

    regJianshuHtml("")

结果

http://www.jianshu.com/p/2622723e95b2
薛之谦高磊鑫：一别两宽，各生欢喜，你好，再见。
http://www.jianshu.com/p/1a5d3310b672
古巴女人迷死多少男人女人和我.....古巴裸游记2
http://www.jianshu.com/p/3298cc246015
中英双语：女性和公司（1）前言
http://www.jianshu.com/p/59a0e9694498
你只是表面上很努力，所以依然过得很煎熬
http://www.jianshu.com/p/bc2282da3b33
【原创育儿故事大赛】小恐龙给姥爷泡的茶
http://www.jianshu.com/p/72d8ce040ca7
不是你幸福少，而是你丧失了获得幸福的能力
http://www.jianshu.com/p/967cfeab62de
你为什么开始谈恋爱？
http://www.jianshu.com/p/cbc0566766d1
别再对我好了，我会当真的
http://www.jianshu.com/p/3d27a8603948
那些年我们追过的梦，还在吗？
http://www.jianshu.com/p/f15e454ac4c5
写在大三末尾：10条不怎么重要的大学建议
http://www.jianshu.com/p/86b10f6ea7d9
《简书·大学生活专题5月刊》|我超级自律，就是为了和别人不一样
http://www.jianshu.com/p/91347494f045
当写作无法跟上野心时，大学老师让我静下心努力
http://www.jianshu.com/p/d8e696fef7d9
如何打造自己的核心竞争力
http://www.jianshu.com/p/0858c9d5ad10
99%有钱人不愿告诉您的赚钱秘诀
http://www.jianshu.com/p/5e0905397803
看懂可口可乐，就能学会“定位”
http://www.jianshu.com/p/12b01f16e929
自律不一定会带来自由，但会带来一个更好的自己
http://www.jianshu.com/p/9c70cad93f1b
高考以后，你的人生才刚刚开始
http://www.jianshu.com/p/4688ba26e22a
情景剧|失恋女的胶原蛋白补充之旅
http://www.jianshu.com/p/7f4427d0ebdd
职场新人：有哪些靠谱的工作基本功
http://www.jianshu.com/p/d01ce7c931e5
夜深了，我来哄你的孩子睡觉

总结：

对HTML网页结构要清晰。

正则表达式要熟悉，是提取数据的关键。
使用BeautifulSoup会简单很多，里面也会用到正则。

python抓取网页

python抓取网页

python 使用requests第三方库抓取网页HTML代码，并使用正则进行匹配检索代码

结果

总结：

推荐阅读更多精彩内容

友情链接更多精彩内容