python爬虫记录

爬虫是比较常用的程序,用python实现起来非常简单,有几个相关的库,这里就记录一下python常用的爬虫代码,备忘。

1 requestxs

import requests
url ='http://onevanillachecker.com/'
rep = requests.get(url)
rep.encoding = 'utf-8'
print(rep.text)

一些参数的记录

import requests
url ='http://onevanillachecker.com/'
header={
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, sdch',
        'Accept-Language': 'zh-CN,zh;q=0.8',
        'Connection': 'keep-alive',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0'
    }
timeout = random.choice(range(80, 180))
rep = requests.get(url,headers = header,timeout = timeout)
rep.encoding = 'utf-8'
print(rep.text)

2 urllib2

import urllib2
req = urllib2.Request('http://onevanillachecker.com/')
response = urllib2.urlopen(req)
html = response.read()

3 beautifulsoup

beautifulsoup是用来解析页面的库,使用起来非常方便
相关文档https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
下面简单记一些常用的东西,备忘。
配置安装

pip install beautifulsoup4

简单使用

from bs4 import BeautifulSoup
import urllib2
req = urllib2.Request('http://onevanillachecker.com/')
response = urllib2.urlopen(req)
html = response.read()

# beautifulsoup
soup = BeautifulSoup(html)
print(soup.title)
# <title>One Vanilla Gift Card Balance Check -Official Website</title>
print(soup.title.name)
# title
print(soup.title.string)
# One Vanilla Gift Card Balance Check -Official Website
print(soup.title.parent.name)
# head
print(soup.p)
# <p>Life happens every day. And OneVanilla <br/>helps make it simpler. Shop, dine, fill 'er up <br/>and more - all with one prepaid card.</p>
# print(soup.p['class'])
print(soup.a)
# <a href="#">Vanilla Gift Card</a>
print(soup.find_all('a'))
# <a href="#">Vanilla Gift Card</a>, <a href="#">Check Vanilla 3 Balance</a>
# <a href="#">Vanilla Gift Cards</a>, <a href="#">Where to Buy</a> # <a href="#">Sign In</a>, <a href="#">About Vanilla Gift Card</a>
# <a href="#">Using Your Vanilla Gift Card</a>, <a href="#">Try Vanilla Gift</a>
# ......
print(soup.find(alt="2"))
# <img alt="2" src="OneVanilla_files\thumb_2.png"/>
print(soup.get_text())
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容