爬虫是比较常用的程序,用python实现起来非常简单,有几个相关的库,这里就记录一下python常用的爬虫代码,备忘。
1 requestxs
import requests
url ='http://onevanillachecker.com/'
rep = requests.get(url)
rep.encoding = 'utf-8'
print(rep.text)
一些参数的记录
import requests
url ='http://onevanillachecker.com/'
header={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0'
}
timeout = random.choice(range(80, 180))
rep = requests.get(url,headers = header,timeout = timeout)
rep.encoding = 'utf-8'
print(rep.text)
2 urllib2
import urllib2
req = urllib2.Request('http://onevanillachecker.com/')
response = urllib2.urlopen(req)
html = response.read()
3 beautifulsoup
beautifulsoup是用来解析页面的库,使用起来非常方便
相关文档https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
下面简单记一些常用的东西,备忘。
配置安装
pip install beautifulsoup4
简单使用
from bs4 import BeautifulSoup
import urllib2
req = urllib2.Request('http://onevanillachecker.com/')
response = urllib2.urlopen(req)
html = response.read()
# beautifulsoup
soup = BeautifulSoup(html)
print(soup.title)
# <title>One Vanilla Gift Card Balance Check -Official Website</title>
print(soup.title.name)
# title
print(soup.title.string)
# One Vanilla Gift Card Balance Check -Official Website
print(soup.title.parent.name)
# head
print(soup.p)
# <p>Life happens every day. And OneVanilla <br/>helps make it simpler. Shop, dine, fill 'er up <br/>and more - all with one prepaid card.</p>
# print(soup.p['class'])
print(soup.a)
# <a href="#">Vanilla Gift Card</a>
print(soup.find_all('a'))
# <a href="#">Vanilla Gift Card</a>, <a href="#">Check Vanilla 3 Balance</a>
# <a href="#">Vanilla Gift Cards</a>, <a href="#">Where to Buy</a> # <a href="#">Sign In</a>, <a href="#">About Vanilla Gift Card</a>
# <a href="#">Using Your Vanilla Gift Card</a>, <a href="#">Try Vanilla Gift</a>
# ......
print(soup.find(alt="2"))
# <img alt="2" src="OneVanilla_files\thumb_2.png"/>
print(soup.get_text())