做了三年的测试,开发的知识学了不少,心血来潮,在测试空闲期写一个小项目消遣一下。
项目的整体构思如下:
1.python 爬虫爬取糗事百科,将需要的元素取出来插入到数据库中
2.java 开发一个接口,以json的形式展示,并分页
3.android 写一个apk,解析json接口,用listView展示数据,并分页
本篇讲解python 爬虫爬取糗事百科的数据
准备:python环境,安装lxml,pymysql,可以进入到python环境下的script目录下用 pip install安装
数据库准备:安装MySql,创建数据库表
CREATE TABLE `qiushibaike` (
`id` INT NOT NULL AUTO_INCREMENT ,
`imgUrl` VARCHAR (3000),
`username` VARCHAR (3000),
`content` VARCHAR (3000),
`vote` VARCHAR (3000),
`comments` VARCHAR (3000),
`imgpath` VARCHAR (3000),
PRIMARY KEY ( id )
)DEFAULT CHARSET=utf8;
打开糗事百科网站,并翻页,我们可以发现page后面的参数表示页数
下面是爬虫代码:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import requests
from lxml import etree
import pymysql
def insert(imgUrl,username,content,vote,comments,imgpath):
#连接数据库
connection = pymysql.connect(host='127.0.0.1', port=3306, user='root', password='root', db='shop',
charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor)
# 通过cursor创建游标
cursor = connection.cursor()
# 创建sql 语句,并执行
sql = 'INSERT INTO `qiushibaike` (`imgUrl`,`username`,`content`,`vote`,`comments`,`imgpath`) VALUES (%s,%s,%s,%s,%s,%s)'
cursor.execute(sql,(imgUrl,username,content,vote,comments,imgpath));
# 提交SQL
connection.commit()
def loadPage(page):
url = 'http://www.qiushibaike.com/8hr/page/' + str(page)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
'Accept-Language': 'zh-CN,zh;q=0.8'}
try:
response = requests.get(url, headers=headers)
resHtml = response.text
html = etree.HTML(resHtml)
result = html.xpath('//div[contains(@id,"qiushi_tag")]')
#遍历取元素
for site in result:
#根据xpath获取用户头像路径,没有获取到则置空
item = {}
try:
imgUrl = site.xpath('./div/a/img/@src')[0].encode('utf-8')
except:
imgUrl = ""
#获取用户名
try:
username = site.xpath('./div/a/h2/text()')[0].encode('utf-8')
except:
username = ""
#获取内容
# username = site.xpath('.//h2')[0].text
try:
content = site.xpath('.//div[@class="content"]/span')[0].text.strip().encode('utf-8')
except:
connect = ""
#获取投票数
try:
vote = site.xpath('.//i')[0].text
except:
vote = ""
# print site.xpath('.//*[@class="number"]')[0].text
# 获取评论信息
try:
comments = site.xpath('.//i')[1].text
except:
comments = ""
#获取内容图片
try:
imgpath = site.xpath('./div/a/img/@src')[1].encode('utf-8')
except:
imgpath = ""
print imgUrl, username, content, vote, comments, imgpath
#插入数据库
insert(imgUrl, username, content, vote, comments, imgpath)
except Exception, e:
print e
if __name__ == '__main__':
#加载1-12页的数据
for num in range(1, 13):
loadPage(num)
print "===============第" + str(num)+"页加载完毕================"
爬取完成之后查看数据库如下说明爬取成功