python实战计划的第三个项目:爬取租房信息。
最终结果如下:
其中包括9张页面,每张页面包含24间房,共计216间房间,即216条数据。
每条数据包含7项信息,分别是:标题、地址、日租金、第一张房间图片链接、房东图片链接、房东性别和房东名称。
代码如下:
import requests
from bs4 import BeautifulSoup
import time
def get_links(url):
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, 'lxml')
links = soup.select('#page_list > ul > li > a')
for link in links:
href = link.get('href')
one(href)
def if_sex(sexname):
if sexname == ['member_girl_ico']:
return '女'
elif sexname == ['member_boy_ico']:
return '男'
else:
return '没填写'
def one(url, data=None):
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, 'lxml')
titles = soup.select('div.pho_info > h4 > em')
addres = soup.select('div.pho_info > p > span.pr5')
prices = soup.select('#pricePart > div.day_l > span')
images = soup.select('#curBigImage')
pictures = soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > a > img')
sexes = soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > span')
names = soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a')
# print(titles,addres,prices,pictures,names)
if (data == None):
for title, addre, price, picture, name, sex, image in zip(titles, addres, prices, pictures, names, sexes,
images):
data = {
'title': title.get_text(),
'addre': addre.get_text().replace('\n', '').replace(' ', ''),
'price': price.get_text(),
'picture': picture.get('src'),
'name': name.get_text(),
'sex': if_sex(sex.get('class')),
'image': image.get('src')
}
print(data)
urls = ['http://wh.xiaozhu.com/search-duanzufang-p{}-0/?startDate=2016-07-17&endDate=2016-08-24'.format(i) for i in
range(1, 10)]
for url in urls:
get_links(url)
time.sleep(2)
总结:
1.一个大的任务尽可能的拆分成小的任务,并注意每一块的输入条件与输出信息。
2.replace('a','b'),replace方法,用b替换a。