学习爬虫第3天,爬取小猪网租房信息。
由于网页改版,目前没有显示性别信息,所以在做练习时去掉了该项。
http://bj.xiaozhu.com/search-duanzufang-p1-0/
代码如下:
#!/usr/bin/env python
# coding: utf-8
__author__ = 'lucky'
from bs4 import BeautifulSoup
import requests
#每个链接打开后的信息
def get_info(url):
wb_data = requests.get(url)
Soup = BeautifulSoup(wb_data.text,'lxml')
titles =Soup.select('div.con_l > div.pho_info > h4 > em')
addresses = Soup.select('div.con_l > div.pho_info > p > span.pr5')
rents = Soup.select('#pricePart > div.day_l > span')
imgs = Soup.select('#curBigImage')
host_imgs = Soup.select('div.member_pic > a > img')
host_names = Soup.select('div.w_240 > h6 > a')
for title,address,rent,img,host_img,host_name in zip(titles,addresses,rents,imgs,host_imgs,host_names):
data={
"title":title.get_text(),
"address":address.get_text().split('\n')[0],
"rent":rent.get_text(),
"img":img.get('src'),
"host_img":host_img.get('src'),
"host_name":host_name.get_text()
}
print(data)
def get_links(one_url):
wb_data = requests.get(one_url)
Soup = BeautifulSoup(wb_data.text,'lxml')
links = Soup.select('#page_list > ul > li > a')
for link in links:
href = link.get("href") #获取每个商品链接
get_info(href) #访问链接,提取商品信息
url_links = ["http://bj.xiaozhu.com/search-duanzufang-p{}-0/".format(number) for number in range(1, 10)]
for url in url_links:
get_links(url)
总结:
1.加深了对request的get访问方式的理解。
2.加深了对网页元素位置查找的学习和使用。
3.温习了封装函数和函数调用的学习。