本次笔记的目的,是通过爬取房天下天津二手房信息练习一下scrapy的运用,并且初步窥探一下pandas和matplotlib两大数据分析库。
1. 分析网页
爬取房天下天津二手房信息
房天下二手房每项至多显示100页信息,本次爬虫只做练习之用,所以只按天津市的范围进行爬取(主要是懒),如果想要更多更准确的数据可以按区进行爬去,也是很简单,在start_urls列表中把各区的url全部放进去就可以了。
点进列表页其中的一条信息,可获得下图的详情页中的详细二手房信息,本次爬取以下字段:区域,小区,户型,面积,层数,单价,总价(貌似列表页基本也全包括了,囧~)。
2. 创建爬虫
创建scrapy项目
scrapy startproject fangxiangxia
cd fangtianxia
scrapy genspider ershoufang fang.com
打开pycharm,找到项目文件夹,然后编写items
import scrapy
class FangtianxiaItem(scrapy.Item):
address = scrapy.Field()
location = scrapy.Field()
mode = scrapy.Field()
area = scrapy.Field()
floor = scrapy.Field()
price = scrapy.Field()
total_price = scrapy.Field()
设置settings:
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 1
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
编写一个启动文件run.py:
from scrapy import cmdline
cmdline.execute('scrapy crawl ershoufang -o ershou.csv'.split())
编写spider:
import scrapy
import re
from fangtianxia.items import FangtianxiaItem
class ErshoufangSpider(scrapy.Spider):
name = 'ershoufang'
allowed_domains = ['fang.com']
start_urls = ['http://tj.esf.fang.com/']
def parse(self, response):
lists = response.xpath("//dl[@dataflag='bg']")
for href in lists:
href = href.xpath(".//h4[@class='clearfix']/a/@href").get()
url = response.urljoin(href)
yield scrapy.Request(url, callback=self.parse_detail)
next_url = response.xpath("//div[@class='page_al']/p[1]/a/@href").get()
if next_url:
next_url = response.urljoin(next_url)
yield scrapy.Request(next_url, callback=self.parse)
def parse_detail(self, response):
item = FangtianxiaItem()
address = response.xpath("//div[@id='address']/a/text()").get().strip()
location = response.xpath("//div[@class='rcont']/a/text()").get().strip()
mode = response.xpath("//div[@class='trl-item1 w146']/div/text()").get().strip()
area = response.xpath("//div[@class='trl-item1 w182']/div/text()").get()
area = float(re.findall(r'\d+', area)[0])
floor = response.xpath("/html/body/div[5]/div[1]/div[4]/div[3]/div[2]/div[1]/text()").get()
price = response.xpath("//div[@class='trl-item1 w132']/div/text()").get()
price = float(re.findall(r'\d+', price)[0])
total_price = response.xpath("//div[@class='trl-item price_esf sty1']/i/text()").get()
total_price = float(total_price)*10000
item['address'] = address
item['location'] = location
item['mode'] = mode
item['area'] = area
item['floor'] = floor
item['price'] = price
item['total_price'] = total_price
yield item
运行run.py如下图,爬虫跑起来了。
最后获得了一个ershou.csv的文件。由于未使用代理ip,只获取了194条二手房信息,就被重定向到填写验证码页了。不过前面说到,不加代理,想获得更多信息的话,可以按区爬取,这样可以每个区获得194条左右的信息。
3. 简单的可视化
import pandas as pd
import matplotlib.pyplot as plt
info = pd.read_csv(r'F:\scrapy\fangtianxia\fangtianxia\ershou.csv')
print(info.head())
用pandas读取爬到的数据,生成一个DataFrame对象:
然后利用matplotlib对各区平均房价进行绘图:
info_mean = info.groupby('address')['price'].mean()
info_count = info.groupby('address')['price'].count()
plt.figure(figsize=(10,6))
plt.rc('font', family='SimHei', size=13)
plt.title('天津各区平均房价')
plt.ylabel('平均房价')
plt.bar(info_mean.index, info_mean.values, color='r')
plt.show
生成结果
由于数据较少,这个结果不一定准确,但是也大体上反映了天津市各区房价的差异情况。