第一篇文章大概把整个框架全托出来了,这一章主要针对数据源获取方法做个介绍,这应该是最简单的爬虫获取数据的步骤了。如果有需要附件或者安装包的可以私信我,只要我手边有电脑就能给你发。最后的代码复制粘贴直接能用哈。
数据爬取步骤
a.此次爬取的是某某客在天津的房价信息,用到的工具和技术有:
python库:PyQuery、requests、csv
用csv格式保存文件(Excel啥的都行),用chrome浏览器打开网页
先找到网页的header信息,如图,代码如下:
headers = {'User-Agent': ' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'}
b.先获取第一页数据:
response = requests.get(url,headers=headers) if response.status_code == 200 :#返回网络请求状态 return response.content.decode("utf-8") else: return Nonec.因为某某客的网页所发布的每一个房价信息的结构组成都是一致的,所以解析第一页的某个房价信息,推出所有房价信息的结构框架的组成。在开发者页面中,Element标签中可以看到网页的信息如图,按照路径展开,找到想要挖掘的房屋的信息如图,由于安居客的房价的信息和房子的其他信息路径是在同一个父目录中的,所以可以直接放在一个循环中。这里需要注意的是安居客的显示面积的模块有的是有几室几厅,有的没有,所以需要判断一下,当item扫描到area部分时,由于几室几厅不止一行,所以要换掉回车换行:
if area:area = area.strip().replace("","")还有要注意的是价格部分,有的是此房屋的价格,有的是周边的均价,所以要赋上不同的值。这里也要进行判断,当扫描到周边的均价时,要换掉回车换行:
if aroundprice:aroundprice = aroundprice.strip().replace("","")

doc = py(html) items = doc(".key-list.imglazyload .item-mod").items() for item in items: con = py(item) name = con(".infos .lp-name .items-name").text() address = con(".infos .address .list-map").text() area = con(".infos .huxing span").text() if area: area = area.strip().replace("","") price = con("a.favor-pos > p.price").text() aroundprice = con(".favor-pos .around-price").text() if aroundprice: aroundprice = aroundprice.strip().replace("","") yield { "name":name, "address":address.replace(" ",""), "area":area, "price":price, "aroundprice":aroundprice, }d.分析网页的URL组成如图,可以发现是https://tj.fang.anjuke.com/loupan/all/p+页数的组成结构,所以第一页就是:https://tj.fang.anjuke.com/loupan/all/p1/,因为要爬取所有的天津的房屋信息,所以要设置翻页的方法:
def next_page(html): doc = py(html) next_url = doc(".pagination .next-page").attr.hrefreturn next_url
e.最后写入文件运行结果见图,一共1629条数据,源代码见附件一

最后附上完整源代码:
import requests
import csv
from pyquery import PyQuery as py
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'}
def first_page(url):
response = requests.get(url,headers=headers)
if response.status_code == 200 :#返回网络请求状态
return response.content.decode("utf-8")
else:
return None
def parse_first_page(html):
doc = py(html)
items = doc(".key-list.imglazyload .item-mod").items()
for item in items:
con = py(item)
name = con(".infos .lp-name .items-name").text()
address = con(".infos .address .list-map").text()
area = con(".infos .huxing span").text()
if area:
area = area.strip().replace("\r\n","")
price = con("a.favor-pos > p.price").text()
aroundprice = con(".favor-pos .around-price").text()
if aroundprice:
aroundprice = aroundprice.strip().replace("\r\n","")
yield {
"name":name,
"address":address.replace("\xa0",""),
"area":area,
"price":price,
"aroundprice":aroundprice,
}
#获取下一页
def next_page(html):
doc = py(html)
next_url = doc(".pagination .next-page").attr.href
return next_url
#写入csv文件标题
def write_title_file():
with open ("C:\\Users\\admin\Desktop\\天津房屋数据.csv","a+",encoding="utf-8-sig",newline="") as f:
wea_for=csv.writer(f,delimiter=",")
wea_for.writerow(["房屋名称","地址","户型面积","价格","周边均价"])
#写入csv文件内容
def write_content_file(content):
with open ("C:\\Users\\admin\Desktop\\天津房屋数据.csv","a+",encoding="utf-8-sig",newline="") as f:
wea_for=csv.writer(f,delimiter=",")
wea_for.writerow([content["name"],content["address"],content["area"],content["price"],content["aroundprice"]])
def main():
firsturl = "https://tj.fang.anjuke.com/loupan/all/p1/"
html = first_page(firsturl)
for content in parse_first_page(html):
write_content_file(content)
print(content)
while next_page(html):
next_url = next_page(html)
html = first_page(next_url)
for content in parse_first_page(html):
write_content_file(content)
print(content)
if __name__ == '__main__':
write_title_file()
main()
最后再提醒一下,如果有需要附件或者安装包的可以私信我,只要我手边有电脑就能给你发。如果各位能点个关注和收藏什么的就更好了哈哈哈