什么是异步数据
通过js技术,不需要用户请求即可不断加载的数据
爬取方法
- 通过浏览器network/XHR数据观察确定自动加载的页面请求URL
- 按规律自动生成需要爬取的URL
- 逐页爬取信息
教学代码
from bs4 import BeautifulSoup
import requests
import time
url = 'https://knewone.com/discover?page='
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
def get_page(url,data=None):
web_data = requests.get(url,headers = headers)
soup = BeautifulSoup(web_data.text,'lxml')
imgs = soup.select('article > header > a > img')
titles = soup.select('article > section > h4 > a')
links = soup.select('article > section > h4 > a')
if data == None:
for img,title,link in zip(imgs,titles,links):
data ={
'img':img.get('src'),
'title':title.get('title'),
'link':link.get('href')
}
print(data)
def get_more_pages(start,end):
for one in range(start,end):
get_page(url+str(one))
time.sleep(2)
get_more_pages(1,5)
运行结果
作业
爬取58同城二手商品一页列表中的商品详细信息