主要是循环节的使用,复制html改为了复制xpath,
-
出现这种明显带有分隔符性的要素,
程序当中,在整体处,即红圈处,加[0]的原因。
name=info.xpath('div[2]/p[2]/span/text()')[0] #[0]为什么要加?
name1=name.split('-')[0]
name2 = name.split('-')[1]
2.在 后面加[0],区别就是在下方显示的是否带有[]符号
关于灰色部分,还需弄清楚原因
- 存取需要像之前正则一样,重新定义函数
未存取定义函数之前
import requests
from lxml import etree
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
url='https://xiaoyuan.zhaopin.com/full/industry/0/0_0_0_0_-1_0_1_0'
res = requests.get(url, headers=headers)
html = etree.HTML(res.text)
infos = html.xpath('//ul[@class="searchResultListUl"]/li')
for info in infos:
# rank_1=info.xpath('span[3]')[0]
# rank=rank_1.xpath('string(.)').strip()
name=info.xpath('div[2]/p[2]/span/text()')[0] #[0]为什么要加?
name1=name.split('-')[0]
name2 = name.split('-')[1]
job=info.xpath('div[2]/p[1]/a/text()')[0]
place=info.xpath('div[2]/p[3]/span[1]/span/em/text()')[0]
job_type=info.xpath('div[2]/p[4]/span[4]/span/em/text()')
print(name1,name2,job,place,job_type)
存取定义函数之后
-
正则定义函数之后
xpath定义函数之后
- 完善之后完整代码
import requests
from lxml import etree
import csv
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
url='https://xiaoyuan.zhaopin.com/full/industry/0/0_0_0_0_-1_0_1_0'
def get_info(url):
res = requests.get(url, headers=headers)
html = etree.HTML(res.text)
infos = html.xpath('//ul[@class="searchResultListUl"]/li')
for info in infos:
# rank_1=info.xpath('span[3]')[0]
# rank=rank_1.xpath('string(.)').strip()
name=info.xpath('div[2]/p[2]/span/text()')[0] #[0]为什么要加?
name1=name.split('-')[0]
name2 = name.split('-')[1]
job=info.xpath('div[2]/p[1]/a/text()')[0]
place=info.xpath('div[2]/p[3]/span[1]/span/em/text()')[0]
job_type=info.xpath('div[2]/p[4]/span[4]/span/em/text()')[0]
print(name1,name2,job,place,job_type)
if __name__ == '__main__':
fp = open('C:/Users/秦振凯/Desktop/text2.csv', 'w', encoding='utf-8', newline='')
writer = csv.writer(fp)
writer.writerow(['name1', 'name2','job','place','job_type'])
urls = ['https://xiaoyuan.zhaopin.com/full/industry/0/0_0_0_0_-1_0_{}_0'.format(str(i)) for i in range(0,5)]
for url in urls:
get_info(url)