S01E07.智能翻页和批量下载文件【京客隆超市】

暂时遇到的问题

1、网站URL链接改版,新的链接变为如下形式

http://www.jkl.com.cn/newsList.aspx?current=2&TypeId=10009

而当前获取的链接可能为如下形式:

http://www.jkl.com.cn/newsList.aspx?current=2&TypeId=10009

中间那个“current=2”需要想办法插进去
2、当前程序运行报错的提示如下,异常原因暂时未知:

ConnectionError: HTTPConnectionPool(host='wwww.jkl.com.cn', port=80): Max retries exceeded with url: /newsList.aspx?TypeId=10009 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001C43C03F760>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
目前已完成的部分代码
import requests
from lxml import etree
import re
import os

web_adress='http://www.jkl.com.cn/newsList.aspx?TypeId=10009'
My_agent={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

xiangying_date=requests.get(url=web_adress,headers=My_agent).text
jiexi_date=etree.HTML(xiangying_date)
project_name=jiexi_date.xpath('//div[@class="infoLis"]//a/text()')
project_adress=jiexi_date.xpath('//div[@class="infoLis"]//@href')
#print(project_name)
project_adress=['http://wwww.jkl.com.cn/'+project_adress for project_adress in project_adress]
jiangzhidui=dict(zip(project_name,project_adress))
for project_name,project_adress in jiangzhidui.items():
    #print(project_name)
    project_name=project_name.replace('/','.')
    project_name=project_name.replace('...','报表')
    #print(project_name)
    lujing='D:/'+project_name
    if not os.path.exists(lujing):
        os.mkdir(lujing)
    xiangying_date1=requests.get(url=project_adress,headers=My_agent).text
    jiexi_date=etree.HTML(xiangying_date1)
    weiye=jiexi_date.xpath('//a[text()="尾页"]//@href')
    if weiye!=[]:
        zhengze=re.search("(\d+)'\)",weiye[0])
        page_number=zhengze.group(1)
    else:
        page_number=1
    for page_number in range(1,int(page_number)+1):
        print(project_adress)
        '''new_project_adress='http://www.jkl.com.cn/newsList.aspx?current='+page_number+'&TypeId=10009'
        xiangying_date1=requests.get(url=project_adress,headers=My_agent).text
        jiexi_date=etree.HTML(xiangying_date1)
        weiye=jiexi_date.xpath('//a[text()="尾页"]//@href')

(未完待续……)

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容