爬虫系列（1）--ip代理池的爬取

代理池IP爬取

网络上提供免费代理的网站:

以上述两家代理为例。一般网站的代理数据均以表格样式展现。如下图

表格类的数据

总体代码

爬取
解析
验证
存储

爬取

封装文件读写操作

使用requests爬取网页，使用缓存

缓存是将网页存储在temp文件夹中，按照一定规则命名。按时检查。
不能直接使用hash来进行文件的区分^[1]

 # 增加文件缓存
    if not os.path.isfile(path):
        print('writing file')
        content = requests.get(url=url, headers={'user-agent': 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.3; Win64; x64)'}).text
        crower_table.write_file_content(path, content)
        return content
    else:
        return crower_table.get_file_content(path)

解析

table这边只考虑了两种情况。一种是table标签下只有tr标签，第二种是标准格式。

取出tr标签，第一个tr标签作为标题
清除掉文字中的特殊字符
输出标准格式的json字符串

def get_table_content(table):
    “”“
    这里主要是对类似于
    <tr></tr>
    <tr></tr>
    ...
    这类数据进行处理，第一行定义表格数据的属性，存在titles中，其余行保存数据内容。
    ”“”
    if table.find('thead') is None:
        trs = table.find_all('tr')
        first_tr = trs[0]
        others_tr = trs[1:]
        titles = [th.get_text() for th in first_tr.find_all('th')]
        results = []
        for one_other in others_tr:
            temp = [t.get_text().replace('\n', '') for t in one_other.find_all('td')]
            per = {titles[k]: temp[k] for k in range(0, len(titles) - 1)}
            results.append(per)
        return results

正确将各种不同类别的表格类数据进行格式化是一个相对复杂的工作。这里只是对其中的一种情况进行处理。

验证

通过代理访问baidu，判断返回页面的状态码是否是200
将结果存入输出

def validate_ip(ip, protocol):
    test_url = 'http://baidu.com'
    try:
        proxy_host = {protocol: protocol + "://" + ip}
        html = requests.get(test_url, proxies=proxy_host, timeout=3, headers={'user-agent': 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.3; Win64; x64)'})
        if html.status_code == 200:
            print('success',proxy_host)
            return True
        else:
            print('Failed', proxy_host)
            return False
    except Exception:
        return 'error'

存储

在python3与mysql的直接交互推荐使用pymysql。当然也可以使用ORM类工具

安装python3-devel 和 mysql-devel
使用pymysql进行安装
安装 cryptograph

插入数据库的基本操作

进阶考虑

如何进行代理的及时更新
利用多个代理ip同时从不同断点下载大文件

素材

python抓取代理并验证有效性

原来的想法是通过命名存储的临时文件为 hash(url),但是python3中会增加一个 seed，只能保证运行时hash值不会更新。 ↩

爬虫系列（1）--ip代理池的爬取

代理池IP爬取

总体代码

爬取

解析

验证

存储

进阶考虑

素材

友情链接更多精彩内容