第二周/第二周实战作业: 爬取10万商品数据

1. 引言

Paste_Image.png

标题	说明
网址	http://sh.ganji.com/wu/
要求1	点进来, 拉到页面底部
要求2	需要爬取赶集网-上海-二手市场的所有类目的商品信息
要求3	点进来的列表页, 抓取`个人`类目下的全部帖子

Paste_Image.png

2. 分析

分类查找:

检查元素, 输入.fenlei > dt > a可以等到全部分类链接

Paste_Image.png

分类列表:
- 类目中的个人列表页: http://sh.ganji.com/ershoubijibendiannao/o2/

页数随最后一个数字变化

类目中的商人列表页: http://sh.ganji.com/ershoubijibendiannao/a2o2/

多了a2字符, 页数随最后一个数字变化

列表页中链接查找:

.ft-tit找到全部链接, 接着过虑掉包含click关键字的推广商品和zhuanzhuan商品

Paste_Image.png

列表尾部:

只有5个条目:

Paste_Image.png

已卖完商品返回的页面状态码为404
详情信息抓取采用断点续传和多进程

3. 实现部分

3.1 基础模块

# vim spider_ganji.py  // 基础模块

#!/usr/bin/env python3                                                                                                       
# -*- coding: utf-8 -*-                                                                                                      
                                                                                                                             
__author__ = 'jhw'                                                                                                           
                                                                                                                             
                                                                                                                             
from bs4 import BeautifulSoup                                                                                                
from pymongo import MongoClient                                                                                              
import requests                                                                                                              
                                                                                                                             
                                                                                                                             
headers = {                                                                                                                  
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36'
    'Accept-Encoding': 'gzip, deflate',                                                                                      
    'Accept-Language': 'zh-CN,zh',                                                                                           
    'Connection': 'keep-alive',                                                                                              
}                                                                                                                            
                                                                                                                             
# 自定义要抓取区域, 拼音简写                                                                                                 
locale = 'sh'                                                                                                                
                                                                                                                             
client = MongoClient('10.66.17.17', 27017)                                                                                   
database = client['ganji']                                                                                                   
# 区域分表存入mongodb                                                                                                        
url_info = database['{}_ershou_url'.format(locale)]                                                                          
# 已抓取的url存入此表                                                                                                        
url_info_exists = database['{}_ershou_url_exists'.format(locale)]                                                            
# 商品信息存储表                                                                                                             
item_info = database['{}_ershou_item'.format(locale)]                                                                        
# 将要抓取的全部url放入此列表中                                                                                              
url_info_list = [item['url'] for item in url_info.find()]                                                                    
# 抓取完毕的url放入此列表中                                                                                                  
url_info_exists_list = [item['url'] for item in url_info_exists.find()]                                                      
                                                                                                                             
                                                                                                                             
# 定义获取主分类链接的函数                                                                                                   
def get_url_index(locale):                                                                                                   
                                                                                                                             
    # 链接根据上面给定的区域决定                                                                                             
    url = 'http://{}.ganji.com/wu/'.format(locale)                                                                           
    data = requests.get(url, headers=headers)                                                                                
    soup = BeautifulSoup(data.text, 'lxml')                                                                                  
    links = soup.select('.fenlei > dt > a')                                                                                  
                                                                                                                             
    # 定义存储分类链接的列表                                                                                                 
    L = []                                                                                                                   
    for link in links:                                                                                                       
        # 组合一条完整的分类链接                                                                                             
        link_pre = url.split('wu/')[0]+'{}/'.format(link.get('href').strip('/'))                                             
        # 将分类链接存入分类列表                                                                                             
        L.append(link_pre)                                                                                                   
                                                                                                                             
    # 返回分类链接列表, 用作多进程                                                                                           
    return L

                                                                                                                             
# 定义循环获取全部url的商品信息函数, 主分类链接传入                                                                          
def get_url_from_list(url, who_sells=''):                                                                                    
                                                                                                                             
    # 主分类链接循环加入页数                                                                                                 
    for page in range(1, 200):                                                                                               
        # 一条完整的带有页数的商品链接                                                                                       
        url_full = url + '{}o{}'.format(who_sells, page)                                                                     
        # 执行获取商品信息函数, 同时返回执行状态码                                                                           
        code = get_url_info(url_full, page)                                                                                  
                                                                                                                             
        # 返回的条目少于10则判断到了页面尾部, 退出循环                                                                       
        if code < 10:                                                                                                        
            print('$'*20, '%s End of the pages!!!' % url_full, '$'*20, '\n\n')                                               
            break                                                                                                            
                                                                                                                             
        print('\n')                                                                                                          
                                                                                                                             
                                                                                                                             
# 定义获取各分类页中商品链接的函数                                                                                           
def get_url_info(url, page):                                                                                                 
                                                                                                                             
    # 截取出链接属于哪个分类                                                                                                 
    cate = url.split('/')[-2]                                                                                                
    data = requests.get(url, headers=headers)                                                                                
                                                                                                                             
    # 讲求失败则退出                                                                                                         
    if data.status_code != 200:                                                                                              
        print('%s Request Error!!!' % data.status_code)                                                                      
    else:                                                                                                                    
        soup = BeautifulSoup(data.text, 'lxml')                                                                              
        # links = soup.select('.ft-db ul li > a')                                                                            
        links = soup.select('.ft-tit')                                                                                       
        # judge = len(soup.select('dl.list-bigpic'))                                                                         
        # 获取页面中商品的总条目                                                                                             
        judge = len(links)                                                                                                   
                                                                                                                             
        # 如果商品总条目为5则判断此页面是列表页的最后一页, 不再抓取                                                          
        if judge == 5:                                                                                                       
            print(cate, '-', page, "Error, We can't find anything because there is nothing to be found.")                    
        else:                                                                                                                
            for i in links:                                                                                                  
                link = i.get('href')                                                                                         
                                                                                                                             
                # 推广商品和转转商品过虑掉                                                                                   
                if 'click' not in link and 'zhuanzhuan' not in link:
                    # 商品链接存入mongodb中负责存储商品链接的表中                                                         
                    url_info.insert_one({'url': link})
                    # 打印目前抓取的是哪个分类中的哪一页                                                                       
                    print(cate, '-', page, '=>', link)                                                                       
    
    # 返回此页商品的总条目, get_url_from_list会用到                                                                                                                         
    return judge
                                                                                                                             
                                                                                                                             
# 定义获取商品详情页信息的函数                                                                                               
def get_item_from(url):                                                                                                      
                                                                                                                             
    # 如果商品之前已经抓取过则退出                                                                                           
    if url in url_info_exists_list:                                                                                          
        print('#'*20, '"%s" has been opened before...' % url, '#'*20)                                                        
    else:                                                                                                                    
        print(url)                                                                                                           
        data = requests.get(url, headers=headers)                                                                            
        # 页面请求错误则退出                                                                                                 
        if data.status_code != 200:                                                                                          
            print('E'*20, '%s request error...', 'E'*20)                                                                     
        # 页面返回404表示商品已卖完, 退出                                                                                    
        elif data.status_code == 404:                                                                                        
            print('W'*20, 'This article has been to Mars...', 'W'*20)                                                        
        else:                                                                                                                
            soup = BeautifulSoup(data.text, 'lxml')                                                                          
            # 商品标题                                                                                                       
            titles = soup.select('.title-name')                                                                              
            # 商品发布时间                                                                                                   
            updates = soup.select('.pr-5')                                                                                   
            # views = soup.select('#pageviews')                                                                              
            # 商品类型                                                                                                       
            types = soup.select('.det-infor > li:nth-of-type(1) > span')                                                     
            # 商品价格                                                                                                       
            prices = soup.select('.f22')                                                                                     
            # 商品区域                                                                                                       
            areas = soup.select('.det-infor > li:nth-of-type(3) > a')                                                        
            # 商品成色                                                                                                       
            degrees = soup.select('.second-det-infor > li')                                                                  
                                                                                                                             
            # 有的商品没有发布时间, 还有时请求网页获取的时间有错, 暂时先这么判断                                             
            if updates:                                                                                                      
                if len(updates[0].get_text()) <= 3:                                                                          
                    update = None                                                                                            
                else:                                                                                                        
                    update = updates[0].get_text().strip().split()[0]                                                        
            else:                                                                                                            
                update = None                                                                                                
                                                                                                                             
            data = {                                                                                                         
                'title': titles[0].get_text() if titles else None,                                                           
                'update': update,                                                                                            
                'type': types[0].get_text().replace('\n', '').replace(' ', '') if types else None,                           
                'price': int(prices[0].get_text()) if prices else 0,                                                         
                'area': [i.get_text().strip() for i in areas] if areas else None,                                            
                'degree': degrees[0].get_text().split('\n')[-1].replace(' ', '') if degrees else None,                       
                'cate': url.split('/')[-2],                                                                                  
            }                                                                                                                
                                                                                                                             
            print(data)
            # 商品信息存入mongodb                                                                                            
            item_info.insert_one(data)                                                                                       
            # 商品信息抓取完后, 将此商品的链接存入mongodb中存放已经抓取完毕的url表中                                         
            url_info_exists.insert_one({'url': url})                                                                         
                                                                                                                             

# 获取主分类链接列表                                                                                                                             
url_index = get_url_index(locale)

3.2 抓取全部商品链接

  # vim main.py  //程序入口

#!/usr/bin/env python3                                                                                                       
# -*- coding: utf-8 -*-                                                                                                      
                                                                                                                             
__author__ = 'jhw'                                                                                                           
                                                                                                                             
                                                                                                                             
# 从自定义模块中导入获取商品链接的函数和主分类列表                                                                           
from spider_ganji import get_url_from_list, url_index                                                                        
# 从自定义模块中导入获取商品详情的函数和商品链接列表                                                                         
# from spider_ganji import get_item_from, url_info_list                                                                        
# 导入多进程模块                                                                                                             
from multiprocessing import Pool                                                                                             
                                                                                                                             
                                                                                                                             
if __name__ == '__main__':                                                                                                   
    pool = Pool()                                                                                                            
    # 多进程获取全部商品链接                                                                                                 
    pool.map(get_url_from_list, url_index)                                                                                 
    # 多进程获取全部商品详情信息                                                                                             
    # pool.map(get_item_from, url_info_list)                                                                                   
    # 调用join()之前必须先调用close(), 调用close()之后就不能继续添加新的Process了                                            
    pool.close()                                                                                                             
    # 对Pool对象调用join()方法会等待所有子进程执行完毕                                                                       
    pool.join()

# python3 main.py  // 开启多进程抓取商品链接, 4核CPU开启了4个进程

shouji - 2 => http://sh.ganji.com/shouji/1525463821x.htm
shouji - 2 => http://sh.ganji.com/shouji/1637265573x.htm
shouji - 2 => http://sh.ganji.com/shouji/1469334107x.htm
.
jiadian - 3 => http://sh.ganji.com/jiadian/2181455484x.htm
jiadian - 3 => http://sh.ganji.com/jiadian/2181302594x.htm
jiadian - 3 => http://sh.ganji.com/jiadian/1899011509x.htm
.
jiaju - 3 => http://sh.ganji.com/jiaju/2182971064x.htm
jiaju - 3 => http://sh.ganji.com/jiaju/2170617991x.htm
jiaju - 3 => http://sh.ganji.com/jiaju/2109235459x.htm
.
bangong - 3 => http://sh.ganji.com/bangong/2050123549x.htm
bangong - 3 => http://sh.ganji.com/bangong/2207236120x.htm
bangong - 3 => http://sh.ganji.com/bangong/1880908704x.htm

3.3 抓取全部商品详情信息

# vim main.py    // 程序入口

# -*- coding: utf-8 -*-                                                                                                      
                                                                                                                             
__author__ = 'jhw'                                                                                                           
                                                                                                                             
                                                                                                                             
# 从自定义模块中导入获取商品链接的函数和主分类列表                                                                           
# from spider_ganji import get_url_from_list, url_index                                                                      
# 从自定义模块中导入获取商品详情的函数和商品链接列表                                                                         
from spider_ganji import get_item_from, url_info_list                                                                        
# 导入多进程模块                                                                                                             
from multiprocessing import Pool                                                                                             
                                                                                                                             
                                                                                                                             
if __name__ == '__main__':                                                                                                   
    pool = Pool()                                                                                                            
    # 多进程获取全部商品链接                                                                                                 
    # pool.map(get_url_from_list, url_index)                                                                                 
    # 多进程获取全部商品详情信息                                                                                             
    pool.map(get_item_from, url_info_list)                                                                                   
    # 调用join()之前必须先调用close(), 调用close()之后就不能继续添加新的Process了                                            
    pool.close()                                                                                                             
    # 对Pool对象调用join()方法会等待所有子进程执行完毕                                                                       
    pool.join()

# python3 main.py  //开启多进程抓取商品详情信息, 4核CPU开启了4个进程

#################### "http://sh.ganji.com/meironghuazhuang/2197468267x.htm" has been opened before... ####################
#################### "http://sh.ganji.com/meironghuazhuang/2119557406x.htm" has been opened before... ####################
#################### "http://sh.ganji.com/meironghuazhuang/2118152395x.htm" has been opened before... ####################
#################### "http://sh.ganji.com/xianzhilipin/2022660697x.htm" has been opened before... ####################
.
http://sh.ganji.com/jiadian/2196681538x.htm
{'title': '出售5.2KG海尔滚筒洗衣机 - 500元', 'type': '大家电', 'degree': None, 'price': 500, 'cate': 'jiadian', 'update': '07-06', 'area': ['上海', '长宁', '中山公园']}
http://sh.ganji.com/jiadian/2205076770x.htm
{'title': '品牌办公家具大量低价抛售!震旦、美时、欧美、励致、天坛等! - 88元', 'type': '其他办公家具', 'degree': '95成新，可送货', 'price': 88, 'cate': 'bangong', 'update': None, 'area': ['上海', '浦东', '八佰伴']}
.
http://sh.ganji.com/bangong/1660785908x.htm
{'title': '进口音箱功放机投影机空调3P红木办公桌书柜茶桌房子卖了搬家。 - 2500元', 'type': '书柜', 'degree': '95成新，可送货', 'price': 2500, 'cate': 'jiadian', 'update': '07-06', 'area': ['上海', '浦东']}
http://sh.ganji.com/jiadian/2205106758x.htm
{'title': '前台,移动柜子,办公椅子 - 300元', 'type': '前台桌', 'degree': '8成新，不包送货', 'price': 300, 'cate': 'jiaju', 'update': '07-05', 'area': ['上海', '普陀', '曹杨新村']}

4. 总结

多进程利用map由主函数作用于列表
循环执行的程序可由返回的状态来决定退出与否
Pool的默认大小是CPU的核心数，因此，最多同时执行4个进程。这是Pool有意设计的限制，并不是操作系统的限制。如果改成:
p = Pool(5)
就可以同时跑5个进程。

第二周/第二周实战作业: 爬取10万商品数据

第二周/第二周实战作业: 爬取10万商品数据

1. 引言

2. 分析

3. 实现部分

3.1 基础模块

3.2 抓取全部商品链接

3.3 抓取全部商品详情信息

4. 总结

相关阅读更多精彩内容

友情链接更多精彩内容