This project is to crawl a big amount of web links using the "crawler" and store them into MongoDB. Next, to retrieve the links from the MongoDB and retrieve the details from the links. For this project, I am reviewing how to crawl data from web and I am totally new to learn how to use MongoDB to store and filter data. I am also quite new to learn how to "if__name__== 'main':" to initiate a program.
We distributed the codes into 4 parts. The first one 'channel_extracing.py' is to retrieve the links from the web and store them into MongoDB. Below is the code.
<code>
import requests
from bs4 import BeautifulSoup
start_url = "http://bj.ganji.com/wu/"
base_url = "http://bj.ganji.com"
wb_data = requests.get(start_url)
soup = BeautifulSoup(wb_data.text,'lxml')
info_list = soup.select("dt > a")
for info in info_list:
url = base_url + info.get('href')
print(url)
channel_list = """
http://bj.ganji.com/shouji/
http://bj.ganji.com/shoujihaoma/
http://bj.ganji.com/shoujipeijian/
http://bj.ganji.com/bijibendiannao/
http://bj.ganji.com/taishidiannaozhengji/
http://bj.ganji.com/diannaoyingjian/
http://bj.ganji.com/wangluoshebei/
http://bj.ganji.com/shumaxiangji/
http://bj.ganji.com/youxiji/
http://bj.ganji.com/xuniwupin/
http://bj.ganji.com/jiaju/
http://bj.ganji.com/jiadian/
http://bj.ganji.com/zixingchemaimai/
http://bj.ganji.com/rirongbaihuo/
http://bj.ganji.com/yingyouyunfu/
http://bj.ganji.com/fushixiaobaxuemao/
http://bj.ganji.com/meironghuazhuang/
http://bj.ganji.com/yundongqicai/
http://bj.ganji.com/yueqi/
http://bj.ganji.com/tushu/
http://bj.ganji.com/bangongjiaju/
http://bj.ganji.com/wujingongju/
http://bj.ganji.com/nongyongpin/
http://bj.ganji.com/xianzhilipin/
http://bj.ganji.com/shoucangpin/
http://bj.ganji.com/baojianpin/
http://bj.ganji.com/laonianyongpin/
http://bj.ganji.com/gou/
http://bj.ganji.com/qitaxiaochong/
http://bj.ganji.com/xiaofeika/
http://bj.ganji.com/menpiao/
http://bj.ganji.com/jiaju/
http://bj.ganji.com/rirongbaihuo/
http://bj.ganji.com/shouji/
http://bj.ganji.com/shoujihaoma/
http://bj.ganji.com/bangong/
http://bj.ganji.com/nongyongpin/
http://bj.ganji.com/jiadian/
http://bj.ganji.com/ershoubijibendiannao/
http://bj.ganji.com/ruanjiantushu/
http://bj.ganji.com/yingyouyunfu/
http://bj.ganji.com/diannao/
http://bj.ganji.com/xianzhilipin/
http://bj.ganji.com/fushixiaobaxuemao/
http://bj.ganji.com/meironghuazhuang/
http://bj.ganji.com/shuma/
http://bj.ganji.com/laonianyongpin/
http://bj.ganji.com/xuniwupin/
http://bj.ganji.com/qitawupin/
http://bj.ganji.com/ershoufree/
http://bj.ganji.com/wupinjiaohuan/
"""
</code>
The second part is to get details for a single item from one link. The links are from the MongoDB. Here is the code "page_parsing.py":
<code>
import requests
from bs4 import BeautifulSoup
import time
import pymongo
import random
client = pymongo.MongoClient('localhost', 27017)
ganji = client['ganji']
item_url = ganji['item_url']
item_info = ganji['item_info']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
'Connection':'keep-alive'
}
proxy_list = [
'http://117.177.250.151:8081',
'http://111.85.219.250:3129',
'http://122.70.183.138:8118',
]
proxy_ip = random.choice(proxy_list)
proxies = {'http': proxy_ip}
def get_links_from(channel, pages):
page_link = channel + 'o{}'.format(str(pages))
wb_data = requests.get(page_link, headers=headers, proxies=proxies)
soup = BeautifulSoup(wb_data.text, 'lxml')
if soup.find('td', 't'):
urls = soup.select("td.t > a")
for url_sub in urls:
url = url_sub.get('href').split('?')[0]
print(url)
item_url.insert_one(url)
else:
pass
def get_item_info_from(url):
wb_data = requests.get(url, headers=headers)
if wb_data.status_code == '404':
pass
else:
soup = BeautifulSoup(wb_data.text, 'lxml')
titles = soup.select('h1.info_titile')
prices = soup.select('span.price_now > i')
places = soup.select('div.palce_li > span > i')
for title, price, place in zip(titles, prices, places):
data = {
'url': url,
'title': title.get_text(),
'price': price.get_text(),
'place': place.get_text()
}
print(data)
item_info.insert_one(data)
get_item_info_from('http://zhuanzhuan.ganji.com/detail/811531368570765314z.shtml')
</code>
Now is the "main.py" code. We use this to initiate the program. The code is below:
<code>
from multiprocessing import Pool
from channel_extracing import channel_list
from page_parsing import get_links_from, get_item_info_from, item_url, item_info
def get_all_links(channel):
for page in range(1,101):
get_all_links(channel,page)
if name == 'main':
pool = Pool()
pool.map(get_all_links,channel_list.split())
</code>
The last part is a separate code named "counts.py". Its function is to count every 5 seconds about how many information we have got so far. We run it apart from the main code (the above 3 ones). Here is the code:
<code>
import time
from page_parsing import url_list_v1
while True:
print(url_list_v1.find().count())
time.sleep(5)
</code>