Week2_Retrieve and Store Information from Web with MongoBD

This project is to crawl a big amount of web links using the "crawler" and store them into MongoDB. Next, to retrieve the links from the MongoDB and retrieve the details from the links. For this project, I am reviewing how to crawl data from web and I am totally new to learn how to use MongoDB to store and filter data. I am also quite new to learn how to "if__name__== 'main':" to initiate a program.

We distributed the codes into 4 parts. The first one 'channel_extracing.py' is to retrieve the links from the web and store them into MongoDB. Below is the code.
<code>

import requests
from bs4 import BeautifulSoup

start_url = "http://bj.ganji.com/wu/"
base_url = "http://bj.ganji.com"

wb_data = requests.get(start_url)
soup = BeautifulSoup(wb_data.text,'lxml')
info_list = soup.select("dt > a")
for info in info_list:
url = base_url + info.get('href')
print(url)

channel_list = """
http://bj.ganji.com/shouji/
http://bj.ganji.com/shoujihaoma/
http://bj.ganji.com/shoujipeijian/
http://bj.ganji.com/bijibendiannao/
http://bj.ganji.com/taishidiannaozhengji/
http://bj.ganji.com/diannaoyingjian/
http://bj.ganji.com/wangluoshebei/
http://bj.ganji.com/shumaxiangji/
http://bj.ganji.com/youxiji/
http://bj.ganji.com/xuniwupin/
http://bj.ganji.com/jiaju/
http://bj.ganji.com/jiadian/
http://bj.ganji.com/zixingchemaimai/
http://bj.ganji.com/rirongbaihuo/
http://bj.ganji.com/yingyouyunfu/
http://bj.ganji.com/fushixiaobaxuemao/
http://bj.ganji.com/meironghuazhuang/
http://bj.ganji.com/yundongqicai/
http://bj.ganji.com/yueqi/
http://bj.ganji.com/tushu/
http://bj.ganji.com/bangongjiaju/
http://bj.ganji.com/wujingongju/
http://bj.ganji.com/nongyongpin/
http://bj.ganji.com/xianzhilipin/
http://bj.ganji.com/shoucangpin/
http://bj.ganji.com/baojianpin/
http://bj.ganji.com/laonianyongpin/
http://bj.ganji.com/gou/
http://bj.ganji.com/qitaxiaochong/
http://bj.ganji.com/xiaofeika/
http://bj.ganji.com/menpiao/
http://bj.ganji.com/jiaju/
http://bj.ganji.com/rirongbaihuo/
http://bj.ganji.com/shouji/
http://bj.ganji.com/shoujihaoma/
http://bj.ganji.com/bangong/
http://bj.ganji.com/nongyongpin/
http://bj.ganji.com/jiadian/
http://bj.ganji.com/ershoubijibendiannao/
http://bj.ganji.com/ruanjiantushu/
http://bj.ganji.com/yingyouyunfu/
http://bj.ganji.com/diannao/
http://bj.ganji.com/xianzhilipin/
http://bj.ganji.com/fushixiaobaxuemao/
http://bj.ganji.com/meironghuazhuang/
http://bj.ganji.com/shuma/
http://bj.ganji.com/laonianyongpin/
http://bj.ganji.com/xuniwupin/
http://bj.ganji.com/qitawupin/
http://bj.ganji.com/ershoufree/
http://bj.ganji.com/wupinjiaohuan/
"""
</code>

The second part is to get details for a single item from one link. The links are from the MongoDB. Here is the code "page_parsing.py":
<code>

import requests
from bs4 import BeautifulSoup
import time
import pymongo
import random

client = pymongo.MongoClient('localhost', 27017)
ganji = client['ganji']
item_url = ganji['item_url']
item_info = ganji['item_info']

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
'Connection':'keep-alive'
}

proxy_list = [
'http://117.177.250.151:8081',
'http://111.85.219.250:3129',
'http://122.70.183.138:8118',
]
proxy_ip = random.choice(proxy_list)
proxies = {'http': proxy_ip}

def get_links_from(channel, pages):
page_link = channel + 'o{}'.format(str(pages))
wb_data = requests.get(page_link, headers=headers, proxies=proxies)
soup = BeautifulSoup(wb_data.text, 'lxml')
if soup.find('td', 't'):
urls = soup.select("td.t > a")
for url_sub in urls:
url = url_sub.get('href').split('?')[0]
print(url)
item_url.insert_one(url)
else:
pass

def get_item_info_from(url):
wb_data = requests.get(url, headers=headers)
if wb_data.status_code == '404':
pass
else:
soup = BeautifulSoup(wb_data.text, 'lxml')
titles = soup.select('h1.info_titile')
prices = soup.select('span.price_now > i')
places = soup.select('div.palce_li > span > i')
for title, price, place in zip(titles, prices, places):
data = {
'url': url,
'title': title.get_text(),
'price': price.get_text(),
'place': place.get_text()
}
print(data)
item_info.insert_one(data)

get_item_info_from('http://zhuanzhuan.ganji.com/detail/811531368570765314z.shtml')
</code>

Now is the "main.py" code. We use this to initiate the program. The code is below:
<code>

from multiprocessing import Pool
from channel_extracing import channel_list
from page_parsing import get_links_from, get_item_info_from, item_url, item_info

def get_all_links(channel):
for page in range(1,101):
get_all_links(channel,page)

if name == 'main':
pool = Pool()
pool.map(get_all_links,channel_list.split())
</code>

The last part is a separate code named "counts.py". Its function is to count every 5 seconds about how many information we have got so far. We run it apart from the main code (the above 3 ones). Here is the code:
<code>

import time
from page_parsing import url_list_v1

while True:
print(url_list_v1.find().count())
time.sleep(5)
</code>

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • Spring Cloud为开发人员提供了快速构建分布式系统中一些常见模式的工具(例如配置管理,服务发现,断路器,智...
    卡卡罗2017阅读 134,973评论 19 139
  • 从几何时,我们不再言语不再沟通,你我中间隔了2条距离,不知是什么,总会都有一种不想接话的尴尬! 我想男人和女人的区...
    爱吹风的妖妖阅读 159评论 0 1
  • 大家好,我是家雯妈妈。发现家里关于爸爸的书不是很多,这次买来了一套关于女儿和爸爸的书,图中的好多场景值得我们去探索...
    家雯妈妈阅读 1,310评论 0 1
  • 前几日失眠,躺在床上翻来覆去很久都没能睡着。宿舍内有些嘈杂,所以带上了耳机,想借着音乐来快些进入睡眠。 我想我可能...
    皮皮昕阅读 458评论 9 3
  • 放假前我班搞了一次班级的宿舍杯篮球赛。比赛完成之后,我列出了获奖的名单,承诺在放假之后再发奖品。今天中午我出...
    七哥特阅读 579评论 0 0