今天的爬虫是爬取某网站的所有链接,涉及到了MongoDB及其简单的操作,和多线程,虽然爬取的数据简单,但是能爬取这么多的数据,感觉很激动。
代码如下:
channel_extract.py
from bs4 import BeautifulSoup
import requests
start_url = 'http://cd.58.com/sale.shtml'
url_host = 'http://cd.58.com'
def get_channel_urls(url):
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, 'lxml')
links = soup.select('ul.ym-submnu > li > b > a')
for link in links:
page_url = url_host + link.get('href')
print(page_url)
# get_channel_urls(start_url) # 爬取所有二级链接,并存入channel_list
channel_list = '''
'''
page_parsing.py
def get_links_from(channel, pages, who_sells=0): # 爬取二级链接下面的一个页面的所有链接
list_view = '{}{}/pn{}'.format(channel, str(who_sells), str(pages))
wb_data = requests.get(list_view)
time.sleep(1)
soup = BeautifulSoup(wb_data.text, 'lxml')
if soup.find('td', 't'):
for link in soup.select('td.t a.t'):
item_link = link.get('href').split('?')[0]
if len(item_link) <= 56 and 'jump' not in item_link: # 去点无效链接
url_list.insert_one({'url': item_link})
print(item_link)
else:
pass
main.py
from multiprocessing import Pool
from channel_extract import channel_list
from page_parsing import get_links_from
def get_all_links_from(channel):
for page in range(1, 101):
get_links_from(channel, page)
if __name__ == '__main__':
pool = Pool()
pool.map(get_all_links_from, channel_list.split())
最后一个计数文件:
count.py
import time
from page_parsing import url_list
while True:
print(url_list.find().count()) # 从MongoDB中获取数据条数
time.sleep(5)
虽然爬取的数据不是很多,但是能够激励我坚持学习下去。