在以前 的分享中,我们利用urllib和BeautifulSoup模块爬取中国高校排名前100名并将其写入MySQL.在本次分享中,我们将会用到Scrapy和BeautifulSoup模块,来爬取中国高校排名前100名并将其写入MongoDB数据库。爬取的网页为:http://gaokao.xdf.cn/201702/10612921.html, 截图如下(部分):
首先登陆MongoDB数据库,创建好testdb数据库和university_rank集合(collection)。然后开始着手写Scrapy爬虫。
完整的Python代码如下:
#import modules
import bs4
import scrapy
import pymongo
from bs4 import BeautifulSoup
from pymongo import MongoClient
class UniversityRankSpider(scrapy.Spider):
name = "university-rank" #name of spider
start_urls = ['http://gaokao.xdf.cn/201702/10612921.html',] #url of website
def parse(self, response): #parse function
content = response.xpath("//tbody").extract()[0]
soup = BeautifulSoup(content, "lxml") #use BeautifulSoup
table = soup.find('tbody')
count = 0
lst = [] # list to save data from the table
for tr in table.children: #BeautifulSoup grammmer
if isinstance(tr, bs4.element.Tag):
td = tr('td')
if count >= 2: #ingore the first line
lst.append([td[i]('p')[0].string.replace('\n','').replace('\t','') for i in range(8)])
count += 1
conn = MongoClient('mongodb://localhost:27017/') #connect mongodb
db = conn.testdb
for item in lst: #insert data into university_rank table
db.university_rank.insert([
{'rank':'%s'%item[0], 'university':'%s'%item[1], 'address':'%s'%item[2], 'local_rank':'%s'%item[3],
'total grade':'%s'%item[4], 'type':'%s'%item[5], 'star rank':'%s'%item[6], 'class':'%s'%item[7]},
])
print 'Successfully downloading data from website, and write it to mongodb database!'
Scrapy爬虫的运行结果如下:
接下来我们去robo3t中查看mongodb数据库,其中的university_rank集合如下:
Bingo,我们成功地把数据写入了mongodb数据库!
本次分享到此结束, 欢迎大家批评与交流~~