Crawler Serise Part I

Scrapy

First sample

Preparation work

Python(3.5+ suggested)
Requests(2.9.1 for me)
http://maoyan.com/board/4 (Top 100 films ranking at http://maoyan.com)

Code(incomplete)

# import modules
import requests
import re
from requests.exceptions import RequestException
import json

# get the indexpage according to url
def getIndexPage(url):
    headers = {...}
    try:
          response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return reponse.text
        return None
    except RequestException:
        return None
# according to html to extract content we want
def parseHtml(html):
    pattern = re.compile('...', re.S)
        ...

# write content to file
def writeToFile(content):
    with open('result.txt', 'a', encoding='utf-8') as f:
        f.write(json.dumps(content, ensure_ascii=False) + '\n')
        f.close()

# main func
def main(offset):
    url = 'http://maoyan.com/board/4?offset=' + str(offset)
    html = getIndexPage(url)
    for item in parseHtml(html):
        writeToFile(item)

if __name__ == "__main__":
# 10 film each page
for i in range(10):
    main(i*10)

This sample we userequests,reand some simple modules to get information on the internet.

Second sample

we will use a popular framescrapyto get hundreds job information aboutpython.

Preparation work

Python(3.5+ suggested)
Scrapy(1.5.0 for me)
https://www.zhipin.com/

Initial project

First, create a new folder namedzhipin

Under zhipin, run

scrapy startproject zhipin

Your folder structure will like below:

zhipin
|-scrapy.cfg
|-zhipin
|-init.py
|-items.py
|-middlewares.py
|-pipelines.py
|-settings.py
|-spiders
|-init.py

Get start url

We need go to zhipin to get the url for sipder, you can press F12 to get more details, for now you just know what url we use:https://www.zhipin.com/job_detail/?query=python&scity=100010000&page=1&ka=page-1.

Edit files

Because of this simple project, we just have few files to edit.
zhipin/items.py

import scrapy
class ZhipinItem(scrapy.Item):
    jobTitle = scrapy.Field()
    salary = scrapy.Field()
    area = scrapy.Field()
    workExpirence = scrapy.Field()
    education = scrapy.Field()
    company = scrapy.Field()
    description = scrapy.Field()

zhipin/settings.py

# support to chinese, add below codes
FEED_EXPORT_ENCODING = 'utf-8'

zhipin/spiders/spider.py

import scrapy
from zhipin.items import ZhipinItem

class zhiPinSpider(scrapy.Spider):
    name = 'spider'
    start_urls = ['https://www.zhipin.com/job_detail/?  query=python&scity=100010000&page=1&ka=page-1',]

def parse(self, response):
    items = ZhipinItem()
    try:
        for info in response.xpath('//div[@class="job-list"]/ul/li'):
        items['jobTitle'] = info.xpath('div[1]/div[1]/h3/a/text()').extract_first()
        items['description'] = info.xpath('div[1]/div[1]/h3/a/@href').extract_first()
        items['salary'] = info.xpath('div/div[1]/h3/a/span/text()').extract_first()
        items['area'] = info.xpath('div/div[1]/p/text()[1]').extract_first()
        items['workExpirence'] = info.xpath('div[1]/div[1]/p/text()[2]').extract_first()
        items['education'] = info.xpath('div[1]/div[1]/p/text()[3]').extract_first()
        items['company'] = info.xpath('div[1]/div[2]/div/h3/a/@href').extract_first()
        yield items
    except Exception:
        print("******************ERROR***********************")
    next_page_url = response.xpath('//*[@ka="page-next"]/@href').extract_first()
    if next_page_url is not None and next_page_url != 'javascript:;':
        yield scrapy.Request(response.urljoin(next_page_url))
    else:
        print("No more pages!")

All above codes you can find origin files here

Then, run it:

scrapy crawl zhipin

the information will show on your terminal.

Also you can export to a json file like this:

scrapy crawl zhipin -o sample.json

Finally

We just build a simple scrapy project to get job information, scrapy is not so simple, it can do more things, we will update much more usage later.

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容