First sample
Preparation work
Python(3.5+ suggested)
Requests(2.9.1 for me)
http://maoyan.com/board/4 (Top 100 films ranking at http://maoyan.com)
Code(incomplete)
# import modules
import requests
import re
from requests.exceptions import RequestException
import json
# get the indexpage according to url
def getIndexPage(url):
headers = {...}
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
return reponse.text
return None
except RequestException:
return None
# according to html to extract content we want
def parseHtml(html):
pattern = re.compile('...', re.S)
...
# write content to file
def writeToFile(content):
with open('result.txt', 'a', encoding='utf-8') as f:
f.write(json.dumps(content, ensure_ascii=False) + '\n')
f.close()
# main func
def main(offset):
url = 'http://maoyan.com/board/4?offset=' + str(offset)
html = getIndexPage(url)
for item in parseHtml(html):
writeToFile(item)
if __name__ == "__main__":
# 10 film each page
for i in range(10):
main(i*10)
This sample we userequests,reand some simple modules to get information on the internet.
Second sample
we will use a popular framescrapyto get hundreds job information aboutpython.
Preparation work
Python(3.5+ suggested)
Scrapy(1.5.0 for me)
https://www.zhipin.com/
Initial project
First, create a new folder namedzhipin
Under zhipin, run
scrapy startproject zhipin
Your folder structure will like below:
zhipin
|-scrapy.cfg
|-zhipin
|-init.py
|-items.py
|-middlewares.py
|-pipelines.py
|-settings.py
|-spiders
|-init.py
Get start url
We need go to zhipin to get the url for sipder, you can press F12 to get more details, for now you just know what url we use:https://www.zhipin.com/job_detail/?query=python&scity=100010000&page=1&ka=page-1.
Edit files
Because of this simple project, we just have few files to edit.
zhipin/items.py
import scrapy
class ZhipinItem(scrapy.Item):
jobTitle = scrapy.Field()
salary = scrapy.Field()
area = scrapy.Field()
workExpirence = scrapy.Field()
education = scrapy.Field()
company = scrapy.Field()
description = scrapy.Field()
zhipin/settings.py
# support to chinese, add below codes
FEED_EXPORT_ENCODING = 'utf-8'
zhipin/spiders/spider.py
import scrapy
from zhipin.items import ZhipinItem
class zhiPinSpider(scrapy.Spider):
name = 'spider'
start_urls = ['https://www.zhipin.com/job_detail/? query=python&scity=100010000&page=1&ka=page-1',]
def parse(self, response):
items = ZhipinItem()
try:
for info in response.xpath('//div[@class="job-list"]/ul/li'):
items['jobTitle'] = info.xpath('div[1]/div[1]/h3/a/text()').extract_first()
items['description'] = info.xpath('div[1]/div[1]/h3/a/@href').extract_first()
items['salary'] = info.xpath('div/div[1]/h3/a/span/text()').extract_first()
items['area'] = info.xpath('div/div[1]/p/text()[1]').extract_first()
items['workExpirence'] = info.xpath('div[1]/div[1]/p/text()[2]').extract_first()
items['education'] = info.xpath('div[1]/div[1]/p/text()[3]').extract_first()
items['company'] = info.xpath('div[1]/div[2]/div/h3/a/@href').extract_first()
yield items
except Exception:
print("******************ERROR***********************")
next_page_url = response.xpath('//*[@ka="page-next"]/@href').extract_first()
if next_page_url is not None and next_page_url != 'javascript:;':
yield scrapy.Request(response.urljoin(next_page_url))
else:
print("No more pages!")
All above codes you can find origin files here
Then, run it:
scrapy crawl zhipin
the information will show on your terminal.
Also you can export to a json file like this:
scrapy crawl zhipin -o sample.json
Finally
We just build a simple scrapy project to get job information, scrapy is not so simple, it can do more things, we will update much more usage later.