urllib爬虫上机实操: 爬取首页的所有链接
- 爬取首页的所有链接,网址:https://china-testing.github.io/
参考答案
#!/usr/bin/python3
# -*- coding: utf-8 -*-
# 讨论钉钉免费群21745728 qq群144081101 567351477
# CreateDate: 2018-10-20
from urllib.request import urlopen, urljoin
import re
def download_page(url):
return urlopen(url).read().decode('utf-8')
def extract_links(page):
link_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
return link_regex.findall(page)
if __name__ == '__main__':
target_url = 'https://china-testing.github.io/'
apress = download_page(target_url)
links = extract_links(apress)
for link in links:
print(urljoin(target_url, link))
本例参考书籍: Website Scraping with Python - 2018.pdf
- 面试问答
1, urljoin('https://china-testing.github.io/', 'test')的结果是什么?
2,urljoin('https://china-testing.github.io/', 'https://www.google.com')的结果是什么?