爬虫学习笔记（一）--urllib总结

基础知识：

1.url（Uniform Resource Locator）:叫做统一资源定位符，是互联网上标准资源的地址，俗称“网址”。

2.在python 3.x中已经没有了urllib2库，只有urllib一个库了。

3.url Encoding也叫做percent—encode，即URL编码也叫做百分号编码。

4.python2.7中的urllib2就是python3中的urllib.request

robotparser变为了urllib库中的一个模块

根据官方手册，urllib是处理url的一个库：

其中有四个模块：

1.urllib.request用来打开和读取urls

1.1.urlopen函数是常用的打开url方式。

1.2.用built_opener函数构建opener来打开网页时高级方式。

2.urllib.error包含了运行urllib.request的过程中发生的错误

3.urllib.parse用来分析网址（urls）

4.urllib.robotparser用来分析robots.txt文件

一、urllib.request中常用的函数

urllib.request.urlopen(url, data=None, [timeout,], cafile=None, capath=None, cadefault=False, context=None)

1.urllib.request 模块用HTTP/1.1协议以及包括Connection：close的头部在它的http请求中。

2.可供选择的timeout参数指明阻止连接时间，请求连接的操作timeout秒后还没连接上，就会抛出连接超时的异常。若没有设置则为全局变量中缺省的超时时间。

3.对于HTTP and HTTPS URLs，这个函数返回的是一个http.client.HTTPResponse对象（进行了轻微的修饰），该对象有如下方法：

- 该对象是类文件对象，类文件的方法都可以使用，（read，readline，fileno，close）

- geturl（）：返回请求的url

- getcode（）：返回响应的http状态码，200表示请求成功得到响应，404表示请求没响应

- info():返回httplib.HTTPMessage对象，表示远程服务器返回的头部信息

二、urllib.parse中常用函数：

1.urllib.parse.urlparse(url,scheme='',allow_fragments=True)：

-用来分析一个URL，并分解为6个组成部分

-返回一个6个元素的元组：（scheme，netloc，path，params，query，fragment）是一个urllib.parse.ParseResult对象

并且该对象有这6个元素对应的方法

eg：

>>>from urllib import parse

>>>url = r'https://docs.python.org/3.5/search.html?q=parse&check_keywords=yes&area=default'

>>>parseResult= parse.urlparse(url)

>>>parseResult#把地址解析成组件

ParseResult(scheme='https', netloc='docs.python.org', path='/3.5/search.html', params='', query='q=parse&check_keywords=yes&area=default', fragment='')

>>>parseResult.query

'q=parse&check_keywords=yes&area=default'

看结果就知道是什么意思了

2.urllib.parse.urlunparse(Tuple)

-是urlparse的逆过程

-输入是6个元素的元组，输出是完整的url地址

3.urllib.parse.urljoin

urljoin(base, url, allow_fragments=True)

Join a base URL and a possibly relative URL to form an absolute

interpretation of the latter.

-base是url的基地址

-base与第二个参数中的相对地址相结合组成一个绝对URL地址

eg:

>>>scheme='http'

>>>netloc='www.python.org'

>>>path='lib/module-urlparse.html'

>>>modlist=('urllib','urllib2','httplib')

>>> unparsed_url=parse.urlunparse((scheme,netloc,path,'','',''))

>>> unparsed_url

'http://www.python.org/lib/module-urlparse.html'

>>> for mod in modlist:

url=parse.urljoin(unparsed_url,'module-%s.html'%mod)

print(url)

#替换是从最后一个"/"处替换的

http://www.python.org/lib/module-urllib.html

http://www.python.org/lib/module-urllib2.html

http://www.python.org/lib/module-httplib.html

>>>

4.urllib.parse.parse_qs(qs,keep_blank_values=False,strict_parsing=False,encoding='urf-8',error='replace'):

-用来分析字符串形式的query请求。（Parse a query given as a string argument）

qs参数：url编码的字符串query请求（get请求）。

-返回query请求的参数字典

eg：

接上，

>>> param_dict=parse.parse_qs(parseResult.query)

>>> param_dict

>>> {'area': ['default'], 'check_keywords': ['yes'], 'q': ['parse']}

5.urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=<function quote_plus at 0x0365CC90>)

#对query合并，并且进行url编码

>>> from urllib import parse

>>> query={'name':'walker','age':99}

>>> parse.urlencode(query)

'name=walker&age=99'

总结：

1.2.是对url整体的处理，包括分解和组合。

4.5是对url中的query这个参数的处理。

5.urllib.parse.quote(string, safe='/', encoding=None, errors=None)

#对字符串进行url编码

1.url字符串中如果带有中文的编码，要使用url时。先将中文部分编码由gbk译为utf8

然后在urllib.quote(str)才可以使用url正常访问打开，否则编码会出问题。

2.同样如果从url中取出相应中文字段解码时，需要先unquote，然后在decode，具体按照gbk或者utf8，视情况而定。

eg：

>>>from urllib import parse

>>>parse.quote('a&b/c')#未编码斜线

'a%26b/c'

>>>parse.quote_plus('a&b/c')#编码了斜线

6.unquote(string, encoding='utf-8', errors='replace')

>>>parse.unquote('1+2')

'1+2'

>>> parse.unquote_plus('1+2')

'1 2'

三、urllib.robotparser

用来分析robots.txt文件,看是否支持该爬虫

eg：

>>>from urlli import robotparser

>>>rp=robotparser.RobotFileParser()

>>>rp.set_url('http://example.webscraping.com/robots.txt')#读入robots.txt文件

>>>rp.read()

>>>url='http://example.webscraping.com'

>>>user_agent='GoodCrawler'

>>>rp.can_fetch(user_agent,url)

True

详细说明，见下面函数文档：

FUNCTIONS

parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace')

Parse a query given as a string argument.

Arguments:

qs: percent-encoded query string to be parsed

keep_blank_values: flag indicating whether blank values in

percent-encoded queries should be treated as blank strings.

A true value indicates that blanks should be retained as

blank strings. The default false value indicates that

blank values are to be ignored and treated as if they were

not included.

strict_parsing: flag indicating what to do with parsing errors.

If false (the default), errors are silently ignored.

If true, errors raise a ValueError exception.

encoding and errors: specify how to decode percent-encoded sequences

into Unicode characters, as accepted by the bytes.decode() method.

parse_qsl(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace')

Parse a query given as a string argument.

Arguments:

qs: percent-encoded query string to be parsed

keep_blank_values: flag indicating whether blank values in

percent-encoded queries should be treated as blank strings. A

true value indicates that blanks should be retained as blank

strings. The default false value indicates that blank values

are to be ignored and treated as if they were not included.

strict_parsing: flag indicating what to do with parsing errors. If

false (the default), errors are silently ignored. If true,

errors raise a ValueError exception.

encoding and errors: specify how to decode percent-encoded sequences

into Unicode characters, as accepted by the bytes.decode() method.

Returns a list, as G-d intended.

quote(string, safe='/', encoding=None, errors=None)

quote('abc def') -> 'abc%20def'

Each part of a URL, e.g. the path info, the query, etc., has a

different set of reserved characters that must be quoted.

RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax lists

the following reserved characters.

reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |

"$" | ","

Each of these characters is reserved in some component of a URL,

but not necessarily in all of them.

By default, the quote function is intended for quoting the path

section of a URL. Thus, it will not encode '/'. This character

is reserved, but in typical usage the quote function is being

called on a path where the existing slash characters are used as

reserved characters.

string and safe may be either str or bytes objects. encoding and errors

must not be specified if string is a bytes object.

The optional encoding and errors parameters specify how to deal with

non-ASCII characters, as accepted by the str.encode method.

By default, encoding='utf-8' (characters are encoded with UTF-8), and

errors='strict' (unsupported characters raise a UnicodeEncodeError).

quote_from_bytes(bs, safe='/')

Like quote(), but accepts a bytes object rather than a str, and does

not perform string-to-bytes encoding. It always returns an ASCII string.

quote_from_bytes(b'abc def?') -> 'abc%20def%3f'

quote_plus(string, safe='', encoding=None, errors=None)

Like quote(), but also replace ' ' with '+', as required for quoting

HTML form values. Plus signs in the original string are escaped unless

they are included in safe. It also does not have safe default to '/'.

unquote(string, encoding='utf-8', errors='replace')

Replace %xx escapes by their single-character equivalent. The optional

encoding and errors parameters specify how to decode percent-encoded

sequences into Unicode characters, as accepted by the bytes.decode()

method.

By default, percent-encoded sequences are decoded with UTF-8, and invalid

sequences are replaced by a placeholder character.

unquote('abc%20def') -> 'abc def'.

unquote_plus(string, encoding='utf-8', errors='replace')

Like unquote(), but also replace plus signs by spaces, as required for

unquoting HTML form values.

unquote_plus('%7e/abc+def') -> '~/abc def'

unquote_to_bytes(string)

unquote_to_bytes('abc%20def') -> b'abc def'.

urldefrag(url)

Removes any existing fragment from URL.

Returns a tuple of the defragmented URL and the fragment. If

the URL contained no fragments, the second element is the

empty string.

urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=<function quote_plus at 0x0365CC90>)

Encode a dict or sequence of two-element tuples into a URL query string.

If any values in the query arg are sequences and doseq is true, each

sequence element is converted to a separate parameter.

If the query arg is a sequence of two-element tuples, the order of the

parameters in the output will match the order of parameters in the

input.

The components of a query arg may each be either a string or a bytes type.

The safe, encoding, and errors parameters are passed down to the function

specified by quote_via (encoding and errors only if a component is a str).

urljoin(base, url, allow_fragments=True)

Join a base URL and a possibly relative URL to form an absolute

interpretation of the latter.

urlparse(url, scheme='', allow_fragments=True)

Parse a URL into 6 components:

Return a 6-tuple: (scheme, netloc, path, params, query, fragment).

Note that we don't break the components up in smaller bits

(e.g. netloc is a single string) and we don't expand % escapes.

urlsplit(url, scheme='', allow_fragments=True)

Parse a URL into 5 components:

Return a 5-tuple: (scheme, netloc, path, query, fragment).

Note that we don't break the components up in smaller bits

(e.g. netloc is a single string) and we don't expand % escapes.

urlunparse(components)

Put a parsed URL back together again. This may result in a

slightly different, but equivalent URL, if the URL that was parsed

originally had redundant delimiters, e.g. a ? with an empty query

(the draft states that these are equivalent).

urlunsplit(components)

Combine the elements of a tuple as returned by urlsplit() into a

complete URL as a string. The data argument can be any five-item iterable.

This may result in a slightly different, but equivalent URL, if the URL that

was parsed originally had unnecessary delimiters (for example, a ? with an

empty query; the RFC states that these are equivalent).

DATA

__all__ = ['urlparse', 'urlunparse', 'urljoin', 'urldefrag', 'urlsplit...

FILE

d:\python3\lib\urllib\parse.py