urllib 介绍 python2.7版本

https://docs.python.org/2/library/urllib.html

20.5.urllib— Open arbitrary resources by URL

Note

Theurllibmodule has been split into parts and renamed inPython 3 tourllib.request,urllib.parse,andurllib.error. The2to3tool will automatically adaptimports when converting your sources to Python 3.Also note that theurllib.request.urlopen()function in Python 3 isequivalent tourllib2.urlopen()and thaturllib.urlopen()hasbeen removed.

This module provides a high-level interface for fetching data across the WorldWide Web.  In particular, theurlopen()function is similar to thebuilt-in functionopen(), but accepts Universal Resource Locators (URLs)instead of filenames.  Some restrictions apply — it can only open URLs forreading, and no seek operations are available.

See also

TheRequests packageis recommended for a higher-level HTTP client interface.

Changed in version 2.7.9:For HTTPS URIs,urllibperforms all the neccessary certificate and hostname checks by default.

Warning

For Python versions earlier than 2.7.9, urllib does not

attempt to validate the server certificates of HTTPS URIs. Use at your

own risk!

20.5.1. High-level interface

urllib.urlopen(url[,data[,proxies[,context]]])

Open a network object denoted by a URL for reading.  If the URL does nothave a scheme identifier, or if it hasfile:as its schemeidentifier, this opens a local file (withoutuniversal newlines);otherwise it opens a socket to a server somewhere on the network.  If theconnection cannot be made theIOErrorexception is raised.  If allwent well, a file-like object is returned.  This supports the followingmethods:read(),readline(),readlines(),fileno(),close(),info(),getcode()andgeturl().  It alsohas proper support for theiteratorprotocol. One caveat: theread()method, if the size argument is omitted or negative, may notread until the end of the data stream; there is no good way to determinethat the entire stream from a socket has been read in the general case.

Except for theinfo(),getcode()andgeturl()methods,these methods have the same interface as for file objects — see sectionFile Objectsin this manual.  (It is not a built-in file object,however, so it can’t be used at those few places where a true built-in fileobject is required.)

Theinfo()method returns an instance of the classmimetools.Messagecontaining meta-information associated with theURL.  When the method is HTTP, these headers are those returned by the serverat the head of the retrieved HTML page (including Content-Length andContent-Type).  When the method is FTP, a Content-Length header will bepresent if (as is now usual) the server passed back a file length in responseto the FTP retrieval request. A Content-Type header will be present if theMIME type can be guessed.  When the method is local-file, returned headerswill include a Date representing the file’s last-modified time, aContent-Length giving file size, and a Content-Type containing a guess at thefile’s type. See also the description of themimetoolsmodule.

Thegeturl()method returns the real URL of the page.  In some cases, theHTTP server redirects a client to another URL.  Theurlopen()functionhandles this transparently, but in some cases the caller needs to know which URLthe client was redirected to.  Thegeturl()method can be used to get atthis redirected URL.

Thegetcode()method returns the HTTP status code that was sent with theresponse, orNoneif the URL is no HTTP URL.

If theurluses thehttp:scheme identifier, the optionaldataargument may be given to specify aPOSTrequest (normally the request typeisGET).  Thedataargument must be in standardapplication/x-www-form-urlencodedformat; see theurlencode()function below.

Theurlopen()function works transparently with proxies which do notrequire authentication.  In a Unix or Windows environment, set thehttp_proxy, orftp_proxyenvironment variables to a URL thatidentifies the proxy server before starting the Python interpreter.  For example(the'%'is the command prompt):

%http_proxy="http://www.someproxy.com:3128"%exporthttp_proxy%python...

Theno_proxyenvironment variable can be used to specify hosts whichshouldn’t be reached via proxy; if set, it should be a comma-separated listof hostname suffixes, optionally with:portappended, for examplecern.ch,ncsa.uiuc.edu,some.host:8080.

In a Windows environment, if no proxy environment variables are set, proxy

settings are obtained from the registry’s Internet Settings section.

In a Mac OS X  environment,urlopen()will retrieve proxy informationfrom the OS X System Configuration Framework, which can be managed withNetwork System Preferences panel.

Alternatively, the optionalproxiesargument may be used to explicitly specifyproxies.  It must be a dictionary mapping scheme names to proxy URLs, where anempty dictionary causes no proxies to be used, andNone(the default value)causes environmental proxy settings to be used as discussed above.  Forexample:

# Use http://www.someproxy.com:3128 for HTTP proxyingproxies={'http':'http://www.someproxy.com:3128'}filehandle=urllib.urlopen(some_url,proxies=proxies)# Don't use any proxiesfilehandle=urllib.urlopen(some_url,proxies={})# Use proxies from environment - both versions are equivalentfilehandle=urllib.urlopen(some_url,proxies=None)filehandle=urllib.urlopen(some_url)

Proxies which require authentication for use are not currently supported;

this is considered an implementation limitation.

Thecontextparameter may be set to assl.SSLContextinstance toconfigure the SSL settings that are used ifurlopen()makes a HTTPSconnection.

Changed in version 2.3:Added theproxiessupport.

Changed in version 2.6:Addedgetcode()to returned object and support for theno_proxyenvironment variable.

Changed in version 2.7.9:Thecontextparameter was added. All the neccessary certificate and hostname checks are done by default.

Deprecated since version 2.6:Theurlopen()function has been removed in Python 3 in favorofurllib2.urlopen().

urllib.urlretrieve(url[,filename[,reporthook[,data]]])

Copy a network object denoted by a URL to a local file, if necessary. If the URLpoints to a local file, or a valid cached copy of the object exists, the objectis not copied.  Return a tuple(filename,headers)wherefilenameis thelocal file name under which the object can be found, andheadersis whatevertheinfo()method of the object returned byurlopen()returned (fora remote object, possibly cached). Exceptions are the same as forurlopen().

The second argument, if present, specifies the file location to copy to (ifabsent, the location will be a tempfile with a generated name). The thirdargument, if present, is a hook function that will be called once onestablishment of the network connection and once after each block readthereafter.  The hook will be passed three arguments; a count of blockstransferred so far, a block size in bytes, and the total size of the file.  Thethird argument may be-1on older FTP servers which do not return a filesize in response to a retrieval request.

If theurluses thehttp:scheme identifier, the optionaldataargument may be given to specify aPOSTrequest (normally the request typeisGET).  Thedataargument must in standardapplication/x-www-form-urlencodedformat; see theurlencode()function below.

Changed in version 2.5:urlretrieve()will raiseContentTooShortErrorwhen it detects thatthe amount of data available  was less than the expected amount (which is thesize reported by aContent-Lengthheader). This can occur, for example, whenthe  download is interrupted.

TheContent-Lengthis treated as a lower bound: if there’s more data  to read,urlretrieve()reads more data, but if less data is available,  it raisesthe exception.

You can still retrieve the downloaded data in this case, it is stored  in thecontentattribute of the exception instance.

If noContent-Lengthheader was supplied,urlretrieve()can not checkthe size of the data it has downloaded, and just returns it.  In this case youjust have to assume that the download was successful.

urllib._urlopener

The public functionsurlopen()andurlretrieve()create an instanceof theFancyURLopenerclass and use it to perform their requestedactions.  To override this functionality, programmers can create a subclass ofURLopenerorFancyURLopener, then assign an instance of thatclass to theurllib._urlopenervariable before calling the desired function.For example, applications may want to specify a differentUser-Agentheader thanURLopenerdefines.  This can beaccomplished with the following code:

importurllibclassAppURLopener(urllib.FancyURLopener):version="App/1.7"urllib._urlopener=AppURLopener()

urllib.urlcleanup()

Clear the cache that may have been built up by previous calls tourlretrieve().

20.5.2. Utility functions

urllib.quote(string[,safe])

Replace special characters instringusing the%xxescape. Letters,digits, and the characters'_.-'are never quoted. By default, thisfunction is intended for quoting the path section of the URL. The optionalsafeparameter specifies additional characters that should not be quoted— its default value is'/'.

Example:quote('/~connolly/')yields'/%7econnolly/'.

urllib.quote_plus(string[,safe])

Likequote(), but also replaces spaces by plus signs, as required forquoting HTML form values when building up a query string to go into a URL.Plus signs in the original string are escaped unless they are included insafe.  It also does not havesafedefault to'/'.

urllib.unquote(string)

Replace%xxescapes by their single-character equivalent.

Example:unquote('/%7Econnolly/')yields'/~connolly/'.

urllib.unquote_plus(string)

Likeunquote(), but also replaces plus signs by spaces, as required forunquoting HTML form values.

urllib.urlencode(query[,doseq])

Convert a mapping object or a sequence of two-element tuples to a“percent-encoded” string, suitable to pass tourlopen()above as theoptionaldataargument.  This is useful to pass a dictionary of formfields to aPOSTrequest.  The resulting string is a series ofkey=valuepairs separated by'&'characters, where bothkeyandvalueare quoted usingquote_plus()above.  When a sequence oftwo-element tuples is used as thequeryargument, the first element ofeach tuple is a key and the second is a value. The value element in itselfcan be a sequence and in that case, if the optional parameterdoseqisevaluates toTrue, individualkey=valuepairs separated by'&'aregenerated for each element of the value sequence for the key.  The order ofparameters in the encoded string will match the order of parameter tuples inthe sequence. Theurlparsemodule provides the functionsparse_qs()andparse_qsl()which are used to parse query stringsinto Python data structures.

urllib.pathname2url(path)

Convert the pathnamepathfrom the local syntax for a path to the form used inthe path component of a URL.  This does not produce a complete URL.  The returnvalue will already be quoted using thequote()function.

urllib.url2pathname(path)

Convert the path componentpathfrom a percent-encoded URL to the local syntax for apath.  This does not accept a complete URL.  This function usesunquote()to decodepath.

urllib.getproxies()

This helper function returns a dictionary of scheme to proxy server URLmappings. It scans the environment for variables named_proxy,in case insensitive way, for all operating systems first, and when it cannotfind it, looks for proxy information from Mac OSX System Configuration forMac OS X and Windows Systems Registry for Windows.If both lowercase and uppercase environment variables exist (and disagree),lowercase is preferred.

Note

If the environment variableREQUEST_METHODis set, which usuallyindicates your script is running in a CGI environment, the environmentvariableHTTP_PROXY(uppercase_PROXY) will be ignored. This isbecause that variable can be injected by a client using the “Proxy:” HTTPheader. If you need to use an HTTP proxy in a CGI environment, either useProxyHandlerexplicitly, or make sure the variable name is inlowercase (or at least the_proxysuffix).

Note

urllib also exposes certain utility functions like splittype, splithost andothers parsing URL into various components. But it is recommended to useurlparsefor parsing URLs rather than using these functions directly.Python 3 does not expose these helper functions fromurllib.parsemodule.

20.5.3. URL Opener objects

classurllib.URLopener([proxies[,context[,**x509]]])

Base class for opening and reading URLs.  Unless you need to support openingobjects using schemes other thanhttp:,ftp:, orfile:,you probably want to useFancyURLopener.

By default, theURLopenerclass sends aUser-Agentheaderofurllib/VVV, whereVVVis theurllibversion number.Applications can define their ownUser-Agentheader by subclassingURLopenerorFancyURLopenerand setting the class attributeversionto an appropriate string value in the subclass definition.

The optionalproxiesparameter should be a dictionary mapping scheme names toproxy URLs, where an empty dictionary turns proxies off completely.  Its defaultvalue isNone, in which case environmental proxy settings will be used ifpresent, as discussed in the definition ofurlopen(), above.

Thecontextparameter may be assl.SSLContextinstance.  If given,it defines the SSL settings the opener uses to make HTTPS connections.

Additional keyword parameters, collected inx509, may be used forauthentication of the client when using thehttps:scheme.  The keywordskey_fileandcert_fileare supported to provide an  SSL key and certificate;both are needed to support client authentication.

URLopenerobjects will raise anIOErrorexception if the serverreturns an error code.

open(fullurl[,data])

Openfullurlusing the appropriate protocol.  This method sets up cache andproxy information, then calls the appropriate open method with its inputarguments.  If the scheme is not recognized,open_unknown()is called.Thedataargument has the same meaning as thedataargument ofurlopen().

open_unknown(fullurl[,data])

Overridable interface to open unknown URL types.

retrieve(url[,filename[,reporthook[,data]]])

Retrieves the contents ofurland places it infilename.  The return valueis a tuple consisting of a local filename and either amimetools.Messageobject containing the response headers (for remoteURLs) orNone(for local URLs).  The caller must then open and read thecontents offilename.  Iffilenameis not given and the URL refers to alocal file, the input filename is returned.  If the URL is non-local andfilenameis not given, the filename is the output oftempfile.mktemp()with a suffix that matches the suffix of the last path component of the inputURL.  Ifreporthookis given, it must be a function accepting three numericparameters.  It will be called after each chunk of data is read from thenetwork.reporthookis ignored for local URLs.

If theurluses thehttp:scheme identifier, the optionaldataargument may be given to specify aPOSTrequest (normally the request typeisGET).  Thedataargument must in standardapplication/x-www-form-urlencodedformat; see theurlencode()function below.

version

Variable that specifies the user agent of the opener object.  To geturllibto tell servers that it is a particular user agent, set this in asubclass as a class variable or in the constructor before calling the baseconstructor.

classurllib.FancyURLopener(...)

FancyURLopenersubclassesURLopenerproviding default handlingfor the following HTTP response codes: 301, 302, 303, 307 and 401.  For the 30xresponse codes listed above, theLocationheader is used to fetchthe actual URL.  For 401 response codes (authentication required), basic HTTPauthentication is performed.  For the 30x response codes, recursion is boundedby the value of themaxtriesattribute, which defaults to 10.

For all other response codes, the methodhttp_error_default()is calledwhich you can override in subclasses to handle the error appropriately.

Note

According to the letter ofRFC 2616, 301 and 302 responses to POST requestsmust not be automatically redirected without confirmation by the user.  Inreality, browsers do allow automatic redirection of these responses, changingthe POST to a GET, andurllibreproduces this behaviour.

The parameters to the constructor are the same as those forURLopener.

Note

When performing basic authentication, aFancyURLopenerinstance callsitsprompt_user_passwd()method.  The default implementation asks theusers for the required information on the controlling terminal.  A subclass mayoverride this method to support more appropriate behavior if needed.

TheFancyURLopenerclass offers one additional method that should beoverloaded to provide the appropriate behavior:

prompt_user_passwd(host,realm)

Return information needed to authenticate the user at the given host in thespecified security realm.  The return value should be a tuple,(user,password), which can be used for basic authentication.

The implementation prompts for this information on the terminal; an application

should override this method to use an appropriate interaction model in the local

environment.

exceptionurllib.ContentTooShortError(msg[,content])

This exception is raised when theurlretrieve()function detects that theamount of the downloaded data is less than the  expected amount (given by theContent-Lengthheader). Thecontentattribute stores the downloaded(and supposedly truncated) data.

New in version 2.5.

20.5.4.urllibRestrictions

Currently, only the following protocols are supported: HTTP, (versions 0.9 and

1.0),  FTP, and local files.

The caching feature ofurlretrieve()has been disabled until I find thetime to hack proper processing of Expiration time headers.

There should be a function to query whether a particular URL is in the cache.

For backward compatibility, if a URL appears to point to a local file but the

file can’t be opened, the URL is re-interpreted using the FTP protocol.  This

can sometimes cause confusing error messages.

Theurlopen()andurlretrieve()functions can cause arbitrarilylong delays while waiting for a network connection to be set up.  This meansthat it is difficult to build an interactive Web client using these functionswithout using threads.

The data returned byurlopen()orurlretrieve()is the raw datareturned by the server.  This may be binary data (such as an image), plain textor (for example) HTML.  The HTTP protocol provides type information in the replyheader, which can be inspected by looking at theContent-Typeheader.  If the returned data is HTML, you can use the modulehtmllibtoparse it.

The code handling the FTP protocol cannot differentiate between a file and adirectory.  This can lead to unexpected behavior when attempting to read a URLthat points to a file that is not accessible.  If the URL ends in a/, it isassumed to refer to a directory and will be handled accordingly.  But if anattempt to read a file leads to a 550 error (meaning the URL cannot be found oris not accessible, often for permission reasons), then the path is treated as adirectory in order to handle the case when a directory is specified by a URL butthe trailing/has been left off.  This can cause misleading results whenyou try to fetch a file whose read permissions make it inaccessible; the FTPcode will try to read it, fail with a 550 error, and then perform a directorylisting for the unreadable file. If fine-grained control is needed, considerusing theftplibmodule, subclassingFancyURLopener, or changing_urlopenerto meet your needs.

This module does not support the use of proxies which require authentication.

This may be implemented in the future.

Although theurllibmodule contains (undocumented) routines to parseand unparse URL strings, the recommended interface for URL manipulation is inmoduleurlparse.

20.5.5. Examples

Here is an example session that uses theGETmethod to retrieve a URLcontaining parameters:

>>>importurllib>>>params=urllib.urlencode({'spam':1,'eggs':2,'bacon':0})>>>f=urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s"%params)>>>printf.read()

The following example uses thePOSTmethod instead:

>>>importurllib>>>params=urllib.urlencode({'spam':1,'eggs':2,'bacon':0})>>>f=urllib.urlopen("http://www.musi-cal.com/cgi-bin/query",params)>>>printf.read()

The following example uses an explicitly specified HTTP proxy, overriding

environment settings:

>>>importurllib>>>proxies={'http':'http://proxy.example.com:8080/'}>>>opener=urllib.FancyURLopener(proxies)>>>f=opener.open("http://www.python.org")>>>f.read()

The following example uses no proxies at all, overriding environment settings:

>>>importurllib>>>opener=urllib.FancyURLopener({})>>>f=opener.open("http://www.python.org/")>>>f.read()

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,904评论 6 497
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,581评论 3 389
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 160,527评论 0 350
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,463评论 1 288
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,546评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,572评论 1 293
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,582评论 3 414
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,330评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,776评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,087评论 2 330
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,257评论 1 344
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,923评论 5 338
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,571评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,192评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,436评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,145评论 2 366
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,127评论 2 352

推荐阅读更多精彩内容