====================================================== urllib -- simple interface for network resource access ====================================================== .. module:: urllib :synopsis: Accessing remote resources that don't need authentication, cookies, etc. :Purpose: Accessing remote resources that don't need authentication, cookies, etc. :Available In: 1.4 and later The :mod:`urllib` module provides a simple interface for network resource access. Although :mod:`urllib` can be used with gopher and ftp, these examples all use http. HTTP GET ======== .. note:: The test server for these examples is in BaseHTTPServer_GET.py, from the PyMOTW examples for :mod:`BaseHTTPServer`. Start the server in one terminal window, then run these examples in another. An HTTP GET operation is the simplest use of urllib. Simply pass the URL to :func:`urlopen()` to get a "file-like" handle to the remote data. .. include:: urllib_urlopen.py :literal: :start-after: #end_pymotw_header The example server takes the incoming values and formats a plain text response to send back. The return value from :func:`urlopen()` gives access to the headers from the HTTP server through the :func:`info()` method, and the data for the remote resource via methods like :func:`read()` and :func:`readlines()`. :: $ python urllib_urlopen.py RESPONSE: > URL : http://localhost:8080/ DATE : Sun, 30 Mar 2008 16:27:10 GMT HEADERS : --------- Server: BaseHTTP/0.3 Python/2.5.1 Date: Sun, 30 Mar 2008 16:27:10 GMT LENGTH : 221 DATA : --------- CLIENT VALUES: client_address=('127.0.0.1', 54354) (localhost) command=GET path=/ real path=/ query= request_version=HTTP/1.0 SERVER VALUES: server_version=BaseHTTP/0.3 sys_version=Python/2.5.1 protocol_version=HTTP/1.0 The file-like object is also iterable: .. include:: urllib_urlopen_iterator.py :literal: :start-after: #end_pymotw_header Since the lines are returned with newlines and carriage returns intact, this example strips them before printing the output. :: $ python urllib_urlopen_iterator.py CLIENT VALUES: client_address=('127.0.0.1', 54380) (localhost) command=GET path=/ real path=/ query= request_version=HTTP/1.0 SERVER VALUES: server_version=BaseHTTP/0.3 sys_version=Python/2.5.1 protocol_version=HTTP/1.0 .. _urllib-urlencode: Encoding Arguments ================== Arguments can be passed to the server by encoding them and appending them to the URL. .. include:: urllib_urlencode.py :literal: :start-after: #end_pymotw_header Notice that the query, in the list of client values, contains the encoded query arguments. :: $ python urllib_urlencode.py Encoded: q=query+string&foo=bar CLIENT VALUES: client_address=('127.0.0.1', 54415) (localhost) command=GET path=/?q=query+string&foo=bar real path=/ query=q=query+string&foo=bar request_version=HTTP/1.0 SERVER VALUES: server_version=BaseHTTP/0.3 sys_version=Python/2.5.1 protocol_version=HTTP/1.0 To pass a sequence of values using separate occurrences of the variable in the query string, set *doseq* to True when calling :func:`urlencode()`. .. include:: urllib_urlencode_doseq.py :literal: :start-after: #end_pymotw_header :: $ python urllib_urlencode_doseq.py Single : foo=%5B%27foo1%27%2C+%27foo2%27%5D Sequence: foo=foo1&foo=foo2 To decode the query string, see the :class:`FieldStorage` class from the :mod:`cgi` module. Special characters within the query arguments that might cause parse problems with the URL on the server side are "quoted" when passed to :func:`urlencode()`. To quote them locally to make safe versions of the strings, you can use the :func:`quote()` or :func:`quote_plus()` functions directly. .. include:: urllib_quote.py :literal: :start-after: #end_pymotw_header Notice that :func:`quote_plus()` is more aggressive about the characters it replaces. .. {{{cog .. cog.out(run_script(cog.inFile, 'urllib_quote.py')) .. }}} :: $ python urllib_quote.py urlencode() : url=http%3A%2F%2Flocalhost%3A8080%2F%7Edhellmann%2F quote() : http%3A//localhost%3A8080/%7Edhellmann/ quote_plus(): http%3A%2F%2Flocalhost%3A8080%2F%7Edhellmann%2F .. {{{end}}} To reverse the quote operations, use :func:`unquote()` or :func:`unquote_plus()`, as appropriate. .. include:: urllib_unquote.py :literal: :start-after: #end_pymotw_header .. {{{cog .. cog.out(run_script(cog.inFile, 'urllib_unquote.py')) .. }}} :: $ python urllib_unquote.py http://localhost:8080/~dhellmann/ http://localhost:8080/~dhellmann/ .. {{{end}}} HTTP POST ========= .. note:: The test server for these examples is in BaseHTTPServer_POST.py, from the PyMOTW examples for the :mod:`BaseHTTPServer`. Start the server in one terminal window, then run these examples in another. To POST data to the remote server, instead of using GET, pass the encoded query arguments as data to :func:`urlopen()` instead of appending them to the URL. .. include:: urllib_urlopen_post.py :literal: :start-after: #end_pymotw_header :: $ python urllib_urlopen_post.py Client: ('127.0.0.1', 54545) Path: / Form data: q=query string foo=bar You can send any byte-string as data, in case the server expects something other than url-encoded form arguments in the posted data. Paths vs. URLs ============== Some operating systems use different values for separating the components of paths in local files than URLs. To make your code portable, you should use the functions :func:`pathname2url()` and :func:`url2pathname()` to convert back and forth. Since I am working on a Mac, I have to explicitly import the Windows versions of the functions. Using the versions of the functions exported by :mod:`urllib` gives you the correct defaults for your platform, so you do not need to do this. .. include:: urllib_pathnames.py :literal: :start-after: #end_pymotw_header There are two Windows examples, with and without the drive letter at the prefix of the path. .. {{{cog .. cog.out(run_script(cog.inFile, 'urllib_pathnames.py')) .. }}} :: $ python urllib_pathnames.py == Default == Original: /a/b/c URL : /a/b/c Path : /d/e/f == Windows, without drive letter == Original: \a\b\c URL : /a/b/c Path : \d\e\f == Windows, with drive letter == Original: C:\\a\b\c URL : ///C:/a/b/c Path : \d\e\f .. {{{end}}} Simple Retrieval with Cache =========================== Retrieving data is a common operation, and :mod:`urllib` includes the :func:`urlretrieve()` function so you don't have to write your own. :func:`urlretrieve()` takes arguments for the URL, a temporary file to hold the data, a function to report on download progress, and data to pass if the URL refers to a form where data should be POSTed. If no filename is given, :func:`urlretrieve()` creates a temporary file. You can delete the file yourself, or treat the file as a cache and use :func:`urlcleanup()` to remove it. This example uses GET to retrieve some data from a web server: .. include:: urllib_urlretrieve.py :literal: :start-after: #end_pymotw_header Since the server does not return a ``Content-length`` header, :func:`urlretrieve()` does not know how big the data should be, and passes -1 as the *total_size* argument to :func:`reporthook()`. :: $ python urllib_urlretrieve.py Connection opened Read 1 blocks Read 2 blocks Read 3 blocks Read 4 blocks Read 5 blocks Read 6 blocks Read 7 blocks Read 8 blocks Read 9 blocks Read 10 blocks Read 11 blocks Read 12 blocks Read 13 blocks Read 14 blocks Read 15 blocks Read 16 blocks Read 17 blocks Read 18 blocks Read 19 blocks File: /var/folders/9R/9R1t+tR02Raxzk+F71Q50U+++Uw/-Tmp-/tmp3HRpZP Headers: Content-Type: text/html; charset=UTF-8 Last-Modified: Tue, 25 Mar 2008 23:09:10 GMT Cache-Control: max-age=0 private ETag: "904b02e0-c7ff-47f6-9f35-cc6de5d2a2e5" Server: GFE/1.3 Date: Sun, 30 Mar 2008 17:36:48 GMT Connection: Close File exists before cleanup: True File still exists: False URLopener ========= :mod:`urllib` provides a :class:`URLopener` base class, and :class:`FancyURLopener` with default handling for the supported protocols. If you find yourself needing to change their behavior, you are probably better off looking at the :mod:`urllib2` module, added in Python 2.1. .. seealso:: `urllib `_ Standard library documentation for this module. :mod:`urllib2` Updated API for working with URL-based services. :mod:`urlparse` Parse URL values to access their components.