urllib2 – Library for opening URLs.

Purpose:A library for opening URLs that can be extended by defining custom protocol handlers.
Available In:2.1

The urllib2 module provides an updated API for using internet resources identified by URLs. It is designed to be extended by individual applications to support new protocols or add variations to existing protocols (such as handling HTTP basic authentication).

HTTP GET

Note

The test server for these examples is in BaseHTTPServer_GET.py, from the PyMOTW examples for BaseHTTPServer. Start the server in one terminal window, then run these examples in another.

As with urllib, an HTTP GET operation is the simplest use of urllib2. Pass the URL to urlopen() to get a “file-like” handle to the remote data.

import urllib2

response = urllib2.urlopen('http://localhost:8080/')
print 'RESPONSE:', response
print 'URL     :', response.geturl()

headers = response.info()
print 'DATE    :', headers['date']
print 'HEADERS :'
print '---------'
print headers

data = response.read()
print 'LENGTH  :', len(data)
print 'DATA    :'
print '---------'
print data

The example server accepts the incoming values and formats a plain text response to send back. The return value from urlopen() gives access to the headers from the HTTP server through the info() method, and the data for the remote resource via methods like read() and readlines().

$ python urllib2_urlopen.py
RESPONSE: <addinfourl at 11940488 whose fp = <socket._fileobject object at 0xb573f0>>
URL     : http://localhost:8080/
DATE    : Sun, 19 Jul 2009 14:01:31 GMT
HEADERS :
---------
Server: BaseHTTP/0.3 Python/2.6.2
Date: Sun, 19 Jul 2009 14:01:31 GMT

LENGTH  : 349
DATA    :
---------
CLIENT VALUES:
client_address=('127.0.0.1', 55836) (localhost)
command=GET
path=/
real path=/
query=
request_version=HTTP/1.1

SERVER VALUES:
server_version=BaseHTTP/0.3
sys_version=Python/2.6.2
protocol_version=HTTP/1.0

HEADERS RECEIVED:
accept-encoding=identity
connection=close
host=localhost:8080
user-agent=Python-urllib/2.6

The file-like object returned by urlopen() is iterable:

import urllib2

response = urllib2.urlopen('http://localhost:8080/')
for line in response:
    print line.rstrip()

This example strips the trailing newlines and carriage returns before printing the output.

$ python urllib2_urlopen_iterator.py
CLIENT VALUES:
client_address=('127.0.0.1', 55840) (localhost)
command=GET
path=/
real path=/
query=
request_version=HTTP/1.1

SERVER VALUES:
server_version=BaseHTTP/0.3
sys_version=Python/2.6.2
protocol_version=HTTP/1.0

HEADERS RECEIVED:
accept-encoding=identity
connection=close
host=localhost:8080
user-agent=Python-urllib/2.6

Encoding Arguments

Arguments can be passed to the server by encoding them with urllib.urlencode() and appending them to the URL.

import urllib
import urllib2

query_args = { 'q':'query string', 'foo':'bar' }
encoded_args = urllib.urlencode(query_args)
print 'Encoded:', encoded_args

url = 'http://localhost:8080/?' + encoded_args
print urllib2.urlopen(url).read()

The list of client values returned in the example output contains the encoded query arguments.

$ python urllib2_http_get_args.py
Encoded: q=query+string&foo=bar
CLIENT VALUES:
client_address=('127.0.0.1', 55849) (localhost)
command=GET
path=/?q=query+string&foo=bar
real path=/
query=q=query+string&foo=bar
request_version=HTTP/1.1

SERVER VALUES:
server_version=BaseHTTP/0.3
sys_version=Python/2.6.2
protocol_version=HTTP/1.0

HEADERS RECEIVED:
accept-encoding=identity
connection=close
host=localhost:8080
user-agent=Python-urllib/2.6

HTTP POST

Note

The test server for these examples is in BaseHTTPServer_POST.py, from the PyMOTW examples for the BaseHTTPServer. Start the server in one terminal window, then run these examples in another.

To POST form-encoded data to the remote server, instead of using GET, pass the encoded query arguments as data to urlopen().

import urllib
import urllib2

query_args = { 'q':'query string', 'foo':'bar' }
encoded_args = urllib.urlencode(query_args)
url = 'http://localhost:8080/'
print urllib2.urlopen(url, encoded_args).read()

The server can decode the form data and access the individual values by name.

$ python urllib2_urlopen_post.py
Client: ('127.0.0.1', 55943)
User-agent: Python-urllib/2.6
Path: /
Form data:
    q=query string
    foo=bar

Working with Requests Directly

urlopen() is a convenience function that hides some of the details of how the request is made and handled for you. For more precise control, you may want to instantiate and use a Request object directly.

Adding Outgoing Headers

As the examples above illustrate, the default User-agent header value is made up of the constant Python-urllib, followed by the Python interpreter version. If you are creating an application that will access other people’s web resources, it is courteous to include real user agent information in your requests, so they can identify the source of the hits more easily. Using a custom agent also allows them to control crawlers using a robots.txt file (see robotparser).

import urllib2

request = urllib2.Request('http://localhost:8080/')
request.add_header('User-agent', 'PyMOTW (http://www.doughellmann.com/PyMOTW/)')

response = urllib2.urlopen(request)
data = response.read()
print data

After creating a Request object, use add_header() to set the user agent value before opening the request. The last line of the output shows our custom value.

$ python urllib2_request_header.py
CLIENT VALUES:
client_address=('127.0.0.1', 55876) (localhost)
command=GET
path=/
real path=/
query=
request_version=HTTP/1.1

SERVER VALUES:
server_version=BaseHTTP/0.3
sys_version=Python/2.6.2
protocol_version=HTTP/1.0

HEADERS RECEIVED:
accept-encoding=identity
connection=close
host=localhost:8080
user-agent=PyMOTW (http://www.doughellmann.com/PyMOTW/)

Posting Form Data

You can set the outgoing data on the Request to post it to the server.

import urllib
import urllib2

query_args = { 'q':'query string', 'foo':'bar' }

request = urllib2.Request('http://localhost:8080/')
print 'Request method before data:', request.get_method()

request.add_data(urllib.urlencode(query_args))
print 'Request method after data :', request.get_method()
request.add_header('User-agent', 'PyMOTW (http://www.doughellmann.com/PyMOTW/)')

print
print 'OUTGOING DATA:'
print request.get_data()

print
print 'SERVER RESPONSE:'
print urllib2.urlopen(request).read()

The HTTP method used by the Request changes from GET to POST automatically after the data is added.

$ python urllib2_request_post.py
Request method before data: GET
Request method after data : POST

OUTGOING DATA:
q=query+string&foo=bar

SERVER RESPONSE:
Client: ('127.0.0.1', 56044)
User-agent: PyMOTW (http://www.doughellmann.com/PyMOTW/)
Path: /
Form data:
    q=query string
    foo=bar

Note

Although the method is add_data(), its effect is not cumulative. Each call replaces the previous data.

Uploading Files

Encoding files for upload requires a little more work than simple forms. A complete MIME message needs to be constructed in the body of the request, so that the server can distinguish incoming form fields from uploaded files.

import itertools
import mimetools
import mimetypes
from cStringIO import StringIO
import urllib
import urllib2

class MultiPartForm(object):
    """Accumulate the data to be used when posting a form."""

    def __init__(self):
        self.form_fields = []
        self.files = []
        self.boundary = mimetools.choose_boundary()
        return
    
    def get_content_type(self):
        return 'multipart/form-data; boundary=%s' % self.boundary

    def add_field(self, name, value):
        """Add a simple field to the form data."""
        self.form_fields.append((name, value))
        return

    def add_file(self, fieldname, filename, fileHandle, mimetype=None):
        """Add a file to be uploaded."""
        body = fileHandle.read()
        if mimetype is None:
            mimetype = mimetypes.guess_type(filename)[0] or 'application/octet-stream'
        self.files.append((fieldname, filename, mimetype, body))
        return
    
    def __str__(self):
        """Return a string representing the form data, including attached files."""
        # Build a list of lists, each containing "lines" of the
        # request.  Each part is separated by a boundary string.
        # Once the list is built, return a string where each
        # line is separated by '\r\n'.  
        parts = []
        part_boundary = '--' + self.boundary
        
        # Add the form fields
        parts.extend(
            [ part_boundary,
              'Content-Disposition: form-data; name="%s"' % name,
              '',
              value,
            ]
            for name, value in self.form_fields
            )
        
        # Add the files to upload
        parts.extend(
            [ part_boundary,
              'Content-Disposition: file; name="%s"; filename="%s"' % \
                 (field_name, filename),
              'Content-Type: %s' % content_type,
              '',
              body,
            ]
            for field_name, filename, content_type, body in self.files
            )
        
        # Flatten the list and add closing boundary marker,
        # then return CR+LF separated data
        flattened = list(itertools.chain(*parts))
        flattened.append('--' + self.boundary + '--')
        flattened.append('')
        return '\r\n'.join(flattened)

if __name__ == '__main__':
    # Create the form with simple fields
    form = MultiPartForm()
    form.add_field('firstname', 'Doug')
    form.add_field('lastname', 'Hellmann')
    
    # Add a fake file
    form.add_file('biography', 'bio.txt', 
                  fileHandle=StringIO('Python developer and blogger.'))

    # Build the request
    request = urllib2.Request('http://localhost:8080/')
    request.add_header('User-agent', 'PyMOTW (http://www.doughellmann.com/PyMOTW/)')
    body = str(form)
    request.add_header('Content-type', form.get_content_type())
    request.add_header('Content-length', len(body))
    request.add_data(body)

    print
    print 'OUTGOING DATA:'
    print request.get_data()

    print
    print 'SERVER RESPONSE:'
    print urllib2.urlopen(request).read()

The MultiPartForm class can represent an arbitrary form as a multi-part MIME message with attached files.

$ python urllib2_upload_files.py

OUTGOING DATA:
--192.168.1.17.527.30074.1248020372.206.1
Content-Disposition: form-data; name="firstname"

Doug
--192.168.1.17.527.30074.1248020372.206.1
Content-Disposition: form-data; name="lastname"

Hellmann
--192.168.1.17.527.30074.1248020372.206.1
Content-Disposition: file; name="biography"; filename="bio.txt"
Content-Type: text/plain

Python developer and blogger.
--192.168.1.17.527.30074.1248020372.206.1--


SERVER RESPONSE:
Client: ('127.0.0.1', 57126)
User-agent: PyMOTW (http://www.doughellmann.com/PyMOTW/)
Path: /
Form data:
    lastname=Hellmann
    Uploaded biography as "bio.txt" (29 bytes)
    firstname=Doug

Custom Protocol Handlers

urllib2 has built-in support for HTTP(S), FTP, and local file access. If you need to add support for other URL types, you can register your own protocol handler to be invoked as needed. For example, if you want to support URLs pointing to arbitrary files on remote NFS servers, without requiring your users to mount the path manually, would create a class derived from BaseHandler and with a method nfs_open().

The protocol open() method takes a single argument, the Request instance, and it should return an object with a read() method that can be used to read the data, an info() method to return the response headers, and geturl() to return the actual URL of the file being read. A simple way to achieve that is to create an instance of urllib.addurlinfo, passing the headers, URL, and open file handle in to the constructor.

import mimetypes
import os
import tempfile
import urllib
import urllib2

class NFSFile(file):
    def __init__(self, tempdir, filename):
        self.tempdir = tempdir
        file.__init__(self, filename, 'rb')
    def close(self):
        print
        print 'NFSFile:'
        print '  unmounting %s' % self.tempdir
        print '  when %s is closed' % os.path.basename(self.name)
        return file.close(self)

class FauxNFSHandler(urllib2.BaseHandler):
    
    def __init__(self, tempdir):
        self.tempdir = tempdir
    
    def nfs_open(self, req):
        url = req.get_selector()
        directory_name, file_name = os.path.split(url)
        server_name = req.get_host()
        print
        print 'FauxNFSHandler simulating mount:'
        print '  Remote path: %s' % directory_name
        print '  Server     : %s' % server_name
        print '  Local path : %s' % tempdir
        print '  File name  : %s' % file_name
        local_file = os.path.join(tempdir, file_name)
        fp = NFSFile(tempdir, local_file)
        content_type = mimetypes.guess_type(file_name)[0] or 'application/octet-stream'
        stats = os.stat(local_file)
        size = stats.st_size
        headers = { 'Content-type': content_type,
                    'Content-length': size,
                  }
        return urllib.addinfourl(fp, headers, req.get_full_url())

if __name__ == '__main__':
    tempdir = tempfile.mkdtemp()
    try:
        # Populate the temporary file for the simulation
        with open(os.path.join(tempdir, 'file.txt'), 'wt') as f:
            f.write('Contents of file.txt')
        
        # Construct an opener with our NFS handler
        # and register it as the default opener.
        opener = urllib2.build_opener(FauxNFSHandler(tempdir))
        urllib2.install_opener(opener)

        # Open the file through a URL.
        response = urllib2.urlopen('nfs://remote_server/path/to/the/file.txt')
        print
        print 'READ CONTENTS:', response.read()
        print 'URL          :', response.geturl()
        print 'HEADERS:'
        for name, value in sorted(response.info().items()):
            print '  %-15s = %s' % (name, value)
        response.close()
    finally:
        os.remove(os.path.join(tempdir, 'file.txt'))
        os.removedirs(tempdir)

The FauxNFSHandler and NFSFile classes print messages to illustrate where a real implementation would add mount and unmount calls. Since this is just a simulation, FauxNFSHandler is primed with the name of a temporary directory where it should look for all of its files.

$ python urllib2_nfs_handler.py

FauxNFSHandler simulating mount:
  Remote path: /path/to/the
  Server     : remote_server
  Local path : /var/folders/9R/9R1t+tR02Raxzk+F71Q50U+++Uw/-Tmp-/tmppv5Efn
  File name  : file.txt

READ CONTENTS: Contents of file.txt
URL          : nfs://remote_server/path/to/the/file.txt
HEADERS:
  Content-length  = 20
  Content-type    = text/plain

NFSFile:
  unmounting /var/folders/9R/9R1t+tR02Raxzk+F71Q50U+++Uw/-Tmp-/tmppv5Efn
  when file.txt is closed

See also

urllib2
The standard library documentation for this module.
urllib
Original URL handling library.
urlparse
Work with the URL string itself.
urllib2 – The Missing Manual
Michael Foord’s write-up on using urllib2.
Upload Scripts
Example scripts from Michael Foord that illustrate how to upload a file using HTTP and then receive the data on the server.
HTTP client to POST using multipart/form-data
Python cookbook recipe showing how to encode and post data, including files, over HTTP.
Form content types
W3C specification for posting files or large amounts of data via HTTP forms.
mimetypes
Map filenames to mimetype.
mimetools
Tools for parsing MIME messages.