urllib2 – Library for opening URLs.¶
Purpose: | A library for opening URLs that can be extended by defining custom protocol handlers. |
---|---|
Available In: | 2.1 |
The urllib2 module provides an updated API for using internet resources identified by URLs. It is designed to be extended by individual applications to support new protocols or add variations to existing protocols (such as handling HTTP basic authentication).
HTTP GET¶
Note
The test server for these examples is in BaseHTTPServer_GET.py, from the PyMOTW examples for BaseHTTPServer. Start the server in one terminal window, then run these examples in another.
As with urllib, an HTTP GET operation is the simplest use of urllib2. Pass the URL to urlopen() to get a “file-like” handle to the remote data.
import urllib2
response = urllib2.urlopen('http://localhost:8080/')
print 'RESPONSE:', response
print 'URL :', response.geturl()
headers = response.info()
print 'DATE :', headers['date']
print 'HEADERS :'
print '---------'
print headers
data = response.read()
print 'LENGTH :', len(data)
print 'DATA :'
print '---------'
print data
The example server accepts the incoming values and formats a plain text response to send back. The return value from urlopen() gives access to the headers from the HTTP server through the info() method, and the data for the remote resource via methods like read() and readlines().
$ python urllib2_urlopen.py
RESPONSE: <addinfourl at 11940488 whose fp = <socket._fileobject object at 0xb573f0>>
URL : http://localhost:8080/
DATE : Sun, 19 Jul 2009 14:01:31 GMT
HEADERS :
---------
Server: BaseHTTP/0.3 Python/2.6.2
Date: Sun, 19 Jul 2009 14:01:31 GMT
LENGTH : 349
DATA :
---------
CLIENT VALUES:
client_address=('127.0.0.1', 55836) (localhost)
command=GET
path=/
real path=/
query=
request_version=HTTP/1.1
SERVER VALUES:
server_version=BaseHTTP/0.3
sys_version=Python/2.6.2
protocol_version=HTTP/1.0
HEADERS RECEIVED:
accept-encoding=identity
connection=close
host=localhost:8080
user-agent=Python-urllib/2.6
The file-like object returned by urlopen() is iterable:
import urllib2
response = urllib2.urlopen('http://localhost:8080/')
for line in response:
print line.rstrip()
This example strips the trailing newlines and carriage returns before printing the output.
$ python urllib2_urlopen_iterator.py
CLIENT VALUES:
client_address=('127.0.0.1', 55840) (localhost)
command=GET
path=/
real path=/
query=
request_version=HTTP/1.1
SERVER VALUES:
server_version=BaseHTTP/0.3
sys_version=Python/2.6.2
protocol_version=HTTP/1.0
HEADERS RECEIVED:
accept-encoding=identity
connection=close
host=localhost:8080
user-agent=Python-urllib/2.6
Encoding Arguments¶
Arguments can be passed to the server by encoding them with urllib.urlencode() and appending them to the URL.
import urllib
import urllib2
query_args = { 'q':'query string', 'foo':'bar' }
encoded_args = urllib.urlencode(query_args)
print 'Encoded:', encoded_args
url = 'http://localhost:8080/?' + encoded_args
print urllib2.urlopen(url).read()
The list of client values returned in the example output contains the encoded query arguments.
$ python urllib2_http_get_args.py
Encoded: q=query+string&foo=bar
CLIENT VALUES:
client_address=('127.0.0.1', 55849) (localhost)
command=GET
path=/?q=query+string&foo=bar
real path=/
query=q=query+string&foo=bar
request_version=HTTP/1.1
SERVER VALUES:
server_version=BaseHTTP/0.3
sys_version=Python/2.6.2
protocol_version=HTTP/1.0
HEADERS RECEIVED:
accept-encoding=identity
connection=close
host=localhost:8080
user-agent=Python-urllib/2.6
HTTP POST¶
Note
The test server for these examples is in BaseHTTPServer_POST.py, from the PyMOTW examples for the BaseHTTPServer. Start the server in one terminal window, then run these examples in another.
To POST form-encoded data to the remote server, instead of using GET, pass the encoded query arguments as data to urlopen().
import urllib
import urllib2
query_args = { 'q':'query string', 'foo':'bar' }
encoded_args = urllib.urlencode(query_args)
url = 'http://localhost:8080/'
print urllib2.urlopen(url, encoded_args).read()
The server can decode the form data and access the individual values by name.
$ python urllib2_urlopen_post.py
Client: ('127.0.0.1', 55943)
User-agent: Python-urllib/2.6
Path: /
Form data:
q=query string
foo=bar
Working with Requests Directly¶
urlopen() is a convenience function that hides some of the details of how the request is made and handled for you. For more precise control, you may want to instantiate and use a Request object directly.
Adding Outgoing Headers¶
As the examples above illustrate, the default User-agent header value is made up of the constant Python-urllib, followed by the Python interpreter version. If you are creating an application that will access other people’s web resources, it is courteous to include real user agent information in your requests, so they can identify the source of the hits more easily. Using a custom agent also allows them to control crawlers using a robots.txt file (see robotparser).
import urllib2
request = urllib2.Request('http://localhost:8080/')
request.add_header('User-agent', 'PyMOTW (http://www.doughellmann.com/PyMOTW/)')
response = urllib2.urlopen(request)
data = response.read()
print data
After creating a Request object, use add_header() to set the user agent value before opening the request. The last line of the output shows our custom value.
$ python urllib2_request_header.py
CLIENT VALUES:
client_address=('127.0.0.1', 55876) (localhost)
command=GET
path=/
real path=/
query=
request_version=HTTP/1.1
SERVER VALUES:
server_version=BaseHTTP/0.3
sys_version=Python/2.6.2
protocol_version=HTTP/1.0
HEADERS RECEIVED:
accept-encoding=identity
connection=close
host=localhost:8080
user-agent=PyMOTW (http://www.doughellmann.com/PyMOTW/)
Posting Form Data¶
You can set the outgoing data on the Request to post it to the server.
import urllib
import urllib2
query_args = { 'q':'query string', 'foo':'bar' }
request = urllib2.Request('http://localhost:8080/')
print 'Request method before data:', request.get_method()
request.add_data(urllib.urlencode(query_args))
print 'Request method after data :', request.get_method()
request.add_header('User-agent', 'PyMOTW (http://www.doughellmann.com/PyMOTW/)')
print
print 'OUTGOING DATA:'
print request.get_data()
print
print 'SERVER RESPONSE:'
print urllib2.urlopen(request).read()
The HTTP method used by the Request changes from GET to POST automatically after the data is added.
$ python urllib2_request_post.py
Request method before data: GET
Request method after data : POST
OUTGOING DATA:
q=query+string&foo=bar
SERVER RESPONSE:
Client: ('127.0.0.1', 56044)
User-agent: PyMOTW (http://www.doughellmann.com/PyMOTW/)
Path: /
Form data:
q=query string
foo=bar
Note
Although the method is add_data(), its effect is not cumulative. Each call replaces the previous data.
Uploading Files¶
Encoding files for upload requires a little more work than simple forms. A complete MIME message needs to be constructed in the body of the request, so that the server can distinguish incoming form fields from uploaded files.
import itertools
import mimetools
import mimetypes
from cStringIO import StringIO
import urllib
import urllib2
class MultiPartForm(object):
"""Accumulate the data to be used when posting a form."""
def __init__(self):
self.form_fields = []
self.files = []
self.boundary = mimetools.choose_boundary()
return
def get_content_type(self):
return 'multipart/form-data; boundary=%s' % self.boundary
def add_field(self, name, value):
"""Add a simple field to the form data."""
self.form_fields.append((name, value))
return
def add_file(self, fieldname, filename, fileHandle, mimetype=None):
"""Add a file to be uploaded."""
body = fileHandle.read()
if mimetype is None:
mimetype = mimetypes.guess_type(filename)[0] or 'application/octet-stream'
self.files.append((fieldname, filename, mimetype, body))
return
def __str__(self):
"""Return a string representing the form data, including attached files."""
# Build a list of lists, each containing "lines" of the
# request. Each part is separated by a boundary string.
# Once the list is built, return a string where each
# line is separated by '\r\n'.
parts = []
part_boundary = '--' + self.boundary
# Add the form fields
parts.extend(
[ part_boundary,
'Content-Disposition: form-data; name="%s"' % name,
'',
value,
]
for name, value in self.form_fields
)
# Add the files to upload
parts.extend(
[ part_boundary,
'Content-Disposition: file; name="%s"; filename="%s"' % \
(field_name, filename),
'Content-Type: %s' % content_type,
'',
body,
]
for field_name, filename, content_type, body in self.files
)
# Flatten the list and add closing boundary marker,
# then return CR+LF separated data
flattened = list(itertools.chain(*parts))
flattened.append('--' + self.boundary + '--')
flattened.append('')
return '\r\n'.join(flattened)
if __name__ == '__main__':
# Create the form with simple fields
form = MultiPartForm()
form.add_field('firstname', 'Doug')
form.add_field('lastname', 'Hellmann')
# Add a fake file
form.add_file('biography', 'bio.txt',
fileHandle=StringIO('Python developer and blogger.'))
# Build the request
request = urllib2.Request('http://localhost:8080/')
request.add_header('User-agent', 'PyMOTW (http://www.doughellmann.com/PyMOTW/)')
body = str(form)
request.add_header('Content-type', form.get_content_type())
request.add_header('Content-length', len(body))
request.add_data(body)
print
print 'OUTGOING DATA:'
print request.get_data()
print
print 'SERVER RESPONSE:'
print urllib2.urlopen(request).read()
The MultiPartForm class can represent an arbitrary form as a multi-part MIME message with attached files.
$ python urllib2_upload_files.py
OUTGOING DATA:
--192.168.1.17.527.30074.1248020372.206.1
Content-Disposition: form-data; name="firstname"
Doug
--192.168.1.17.527.30074.1248020372.206.1
Content-Disposition: form-data; name="lastname"
Hellmann
--192.168.1.17.527.30074.1248020372.206.1
Content-Disposition: file; name="biography"; filename="bio.txt"
Content-Type: text/plain
Python developer and blogger.
--192.168.1.17.527.30074.1248020372.206.1--
SERVER RESPONSE:
Client: ('127.0.0.1', 57126)
User-agent: PyMOTW (http://www.doughellmann.com/PyMOTW/)
Path: /
Form data:
lastname=Hellmann
Uploaded biography as "bio.txt" (29 bytes)
firstname=Doug
Custom Protocol Handlers¶
urllib2 has built-in support for HTTP(S), FTP, and local file access. If you need to add support for other URL types, you can register your own protocol handler to be invoked as needed. For example, if you want to support URLs pointing to arbitrary files on remote NFS servers, without requiring your users to mount the path manually, would create a class derived from BaseHandler and with a method nfs_open().
The protocol open() method takes a single argument, the Request instance, and it should return an object with a read() method that can be used to read the data, an info() method to return the response headers, and geturl() to return the actual URL of the file being read. A simple way to achieve that is to create an instance of urllib.addurlinfo, passing the headers, URL, and open file handle in to the constructor.
import mimetypes
import os
import tempfile
import urllib
import urllib2
class NFSFile(file):
def __init__(self, tempdir, filename):
self.tempdir = tempdir
file.__init__(self, filename, 'rb')
def close(self):
print
print 'NFSFile:'
print ' unmounting %s' % self.tempdir
print ' when %s is closed' % os.path.basename(self.name)
return file.close(self)
class FauxNFSHandler(urllib2.BaseHandler):
def __init__(self, tempdir):
self.tempdir = tempdir
def nfs_open(self, req):
url = req.get_selector()
directory_name, file_name = os.path.split(url)
server_name = req.get_host()
print
print 'FauxNFSHandler simulating mount:'
print ' Remote path: %s' % directory_name
print ' Server : %s' % server_name
print ' Local path : %s' % tempdir
print ' File name : %s' % file_name
local_file = os.path.join(tempdir, file_name)
fp = NFSFile(tempdir, local_file)
content_type = mimetypes.guess_type(file_name)[0] or 'application/octet-stream'
stats = os.stat(local_file)
size = stats.st_size
headers = { 'Content-type': content_type,
'Content-length': size,
}
return urllib.addinfourl(fp, headers, req.get_full_url())
if __name__ == '__main__':
tempdir = tempfile.mkdtemp()
try:
# Populate the temporary file for the simulation
with open(os.path.join(tempdir, 'file.txt'), 'wt') as f:
f.write('Contents of file.txt')
# Construct an opener with our NFS handler
# and register it as the default opener.
opener = urllib2.build_opener(FauxNFSHandler(tempdir))
urllib2.install_opener(opener)
# Open the file through a URL.
response = urllib2.urlopen('nfs://remote_server/path/to/the/file.txt')
print
print 'READ CONTENTS:', response.read()
print 'URL :', response.geturl()
print 'HEADERS:'
for name, value in sorted(response.info().items()):
print ' %-15s = %s' % (name, value)
response.close()
finally:
os.remove(os.path.join(tempdir, 'file.txt'))
os.removedirs(tempdir)
The FauxNFSHandler and NFSFile classes print messages to illustrate where a real implementation would add mount and unmount calls. Since this is just a simulation, FauxNFSHandler is primed with the name of a temporary directory where it should look for all of its files.
$ python urllib2_nfs_handler.py
FauxNFSHandler simulating mount:
Remote path: /path/to/the
Server : remote_server
Local path : /var/folders/9R/9R1t+tR02Raxzk+F71Q50U+++Uw/-Tmp-/tmppv5Efn
File name : file.txt
READ CONTENTS: Contents of file.txt
URL : nfs://remote_server/path/to/the/file.txt
HEADERS:
Content-length = 20
Content-type = text/plain
NFSFile:
unmounting /var/folders/9R/9R1t+tR02Raxzk+F71Q50U+++Uw/-Tmp-/tmppv5Efn
when file.txt is closed
See also
- urllib2
- The standard library documentation for this module.
- urllib
- Original URL handling library.
- urlparse
- Work with the URL string itself.
- urllib2 – The Missing Manual
- Michael Foord’s write-up on using urllib2.
- Upload Scripts
- Example scripts from Michael Foord that illustrate how to upload a file using HTTP and then receive the data on the server.
- HTTP client to POST using multipart/form-data
- Python cookbook recipe showing how to encode and post data, including files, over HTTP.
- Form content types
- W3C specification for posting files or large amounts of data via HTTP forms.
- mimetypes
- Map filenames to mimetype.
- mimetools
- Tools for parsing MIME messages.