urlparse – Split URL into component pieces.¶
Purpose: | Split URL into component pieces. |
---|---|
Available In: | since 1.4 |
The urlparse module provides functions for breaking URLs down into their component parts, as defined by the relevant RFCs.
Parsing¶
The return value from the urlparse() function is an object which acts like a tuple with 6 elements.
from urlparse import urlparse
parsed = urlparse('http://netloc/path;parameters?query=argument#fragment')
print parsed
The parts of the URL available through the tuple interface are the scheme, network location, path, parameters, query, and fragment.
$ python urlparse_urlparse.py
ParseResult(scheme='http', netloc='netloc', path='/path', params='parameters', query='query=argument', fragment='fragment')
Although the return value acts like a tuple, it is really based on a namedtuple, a subclass of tuple that supports accessing the parts of the URL via named attributes instead of indexes. That’s especially useful if, like me, you can’t remember the index order. In addition to being easier to use for the programmer, the attribute API also offers access to several values not available in the tuple API.
from urlparse import urlparse
parsed = urlparse('http://user:pass@NetLoc:80/path;parameters?query=argument#fragment')
print 'scheme :', parsed.scheme
print 'netloc :', parsed.netloc
print 'path :', parsed.path
print 'params :', parsed.params
print 'query :', parsed.query
print 'fragment:', parsed.fragment
print 'username:', parsed.username
print 'password:', parsed.password
print 'hostname:', parsed.hostname, '(netloc in lower case)'
print 'port :', parsed.port
The username and password are available when present in the input URL and None when not. The hostname is the same value as netloc, in all lower case. And the port is converted to an integer when present and None when not.
$ python urlparse_urlparseattrs.py
scheme : http
netloc : user:pass@NetLoc:80
path : /path
params : parameters
query : query=argument
fragment: fragment
username: user
password: pass
hostname: netloc (netloc in lower case)
port : 80
The urlsplit() function is an alternative to urlparse(). It behaves a little different, because it does not split the parameters from the URL. This is useful for URLs following RFC 2396, which supports parameters for each segment of the path.
from urlparse import urlsplit
parsed = urlsplit('http://user:pass@NetLoc:80/path;parameters/path2;parameters2?query=argument#fragment')
print parsed
print 'scheme :', parsed.scheme
print 'netloc :', parsed.netloc
print 'path :', parsed.path
print 'query :', parsed.query
print 'fragment:', parsed.fragment
print 'username:', parsed.username
print 'password:', parsed.password
print 'hostname:', parsed.hostname, '(netloc in lower case)'
print 'port :', parsed.port
Since the parameters are not split out, the tuple API will show 5 elements instead of 6, and there is no params attribute.
$ python urlparse_urlsplit.py
SplitResult(scheme='http', netloc='user:pass@NetLoc:80', path='/path;parameters/path2;parameters2', query='query=argument', fragment='fragment')
scheme : http
netloc : user:pass@NetLoc:80
path : /path;parameters/path2;parameters2
query : query=argument
fragment: fragment
username: user
password: pass
hostname: netloc (netloc in lower case)
port : 80
To simply strip the fragment identifier from a URL, as you might need to do to find a base page name from a URL, use urldefrag().
from urlparse import urldefrag
original = 'http://netloc/path;parameters?query=argument#fragment'
print original
url, fragment = urldefrag(original)
print url
print fragment
The return value is a tuple containing the base URL and the fragment.
$ python urlparse_urldefrag.py
http://netloc/path;parameters?query=argument#fragment
http://netloc/path;parameters?query=argument
fragment
Unparsing¶
There are several ways to assemble a split URL back together into a single string. The parsed URL object has a geturl() method.
from urlparse import urlparse
original = 'http://netloc/path;parameters?query=argument#fragment'
print 'ORIG :', original
parsed = urlparse(original)
print 'PARSED:', parsed.geturl()
geturl() only works on the object returned by urlparse() or urlsplit().
$ python urlparse_geturl.py
ORIG : http://netloc/path;parameters?query=argument#fragment
PARSED: http://netloc/path;parameters?query=argument#fragment
If you have a regular tuple of values, you can use urlunparse() to combine them into a URL.
from urlparse import urlparse, urlunparse
original = 'http://netloc/path;parameters?query=argument#fragment'
print 'ORIG :', original
parsed = urlparse(original)
print 'PARSED:', type(parsed), parsed
t = parsed[:]
print 'TUPLE :', type(t), t
print 'NEW :', urlunparse(t)
While the ParseResult returned by urlparse() can be used as a tuple, in this example I explicitly create a new tuple to show that urlunparse() works with normal tuples, too.
$ python urlparse_urlunparse.py
ORIG : http://netloc/path;parameters?query=argument#fragment
PARSED: <class 'urlparse.ParseResult'> ParseResult(scheme='http', netloc='netloc', path='/path', params='parameters', query='query=argument', fragment='fragment')
TUPLE : <type 'tuple'> ('http', 'netloc', '/path', 'parameters', 'query=argument', 'fragment')
NEW : http://netloc/path;parameters?query=argument#fragment
If the input URL included superfluous parts, those may be dropped from the unparsed version of the URL.
from urlparse import urlparse, urlunparse
original = 'http://netloc/path;?#'
print 'ORIG :', original
parsed = urlparse(original)
print 'PARSED:', type(parsed), parsed
t = parsed[:]
print 'TUPLE :', type(t), t
print 'NEW :', urlunparse(t)
In this case, the parameters, query, and fragment are all missing in the original URL. The new URL does not look the same as the original, but is equivalent according to the standard.
$ python urlparse_urlunparseextra.py
ORIG : http://netloc/path;?#
PARSED: <class 'urlparse.ParseResult'> ParseResult(scheme='http', netloc='netloc', path='/path', params='', query='', fragment='')
TUPLE : <type 'tuple'> ('http', 'netloc', '/path', '', '', '')
NEW : http://netloc/path
Joining¶
In addition to parsing URLs, urlparse includes urljoin() for constructing absolute URLs from relative fragments.
from urlparse import urljoin
print urljoin('http://www.example.com/path/file.html', 'anotherfile.html')
print urljoin('http://www.example.com/path/file.html', '../anotherfile.html')
In the example, the relative portion of the path ("../") is taken into account when the second URL is computed.
$ python urlparse_urljoin.py
http://www.example.com/path/anotherfile.html
http://www.example.com/anotherfile.html