The urllib.parse module in the standard library defines several functions for working with Uniform Resource Locators(URLs). In this article we will focus on the urlsplit()
and urlunsplit()
functions.
The working of the urlsplit()
function closely resembles that of the popular urlparse()
function. Both functions split a URL string into the individual components: scheme
, netloc
, path
, query string
, and fragment
. The key difference is that the urlparse()
function includes an extra component called "param
" that represents any parameters included in the path.
The official Python docs says that:
urlsplit()
should generally be used instead ofurlparse()
if the more recent URL syntax allowing parameters to be applied to each segment of the path portion of the URL is wanted....
Splitting urls into individual components
The urlparse()
function breaks a URL into 5
basic components i.e scheme
, netloc
, path
, query string
, and fragment
. It returns a namedtuple object containing the components.
urlsplit(url, scheme = '', allow_fragments = True)
copy
Named tuples allows an element to be accessed either by its index(as in regular tuples) or by an attribute name. Thus with the namedtuple object returned by urlsplit(
), we can access the url components using either the index notation or the attribute notation.
The following table shows the index and attribute name of each component:
component | index | attribute |
---|---|---|
url scheme e.g https | 0 |
scheme |
network location | 1 |
netloc |
relative path | 2 |
path |
query string | 3 |
query |
fragment string | 4 |
fragment |
In the above example we used the index notation to access individual URL components from the named tuple, let us now see how we can achieve the same using the attribute notation.
When using the attribute notation as in above, we can access four additional components i.e hostname
, port
, username
and password
. Note that this is not possible when using the index notation.
If any of the additional four additional components do not exist in the given URL, None
will be returned.
scheme
and allow_fragments
parameters
Apart from the URL, we can pass two more parameters to the ursplit()
function; scheme
and allow_fragements
.
When the scheme
parameter is given, its value will be used only if the given URL does not have a scheme. Consider the following example:
The allow_fragments
parameter indicates whether fragments should be included or not, it defaults to True
. If it is set to False, fragment
will be set to an empty string regardless of whether the input URL has fragments or not.
Merging components to form URL
The urlunsplit()
function is the opposite of urlsplit()
. It takes an the five components returned by the urlsplit()
function and reconstructs the original URL.