The urllib.parse module in the standard library defines several functions for working with Uniform Resource Locators(URLs). In this article we will focus on the urlsplit() and urlunsplit() functions.

The working of the urlsplit() function closely resembles that of the popular urlparse() function. Both functions split a URL string into the individual components: scheme, netloc, path, query string, and fragment. The key difference is that the urlparse() function  includes an extra component called "param"    that represents any parameters included in the path.

The official Python docs  says that:

urlsplit() should generally be used instead of urlparse() if the more recent URL syntax allowing parameters to be applied to each segment of the path portion of the URL  is wanted....

Splitting urls into individual components

The urlparse() function breaks a URL into 5 basic components i.e  scheme, netloc, path, query string, and fragment. It returns a namedtuple object containing the components.

urlsplit(url, scheme = '', allow_fragments = True)

Named tuples allows an element to be accessed either by its index(as in regular tuples) or by an attribute name. Thus with the namedtuple object returned by urlsplit(), we can access the url components using either the index notation or the attribute notation.

The following table shows the index and attribute name of each component:

component index attribute
url scheme e.g https 0 scheme
network location 1 netloc
relative path 2 path
query string 3 query
fragment string 4 fragment

using indexes

from urllib.parse import urlsplit

URL = 'https://www.google.com/search?q=python+functions#1&2'
components = urlsplit(URL)
print(components, '\n')

#access individual components by indexes
print('scheme: ', components[0])
print('netloc: ', components[1])
print('path: ', components[2])
print('query: ', components[3])
print('fragment: ', components[4])

In the above example we used the index  notation to access individual URL components from the named tuple, let us now see how we can achieve the same using the attribute notation.

using attributes 

from urllib.parse import urlsplit

URL = 'https://www.google.com/search?q=python+functions#1&2'
components = urlsplit(URL)

#access individual components by indexes
print('scheme: ', components.scheme)
print('netloc: ', components.netloc)
print('path: ', components.path)
print('query: ', components.query)
print('fragment: ', components.fragment)

When using the attribute notation as in above, we can access four additional components i.e hostname, port, username and password. Note that this is not possible when using the index notation.

If any of the additional four additional components do not exist in the given URL, None will be returned.

from urllib.parse import urlsplit

URL = 'https://www.google.com:3000/search?q=python+functions#1&2'
components = urlsplit(URL)

#access individual components by attributes
print('hostname: ', components.hostname)
print('port: ', components.port)
print('username: ', components.username)
print('password: ', components.password)

scheme and  allow_fragments parameters

Apart from the URL, we can pass two more parameters to the ursplit() function; scheme and allow_fragements.

When the scheme parameter is given, its value will be used only if the given URL does not have a scheme. Consider the following example:

from urllib.parse import urlsplit

URL = 'www.google.com/search?q=python+functions' # no scheme
components = urlsplit(URL, scheme = "http")

print('scheme: ', components.scheme)

The allow_fragments parameter indicates whether fragments should be included or not, it defaults to True. If it is set to False, fragment will be set to an empty string regardless of whether the input URL has fragments or not.

Merging components to form URL

The urlunsplit() function is the opposite of  urlsplit(). It takes an the five components returned by the urlsplit() function and reconstructs the original URL.

from urllib.parse import urlsplit, urlunsplit

URL = 'https://www.google.com/search?q=python+functions' # no scheme
components = urlsplit(URL)

original = urlunsplit(components) #reconstruct URL from components
print('original: ', original)