The urllib module in the standard library includes the parse sub-module. parse provides tools for working with Uniform Resource Locators(URLs) . This module  is especially useful for retrieving individual components that makes up a URL. It also has functions for constructing a URL given the components.

To use the parse module, we will first have to import it in our program.

import parse

from urllib import parse

print(parse)

In this article, we will explore the parse module, its usage and the various functions defined in the module.

Parsing urls

The urlparse() function is the basic tool for parsing a url into its individual components.

Basically, a standard url is made up of 6 components with the structure shown below:

<scheme>://<netloc>/<path>;<params>?<query>#<fragments>

More details on some of the components are shown below:

  • scheme - the protocol name, usually http or https.
  • netloc - The network location which normally includes the domain name, port number and any additional credentials e.g www.pynerds.com:8000
  • path - Relative path to the resource in the target server, e.g in www.google.com/search, /search is the relative path.
  • params, query, fragments - Provides additional information on the resource being opened by the specified URL

The urlparse() function returns a namedtuple object with the six values for the input url. 

It has the following syntax:

urlparse(url, scheme = '', allow_fragments = True)
from urllib.parse import urlparse

URL = 'https://www.google.com/search?s=functions'
result = urlparse(URL)

print('Parsed URL: ', result.geturl())
print('Parse Result: ', result)

As you can see in the above example, the urlparse() function returns an object containing the individual components of the URL. The geturl() method returns the url that was parsed.

The returned object is a namedtuple(from the collections module), this means that we can conveniently use either index or attribute name to access the value of individual component.

access elements with  index

from urllib.parse import urlparse

URL = 'https://www.google.com/search?s=functions#dummy-fragment'
result = urlparse(URL)

print('scheme: ', result[0])
print('netloc: ', result[1])
print('path: ', result[2])
print('params: ', result[3])
print('query: ', result[4])
print('fragments: ', result[5])

access elements with attribute names

from urllib.parse import urlparse

URL = 'https://www.google.com/search?s=functions#dummy-fragment'
result = urlparse(URL)

print('scheme: ', result.scheme)
print('netloc: ', result.netloc)
print('path: ', result.path)
print('params: ', result.params)
print('query: ', result.query)
print('fragments: ', result.fragment)

When using attribute names, we can access additional values that are not available when using the index approach. Such values includes port, host, password and  username.

access additional values

from urllib.parse import urlparse

URL = 'https://www.google.com:8000/search?s=functions'
result = urlparse(URL)

print('scheme: ', result.scheme)
print('host: ', result.hostname)
print('port: ', result.port)

All of the values available in the returned namedtuple object are shown in the following table:

attribbute name index
scheme 0
netloc 1
path 2
params 3
query 4
fragment 5
username  
password  
hostname  
port  

Construct URL from components

We can also assemble url components to form the complete  url.

using urlunparse()

The urlunparse() functions takes url components and reconstructs the original url, it returns the assembled url.

from urllib.parse import urlparse, urlunparse

URL = 'https://www.google.com:8000/search?s=functions'

parsed = urlparse(URL)
unparsed = urlunparse(parsed)
print(unparsed)

Note that, the argument given to urlunparse() can be any 6-length iterable with the URL's components. 

quote and unquote urls

Some URLs may contain special characters that will need to be escaped. The parse module provides convenient functions for escaping and unescaping special characters in a given URL.

quote() and quote_plus()

The quote() function replaces all special characters with the %xx escape characters. Note that _, -, ~  and . characters are  never escaped.

escaping special characters 

from urllib.parse import quote

URL = 'https://www.google.com:8000/search?s=python  functions'

print(quote(URL))

The quote_plus() function works just like quote() except that it replaces whitespaces with a plus sign(+).

from urllib.parse import quote_plus

URL = 'https://www.google.com:8000/search?s=python  functions'

print(quote_plus(URL))

unquote() and unquote_plus()

unquote() and unquote_plus() are functionally opposite to quote() and unquote() respectively.

The two functions replaces the escaped characters in the URL with their original equivalents.

from urllib.parse import unquote

escaped = "https%3A//www.google.com%3A8000/search%3Fs%3Dpython%20%20functions"

print(unquote(escaped))

And if the spaces has been escaped with + character, we can use the unquote_plus() as shown below:

from urllib.parse import unquote_plus

escaped = "https%3A%2F%2Fwww.google.com%3A8000%2Fsearch%3Fs%3Dpython++functions"
print(unquote_plus(escaped))