The urllib
module in the standard library includes the parse
sub-module. parse
provides tools for working with Uniform Resource Locators(URLs) . This module is especially useful for retrieving individual components that makes up a URL. It also has functions for constructing a URL given the components.
To use the parse
module, we will first have to import it in our program.
import parse
from urllib import parse
print(parse)
In this article, we will explore the parse module, its usage and the various functions defined in the module.
Parsing urls
The urlparse()
function is the basic tool for parsing a url into its individual components.
Basically, a standard url is made up of 6 components with the structure shown below:
<scheme>://<netloc>/<path>;<params>?<query>#<fragments>
More details on some of the components are shown below:
scheme
- the protocol name, usuallyhttp
orhttps
.netloc
- The network location which normally includes the domain name, port number and any additional credentials e.gwww.pynerds.com:8000
path
- Relative path to the resource in the target server, e.g inwww.google.com/search
, /search is the relativepath
.params
,query
,fragments
- Provides additional information on the resource being opened by the specified URL
The urlparse()
function returns a namedtuple
object with the six values for the input url.
It has the following syntax:
urlparse(url, scheme = '', allow_fragments = True)
from urllib.parse import urlparse
URL = 'https://www.google.com/search?s=functions'
result = urlparse(URL)
print('Parsed URL: ', result.geturl())
print('Parse Result: ', result)
As you can see in the above example, the urlparse()
function returns an object containing the individual components of the URL. The geturl()
method returns the url that was parsed.
The returned object is a namedtuple(from the collections module), this means that we can conveniently use either index or attribute name to access the value of individual component.
access elements with index
from urllib.parse import urlparse
URL = 'https://www.google.com/search?s=functions#dummy-fragment'
result = urlparse(URL)
print('scheme: ', result[0])
print('netloc: ', result[1])
print('path: ', result[2])
print('params: ', result[3])
print('query: ', result[4])
print('fragments: ', result[5])
access elements with attribute names
from urllib.parse import urlparse
URL = 'https://www.google.com/search?s=functions#dummy-fragment'
result = urlparse(URL)
print('scheme: ', result.scheme)
print('netloc: ', result.netloc)
print('path: ', result.path)
print('params: ', result.params)
print('query: ', result.query)
print('fragments: ', result.fragment)
When using attribute names, we can access additional values that are not available when using the index approach. Such values includes port, host, password and username.
access additional values
from urllib.parse import urlparse
URL = 'https://www.google.com:8000/search?s=functions'
result = urlparse(URL)
print('scheme: ', result.scheme)
print('host: ', result.hostname)
print('port: ', result.port)
All of the values available in the returned namedtuple
object are shown in the following table:
attribbute name | index |
---|---|
scheme |
0 |
netloc |
1 |
path |
2 |
params |
3 |
query |
4 |
fragment |
5 |
username |
|
password |
|
hostname |
|
port |
Construct URL from components
We can also assemble url components to form the complete url.
using urlunparse()
The urlunparse()
functions takes url components and reconstructs the original url, it returns the assembled url.
from urllib.parse import urlparse, urlunparse
URL = 'https://www.google.com:8000/search?s=functions'
parsed = urlparse(URL)
unparsed = urlunparse(parsed)
print(unparsed)
Note that, the argument given to urlunparse()
can be any 6-length iterable with the URL's components.
quote and unquote urls
Some URLs may contain special characters that will need to be escaped. The parse
module provides convenient functions for escaping and unescaping special characters in a given URL.
quote()
and quote_plus()
The quote()
function replaces all special characters with the %xx escape characters. Note that _
, -
, ~
and .
characters are never escaped.
escaping special characters
from urllib.parse import quote
URL = 'https://www.google.com:8000/search?s=python functions'
print(quote(URL))
The quote_plus()
function works just like quote()
except that it replaces whitespaces with a plus sign(+
).
from urllib.parse import quote_plus
URL = 'https://www.google.com:8000/search?s=python functions'
print(quote_plus(URL))
unquote()
and unquote_plus()
unquote()
and unquote_plus()
are functionally opposite to quote()
and unquote()
respectively.
The two functions replaces the escaped characters in the URL with their original equivalents.
from urllib.parse import unquote
escaped = "https%3A//www.google.com%3A8000/search%3Fs%3Dpython%20%20functions"
print(unquote(escaped))
And if the spaces has been escaped with +
character, we can use the unquote_plus()
as shown below:
from urllib.parse import unquote_plus
escaped = "https%3A%2F%2Fwww.google.com%3A8000%2Fsearch%3Fs%3Dpython++functions"
print(unquote_plus(escaped))