The urllib module in the standard library includes the request sub-module which offers an interface for fetching data from the web. It allows us to open URLs(Uniform Resource Locators) and retrieve data through different types of HTTP methods such as GET, POST, PUT, DELETE, and HEAD.

urllib.request supports all major URL schemes such as https, http, ftp, file, etc

To use the module and its various resources, we will first need to import it in our program:

from urllib import request

print(request)

Basic Usage

Whenever you access a URL over the internet, the server where that URL exists sends back a response. The response contains important information, some of which includes.

  1. Status code - Tells the client whether the request was successful or an error was encountered. Common status codes include 200 (OK), 404 (Not Found), and 500 (Server Error).
  2. Headers - Additional pieces of information such as the content type, cache-control, and server name.
  3. Body - This is the actual content of the response, which could be HTML, XML, JSON, or any other format depending on the request.
  4. Cookies - Small pieces of data stored on the client's computer by the server, which can be used for tracking and authentication purposes.
  5. Redirects - If the requested resource has been moved to a different location, the server can send a redirect response with the new URL for the client to follow.

The most direct way to open a URL is through the urlopen() function defined in the request module. The following shows a basic example:

from urllib import request

url = "https://www.pynerds.com"
response = request.urlopen(url)

html = response.read()
print('Status: ', response.status)
print('Body: ', html)

response.close()

Status:  200
Body:  b'<!doctype html><html lang="en-us"><head><script>var __ezHttpConsent={setByCat:function(src,tagType,attributes,category,force){var setScript=function(){if(force||window.ezTcfConsent[category]){var scriptElement=document.createElement(tagType);scriptElement.src=src;attributes.forEach(function(attr){for(var key in attr){if(attr.hasOwnProperty(key)){scriptEleme...........

In the above example, we accessed the URL "https://www.pynerds.com". As you can see from the outputs, the status code is 200 meaning that the URL was accessed without any errors. We also displayed the HTML content of the accessed page.

Note the  last line which is response.close(). We can automate the calling of the close() method by using the response objects as a context manager i.e in a with statement. This way we will not be required to call request.close() manually as it will be done for us in the background, this is more convenient. 

response as a context a manager

from urllib import request

url = "https://www.pynerds.com"

with request.urlopen(url) as response:
   html = response.read()
   print('Status: ', response.status)
   print('Body: ', html)

Status:  200
Body:  b'<!doctype html><html lang="en-us"><head><script>var __ezHttpConsent={setByCat:function(src,tagType,attributes,category,force){var setScript=function(){if(force||window.ezTcfConsent[category]){var scriptElement=document.createElement(tagType);scriptElement.src=src;attributes.forEach(function(attr){for(var key in attr){if(attr.hasOwnProperty(key)){scriptEleme........... 

Response objects

We have already interacted with response objects in the previous example. In this section we will dive deeper on how this objects work, their attributes and generally how to use them.

A response object is the return value of the request.urlopen function.

from urllib import request

url = "https://www.pynerds.com"

with request.urlopen(url) as response:
   print(response)

<http.client.HTTPResponse object at 0x000001865B312410>

Basic attributes

The response objects stores some useful attributes about the accessed resource.

from urllib import request

url = "https://www.pynerds.com"

with request.urlopen(url) as response:
   print('url: ', response.url)
   print('code: ', response.code)   
   print('status: ', response.status)   
   print('msg: ', response.msg)
   print('headers: ', response.headers)

url:  https://www.pynerds.com
code:  200
status:  200
msg:  OK
headers:  Date: Thu, 02 May 2024 01:45:29 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: close
Cache-Control: private, max-age=0, must-revalidate, no-cache, no-store
cross-origin-opener-policy: same-origin
display: pub_site_sol
expires: Wed, 01 May 2024 01:45:29 GMT....

In the above example, we accessed some attributes of the response such as url, code, message, e.t.c. 

Some of the attributes are summarized below;

  • url - The url that was fetched.
  • code - The status code e.g 200(success), 404(not found), 500(server error), e.tc
  • status - same as code
  • msg - Status message e.g  OK(200), Not Found(404), Internal Server Error(502).
  • headers - The headers sent by the server.

The file API

In their most basic form, response objects behaves like file objects, thus you can read from them and perform other file operations on them.

Calling read methods(i.e read(), readline() or readlines() ) of response object will retrieve the body/main contents(as bytes) of the accessed resource, this maybe HTML, JSON, XML e.t.c.

from urllib import request

url = "https://www.pynerds.com"

with request.urlopen(url) as response:
   print(response.readline())

b'<!doctype html><html lang="en-us"><head><script>var __ezHttpConsent={setByCat:function(src,tagType,attributes,category,force){var setScript=function(){if(force||window.ezTcfConsent[category]){var scriptElement=document.createElement(tagType);scriptElement.src=src;attributes.forEach(function(attr){for(var key in attr){if(attr.hasOwnProperty(key)){scriptElement.setAttribute(key,attr[key]);}}});var firstScript=document.getElementsByTagName(tagType)[0];firstScript.parentNode.insertBefore(scriptElement,firstScript);}};if(force||(window.ezTcfConsent&&window.ezTcfConsent.loaded)){setScript();}else if(typeof getEzConsentData==="function"){getEzConsentData().then(function(ezTcfConsent){if(ezTcfConsent&&ezTcfConsent.loaded){setScript();}else{console.error("cannot get ez consent data");force=true;setScript();}});}else{force=true;setScript();console.error("getEzConsentData is not a function");}},};</script>\n'

Error while fetching URLs

When the urlopen() function fails to locate the requested resource or encounters an error when doing so, it raises an exception.

We can use exception classes defined in the urllib.error sub-module to catch the exception and perform a corrective action. This can be achieved as shown below:

from urllib import request
from urllib.error import  URLError

url = "https://www.pynerds.com/non-existent/"

try:
   with request.urlopen(url) as response:
      print(response.readline())

except URLError as e:
  print('Server failed to handle the request.')
  print('Reason: ', e.reason)
  print('error code: ', e.code)

Server failed to handle the request.
Reason:  Not Found
error code:  404