Working with PDF files in Python-PyPDF2 Library.

PDF (Portable Document Format) is a widely used file format developed by Adobe Systems. As of today it is the most extensively used file format with over 73 million new PDF files saved every day on Gmail & Drive. It is likely that you are already familiar with this type of files.

PDF files are specifically designed to retain the formatting and layout of documents, ensuring consistency in appearance across different software, hardware, and operating systems.This feature makes them ideal for sharing various type of documents.

By convention, files with a .pdf extension are typically associated with PDF files.

Python does not offer a built-in module designed specially for working with PDF files, however, there are several third-party modules and packages that are made specifically for this purpose. PyPDF2, Slate, pdfminer, and xpdf are some of the most popular modules for working with pdfs.

In this article we will be using the PyPDF2 library. This library is relatively newer compared to other PDF libraries. It is designed with a focus on simplicity, which makes it an excellent choice for performing basic PDF manipulation tasks. PyPDF2 provides a straightforward and intuitive API for extracting text, merging or splitting PDFs, adding watermarks, encrypting or decrypting files, and performing other common PDF operations with ease.

To install PyPDF2, run the following command from the command line:

pip3 install PyPDF2

Note: The module name is case-sensitive. If you are typing the command, make sure it is correctly spelled.

Introduction to PyPDF2

PyPDF2 is a python library used for manipulating and extracting data from pdf documents. It can be used to read and extract text, images, metadata, and other content from pdfs.

With PyPDF2, you can append pages to existing pdfs, create new pages, repair corrupt pdfs, etc. The module is written in Python 3 and runs on any platform that supports Python. It supports major versions of Python including 3.x, 2.x, and legacy versions.

PyPDF2 is an open source library that is actively maintained and supported by an active community of developers. It is easy to install and use, making it an ideal choice for manipulating PDF documents.

Before we start, let us first use the built-in dir() function to view all the functions and class defined in the library.

import PyPDF2 as pypdf

print(list(filter(lambda x: not x.startswith('_'), dir(pypdf))))

['DocumentInformation', 'PageObject', 'PageRange', 'PaperSize', 'PasswordType', 'PdfFileMerger', 'PdfFileReader', 'PdfFileWriter', 'PdfMerger', 'PdfReader', 'PdfWriter', 'Transformation', 'constants', 'errors', 'filters', 'generic', 'pagerange', 'papersizes', 'parse_filename_page_ranges', 'types', 'warnings', 'xmp']

The most important of the above clases is the PdfReader and the PdfWriter. This is because most of the operations involves reading data from a pdf file using the PdfReader, modfying the data and then writing it back to the pdf file(or to a pdf file with a new name) using the PdfWriter.

We will use the following pdf file as the demo file. You can download and save it as 'demo.pdf ' it, or use an existing one in your computer.

Extracting Document Information

Reading metadata

In this part we will use the PdfReader class to extract information about the pdf file. Such information include:

Author
Creator
Title
Subject
Page count

from PyPDF2 import PdfReader

pdf = PdfReader('demo.pdf')

number_of_pages = len(pdf.pages)

print('number of pages: %s'%number_of_pages)

meta = pdf.metadata

print('author: %s'%meta.author)
print('title: %s'%meta.title)
print('creator: %s'%meta.creator)
print('subject: %s'%meta.subject)

number of pages: 30
author: John Doe
title: PDF demo File
creator: Pynerds
subject: working with pdf files in python

Writing/Updating pdf metadata

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("demo.pdf")
writer = PdfWriter()

# Add all pages to the writer
for page in reader.pages:
    writer.add_page(page)

# Add the metadata
writer.add_metadata(
    {
        "/Author": "Jane Doe",
        "/Producer": "Anonymous Writer",
    }
)

# Save the new PDF to a file
with open("modified-demo.pdf", "wb") as f:
    writer.write(f)

Extract Text from a PDF

The PdfReader class takes a PDF file as its argument and divides the pdf into pages. Each page is represented as an object that can be manipulated using various methods. The methods can be used to extract text, images, annotations, and other information from the document.

To extract text from a given page, we use the extract_text() method of the page object.

Syntax:

page_object.extract_text()

The extract_text() method Error:

Our demo pdf file has 30 pages. The following example shows how we can extract text from a given page.

from PyPDF2 import PdfReader

reader = PdfReader("demo.pdf")
page = reader.pages[29]
print(page.extract_text())

Etiam vehicula luctus fermentum. In vel metus congue, pulvinar lectus vel, fermentum dui.
Maecenas ante orci, egestas ut aliquet sit amet, sagittis a magna. Aliquam ante quam,
pellentesque ut dignissim quis, laoreet eget est. Aliquam erat volutpat. Class aptent taciti
sociosqu ad litora torquent per conubia nosatra, per inceptos himenaeos. Ut ullamcorper
justo sapien, in cursus libero viverra eget. Vivamus auctor imperdiet urna, at pulvinar leo
posuere laoreet. Suspendisse neque nisl, fringilla at iaculis scelerisque, ornare vel dolor. Ut
et pulvinar nunc. Pellentesque fringilla mollis efficitur. Nullam venenatis commodo
imperdiet. Morbi velit neque, semper quis lorem quis, efficitur dignissim ipsum. Ut ac lorem
sed turpis imperdiet eleifend sit amet id sapien.

In the above case, we have read the last page(page 30) of our demo.pdf file.

Extract Images

Each page of the PdfReader() object contains a reference to the images contained in the particular page it references in the pdf. We can access the images in a particular page using the page.images attribute.

The pillow package is required for manipulating the images in the pdf files. Run the following command to install pillow.

pip install pillow

The following example shows how to extract images from a given page of the PdfReader object.

from PyPDF2 import PdfReader

reader = PdfReader("demo.pdf")

page = reader.pages[0]
count = 0

for image_file_object in page.images:
    with open(str(count) + image_file_object.name, "wb") as fp:
        fp.write(image_file_object.data)
        count += 1

The above program will extract the images present in the first page in to the directory you are in.

Merge PDF files

We can merge multiple PDF files together into a single document.

Basic Example:

from PyPDF2 import PdfWriter

merger = PdfWriter()

for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    merger.append(pdf)

merger.write("merged-pdf.pdf")
merger.close()

Showing more merging options

from PyPDF2 import PdfWriter

merger = PdfWriter()

input1 = open("document1.pdf", "rb")
input2 = open("document2.pdf", "rb")
input3 = open("document3.pdf", "rb")

# add the first 3 pages of input1 document to output
merger.append(fileobj=input1, pages=(0, 3))

# insert the first page of input2 into the output beginning after the second page
merger.merge(position=2, fileobj=input2, pages=(0, 1))

# append entire input3 document to the end of the output document
merger.append(input3)

# Write to an output PDF document
output = open("document-output.pdf", "wb")
merger.write(output)

# Close File Descriptors
merger.close()
output.close()

Compress a PDF File

Sometimes we want to reduce the size of a pdf file without losing its content. We can achieve this through lossless compression. By lossless compression, we mean that meaning the resulting PDF looks exactly the same but with reduced size.

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.compress_content_streams()  # This is CPU intensive!
    writer.add_page(page)

with open("out.pdf", "wb") as f:
    writer.write(f)

Encrypting and Decrypting a pdf file.

File Encryption is the process of encoding data with a key/password so that it cannot be read by anyone except those who have the key. Decryption, on the other hand, is the process of taking the encrypted data and translating it back into its original form so it can be read freely.

Add a password to a PDF

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
writer = PdfWriter()

# Add all pages to the writer
for page in reader.pages:
    writer.add_page(page)

# Add a password to the new PDF
writer.encrypt("my-secret-password")

# Save the new PDF to a file
with open("encrypted-pdf.pdf", "wb") as f:
    writer.write(f)

Remove the password from a PDF

Note: You will need to have the password to the file you want to decrypt

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("encrypted-pdf.pdf")
writer = PdfWriter()

if reader.is_encrypted:
    reader.decrypt("my-secret-password")

# Add all pages to the writer
for page in reader.pages:
    writer.add_page(page)

# Save the new PDF to a file
with open("decrypted-pdf.pdf", "wb") as f:
    writer.write(f)

Files