The difflib module in the standard library  provide advanced text comparison features. We can use the various tools it offers to check for similarities between texts as well as to get a summary on the difference between two pieces of texts.

While the module can be used with any hashable data type, it is in most cases used for comparing  texts. It defines some useful classes and  functions that facilitates  text comparison.

Classes defined in the module

The module contains two primary classes, SequenceMatcher and Differ.

Exploring the SequenceMatcher class

The difflib.SequenceMatcher() class is designed to compare hashable sequences of items. It  compares the similarity between two sequences based on their longest common sub-sequence.

Get the similarity ratio between two strings

#import the difflib module
import difflib

#demo strings
S1 = "Welcome to Pynerds"
S2 = "Welcome to Python"

#Create a matcher object
matcher = difflib.SequenceMatcher(a = S1, b =S2)

#print the similarity ratio
print(matcher.ratio())

To instantiate a SequenceMatcher object, we use the following syntax:

SequenceMatcher(isjunk = None, a = '', b = '', autojunk = 'True')
isjunk A function which defines the set of elements that are to be ignored. It defaults to None, meaning that all elements are considered when comparing the two strings.
a The first string
b The second string.
autojunk An optional parameter that determines whether the sequence matcher should ignore certain non-alphanumeric characters which may lead to inaccurate similarity results. It defaults to True.
 Changing the comparison sequences after initialization

The class allows us to change the sequences in the class without the need to instantiate a new SequenceMatcher object. This is achieved using three useful methods.

set_seq1(a) Set or change the first sequence(a)
set_seq2(b) Set or change the second sequence(b).
set_seqs(a, b) Set or change the values of both sequences(a and b)
from difflib import SequenceMatcher


S = SequenceMatcher()

S.set_seqs("Python", "Pynerds") #The values of a and b are now set to  'Python' and 'Pynerds' respectively.
print(S.ratio())

S.set_seq1("Django") #Changes the value of a from 'Python' to 'Django'
print(S.ratio())

S.set_seq2("Django") #Changes the value of b from 'Pynerds' to 'Django'
print(S.ratio())
The ratio() method 

The SequenceMatcher class includes the ratio() method which  calculates the ratio of similarity between two sequences. The ratio is expressed as a number between 0 and 1, with 1 representing an exact match and 0 indicating no match at all.

#import the difflib module
import difflib

S1 = "Welcome to Pynerds"
S2 = "Welcome to Pynerds"
S3 = "Hello, World!"

matcher = difflib.SequenceMatcher(a = S1, b = S2)
print(matcher.ratio())

matcher.set_seq2(S3)
print(matcher.ratio())
The quick_ratio() method

The ratio method is relatively slow when large sequences are involved.  The quick_ratio() method is much faster as it uses an algorithm to quickly compare the sequences and return an upper bound of the similarity ratio. It is equally a little less accurate than the ratio() method.

#import the difflib module
import difflib

S1 = "Welcome to Pynerds"
S2 = "Welcome to Python"
S3 = "Hello, World!"

matcher = difflib.SequenceMatcher(a = S1, b = S2)
print(matcher.quick_ratio())

matcher.set_seq2(S3)
print(matcher.quick_ratio())
The real_quick_ratio() method

This method is much quicker than both the ratio and the quick_ratio method. It similarly returns the upper bound of the ratio.

#import the difflib module
import difflib


S1 = "Welcome to Pynerds"
S2 = "Welcome to Python"
S3 = "Hello, World!"

matcher = difflib.SequenceMatcher(a = S1, b = S2)
print(matcher.real_quick_ratio())

matcher.set_seqs(S2, S3)
print(matcher.real_quick_ratio())

There are more methods defined on the class, you can use the builtin help() function to view more about each of the methods and its usage.

Where can you apply the SequenceMatcher class?
  • For comparing two text documents to identify similarities and differences. This can be useful for example in identifying plagiarism in the texts.
  • Determining the distance between two strings for spelling correction.
  • Finding approximate matches in a large list of strings.
  • Identifying the closest matches between two sequences in a given dataset.

Exploring the Differ class

While SequenceMatcher is used to compare two sequences and return a similarity measurement, the difflib.Differ class offers a functionality opposite of  the SequenceMatcher class.  It examines sequences of lines of text, producing human-readable differences between them. 

The class actually uses the SequenceMatcher class for the comparisons.

#import the Differ class
from difflib import Differ

#sample Texts
text1 = "Half a pound of tuppenny rice".splitlines()
text2 = "Half a pound of treacle".splitlines()

#Instantiate a Differ object
d = Differ()

#Compare the two texts
result = list(d.compare(text1, text2))

#print the results
print('\n'.join(result))

- Half a pound of tuppenny rice
?                           ^^^ ^^^^^^
+ Half a pound of treacle
?                            ^ ^ +

How  the compare() method works

The compare() method is used for comparing two sequences of lines of text and producing the differences between them in an easy-to-read format. it is the only public method defined by the Differ class.

The method compares both sequences character-by-character and builds a human readable diff list.

The diff list  contains strings starting with a tag (e.g, '-','+', or '?') indicating the action required to make the two sequences match.

'+' indicates appending a character to the sequence, '-' indicates deleting a character from the sequence, and '?' indicates changing a character in the sequence to another one.

#import the Differ class
from difflib import Differ

#sample Texts
text1 = """Half a pound of tuppenny rice,
Pop goes the weasel!
""".splitlines(keepends = True)

text2 = """Half a pound of treacle,
Pop! Goes the weasel!
""".splitlines(keepends = True)

#Instantiate a Differ object
d = Differ()

#Compare the two texts
result = list(d.compare(text1, text2))

#print the results
print(''.join(result), end = '')

- Half a pound of tuppenny rice,
?                           ^^^ ^^^^^^
+ Half a pound of treacle,
?                            ^ ^ +
- Pop goes the weasel!
?        ^
+ Pop! Goes the weasel!
?       +  ^

Functions defined in the module

The class also defines some utility standalone functions.

get_close_matches()

This function is used to find the close matches of the given word from a collection of possibilities in an iterable.

difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)
word string specifying the word that you want to find close matches to.
possibilities An iterable containing the strings to compare the specified word against.
n The number of close matches to return. It defaults to 3.
cutoff An optional value from 0 to 1 specifying maximum allowable difference between the word and possible matches. A lower cutoff will result in more possible matches being returned. It defaults to 0.6

The function compares word and the  possibilities and returns an ordered list of no more than n "good enough " matches as per the given cutoff value.

import difflib

matches = difflib.get_close_matches("appel", ["ape", "apple", "peach", "puppy"])

print(matches)

 and a final example

import difflib

word_list = ["Clothes", "Clothed", "Closely", "Closure", "Closable", "Closeness", "Closer"]

#Get the 5 closest word to "Close"
print(difflib.get_close_matches('Close', word_list))