The filecmp module in the standard library contains useful functions for comparing files and directories.

We will first have to import the module in the program before using it.

ExampleEdit & Run

import the filecmp module

import filecmp

print(filecmp)
Output:
<module 'filecmp' from '/app/.heroku/python/lib/python3.11/filecmp.py'>[Finished in 0.011087319115176797s]

Compare Two Files

The cmp() function  is the basic tool offered by the filecmp module. It checks if two files have the same basic properties and can be considered as duplicates.

The cmp() function has the following syntax:

Syntax:
filecmp.cmp(file1, file2, shallow = True)
file1 name of the first file
file2 name of the second file
shallow Optional.  Indicates whether to compare only the files metadata(os.stat) instead of entire contents. It defaults to True.

The cmp() function returns True if both file1 and file2 are the same and False if they are different. If the shallow parameters is True, files will be regarded as equal  if  only their metadata (e.g. type, size, modified time) are the same. If the shallow parameter is False, the contents of the files will be compared byte-to-byte.

Consider the following working directory:

wdir
├── main.py
│
├── test_file1.txt
│
│  
└── test_file2.txt
ExampleEdit & Run
#main.py

import filecmp

f1 = 'test_file1.txt'
f2 ='test_file2.txt'

print(cmp(f1, f2))
Output:

True

If the cmp() function returns True, it means that the two files, test_file1.txt and test_file2.txt, have identical metadata like size, modification time, creation time, e.tc. Essentially, it means that if you call the os.stat() function on both files, the returned values would be mostly the same. 

ExampleEdit & Run
import os.stat

f1 = 'file_test1.txt'
f2 = 'file_test2.txt'

print('f1: ', os.stat(f1))
print('f2: ', os.stat(f2))
Output:

f1: os.stat_result(st_mode=33206, st_ino=2251799814338969, st_dev=9704861891122717451, st_nlink=1, st_uid=0, st_gid=0, st_size=0, st_atime=1711061338, st_mtime=1711061338, st_ctime=1705709008)
f2: os.stat_result(st_mode=33206, st_ino=2251799814338969, st_dev=9704861891122717451, st_nlink=1, st_uid=0, st_gid=0, st_size=0, st_atime=1711061338, st_mtime=1711061338, st_ctime=1705709008)

When the shallow parameter is True like in the previous example, the cmp() function returns True if the files are identical in terms of their basic properties as given by os.stat. This would suffice in most cases but does not certainly mean that the content  is identical. 

To check for absolute content equality, you would need to  set the shallow parameter to False. This way, the cmp() function compares the content of the two files byte-by-byte. This ensures that even small differences in the content of the files are not ignored and the comparison is more accurate. It also means that the comparison takes longer to execute due to the extra checks.

ExampleEdit & Run
import filecmp

f1 = 'test_file1.txt'
f2 ='test_file2.txt'

print(cmp(f1, f2, shallow = False))
Output:

False

Compare files across two directories

The cmpfiles() function compares a set of files across two directories non-recursively. It has the following syntax:

Syntax:
cmpfiles('dir1', 'dir2', common, shallow = True)

Where dir1 and dir2 are the names of the two directories and common is a list of names of the files to be compared which are common in both directories.

The function returns a tuple of three lists. The lists are as highlighted below:

  1. match -  A list of files that are in both directories and have same properties(can be considered identical).
  2. mismatch - A list of files that are in both directories but have differing properties.
  3. errors - A list of files that could not be compared for some reasons such as not existing or requires permissions.

Consider the following working directory structure.

wdir
├── main.py
├── first_dir
│   ├── hello.txt  
│   ├── demo.txt
│   ├── data.txt
│   └── simple.txt
└── second_dir
    ├── hello.txt 
    ├── demo.txt
    └── data.txt
ExampleEdit & Run

#main.py

#main.py

import filecmp

common_files = ['hello.txt', 'demo.txt', 'data.txt', 'simple.txt']
match, mismatch, errors = filecmp.cmpfiles('first_dir', 'second_dir', common = common_files )

print('Matched: ', match)
print('Mismatch: ',  mismatch)
print('Erros: ', erros)
Output:

Matched: ['hello.txt', 'data.txt']
Mismatch: ['demo.txt']
Errors:  ['simple.txt'] 

In the above example, it means that files named hello.txt and data.txt  exists in both first_dir and second_dir and have identical properties. 'demo.txt'  exists in both directories but have different property. And  'simple.txt' could not be compared due to some reason, in the above case we know that the reason is because it does not exist in second_dir.

Typing the names of the common files manually like we previously did can be tedious and sometimes impractical if the two directories are large and have many files. In such a case we can use the os.listdir alongside os.isfile function to get programmatically the name of the files in both directories, as shown below:

ExampleEdit & Run
import filecmp, os

d1_files = {f for f in os.listdir('first_dir') 
            if os.path.isfile(os.path.join('first_dir',f))
            }
d2_files = {f for f in os.listdir('second_dir')
            if os.path.isfile(os.path.join('second_dir',f)) 
           }

common_files = d1_files | d2_files

print('Common: ', common_files)

match, mismatch, errors = filecmp.cmpfiles('first_dir', 'second_dir', common = common_files )

print('Matched: ', match)
print('Mismatch: ',  mismatch)
print('Erros: ', errors)
Output:

Common: ['hello.txt', 'demo.txt', 'data.txt', 'simple.txt']
Matched: ['hello.txt', 'data.txt']
Mismatch: ['demo.txt']
Errors:  ['simple.txt']  

Compare Directories

The filecmp module contains the dircmp class for comparing two directories. The class provides a more detailed report of the differences between the two directories, including sub-directories and files that are present in one directory but not the other.

We start by creating a dircmp() object. The constructor has the following syntax:

Syntax:
dircmp(dir1, dir2, ignore=None, hide=None)
dir1 Name of the first directory
dir2 Name of the second directory.
ignore A list of files to be ignored during the comparison.
hide A list of files that will not be displayed in the output.
ExampleEdit & Run
#main.py

import filecmp

cmp_result = filecmp.dircmp('first_dir', 'second_dir')
cmp_result.report()
Output:

diff first_dir second_dir
Only in first_dir : ['simple.txt']
Identical files : ['hello.txt', 'demo.txt', 'data.txt']
Differing Files : "demo.txt

As shown above, the report() method displays a message showing information of the the two directories such as the common files, common directories, differing files, etc.

dircmp objects have some defined attributes that lets you retrieve information about the compared directories and their contents. Some of the attributes are shown in the following table:

common A list of files and directories that are in both dir1 and dir2
common_dirs A list of directories that are in both dir1 and dir2
common_files A list of files that are in both dir1 and dir2
same_files A list of identical files.
diff_files A list of differing files
left_list, right_list All files and directories in dir1(left_list) and dir2(right_list).
left_only, right_only Files that are only in dir1(left_oly), dir2(right_only)
ExampleEdit & Run
#main.py

import filecmp

cmp_result = filecmp.dircmp('first_dir', 'second_dir')
print(cmp_result.common)
print(cmp_result.left_only)
print(cmp_result.right_only)
Output:

[ 'hello.txt', 'demo.txt', 'data.txt']
['simple.txt']
[]

Reports on sub-directories

The dircmp object provides two methods for comparing files recursively, report_partial_closure() and report_full_closure().

The report_partial_closure() method displays a report on the differences between dir1 and dir2 as well as a report on common immediate directories.

The report_full_closure() method is like the report_partial_closer() but fully recursive meaning that it as well reports on all sub-directories in dir1 and dir2 not just the immediate ones.

ExampleEdit & Run
#main.py

import filecmp

cmp_result = filecmp.dircmp('first_dir', 'second_dir')
cmp_result.report_full_closure()