The filecmp
module in the standard library contains useful functions for comparing files and directories.
We will first have to import the module in the program before using it.
import the filecmp module
import filecmp
print(filecmp)
Compare Two Files
The cmp()
function is the basic tool offered by the filecmp
module. It checks if two files have the same basic properties and can be considered as duplicates.
The cmp()
function has the following syntax:
filecmp.cmp(file1, file2, shallow = True)
file1 |
name of the first file |
file2 |
name of the second file |
shallow |
Optional. Indicates whether to compare only the files metadata(os.stat ) instead of entire contents. It defaults to True . |
The cmp()
function returns True
if both file1
and file2
are the same and False
if they are different. If the shallow
parameters is True
, files will be regarded as equal if only their metadata (e.g. type, size, modified time) are the same. If the shallow parameter is False
, the contents of the files will be compared byte-to-byte.
Consider the following working directory:
wdir
├── main.py
│
├── test_file1.txt
│
│
└── test_file2.txt
#main.py
import filecmp
f1 = 'test_file1.txt'
f2 ='test_file2.txt'
print(cmp(f1, f2))
True
If the cmp()
function returns True
, it means that the two files, test_file1.txt
and test_file2.txt
, have identical metadata like size, modification time, creation time, e.tc. Essentially, it means that if you call the os.stat()
function on both files, the returned values would be mostly the same.
import os.stat
f1 = 'file_test1.txt'
f2 = 'file_test2.txt'
print('f1: ', os.stat(f1))
print('f2: ', os.stat(f2))
f1: os.stat_result(st_mode=33206, st_ino=2251799814338969, st_dev=9704861891122717451, st_nlink=1, st_uid=0, st_gid=0, st_size=0, st_atime=1711061338, st_mtime=1711061338, st_ctime=1705709008)
f2: os.stat_result(st_mode=33206, st_ino=2251799814338969, st_dev=9704861891122717451, st_nlink=1, st_uid=0, st_gid=0, st_size=0, st_atime=1711061338, st_mtime=1711061338, st_ctime=1705709008)
When the shallow
parameter is True
like in the previous example, the cmp()
function returns True
if the files are identical in terms of their basic properties as given by os.stat
. This would suffice in most cases but does not certainly mean that the content is identical.
To check for absolute content equality, you would need to set the shallow
parameter to False
. This way, the
cmp()
function compares the content of the two files byte-by-byte. This ensures that even small differences in the content of the files are not ignored and the comparison is more accurate. It also means that the comparison takes longer to execute due to the extra checks.
import filecmp
f1 = 'test_file1.txt'
f2 ='test_file2.txt'
print(cmp(f1, f2, shallow = False))
False
Compare files across two directories
The cmpfiles()
function compares a set of files across two directories non-recursively. It has the following syntax:
cmpfiles('dir1', 'dir2', common, shallow = True)
Where dir1
and dir2
are the names of the two directories and common
is a list of names of the files to be compared which are common in both directories.
The function returns a tuple of three lists. The lists are as highlighted below:
match
- A list of files that are in both directories and have same properties(can be considered identical).mismatch
- A list of files that are in both directories but have differing properties.errors
- A list of files that could not be compared for some reasons such as not existing or requires permissions.
Consider the following working directory structure.
wdir
├── main.py
├── first_dir
│ ├── hello.txt
│ ├── demo.txt
│ ├── data.txt
│ └── simple.txt
└── second_dir
├── hello.txt
├── demo.txt
└── data.txt
#main.py
#main.py
import filecmp
common_files = ['hello.txt', 'demo.txt', 'data.txt', 'simple.txt']
match, mismatch, errors = filecmp.cmpfiles('first_dir', 'second_dir', common = common_files )
print('Matched: ', match)
print('Mismatch: ', mismatch)
print('Erros: ', erros)
Matched: ['hello.txt', 'data.txt']
Mismatch: ['demo.txt']
Errors: ['simple.txt']
In the above example, it means that files named hello.txt
and data.txt
exists in both first_dir
and second_dir
and have identical properties. 'demo.txt'
exists in both directories but have different property. And 'simple.txt'
could not be compared due to some reason, in the above case we know that the reason is because it does not exist in second_dir
.
Typing the names of the common files manually like we previously did can be tedious and sometimes impractical if the two directories are large and have many files. In such a case we can use the os.listdir alongside os.isfile function to get programmatically the name of the files in both directories, as shown below:
import filecmp, os
d1_files = {f for f in os.listdir('first_dir')
if os.path.isfile(os.path.join('first_dir',f))
}
d2_files = {f for f in os.listdir('second_dir')
if os.path.isfile(os.path.join('second_dir',f))
}
common_files = d1_files | d2_files
print('Common: ', common_files)
match, mismatch, errors = filecmp.cmpfiles('first_dir', 'second_dir', common = common_files )
print('Matched: ', match)
print('Mismatch: ', mismatch)
print('Erros: ', errors)
Common: ['hello.txt', 'demo.txt', 'data.txt', 'simple.txt']
Matched: ['hello.txt', 'data.txt']
Mismatch: ['demo.txt']
Errors: ['simple.txt']
Compare Directories
The filecmp
module contains the dircmp
class for comparing two directories. The class provides a more detailed report of the differences between the two directories, including sub-directories and files that are present in one directory but not the other.
We start by creating a dircmp()
object. The constructor has the following syntax:
dircmp(dir1, dir2, ignore=None, hide=None)
dir1 |
Name of the first directory |
dir2 |
Name of the second directory. |
ignore |
A list of files to be ignored during the comparison. |
hide |
A list of files that will not be displayed in the output. |
#main.py
import filecmp
cmp_result = filecmp.dircmp('first_dir', 'second_dir')
cmp_result.report()
diff first_dir second_dir
Only in first_dir : ['simple.txt']
Identical files : ['hello.txt', 'demo.txt', 'data.txt']
Differing Files : "demo.txt
As shown above, the report()
method displays a message showing information of the the two directories such as the common files, common directories, differing files, etc.
dircmp
objects have some defined attributes that lets you retrieve information about the compared directories and their contents. Some of the attributes are shown in the following table:
common |
A list of files and directories that are in both dir1 and dir2 |
common_dirs |
A list of directories that are in both dir1 and dir2 |
common_files |
A list of files that are in both dir1 and dir2 |
same_files |
A list of identical files. |
diff_files |
A list of differing files |
left_list, right_list |
All files and directories in dir1 (left_list ) and dir2 (right_list ). |
left_only, right_only |
Files that are only in dir1 (left_oly ), dir2 (right_only ) |
#main.py
import filecmp
cmp_result = filecmp.dircmp('first_dir', 'second_dir')
print(cmp_result.common)
print(cmp_result.left_only)
print(cmp_result.right_only)
[ 'hello.txt', 'demo.txt', 'data.txt']
['simple.txt']
[]
Reports on sub-directories
The dircmp object provides two methods for comparing files recursively, report_partial_closure()
and report_full_closure().
The report_partial_closure()
method displays a report on the differences between dir1 and dir2 as well as a report on common immediate directories.
The report_full_closure()
method is like the report_partial_closer()
but fully recursive meaning that it as well reports on all sub-directories in dir1
and dir2
not just the immediate ones.
#main.py
import filecmp
cmp_result = filecmp.dircmp('first_dir', 'second_dir')
cmp_result.report_full_closure()