glob
is one of the various modules that exists in Python's standard library. It is primarily used for finding all pathnames matching a specified pattern. For example, instead of writing custom code for finding all files with a certain extension, you can use glob
to achieve this quickly and easily.
The module supports wildcards and patterns as defined by UNIX shell rules.
The primary tool offered in the module is the glob.glob()
function which returns a list of path names matching a specified pattern. The function has the following syntax.
glob.glob(pattern, recursive = False, include_hidden = False )
pattern |
The pattern to be matched, it may contain simple wildcards. |
recursive |
If True, the pattern '**' will match any files and directories in the specified directory and its subdirectories. |
include_hidden |
If True, hidden directories will be matched. |
The function returns a list containing the paths that matched the pattern.
For illustration purposes, in the following parts we will work with the glob
module assuming that the following files and directories exists in the current working directory.
project
project/file.py
project/file1.py
project/file2.py
project/file3.py
project/tests
project/tests/test.py
project/tests/test1.py
glob() with wildcards
As earlier mentioned the glob
module supports wildcard characters:
- An asterisk (
*
) matches zero or more characters of any type. - A question mark(
?
) matches a single character.
match zero or more characters(*
)
use an asterisk for matching
import glob
matches = glob.glob('project/*')
for p in matches:
print(p)
project\file.py
project\file1.py
project\file2.py
project\file3.py
project\tests
In the above example, only files that are direct children of "project" directory are returned i.e those that matches the pattern "project/*". To include paths in sub-directories, the sub-directory must also be included in the pattern.
import glob
matches = glob.glob('project/tests/*')
for p in matches:
print(p)
project\tests\test.py
project\tests\test1.py
In the above example, we have included explicitly the name of the sub-directory, we can also rely on a wildcard to find the directory as shown below.
import glob
matches = glob.glob('project/*/*')
for p in matches:
print(p)
While the previous approach will match only the directories under the "tests" subdirectory, using a wildcard as in above, will match all present subdirectories.
match single character(?
)
A question mark in the pattern will match a single character in the specified position.
import glob
matches = glob.glob('project/file?.py')
for p in matches:
print(p)
project\file1.py
project\file2.py
project\file3.py
As you can see in the above example, all paths matching the pattern and with a character in the position specified by the ?
character are returned.
Instead of a question mark, we can also use simple character range to match a single character but in a limited manner, for example [0-9]
in the pattern will match only digits from 0 to 9 instead of just any character.
import glob
matches = glob.glob('project/file[0-9].py')
for p in matches:
print(p)
project\file1.py
project\file2.py
project\file3.py
Matching files recursively
Normally the glob()
function will only match paths only in the specified directory without descending into sub-directories. We can however, specify the recursive
parameter as True
so that glob will traverse all sub-directories of the specified directory and match the given path accordingly.
We use a double star wildcard(**
) to indicate that the search should be recursive.
Consider the following example.
import glob
for p in glob.glob("project/**/*.py", recursive = True):
print(p)
project\file.py
project\file1.py
project\file2.py
project\file3.py
project\tests\test.py
project\tests\test1.py
In the above example, we were able to recursively search through the "project" directory to get all files with a ".py" extension . The double star wildcard '**'
indicates recursive searching, meaning the function will search through all subdirectories of the directory given in the pattern.
Recursive searching can be very expensive especially if the directory tree is too large. The iglobe()
function can be used instead of glob()
so as to significantly improve the performance. The function works just like globe()
except that it returns an iterator object instead of a list. This makes it suitable if the directory tree is large because each path is only retrieved when it is needed.
import glob
paths = glob.iglob("project/**/*.py", recursive = True)
print(next(paths))
print(next(paths))
print(next(paths))
project\file.py
project\file1.py
project\file2.py
As you can see in the above example, the iglobe()
function does not load the paths into memory all at once, instead, it loads them one by one, as needed.