glob is one of the various modules that exists in Python's standard library. It is primarily used for finding all pathnames matching a specified pattern. For example, instead of writing custom code for finding all files with a certain extension, you can use glob to achieve this quickly and easily.

The module supports wildcards and patterns as defined by UNIX shell rules.

The primary tool offered in the module is the glob.glob() function which returns a list of path names matching a specified pattern. The function has the following syntax.

glob.glob(pattern, recursive = False, include_hidden = False )
pattern The pattern to be matched, it may contain simple wildcards.
recursive If True, the pattern '**' will match any files and directories in the specified directory and its subdirectories.
include_hidden If True, hidden directories will be matched.

The function returns a list containing the paths that matched the pattern.

For illustration purposes, in the following parts we will work with the glob module assuming that the following files and directories exists in the current working directory. 

project
project/file.py
project/file1.py
project/file2.py
project/file3.py
project/tests
project/tests/test.py
project/tests/test1.py

glob() with wildcards

As earlier mentioned the glob module supports wildcard characters:

  • An asterisk (*) matches  zero or more characters of any type.
  • A question mark(?) matches a single character.

match zero or more characters(*)

use an asterisk for matching

import glob

matches = glob.glob('project/*')
for p in matches:
    print(p)

project\file.py
project\file1.py
project\file2.py
project\file3.py
project\tests 

In the above example, only files that are direct children of  "project"  directory are returned i.e those that matches the pattern "project/*". To include paths in sub-directories, the  sub-directory must also be included in the pattern. 

import glob

matches = glob.glob('project/tests/*')
for p in matches:
    print(p)

project\tests\test.py
project\tests\test1.py 

In the above example, we have included explicitly the name of the  sub-directory, we can also rely on a wildcard to find the directory as shown below.

import glob

matches = glob.glob('project/*/*')
for p in matches:
    print(p)

While the previous approach will match only the directories under the "tests" subdirectory, using a wildcard as in above, will match all present subdirectories.

match single character(?)

A question mark in the pattern will match a single character in the specified position.

import glob

matches = glob.glob('project/file?.py')
for p in matches:
    print(p)

project\file1.py
project\file2.py
project\file3.py 

As you can see in the above example, all paths matching the pattern and with a character in the position specified by the ? character are returned. 

Instead of a question mark, we can also use simple character range to match a single character but in a limited manner, for example [0-9] in the pattern will match only digits from 0 to 9 instead of just any character.

import glob

matches = glob.glob('project/file[0-9].py')
for p in matches:
    print(p)

project\file1.py
project\file2.py
project\file3.py

Matching files recursively

Normally the glob() function will only match paths only in the specified directory without descending into sub-directories. We can however, specify the recursive parameter as True so that glob will traverse all sub-directories of the specified directory and match the given path accordingly.

We use a double star wildcard(**) to indicate that the search should be recursive.

Consider the following example.

import glob

for p in glob.glob("project/**/*.py", recursive = True):
     print(p)

project\file.py
project\file1.py
project\file2.py
project\file3.py
project\tests\test.py
project\tests\test1.py

In the above example, we were able to recursively search through the "project" directory to get all files with a ".py" extension . The double star wildcard '**' indicates recursive searching, meaning the function will search through all subdirectories of the directory given in the pattern.

Recursive searching can be very expensive especially if the directory tree is too large. The iglobe() function can be used instead of glob() so as to significantly improve the performance. The function works just like globe() except that it returns an iterator object instead of a list. This makes it suitable if the directory tree is large because each path is only retrieved when it is needed.

import glob

paths = glob.iglob("project/**/*.py", recursive = True)
print(next(paths))
print(next(paths))
print(next(paths))

project\file.py
project\file1.py
project\file2.py

As you can see in the above example, the iglobe() function does not load the paths into memory all at once, instead,  it loads them one by one, as needed.