Regular Expressions, often shortened as regex, are expressive and powerful way of manipulating text data. They are particularly useful in , matching patterns, searching , replacing text, verifying the correctness of strings as per a given pattern or for extracting information from text.
In Python, regular expressions are accessed through the re module which is available in the standard library meaning that we just import the module without any additional setups or installations.
The module provides a number of functions and constants that makes it easier to write regular expressions. Some common functions defined in the module are match, search, findall, split, sub, etc.
Finding Patterns in text
The re.search()
function is one of the most used tool in the module. It takes a pattern and the text to be scanned, it returns a Match
object if the pattern is found, else None
.
The match object
The Match
Object contains information about a successful match from a Regular Expression. It has several methods and attributes that allow us to extract information from the match. The re.search()
and other re
functions, as we will see in a while, returns Match objects.
Some useful methods and attributes in the Match objects
attributes
string |
The string input text passed to the involved function(either search() or match() |
pos |
The offset of the start of the match in the string |
re |
The compiled regular expression object used for the search |
lastgroup |
The name of the last matched capturing group |
lastindex |
The integer index of the last matched capturing group. |
methods
group() |
Returns the matched strings from a pattern. |
groups() |
Returns a tuple containing all the matched substrings. |
span() |
Returns a tuple containing the start and end positions of the match. |
start() |
Returns the start index of the match within the input string. |
end() |
Returns the ending index of the match within the string |
ReGex Patterns and their syntax
In Python, a regex pattern is a string of characters, which is used to match character combinations in a string. Patterns are constructed using a combination of literal characters(digits, letters, symbols, etc) and metacharacters
For example if we simply want to match an exact word in a given string, we can use the word as the pattern.
However, sometimes we want to match text against a more general pattern instead of matching against an literal string. For example say we need to match any non-specific digit, or to search for a pattern which starts with a certain character. In such cases, it would be tedious and in some cases impossible to use exact literal string patterns, this is where metacharacters comes in.
Metacharacters are special characters that have a predefined meaning in the regex context. By using metacharacters, you can match a variety of specific patterns instead of having to match them as literal strings. This allows you to simplify and speed up the task of searching for patterns within strings.
Special sequences are a combination of characters and metacharacters.
For example metacharacter "." matches single character except the newline character , thus pattern "." will match any non line-break character in the given string.
As you can see above, the "." pattern was able to retrieve 4 among the line break characters. We will look more on the search function and its usage.
Note: The 'r' before a pattern indicates that the pattern is a raw string.This means that backslashes and other special characters in the pattern are interpreted literally, rather than being treated as special characters. This is necessary in most case because the backlash('\\') is one of the most used metacharacters as we will see in the following section.
regex metacharacters and usage
You already have a clue on what metacharacters are from the previous sub-heading. In this section we will explore the available metacharacters and how they are used to achieve complex pattern matching on strings.
The following table summarizes the metacharacters and their usage.
metacharcter | name | usage |
---|---|---|
. | Dot | It matches any single character except the newline character. |
^ | Caret | It is used to match the pattern from the start of the string. (Starts With). |
$ | Dollar sign | Used to match the end of a string or line.(Ends With). |
* | Asterisk | Matches the preceding expression 0 or more times |
+ | Plus | Matches the preceding expression 1 or more times |
? | Question mark | Matches the preceding expression 0 or 1 time |
[] | Brackets | Matches a character set. |
{} | Curly braces | Matches the specified number of occurrences of the preceding element. |
| | Pipe | Matches either of the specified characters. |
\ | Backlash | Escapes the following special character. |
Shorthand Character classes
In regex, shorthand character classes consist of '\' followed by a literal character. The shorthand character classes are used to create common patterns that match particular strings or sets of strings.
Character | Usage |
---|---|
\d | Matches any single digit from 0-9. Equivalent to [0-9] |
\D | matches any single character that is not a digit. Equivalent to [^0-9] |
\s | Matches any whitespace characters( " ", "\n", ''\t", "\r", "\v", "\f", "\0"). Equivalent to [\t\n\r\f\v] |
\S | Matches any non-whitespace character. Equivalent to [^\t\n\r\f\v] |
\w | Matches any alphanumeric character, including underscore. Equivalent to [a-zA-Z0-9] |
\W | Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9] |
\A | Matches the defined pattern at the start of the string. |
\b | r"\b..." - It matches the pattern at the beginning of a word in a string. r"...\b" - It matches the pattern at the end of a word in a string. |
\B | This is the opposite of \b. |
\Z | Matches if the pattern is found at the end of the string. |
Negation in Regex Patterns
Negation is toggling the functionality of a regex pattern. A caret (^) before a pattern will negate the subsquent pattern. For example, the regex ^\d
will match any character that is not a number.
Note: The caret is also used to match pattern from the start of a string. This can introduce room for ambiguity in som cases. Consider the examples below.
Negation is commonly used with character classes. A character class is a set of characters within square brackets that represent a group of characters, such as [0-9], [a-z], or [A-Z]. When a caret is placed just after the first opening bracket of a character class, the meaning of the character class is negated. For example, ^[0-9] will match any character that is not a number.
RegEx Functions and Usage.
Most of the functions in the re
module have a similar syntax. They typically accept a pattern as the first argument, followed by the string to be matched or manipulated.
Following snippet shows some of these functions and their basic syntax:
re.match(pattern, string, flags=0)
re.search(pattern, string, flags=0)
re.findall(pattern, string, flags=0)
re.sub(pattern, repl, string, count=0, flags=0)
re.split(pattern, string, maxsplit=0, flags=0)
re.compile()
This function creates a regular expression object from a regular expression pattern. This can can save time when the same expression will be used multiple times in a single program.
Even though we can use the functions in re module without first compiling the patterns, compiling a regular expression makes it faster to use for matching strings, since it has to be compiled only once. This makes it more efficient if the same regular expression is used again and again.
Syntax:
re.compile(pattern, flags=0)
The pattern
is a string with which we want to compare and match the strings. The optional argument flags is used to modify the meaning of the given regex pattern. Flags can be any of the following: re.I(IGNORECASE): performers case-insensitive matching re.M (MULTILINE): makes beginning and end characters (^ and $) work over multiple lines re.S (DOTALL): makes a dot (.) match any character, including a newline re.U (UNICODE): performs Unicode case-insensitive matching.
re.match():
This function is used to match a substring at the beginning of a string. is a regular expression which can be either a string or a pattern object. If it is successful it returns a regexp object, otherwise it returns None.
Syntax:
re.match(pattern, string, flags=0)
re.search()
This function searches for first occurrence of the regex pattern within the specified string.
Syntax:
re.search(pattern, string, flags=0)
The function returns match object if a match is found, otherwise, it returns None.
re.findall()
This function returns all non-overlapping occurrences of RE pattern in string as a list.
Syntax:
re.findall(pattern, string, flags = 0)
re.sub()
This replaces all occurrences of pattern in string with replacement, and returns the modified string.
Syntax:
re.sub(pattern, repl, string, count=0, flags=0)
The pattern
parameter is the regular expression to be matched. Thee repl
parameter is the string to be substituted for each match. The string
parameter is the string being processed. The count
parameter is the maximum number of replacements to be made (default is 0, which means all matches are replaced.
re.split()
This function is used to split a string into substrings based on a specified regular expression pattern. It returns a list of strings that were split from the original string.
re.split(pattern, string, maxsplit=0, flags=0)
re.finditer()
This function is used to find an iterator over all non-overlapping matches of a pattern in a given string. It is similar to re.findall(), but instead of returning a list of all patterns found, it returns an iterator containing MatchObject
instances for each match.
Syntax:
re.finditer(pattern, string, flags=0)
re.escape()
This function is used to ensure that any special characters within a string are escaped and can be successfully found in a Text string.
Syntax:
re.escape(pattern)
The ‘pattern
’ value represents the the text string that will be escaped.
re.purge()
The re.purge() function is used to clear all the caches associated with compiled regular expressions.