introduction to regular expressions image

Regular Expressions, often shortened as regex,  are expressive and powerful way of manipulating text data. They are particularly useful in , matching patterns, searching , replacing text, verifying the correctness of strings as per a given pattern or for extracting information from text.

In Python, regular expressions are accessed through the re module which is available in the standard library meaning that we just import the module without any additional setups or installations.  

The module provides a number of functions and constants that makes it easier to write regular expressions. Some common functions defined in the module are  match, search, findall, split, sub, etc.

Finding Patterns in text

The re.search() function is one of the most used tool in the module. It takes a pattern and the text to be scanned, it returns a Match object if the pattern is found, else None.

#import the module
import re

#the pattern
pattern = "valuable"

#text to be scanned
text = "Eric has proved himself to be a valuable asset to the team."

match = re.search(pattern, text)

s = match.start() #The starting index
e = match.end() #The ending index

print("""Found "%s"
in: %s
from index %d to index %d("%s")."""%(match.re.pattern, match.string, s, e, text[s:e]))

The match object

The Match Object contains information about a successful match from a Regular Expression. It has several methods and attributes that allow us to extract information from the match. The re.search() and other re functions, as we will see in a while, returns Match objects.

import re

pattern = 'world'
text = 'Hello, world!'

#create a match object
match_object = re.search(pattern, text)

print(match_object)
Some useful methods and attributes in the Match objects
attributes
string The string input text passed to the involved function(either search() or match()
pos The offset of the start of the match in the string
re The compiled regular expression object used for the search
lastgroup The name of the last matched capturing group
lastindex The integer index of the last matched capturing group.
import re

pattern = 'world'
text = 'Hello, world!'

#create a match object
match_object = re.search(pattern, text)

print(match_object.string)
print(match_object.re)
print(match_object.pos)
methods
group() Returns the matched strings from a pattern.
groups() Returns a tuple containing all the matched substrings.
span() Returns a tuple containing the start and end positions of the match.
start() Returns the start index of the match within the input string.
end() Returns the ending index of the match within the string
import re

pattern = 'world'
text = 'Hello, world!'

#create a match object
match_object = re.search(pattern, text)

#prints the matched string
print(match_object.group()) 

#print a tuple containing the starrting and the ending indices
print(match_object.span())

#print the starting index
print(match_object.start())

#print the ending index
print(match_object.end())

ReGex Patterns and their syntax

In Python, a regex pattern is a string of characters, which is used to match character combinations in a string. Patterns are constructed using a combination of literal characters(digits, letters, symbols, etc) and metacharacters

For example if we simply want to match an exact word  in a given string, we can use the word as the pattern.

import re

#The pattern is the literal string we want to search
pattern = r'oriented'

text = "Python is an object oriented language."

match_object = re.search(pattern, text)

if match_object:
   print("The pattern found from index %d to index %d"%(match_object.start(), match_object.end()))

However, sometimes we want to match text against a more general pattern instead of matching against an literal string. For example say we need to match any non-specific digit, or to search for a pattern which starts with a certain character. In such cases, it would be tedious and in some cases  impossible to use exact literal string patterns, this is where  metacharacters comes in.

Metacharacters are special characters that have a predefined meaning in the regex context. By using metacharacters, you can match a variety of specific patterns instead of having to match them as literal strings. This allows you to simplify and speed up the task of searching for patterns within strings.

Special sequences are a combination of characters and metacharacters.

For example  metacharacter "."  matches single character except the newline character , thus pattern "." will match any non line-break character in the given string.

import re
pat = r"."
x = re.search(pat, "\n\n\n4\n\n\n").group() or "No non line-break character found.!"
print(x)

As you can see above, the "." pattern was able to retrieve among the line break characters. We will look more on the search function and its usage.

Note: The 'r' before a pattern indicates that the pattern is a raw string.This means that backslashes and other special characters in the pattern are interpreted literally, rather than being treated as special characters. This is necessary in most case because  the backlash('\\') is one of the most used metacharacters as we will see in the following section.

regex metacharacters and usage

You already have a clue on what metacharacters are from the previous sub-heading. In this section we will explore  the  available metacharacters and how they are used to achieve complex pattern matching on strings.

The following table summarizes the metacharacters and their usage.

metacharcter name usage
. Dot It matches any single character except the newline character.
^ Caret It is used to match the pattern from the start of the string. (Starts With).
$ Dollar sign Used to match the end of a string or line.(Ends With).
* Asterisk Matches the preceding expression 0 or more times
+ Plus Matches the preceding expression 1 or more times
? Question mark Matches the preceding expression 0 or 1 time
[] Brackets Matches a character set.
{} Curly braces Matches the specified number of occurrences of the preceding element.
| Pipe Matches either of the specified characters.
\ Backlash Escapes the following special character.
import re

pat = "[a-z]" #matches any characters from a to z
S = "Hello, World!"

if re.search(pat, S):
   print('Found "%s"'%re.search(pat, S).group())
else:
  print('No Match')
#Check whether a given string starts and ends with specified characters. 
import re

pat1 = "^H" #tests whether the string starts with letter 'H'
pat2 = "!$" #tests whether the string ends with '!'
S = 'Hello World!'

if re.search(pat1, S) and re.search(pat2, S):
   print(True)
else:
   print(False)

#Combining the two patterns 
pat = "^H.*!$"
if re.search(pat, S) and re.search(pat2, S):
   print(True)
else:
   print(False)

Shorthand Character classes

In regex, shorthand character classes consist of '\' followed by a literal  character. The shorthand character classes are used to create common patterns that match particular strings or sets of strings.

Character Usage
\d Matches any single digit from 0-9. Equivalent to [0-9]
\D matches any single character that is not a digit. Equivalent to [^0-9]
\s Matches any whitespace characters( " ", "\n", ''\t", "\r", "\v", "\f", "\0"). Equivalent to [\t\n\r\f\v]
\S Matches any non-whitespace character. Equivalent to [^\t\n\r\f\v]
\w Matches any alphanumeric character, including underscore. Equivalent to [a-zA-Z0-9]
\W Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9]
\A Matches the defined pattern at the start of the string.
\b r"\b..." - It matches the pattern at the beginning of a word in a string.
r"...\b" - It matches the pattern at the end of a word in a string.
\B This is the opposite of \b.
\Z Matches  if the pattern is found at the end of the string.
import re

pat = "\w*Python\w*"# Matches any string containing the word "Python"

if re.search(pat, "I have been learning the programming language Python for the last few months."):
   print('python Found!')
else:
   print(None)

Negation in Regex Patterns

Negation is toggling the functionality of a regex pattern. A caret (^) before a pattern will negate the subsquent pattern. For example, the regex  ^\d will match any character that is not a number.

Note: The caret is also used to match pattern from the start of a string. This can introduce room for ambiguity in som cases. Consider the examples below.

import re

pat = "^\d"
S1 = "Python 3"
S2 = "1 and 2"
if not re.search(pat, S1):
   print('The nagation worked!')
if re.search(pat, S2):
   print("This didn't work as intended because the string starts with an integer.")

Negation is commonly  used with character classes. A character class is a set of characters within square brackets that represent a group of characters, such as [0-9], [a-z], or [A-Z]. When a caret is placed just after the first opening bracket of a character class, the meaning of the character class is negated. For example, ^[0-9] will match any character that is not a number.   

import re
pat = "[^0-9]"

S = '123abc'
print(re.search(pat, S).group())

RegEx Functions and Usage.

Most of the functions in the re module have a similar syntax. They typically accept a pattern as the first argument, followed by the string to be matched or manipulated.

Following snippet shows some of these functions and their basic syntax:

re.match(pattern, string, flags=0)

re.search(pattern, string, flags=0)

re.findall(pattern, string, flags=0)

re.sub(pattern, repl, string, count=0, flags=0)

re.split(pattern, string, maxsplit=0, flags=0)

re.compile()

This function creates a regular expression object from a regular expression pattern.  This can can save time when the same expression will be used multiple times in a single program.

Even though we can use the functions in re module without first compiling the patterns, compiling a regular expression makes it faster to use for matching strings, since it has to be compiled only once. This makes it more efficient if the same regular expression is used again and again.

Syntax:

re.compile(pattern, flags=0)

The pattern is a string with which we want to compare and match the strings. The optional argument flags is used to modify the meaning of the given regex pattern. Flags can be any of the following: re.I(IGNORECASE): performers case-insensitive matching re.M (MULTILINE): makes beginning and end characters (^ and $) work over multiple lines re.S (DOTALL): makes a dot (.) match any character, including a newline re.U (UNICODE): performs Unicode case-insensitive matching.

import re
pattern = r'\d'
regexp = re.compile(pattern)
print(regexp)

def demo(s, regexp):
   if regexp.search(s):
       print('digit Found.')
   else:
       print('no digit found')


demo('Hello world!', regexp)
demo('Hello 2 World!', regexp)

re.match():

This function is used to match a substring at the beginning of a string.    is a regular expression which can be either a string or a pattern object. If it is successful it returns a regexp object,  otherwise it returns None.

Syntax:

re.match(pattern, string, flags=0)
import re

S = "Hello, my name is Joe" # Search for a basic string

match = re.match("Hello", S)
if match: 
    print("Match found")
else: 
    print("No match found")

This function searches for first occurrence of the regex pattern within the specified string.

Syntax:

re.search(pattern, string, flags=0)

The function returns match object if a match is found, otherwise, it returns None.

import re 

string = "The quick brown fox jumps over the lazy dog." 

# 1 - search for the word "fox":
match = re.search(r"fox", string)
print(match)

# 2 - search for a non-word character: 
match = re.search(r"\W", string)
print(match) 

# 3 - search for a series of characters:
match = re.search(r"quick\sbrown", string)
print(match)

re.findall()

This function returns all non-overlapping occurrences of RE pattern in string as a list.

Syntax:

re.findall(pattern, string, flags = 0)
import re

text = 'My name is Prakriti, I am in class 8,  I live in India. I was born in 2000.' 

# find all numbers present in text

result = re.findall(r'\d+', text) 

print(result)

re.sub()

This replaces all occurrences of pattern in string with replacement, and returns the modified string.

Syntax:

re.sub(pattern, repl, string, count=0, flags=0)

The pattern parameter is the regular expression to be matched. Thee repl parameter is the string to be substituted for each match. The string parameter is the string being processed. The count parameter is the maximum number of replacements to be made (default is 0, which means all matches are replaced.

#1 Replace all occurrences of the letter "a" with the letter "b" 
import re

text = 'This is a test sentence.'

result = re.sub('a', 'b', text)

print(result)

re.split()

This function is used to split a string into substrings based on a specified regular expression pattern. It returns a list of strings that were split from the original string.

re.split(pattern, string, maxsplit=0, flags=0)
import re 

test_data = "I love working with Python"
print(re.split('i\s',test_data)) 

test_data = "Python is a great language and very easy to learn"
print(re.split('\s',test_data))

re.finditer()

This function is used to find an iterator over all non-overlapping matches of a pattern in a given string. It is similar to re.findall(), but instead of returning a list of all patterns found, it returns an iterator containing MatchObject instances for each match.

Syntax:

re.finditer(pattern, string, flags=0)
import re

string = 'This is a sample string'

pattern = 'sample' 
match = re.finditer(pattern, string) 
print(match)

string = 'This is a sample string with a number 123, 124 and 125 in it'

pattern = '\d+'
matches = re.finditer(pattern, string)
for match in matches:
    print(match.group())

re.escape()

This function is used to ensure that any special characters within a string are escaped and can be successfully found in a Text string.

Syntax:

re.escape(pattern)

The ‘pattern’ value represents the the text string that will be escaped.  

import re 

string = 'This is (an example)'
escaped_string = re.escape(string)
print(escaped_string)

string = 'This string contains [] special characters!'
escaped_string = re.escape(string)
print(escaped_string)

re.purge()

The re.purge() function is used to clear all the caches associated with compiled regular expressions. 

import re 

pattern1 = r'\d+'  
  

print(re.search(pattern1, '123abc'))    
 
re.purge()  
print(re.search(pattern1, '456def'))