Regex character classes, also known as character sets, are used in regular expressions to define a group or range of characters that can be matched within a pattern. They allow you to specify a set of possible characters that can occur at a particular position in the text you're searching or manipulating.
Basic Character Classes
Character classes are enclosed within square brackets []
and can contain individual characters, character ranges, or predefined character classes.
For example If you want to match an a or an e, use [ae]
. You could use this in gr[ae]
to match either "gray”
or "grey”
.
import re
pattern = r"gr[ae]y"
print(re.findall(pattern, "The sky was clear, but as the sun set, it transformed into a beautiful shade of grey."))
Range in Character sets
A hyphen '-' inside a character class specifies a range of characters. For example, [a-z]
matches any lowercase letter from 'a' to 'z', [A-Z]
matches any uppercase letter from A to Z, [0-9]
matches any number from 0 to 9.
import re
pattern = r"[a-f]" #matches letters from a to f
S = "abxcywez"
print(re.findall(pattern, S))
Metacharacters inside Character Classes
Use metacharacters inside character classes as literals by escaping them with a backslash. For example, `[\[\]()]` matches any of the characters '[', ']', '(', or ')'.
import re
pattern = r"[\[\]\(\)\{\}]"
S = "Programming language syntax includes various symbols, such as square brackets [], parentheses (), and curly braces {}."
print(re.findall(pattern, S))
Character classes are case-sensitive by default. Use the re.I
flag or include both uppercase and lowercase letters to make it case-insensitive.
import re
pattern = r"[a-z]"
S = "123a45x56Y7z8"
print(re.findall(pattern, S, re.I))#the re.I flag makes the pattern cas-insensitive
Combining Multiple Character Classes
You can combine multiple character classes by placing them adjacent to each other or using the pipe '|' symbol. For example, `[a-zA-Z0-9]` matches any alphanumeric character.The combined character class matches any character that matches at least one of the individual character classes.
import re
pattern = "[a-zA-Z|0-9]"
S = "1@#Rj.*k"
print(re.findall(pattern, S))
Negation
A caret '^' at the beginning of a character class negates it. For example, `[^0-9]` matches any character that is not a digit.
When the caret '^' is not the first character within a character class, it matches a literal caret.
import re
pattern = "[^0-9]"
S = "abcd123456xyz"
print(re.findall(pattern, S))
Predefined Character Classes
\d
matches any digit character `[0-9]`.
\D
matches any non-digit character `[^0-9]`.
\w
matches any word character (alphanumeric and underscore) `[a-zA-Z0-9_]`.
\W
matches any non-word character `[^a-zA-Z0-9_]`.
\s
matches any whitespace character (space, tab, newline, etc).
\S
matches any non-whitespace character.
\b
matches a word boundary (the position between a word and a non-word character).
\B
matches a non-word boundary.
Character classes in regular expressions provide a powerful way to define patterns for matching specific sets of characters in text. Understanding and effectively using character classes can greatly enhance your regex skills. Experiment and practice with different character class combinations to become proficient in their usage.