RegEx - Regular Expressions
A regular expression, or RegEx, is a sequence of characters that defines a search pattern which is used for finding a specific pattern of characters in a text string. The interpreter used for processing the regular expressions isMicrosoft VBScript Regular Expressions 5.5
.
Search Patterns
A RegEx search pattern is a sequence of characters consisting of one or more of the following components:
- Literal characters.
- Metacharacters.
Literal characters simply match their counterparts in the input text. For instance, the literal character sequence abc in a search pattern will simply match all occurrences of abc in the input text. Metacharacters, on the other hand, are far more abstract.
Common metacharacters
A metacharacter is a character, or combination of characters, with a special meaning in a RegEx pattern. Unlike literal characters, metacharacters can match several different characters in the input text or they can even represent something other than a character. I know some of these descriptions may be a bit confusing, but I encourage you to take a moment to familiarize yourself with the following metacharacters:
Metacharacter | Description |
---|---|
. | "Wildcard." The unescaped period matches any character, except a new line. |
^ | Beginning of a string. |
$ | End of a string |
\ | "Escape." The backslash in front of a metacharacter turns it into a literal character. |
\b | "Word boundary" or "backspace character." Outside character classes, \b matches a position before or after a word within the text source. Within character classes, \b denotes the backspace character. |
\B | "Not a word boundary." \B is the negation of \b, but has no alternate meaning within character classes. |
\d | "Digit." Matches any digit from 0-9. |
\D | "Not digit." Matches any character that's not a digit. |
\s | "Whitespace." Matches a space, newline or tab character. |
\S | "Not whitespace." Matches a character that's not a space, newline or tab. |
The characters ^, $ and \b are called anchors, since they match a position before, after, or between characters.
Operators
Some metacharacters change how one or more of the other components in the search pattern is interpreted, i.e. they perform operations on these other components. RegEx comes with three types of operators. The last one in the table is actually a group of metacharacter expressions:
Operator | Meta-character | Description | Example |
---|---|---|---|
Boolean "or" | | | The vertical bar denotes the boolean "or" operator. | a|b matches either "a" or "b". |
Grouping | () | Parentheses are used for several purposes: 1) to define the scope and precedence of operators.2) to group characters and remember text. | h(a|e)y matches either "hay" or "hey". |
Quantification | ? | Zero or one occurrences of the preceding element. | colou?r matches both "color" and "colour". |
* | Zero or more occurrences of the preceding element. | ab*c matches "ac", "abc", "abbc", "abbbc", and so on. | |
+ | One or more occurrences of the preceding element. | ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac". | |
{n} | The preceding item is matched exactly n times. | a{3} matches "aaa". | |
{min,} | The preceding item is matched min or more times. | a{1,} matches "a", "aa", "aaa" and so on. | |
{min, max} | The preceding item is matched at least min times, but not more than max times. | a{1,3} matches "a", "aa" and "aaa", but not "aaaa". |
Character classes
Character classes or character sets are specified with square brackets [ ]. Some of the most common ones are:
[a-z]The set of lower-case letters ranging from a to z.
[A-Z]The set of upper-case letters ranging from A to Z.
[0-9]The set of single digits ranging from 0 to 9.
Character classes are frequently used in conjunction with operators in the search pattern. For instance, [0-5]+ translates to “find one or more digits”, rather than just a single digit in the range from zero to five.