A regular expression is a pattern in the form of a string that describes or matches the format of expected results, according to certain rules. It is very useful in defining patterns, such as, dates, invoice numbers, or credit card numbers.
Below is a basic introduction to regular expressions and some patterns that are possible. SmartZone ICR/OCR use regular expressions to define the expected format of recognized text.
Basic Patterns (Literals)
For basic regular expressions the pattern is matched to the exact recognition results. For example, to find the exact string "123", the regular expression pattern is "123".
If the masking pattern includes special characters (asterisk, plus, question mark, period, square or curly brackets, bar, parentheses, dash, or dollar sign), then each of these characters require escaping.
- To escape the special characters, precede them with a backslash ("\") character. For example, to define the string "$123", the pattern is "\$123".
- In C# "\" implies an escape character, so you have to either type @"\.123" or "\\.123". In other words, to use a backslash in the search pattern, simply escape it with a backslash. For example, to specify the string "\123", the necessary pattern is "\\.123".
The period "." matches any single character.
Submatches
Regular expressions are more versatile than the basic patterns they contain. Patterns can contain submatches. Submatches are located between parentheses and enable splitting the regular expression. The "|" character means "or" in regular expressions.
- To look for the pattern "ab" or "bc" or "cd", use the submatch expression "(ab|bc|cd)".
- To look for the pattern "[weekend day name] [time of day]", use the pattern "(Saturday|Sunday) (Morning|Afternoon|Night)". This matches every sequence of text that contains "Saturday Morning", "Saturday Afternoon", "Saturday Night", "Sunday Morning", "Sunday Afternoon", or "Sunday Night".
Repetition Operators
Patterns are generally of variable length. In such cases, it is useful to have repeating operators.
Pattern |
Description |
* |
Use the "*" to match the preceding sequence 0 or more times (SmartZone tries to match as many times as possible). For example, the pattern "(sunny )*day" matches "day", "sunny day", "sunny sunny day", etc. |
+ |
"+" works the same as "*", but there must be a minimum of one instance of the submatch or characters present. For example, the pattern "(sunny )+day" matches "sunny day", "sunny sunny day", etc., but not "day" alone. |
? |
Use the "?" to match the preceding sequence 0 or 1 time. For example, the pattern "(sunny )?day" will match "day" and "sunny day".
Adding a "?" to a repeat operator makes the subexpression minimal or non-greedy. Normally, a repeated expression is greedy, in other words, it matches as many characters as possible. A non-greedy subexpression matches as few characters as possible. |
{n} |
Use "{n}" to match the preceding sequence n times. For example, the pattern "(sunny ){2}day" only matches "sunny sunny day". |
{n,m} |
Use "{n,m}" to match the preceding sequence between n and m times. For example, the pattern "(sunny ){2,3}day" matches "sunny sunny day" and "sunny sunny sunny day". |
{n,} |
Use "{n,}" to match the preceding sequence at least n times. For example, the pattern "(sunny ){2,}day" matches "sunny sunny day" and "sunny sunny sunny sunny day". |
It is preferable to use the most restrictive pattern possible to improve recognition results.
Character Classes
Character classes are more convenient than sub-matches for matching from a range of characters at one position in a pattern. There are two ways to specify a character class, square brackets or the special backslash options.
Square Bracket Character Classes
These are the more versatile, but also the more verbose option. It allows you to specify a range of characters to include or exclude. For example, "[A-Za-z]" matches any single upper or lowercase letter character in the Latin character set. And, "[aeiouy]" matches any of the vowel characters.
- When two characters are separated by the dash "-" character, it implies the full range of characters between the two inclusive characters in the collating sequence. So, "[0-9]" will match any decimal digit.
- To match the dash "-" character, use "[.-.]".
- Putting the caret ("^") character after the opening square bracket implies that at this position any character should match, other than those in the brackets. In other words, exclude the specified characters or range of characters. So, "[^aeiouy]" should match any consonant and "[^0-9]" should match any character which is not a digit.
- There are also built-in character classes that can be incorporated within the brackets. These are recommended because their use remains standard, even if for some reason the standard character values change. Their syntax is "[[:class:]]" or "[^[:class:]]". A few useful classes are:
- alnum - alphanumeric characters
- alpha - alphabetic characters
- lower - lower case letters
- punct - printable characters not space or alphanumeric
- space - whitespace characters
- upper - upper case letters
Backslash Character Classes
These are the less versatile, but more concise options. Like the bracket classes, these specify predefined ranges of characters. They all match one character. These are all single characters preceded by the backslash ("\") character:
- < - beginning of word
- > - end of word
- d - digit, equivalent to [[:digit;]]
- D - non-digit, equivalent to [^[:digit;]]
- s - whitespace, equivalent to [[:space;]]
- S - non-whitespace, equivalent to [^[:space;]]
- w - word character, equivalent to [[:alnum;]]
- W - non-word character, equivalent to [^[:alnum;]]
More Examples
Pattern |
Description |
(\d{4} ){3}\d{4} |
This pattern finds the standard 16 digit credit card number:
|
(\(?\d{3}\)?-?)?\d{3}-?\d{4} |
This pattern matches phone numbers written in a format that is standard in the United States (though it sometimes match numbers in a non-standard format *). Numbers written in the following formats match this pattern:
- (555)-222-1234
- (555)222-1234
- (555)2221234
- 555-222-1234
- 222-1234
- 2221234
- 5552221234
* Poorly formatted numbers this may match are:
- (555-222-1234
- 555)-222-1234
- 555222-1234
- (555-2221234
- 555)-2221234
|
\d{3}-?\d{2}-?\d{4} |
This pattern matches social security numbers, whether or not they are delimited by dashes:
- 078-05-1120
- 07805-1120
- 078-051120
- 078051120
|
\$(\d|,\d)+(\.\d\d)? |
This pattern matches monetary amounts, preceded by a “$” sign:
- $20,000.00
- $20
- $20.00
- $20,0.00 (a more specific pattern would be required to ensure that a comma is followed by at least 3 digits)
|