ImageGear for C and C++ on Windows v19.1 - Updated
Regular Expressions
User Guide > How to Work with... > OCR > How to... > Auto-Redact > Regular Expressions

A regular expression is a pattern in the form of a string that describes or matches the format of expected results, according to certain rules. Below is a basic introduction to regular expressions and some patterns that are possible. ImageGear uses regular expressions to define the expected format of recognized text.

This section provides the following information:

Basic Patterns (Literals)

For basic regular expressions, the pattern is matched to the exact recognition results. For example, to find the exact string "123", the regular expression pattern is "123".

If the masking pattern includes special characters, then each of these characters require escaping.

To escape the special characters, precede them with a backslash ("\", U+005C) character. For example, to define the string "$123", the pattern is "\$123".

In C and C++, the backslash ("\", U+005C) implies an escape character, so you have to type "\\.123". In other words, to use a backslash in the search pattern, simply escape it with a backslash. For example, to specify the string "\123", the necessary pattern is "\\123".

The period "." matches any single character.

Sub-Matches

Regular expressions are more versatile than the basic patterns they contain. Patterns can contain sub-matches. Sub-matches are located between parentheses and enable splitting the regular expression. The "|" character means "or" in regular expressions.

To look for the pattern "ab" or "bc" or "cd", use the sub-match expression "(ab|bc|cd)".

To look for the pattern "[weekend day name] [time of day]", use the pattern "(Saturday|Sunday) (Morning|Afternoon|Night)". This matches every sequence of text that contains "Saturday Morning", "Saturday Afternoon", "Saturday Night", "Sunday Morning", "Sunday Afternoon", or "Sunday Night".

Repetition Operators

Patterns are generally of variable length. In such cases, it is useful to have repeating operators.

It is preferable to use the most restrictive pattern possible to improve recognition results.

Character Classes

Character classes are more convenient than sub-matches for matching from a range of characters at one position in a pattern. There are two ways to specify a character class: square brackets or the special backslash options. See the following sections:

Square Bracket Character Classes

These are the more versatile, but also the more verbose option. It allows you to specify a range of characters to include or exclude. For example, "[A-Za-z]" matches any single upper or lowercase letter character in the Latin character set. And, "[aeiouy]" matches any of the vowel characters.

When two characters are separated by the dash "-" character, it implies the full range of characters between the two inclusive characters in the collating sequence. So, "[0-9]" will match any decimal digit.

To match the dash "-" character, use "[.-.]".

Putting the caret ("^") character after the opening square bracket implies that at this position any character should match, other than those in the brackets. In other words, exclude the specified characters or range of characters. So, "[^aeiouy]" should match any consonant and "[^0-9]" should match any non-Latin digit character.

There are also built-in character classes that can be incorporated within the brackets. These are recommended because their use remains standard, even if for some reason the standard character values changes. Their syntax is "[[:class:]]" or "[^[:class:]]". A few useful classes are:

Backslash Character Classes

These are the less versatile, but more concise option. Like the bracket classes, these specify predefined ranges of characters. They all match one character. These are all single characters preceded by the backslash ("\") character:

More Examples

Pattern

Description

(\\d{4} ){3}\\d{4}

This pattern finds the standard 16-digit credit card number:

  • 1234 5678 9012 3456

(\\(?\\d{3}\\)?-?)?\\d{3}-?\\d{4}

This pattern matches phone numbers written in a format that is standard in the United States (though it sometimes will match numbers in a non-standard format). Numbers written in the following formats match this pattern:

  • (555)-222-1234
  • (555)222-1234
  • (555)2221234
  • 555-222-1234
  • 222-1234
  • 2221234
  • 5552221234

Poorly formatted numbers this may match are:

  • (555-222-1234
  • 555)-222-1234
  • 555222-1234
  • (555-2221234
  • 555)-2221234

\\d{3}-?\\d{2}-?\\d{4}

This pattern matches social security numbers, whether or not they are delimited by dashes:

  • 078-05-1120
  • 07805-1120
  • 078-051120
  • 078051120

\\$(\\d|,\\d)+(\\.\\d\\d)?

This pattern matches monetary amounts, preceded by a "$" sign:

  • $20,000.00
  • $20
  • $20.00
  • $20,0.00 (a more specific pattern would be required to ensure that a comma is followed by at least 3 digits)