Accusoft.SmartZoneOCR4.Net
Regular Expressions
See Also Send Feedback
SmartZone OCR 4 for .Net - User Guide > Concepts > Regular Expressions

Glossary Item Box

A regular expression is a pattern in the form of a string that describes or matches the format of expected results, according to certain rules. It is very useful in defining patterns, such as, dates, invoice numbers, or credit card numbers. Below is a basic introduction to regular expressions and some patterns that are possible. SmartZone OCR uses regular expressions to define the expected format of recognized text.

Basic Patterns (Literals)

For basic regular expressions the pattern is matched to the exact recognition results. For example, to find the exact string "123", the regular expression pattern is "123".

If the masking pattern includes special characters (asterisk, plus, question mark, period, square or curly brackets, bar, parentheses, dash, or dollar sign), then each of these characters require escaping. To escape the special characters, precede them with a backslash ("\") character. For example, to define the string "$123", the pattern is "\$123".

In C# "\" implies an escape character, so you have to either type @"\.123" or \\.123. In other words, to use a backslash in the search pattern, simply escape it with a backslash. For example, to specify the string "\123", the necessary pattern is \\123.

The period "." matches any single character.

Submatches

Regular expressions are more versatile than the basic patterns they contain. Patterns can contain submatches. Submatches are located between parentheses and enable splitting the regular expression. The "|" character means "or" in regular expressions.

To look for the pattern "ab" or "bc" or "cd", use the submatch expression "(ab|bc|cd)".

To look for the pattern "[weekend day name] [time of day]", use the pattern "(Saturday|Sunday) (Morning|Afternoon|Night)". This matches every sequence of text that contains "Saturday Morning", "Saturday Afternoon", "Saturday Night", "Sunday Morning", "Sunday Afternoon", or "Sunday Night".

Repetition Operators

Patterns are generally of variable length. In such cases, it is useful to have repeating operators.

It is preferable to use the most restrictive pattern possible to improve recognition results.

Character Classes

Character classes are more convenient than sub-matches for matching from a range of characters at one position in a pattern. There are two ways to specify a character class, square brackets or the special backslash options.

Square Bracket Character Classes

These are the more versatile, but also the more verbose option. It allows you to specify a range of characters to include or exclude. For example, "[A-Za-z]" matches any single upper or lowercase letter character in the Latin character set. And, "[aeiouy]" matches any of the vowel characters.

When two characters are separated by the dash "-" character, it implies the full range of characters between the two inclusive characters in the collating sequence. So, "[0-9]" will match any decimal digit.

To match the dash "-" character, use "[.-.]".

Putting the caret ("^") character after the opening square bracket implies that at this position any character should match, other than those in the brackets. In other words, exclude the specified characters or range of characters. So, "[^aeiouy]" should match any consonant and "[^0-9]" should match any character which is not a digit.

There are also built-in character classes that can be incorporated within the brackets. These are recommended because their use remains standard, even if for some reason the standard character values changes. Their syntax is "[[:class:]]" or "[^[:class:]]". A few useful classes are:

Backslash Character Classes

These are the less versatile, but more concise option. Like the bracket classes, these specify predefined ranges of characters. They all match one character. These are all single characters preceded by the backslash ("\") character:

More Examples

Pattern Description
(\d{4} ){3}\d{4}

This pattern finds the standard 16 digit credit card number:

  • 1234 5678 9012 3456
(\(?\d{3}\)?-?)?\d{3}-?\d{4}

This pattern matches phone numbers written in a format that is standard in the United States (though it sometimes match numbers in a non-standard format *). Numbers written in the following formats match this pattern:

  • (555)-222-1234
  • (555)222-1234
  • (555)2221234
  • 555-222-1234
  • 222-1234
  • 2221234
  • 5552221234

* Poorly formatted numbers this may match are:

  • (555-222-1234
  • 555)-222-1234
  • 555222-1234
  • (555-2221234
  • 555)-2221234
\d{3}-?\d{2}-?\d{4}

This pattern matches social security numbers, whether or not they are delimited by dashes:

  • 078-05-1120
  • 07805-1120
  • 078-051120
  • 078051120
\$(\d|,\d)+(\.\d\d)?

This pattern matches monetary amounts, preceded by a “$” sign:

  • $20,000.00
  • $20
  • $20.00
  • $20,0.00 (a more specific pattern would be required to ensure that a comma is followed by at least 3 digits)

 

See Also

©2013. Accusoft Corporation. All Rights Reserved.