SmartZone v6.1 for .NET - Updated
Regular Expressions
Overview > Concepts > Regular Expressions

A regular expression is a pattern in the form of a string that describes or matches the format of expected results, according to certain rules. It is very useful in defining patterns, such as, dates, invoice numbers, or credit card numbers.

Below is a basic introduction to regular expressions and some patterns that are possible. SmartZone ICR/OCR use regular expressions to define the expected format of recognized text.

Basic Patterns (Literals)

For basic regular expressions the pattern is matched to the exact recognition results. For example, to find the exact string "123", the regular expression pattern is "123".

If the masking pattern includes special characters (asterisk, plus, question mark, period, square or curly brackets, bar, parentheses, dash, or dollar sign), then each of these characters require escaping.

The period "." matches any single character.

Submatches

Regular expressions are more versatile than the basic patterns they contain. Patterns can contain submatches. Submatches are located between parentheses and enable splitting the regular expression. The "|" character means "or" in regular expressions.

Repetition Operators

Patterns are generally of variable length. In such cases, it is useful to have repeating operators.

Pattern Description
* Use the "*" to match the preceding sequence 0 or more times (SmartZone tries to match as many times as possible). 
For example, the pattern "(sunny )*day" matches "day", "sunny day", "sunny sunny day", etc.
+ "+" works the same as "*", but there must be a minimum of one instance of the submatch or characters present.
For example, the pattern "(sunny )+day" matches "sunny day", "sunny sunny day", etc.,  but not "day" alone.
?

Use the "?" to match the preceding sequence 0 or 1 time.
For example, the pattern "(sunny )?day" will match "day" and "sunny day".

Adding a "?" to a repeat operator makes the subexpression minimal or non-greedy. Normally, a repeated expression is greedy, in other words, it matches as many characters as possible. A non-greedy subexpression matches as few characters as possible.

{n} Use "{n}" to match the preceding sequence n times.
For example, the pattern "(sunny ){2}day" only matches "sunny sunny day".
{n,m} Use "{n,m}" to match the preceding sequence between n and m times.
For example, the pattern "(sunny ){2,3}day" matches "sunny sunny day" and "sunny sunny sunny day".
{n,} Use "{n,}" to match the preceding sequence at least n times.
For example, the pattern "(sunny ){2,}day" matches "sunny sunny day" and "sunny sunny sunny sunny day".

It is preferable to use the most restrictive pattern possible to improve recognition results.

Character Classes

Character classes are more convenient than sub-matches for matching from a range of characters at one position in a pattern. There are two ways to specify a character class, square brackets or the special backslash options.

Square Bracket Character Classes

These are the more versatile, but also the more verbose option. It allows you to specify a range of characters to include or exclude. For example, "[A-Za-z]" matches any single upper or lowercase letter character in the Latin character set. And, "[aeiouy]" matches any of the vowel characters.

Backslash Character Classes

These are the less versatile, but more concise options. Like the bracket classes, these specify predefined ranges of characters. They all match one character. These are all single characters preceded by the backslash ("\") character:

More Examples

Pattern Description
(\d{4} ){3}\d{4}

This pattern finds the standard 16 digit credit card number:

  • 1234 5678 9012 3456
(\(?\d{3}\)?-?)?\d{3}-?\d{4}

This pattern matches phone numbers written in a format that is standard in the United States (though it sometimes match numbers in a non-standard format *). Numbers written in the following formats match this pattern:

  • (555)-222-1234
  • (555)222-1234
  • (555)2221234
  • 555-222-1234
  • 222-1234
  • 2221234
  • 5552221234

* Poorly formatted numbers this may match are:

  • (555-222-1234
  • 555)-222-1234
  • 555222-1234
  • (555-2221234
  • 555)-2221234
\d{3}-?\d{2}-?\d{4}

This pattern matches social security numbers, whether or not they are delimited by dashes:

  • 078-05-1120
  • 07805-1120
  • 078-051120
  • 078051120
\$(\d|,\d)+(\.\d\d)?

This pattern matches monetary amounts, preceded by a “$” sign:

  • $20,000.00
  • $20
  • $20.00
  • $20,0.00 (a more specific pattern would be required to ensure that a comma is followed by at least 3 digits)

 

Is this page helpful?
Yes No
Thanks for your feedback.