You can force a numeric string to contain a precise or maximum number of digits (see the example 1). If the range of digits in a number can be limited, setting this in a regular expression can avoid typical recognition errors, e.g., between 3 and 8, and the different styles of 1 and 7 (see the date examples). Our examples use both the longer and shorter notation systems.
Example 1: |
A price |
Task: |
To read prices in dollars with a maximum permitted value of $999.99. The decimal point and a cents value must be present. |
Expression: |
\$[0-9]{0,3}\.[0-9]{2} or \$\d{0,3}\.\d\d |
Backslashes are needed to get the literal values of $ and . In short notation, \d means a pre-defined set of characters, the digits (equivalent to [0-9]).
Example 2: |
U.S. Social Security Number (SSN) |
Task: |
To match an SSN in the usual 123-45-6789 format: |
Expression: |
[0-9]{3}-[0-9]{2}-[0-9]{4} or \d{3}-\d\d-\d{4} |
The dash (-) takes its literal value here, since it is not in a context where it could denote a range.
Example 3: |
North American telephone numbers |
Task: |
To match either of the two frequently used forms of North American telephone numbers: (123) 456-7890 and 123-456-7890. |
Expression: |
(\(\d{3}\) ?|\d{3}-)\d{3}-\d{4} |
We needed to use parentheses, because the priority of the (implicit) concatenation operator is higher than that of the OR (|) operator. On the other hand, when we want to match parenthesis characters themselves we need to escape them with the backslash character. We also made the space character between the closing parenthesis and the fourth digit optional using the question mark.
Example 4: |
Integer numbers with optional thousands separators |
Task: |
To match numbers like 123456 or 12,345,678. |
Expression: |
\d{1,3}(,?\d{3})* |
Note that it will not accept a comma that is not a proper thousands separator; e.g., 12,34 is not matched, but 1234 is matched.
Example 5: |
U.S. dates (short notation) |
Task: |
To match the usual short ways of expressing a date in U.S. format; i.e., 3/2/11, 3/25/11, 12/31/11, 03/02/11. In addition, the year part can contain the century, as in 1/1/2011. |
Expression: |
(0?[1-9]|1[0-2])/(0?[1-9]|[12]\d|3[01])/ (\d\d)?(\d\d) |
The four pairs of parentheses enclose the four parts of a date: month, day, century (can be empty), and year. This is a classic case when there are range limitations. This expression takes care not to accept months like 0, 00, 16, etc. It also has a similar check for the day.
Example 6: |
International dates (short and long notation) |
Task: |
To read a date in the format DD MM YYYY with only dates 1998-2019 valid, but no certainty on the spacing convention used. |
Expression: |
[\s0-3][0-9][\-/.\s][\s01][0-9][-/.\s] (1998|1999|20[01][0-9]) or [\s0-3]\d[\-/.\s][\s01]\d[\-/.\s](199[89]|20[01]\d) |
i.e.,
Character 1: |
a space or 0, 1, 2, or 3 |
Character 2: |
any digit |
Character 3: |
a hyphen, slash, dot, or space |
Character 4: |
a space or 0 or 1 |
Character 5: |
any digit |
Character 6: |
a hyphen, slash, dot, or space |
Character 7: |
1 or 2 |
Character 8: |
9 or 0 |
Character 9: |
9 or 0 or 1 |
Character 10: |
any digit |
Characters 7 to 10 are treated as a group to avoid years like 2999. Unfortunately, there are many various ways of writing dates, to name but three:
MM-DD-YY (One U.S. style)
YY-MM-DD (One European style)
DD-MM-YYYY
It is possible to write expressions to accept any of these. But then the range of possible characters cannot be narrowed so precisely; e.g., if you have to accommodate both MM-DD-YY and DD-MM-YY, without knowing which convention is being used, a value 17 for the two middle characters would have to be accepted, even though it is acceptable only in the U.S. notation and not the European. For this reason (and also to know whether 5-8-11 is the 5th of August or the 8th of May), form design should try to specify the date system to be used, so as to allow the regular expressions to be as precise as possible.
A regular expression can be useful for preventing even more typical misrecognition between letters and numbers; e.g., B and 8, Z and 2, S and 5, 1 and I or l, O and 0 (zero).
Example 7: |
Hexadecimal numbers |
Task: |
To match a 16-bit hexadecimal value; e.g., 12AB, ff55, or 0. Remember that \d means the digits (equivalent to [0-9]), and we use it here to build another set. |
Expression: |
[\da-fA-F]{1,4} |
Example 8: |
E-mail addresses |
Task: |
To match a properly structured U.S. E-mail address: |
Expression: |
.*@.*\.(com|org|gov|mil|net|edu) |
Note that the dot character has two functions: the first two occurrences use its meta-character meaning of any character, while in the third case we want to use it literally, so we needed to escape it with a backslash.
Example 9: |
Airline Flight Numbers |
Task: |
To read scheduled (non-charter) airline flight numbers. |
Expression: |
[A-Z][A-Z]\s[0-9]{2,3} or [A-Z]{2}\s\d\d\d? |
\u is not used in the short notation, since it might enable accented uppercase letters, which are not used in airline codes. The numeric part is designed to accept two-digit or three-digit numbers.
Since the two-letter airline codes are a finite known set, it would be possible to define each possible one in a separate expression, then any illegal letter-combinations would be refused. If the range of valid flight numbers for each airline were known, it would be possible to flag even non-existent flight numbers, or to use expressions to capture only documents relating to one specified flight or a group of flights.
Example 10: |
Canadian Zip Codes |
Description: |
These always conform to a single pattern: Three characters, space, three characters; more precisely: Uppercase Letter, Digit, U_Letter, Space, Digit, U_Letter, Digit. |
Expression: |
[A-Z][0-9][A-Z]\s[0-9][A-Z][0-9] or [A-Z]\d[A-Z]\s\d[A-Z]\d |
Example 11: |
British Postal Codes |
Description: |
These are more complex; only the last three characters conform to a precise pattern. One or two U_Letters, one or two digits, space, one digit, two U_Letters. That is: |
Expression: |
[A-Z]{1,2}[0-9]{1,2}\s[0-9][A-Z]{2} |
A set of more precise expressions could be devised, since the first one or two letters define a geographical region (L = London, B = Birmingham, CB = Cambridge...). Furthermore, each region has a finite number of numbered districts, presented by the following one or two digits. By creating a regular expression for each regional code, together with each one's maximum number of districts, it would be possible to reject any incoming code if its characters before the space did not conform to a valid post code value.
Example 12: |
Dutch Postal Codes |
Description: |
DDDD LL (four digits and two uppercase letters). In international traffic, the land code is needed, but in inland traffic it is often omitted, especially when it doesn't fit neatly with the remaining postcode, as in the Netherlands. So, our task is to read Dutch postcodes, with or without the land code NL-. The space between the digits and the letters may be very small or large, so we allow 0, 1, or 2 space characters. |
Expression: |
(NL-){0,1}[0-9]{4}\s{0,2}[A-Z]{2} or (NL-)?\d{4}\s{0,2}[A-Z]{2} |
Example 13: |
Other European postcodes |
Description: |
All countries have an official ISO-assigned two letter land-code, though some still persist in using a single letter. Leaving aside exceptional cases (e.g., Netherlands and Poland), Iceland has three-digit codes, many countries have four or five digits (some with spaces, most without), and three countries use six digits. The general expression below will allow one or two-letter landcodes (any uppercase letters permitted, including invalid ones), then a hyphen, then anywhere from three to six digits. |
General Expression: |
[A-Z]{1,2}-[0-9]{3,6} |
This general expression is of limited value. The following more precise expressions allow one or two-letter landcodes, checking that only valid landcodes are used, and that the correct number of digits and spaces are detected after the hyphen: |
|
Precise Expressions: |
(IS-)[0-9]{3} |
Parentheses are used when several alternatives are concatenated with other characters.