The checking subsystem also makes use of a User dictionary. A User dictionary is a collection of user-specific elements, the so-called UDitems. UDitems can be of two types: literal strings (usually words, as in the case of any word processor's user dictionary) or regular expressions. A string being checked will be accepted if it conforms to at least one item in the specified section of the User dictionary. A regular expression defines a pattern, range, or class of characters, either singly or as a group. When an item is a regular expression, it means that during the UD checking, strings passed for checking by a recognition module will be checked to see whether they conform to the pattern defined by the regular expression.
In the following example, regular expressions will be applied to check whether the recognized strings comply with post or zip code formats used mostly in Europe or in the US.
C |
Copy Code
|
---|---|
AT_ERRCOUNT nErrCount; LPCSTR sect1 = "ZIP_Section"; LPCWSTR item_literal = L"Accusoft"; // US postal zip code: 12345 or 12345-67890 LPCWSTR US_postal_zip = L"\\d{5}(-\\d{5})?"; // European postal code: D-12345 or H-1234 LPCWSTR European_postal_zip = L"[A-Z]-\\d{4,5}"; // Open user dictionary nErrCount = IG_REC_UD_edit_open(); // 0 as a third parameter denotes a literal string // and 1 denotes a regular expression nErrCount = IG_REC_UD_item_add(sect1, item_literal, 0); nErrCount = IG_REC_UD_item_add(sect1, US_postal_zip, 1); nErrCount = IG_REC_UD_item_add(sect1, European_postal_zip, 1); // Close user dictionary nErrCount = IG_REC_UD_edit_close(); |
Within the User Dictionary, the UDitems can be organized under different sections. Zones are always associated with a section of the User dictionary when they are created.
There can be different situations when it is worth doing an UD checking.
If the application uses spell checking and it consistently encounters words marked as uncertain that are spelled correctly, or it is known that the document contains many proper nouns, the application can reduce unwanted marking and improve recognition accuracy by performing UD checking, to supplement the spell checking (assuming that the User Dictionary has been prepared previously by adding the required words to it). In this case the UD checking is complementary to the spell checking.
UD checking without spell checking enabled is typically used in form-like applications where the data to be recognized is highly structured and follows predictable patterns (e.g., questionnaires).
Specifying the User dictionary file itself is a page-level setting. Once it is specified, it will be applied to all zones on the page. However, since the User dictionary may have several sections, each to be assigned to the different zones, different sets of dictionary items can be used for the different zones. For particular zones, the UD-checking can be disabled with the IG_REC_ZCF_USERDICT_PROHIBIT flag.
In addition to enabling the Checking Subsystem (as explained in Improve Accuracy with Checking), you also need to specifically enable User Dictionary checking.
The prerequisites for checking with a User dictionary are as follows:
The checking subsystem can handle two kinds of User dictionaries: native dictionary files (created or updated by a previous IG_REC_UD_save() call), and word-list file. The way of preparing a native User dictionary file is given in the next topic. A word-list file is a text file; it contains words, one in each line.
Before recognition, the IG_REC_UD_set() function must be called with the name of the User dictionary file and also with a section name. This section name defines the default section in the User dictionary. The items under this section will be "used" by any AT_REC_ZONE in the zone list whose UserDictionarySection field contains an empty string. (Note that the auto-zoning function always creates zones with an empty string in this field.) Other sections in the User dictionary can be referred to by name. In this case, for those zones where it is needed, the UserDictionarySection field must contain the section name. This way different zones on a page can be UD-checked with different sections. A zone will be subject to UD-checking only if the IG_REC_ZCF_USERDICT_PROHIBIT flag of its Checking field is off.
C |
Copy Code
|
---|---|
AT_ERRCOUNT nErrCount; nErrCount = IG_REC_spelling_is_enabled_set(TRUE); nErrCount = IG_REC_correction_is_enabled_set(TRUE); nErrCount = IG_REC_UD_set("MYWORDS.DCT", "DEFSECT"); |
After recognition, the Checking field of the zones might be updated by one of the flags IG_REC_ZCF_LANGDICT_USED, IG_REC_ZCF_USERDICT_USED , IG_REC_ZCF_CHECKCBF_USED, or IG_REC_ZCF_VERTDICT_USED.