User Guide > How to Work with... > OCR > How to... > Improve Accuracy with Checking |
The checking subsystem is mainly used to achieve better accuracy in the recognized output text and/or to make proofing more effective. Its services are closely integrated into the recognition process; they are activated internally either during the recognition (by some recognition modules) or immediately after the recognition process. That is why all the checking related settings have to be set BEFORE calling the Recognize Method.
This section provides information about the checking subsystem, which consists of the following three independent parts:
Asian Recognition Module: The checking subsystem is not available. This means spell checking, UD-Checking and User-Written Checking cannot be used when the Asian Recognition Module is active. See the Asian Recognition Module topic for more details. |
Any combination of these parts is possible during the recognition process as steps in determining the acceptability of words. Note that the use of these checking types can still be enabled or disabled separately at zone level.
Spell checking uses third-party language-specific spell checkers. There are two kinds of spell checkers:
The recognition engine is delivered with seventeen different Language dictionaries (English, German, Dutch, Polish, Russian, etc.). These are generic language dictionaries, which contain between 100,000 and 200,000 entries.
Vertical dictionaries apply to special professions and in the toolkit. They can be treated as extensions to the Language dictionaries, though they can even be used when no Language dictionary is specified. The recognition engine is delivered with 8 different Vertical dictionaries for two professions (medical and legal) in four languages (English, German, French, and Dutch).
Specifying the Spelling language (Language dictionary) and/or a Vertical dictionary is page-level setting. Once a Language dictionary has been specified, the language checking will be applied to all zones on the page, unless their LANGDICT_PROHIBIT flag is set.
Similarly, once a Vertical dictionary has been specified, the vertical language checking will be applied to all zones on the page, unless their VERTDICT_PROHIBIT flag is set.
To check if a specific language has spell checking available, see ImGearRecLanguage Enumeration.
Asian Recognition Module: The checking subsystem is not available. This means spell checking, UD-Checking and User-Written Checking cannot be used when the Asian Recognition Module is active. See the Asian Recognition Module topic for more details. |
This section provides information about the following:
The most important property in controlling the checking subsystem is the ImGearRecRecognitionSettings.SpellingEnabled Property. This property is used to enable or disable the running of the checking subsystem.
If it is enabled (default), the correction of non-compliant words can be enabled or disabled. For this, the CorrectionEnabled Property and SpellingLanguage Property are used. If correction is disabled, non-compliant and suspicious words are flagged, but no auto-correction is done. When enabled, all these unchanged words are flagged, while some characters may be changed and are marked as changed. Correction is the default, so the application needs to call this property only when it wants to explicitly disable the word correction feature.
The remaining checking-related properties are specific to the particular checking system.
The spelling languages available in the current recognition engine configuration can be accessed via the ImGearRecRecognitionSettings.SpellingLanguages Property. The application can specify a language setting as the spelling language, valid for the next page or pages with the SpellingLanguage Property.
Additionally, vertical language checking can also be activated when one of the Vertical dictionaries (e.g., LEGAL_GER.DMD) has been set through the VerticalDictionary Property.
When the specified spelling language is set to AUTO (default), the language checking will be performed based on the recognition language selection (LanguageEnabled Property) as follows:
C# |
Copy Code |
---|---|
igRecognition.Recognition.SpellingEnabled = true; igRecognition.Recognition.CorrectionEnabled = true; igRecognition.Recognition.SpellingLanguage = ImGearRecLanguage.ENG; |
VB.NET |
Copy Code |
---|---|
igRecognition.Recognition.SpellingEnabled = True igRecognition.Recognition.CorrectionEnabled = True igRecognition.Recognition.SpellingLanguage = ImGearRecLanguage.ENG |
When the spelling language is set to NO, or when it specifies a language for which there is no language dictionary, the language checking will not be activated for the zones of the page.
C# |
Copy Code |
---|---|
igRecognition.Recognition.LanguageEnabled[ImGearRecLanguage.DUT] = true; igRecognition.Recognition.SpellingLanguage = ImGearRecLanguage.DUT; igRecognition.Recognition.VerticalDictionary = "Dutch Legal Dictionary"; |
VB .NET |
Copy Code |
---|---|
igRecognition.Recognition.LanguageEnabled(ImGearRecLanguage.DUT) = True igRecognition.Recognition.SpellingLanguage = ImGearRecLanguage.DUT igRecognition.Recognition.VerticalDictionary = "Dutch Legal Dictionary" |
The checking subsystem also makes use of a User dictionary. A User dictionary is a collection of user-specific elements, the so-called UDitems. UDitems can be of two types: literal strings (usually words, as in the case of any word processor's user dictionary) or regular expressions. A string being checked will be accepted if it conforms to at least one item in the specified section of the User dictionary. A regular expression defines a pattern, range, or class of characters, either singly or as a group. When an item is a regular expression, it means that during the UD-checking, strings passed for checking by a recognition module will be checked to see whether they conform to the pattern defined by the regular expression.
Asian Recognition Module: The checking subsystem is not available. This means spell checking, UD-Checking and User-Written Checking cannot be used when the Asian Recognition Module is active. See the Asian Recognition Module topic for more details. |
This section provides information about the following:
The prerequisites for checking with a User dictionary are:
The checking subsystem can handle two kinds of User dictionaries: native dictionary files (created or updated by a previous ImGearRecUserDictionary.Save Method call), and word-list file. The way of preparing a native User dictionary file is given in the next topic. A word-list file is a text file; it contains words, one in each line.
Before recognition, the ImGearRecUserDictionary.Load Method must be called with the name of the User dictionary file and also with a section name. This section name defines the default section in the User dictionary. The items under this section will be "used" by any zone in the zone list whose UserDictionarySection Property contains an empty string. (Note that the auto-zoning feature always creates zones with an empty string in this property.) Other sections in the User dictionary can be referred to by name. In this case, for those zones where it is needed, the UserDictionarySection Property must contain the section name. This way different zones on a page can be UD-checked with different sections. A zone will be subject to UD-checking only if the USERDICT_PROHIBIT flag of its Checking Property is off.
C# |
Copy Code |
---|---|
igRecognition.Recognition.SpellingEnabled = true; igRecognition.Recognition.CorrectionEnabled = true; igRecognition.Recognition.UserDictionary.Load("MYWORDS.DCT", "DEFSECT"); |
VB.NET |
Copy Code |
---|---|
igRecognition.Recognition.SpellingEnabled = True igRecognition.Recognition.CorrectionEnabled = True igRecognition.Recognition.UserDictionary.Load("MYWORDS.DCT", "DEFSECT") |
After recognition, the Checking Property of the zones might be updated by one of the flags: LANGDICT_USED, USERDICT_USED, CHECKCBF_USED, or VERTDICT_USED.
In the following example, regular expressions will be applied to check whether the recognized strings comply with post or zip code formats used mostly in Europe or in the US.
C# |
Copy Code |
---|---|
string sect1 = "ZIP_Section"; string item_literal = "Accusoft"; // US postal zip code: 12345 or 12345-67890 string US_postal_zip = "\\d{5}(-\\d{5})?"; // European postal code: D-12345 or H-1234 string European_postal_zip = "[A-Z]-\\d{4,5}"; // This assumes the UD is already open for maintenance igRecognition.Recognition.UserDictionary.Create(); ImGearRecUserDictionary igRecUserDictionary = igRecognition.Recognition.UserDictionary; igRecUserDictionary.AddItem(new ImGearRecUDItem(sect1, item_literal)); igRecUserDictionary.AddItem(new ImGearRecUDItem(sect1, US_postal_zip, true)); igRecUserDictionary.AddItem(new ImGearRecUDItem(sect1, European_postal_zip, true)); |
VB.NET |
Copy Code |
---|---|
Dim sect1 As String = "ZIP_Section" Dim item_literal As String = "AccuSoft" ' US postal zip code: 12345 or 12345-67890 Dim US_postal_zip As String = "\d{5}(-\d{5})?" ' European postal code: D-12345 or H-1234 Dim European_postal_zip As String = "[A-Z]-\d{4,5}" ' This assumes the UD is already open for maintenance igRecognition.Recognition.UserDictionary.Create() Dim igRecUserDictionary As ImGearRecUserDictionary = igRecognition.Recognition.UserDictionary igRecUserDictionary.AddItem(New ImGearRecUDItem(sect1, item_literal)) igRecUserDictionary.AddItem(New ImGearRecUDItem(sect1, US_postal_zip, True)) igRecUserDictionary.AddItem(New ImGearRecUDItem(sect1, European_postal_zip, True)) |
Within the User dictionary, the UDitems can be organized under different sections. Zones are always associated with a section of the User dictionary when they are created.
There can be different situations when it is worth doing an UD-checking.
If the application uses spell checking, and it consistently encounters words marked as uncertain that are spelled correctly, or it is known that the document contains many proper nouns, the application can reduce unwanted marking and improve recognition accuracy by performing UD-checking, to supplement the spell checking (assuming that the User dictionary has been prepared previously by adding the required words to it). In this case the UD-checking is complementary to the spell checking.
UD-checking without spell checking enabled is typically used in form-like applications (e.g., questionnaires), i.e., where the data to be recognized is highly structured and follows predictable patterns.
Specifying the User dictionary file itself is a page-level setting. Once it is specified, it will be applied to all zones on the page. However, since the User dictionary may have several sections, each to be assigned to the different zones, different sets of dictionary items can be used for the different zones. For particular zones the UD-checking can be disabled with the USERDICT_PROHIBIT flag.
User-written checking is performed through the ImGearRecZone.CheckWord Event by the integrating application. When creating or updating a zone, the ImGearRecCheckWordEventArgs.CheckWord Property of the zone may be used to register an event listener for the CheckWord Event. During recognition, the CheckWord Event is fired, and the user-written event handler is passed the string to be checked and the index of the zone (from which the string to be checked is derived). The user-written event handler should evaluate the string, knowing the pattern of permissible zone content. Its opinion on the recognized string's acceptability should be expressed with one of the five values from the ImGearRecCheckWordOpinion Enumeration.
Asian Recognition Module: The checking subsystem is not available. This means spell checking, UD-Checking and User-Written Checking cannot be used when the Asian Recognition Module is active. See the Asian Recognition Module topic for more details. |
See the C# OCR sample, ImageOCRtoFile, for a demonstration of how to handle the CheckWord Event.