OCR Performance Considerations
There are different considerations to be taken into account when you want to improve recognition accuracy. Typically, they also have consequences for the processing speed.
Image Quality
This is one of the most important factors that influences accuracy.
- A resolution of 300 dpi or 400 dpi is best for recognition.
- Use image processing to enhance the quality of the given image to get more accurate auto-zoning and recognition.
Zone Locating
If processing speed and accuracy are also important, consider recognition zone location definition. Use ImGearOCR for working with zones.
Recognition zones may be defined in two ways:
The Checking Subsystem
This consists of any combination of the following checking tools:
- Spell checking (language-based checking)
- Character set
- User dictionary
The Checking subsystem:
- finds characters using specific language Character Set. The characters of other language Character Sets are not used. It increases the accuracy of recognition and the performance of this process.
- applies the dictionaries to the recognized words. Only dictionaries that correspond with a given language are applied to recognized text. This technology improves the accuracy of recognized text.
Spell Checking
An additional way to control the accuracy of recognition and performance algorithms is use of the language selection. Setting the wrong language(s) and/or language dictionary (or leaving unneeded ones enabled) is likely to slow down recognition and reduce accuracy considerably.
The use of the ImGearOCRSettings.LanguageEnabled property allows control of the set of languages that will be used in the recognition process. Only languages corresponding to enabled languages dictionaries will be used.
Character Set
This determines, at the engine level, which set of characters should be considered as valid. By eliminating characters that are known not to appear in the page, accuracy and performance can be improved.
Use the ImGearOCRSettings.UserCharacterSet property to set the character set to be used on the recognizing page.
User Dictionary
If some specific words or phrases are present on the page, the performance and accuracy of the recognition process may be decreased. To avoid such cases the user dictionary may be provided for the recognition process. The user dictionary is a file that contains the set of lines. Each line of this file represents a word or phrase that will be checked for inclusion in the recognized page text.
The interface of the user dictionary is presented in the product as the ImGearOCRDictionary class. The dictionary may be loaded from a file as well as created programmatically. The user dictionary attached to the recognition process ImGearOCRSettings.UserDictionary property is used.