ImageGear .NET - Updated
Confidence Reporting
User Guide > How to Work with... > OCR > How to... > Access and Analyze OCR Output > Confidence Reporting

For some applications, it may be important to know the reliability of the recognized text. These applications may require having additional confidence information for the recognized characters and/or words.

Confidence information can be retrieved directly into application memory by a call to the GetLetters Method, just after issuing the Recognize Method call. The GetLetters Method provides the most detailed information about the recognized data: it results in a ImGearRecLetter Class for each recognized character.

The ImGearRecLetter Class provides character recognition confidence information via its ConfidenceInfo Property. The ConfidenceInfo Property is a combined value. Its most significant bit is used to express the certainty/uncertainty of the word (the word is uncertain if this bit is set to 1). The remaining bits express the certainty of the recognition of the character, which ranges between 0 and 100. A value of zero (0) means that the recognition engine recognized the character with high confidence. In some cases a word may have some or all characters that are individually suspicious but the characters are not be marked suspicious in the word bit. This is usually a result of language or user dictionary checking. It means the word was validated by the checking subsystem.

Applications that examine the character confidence information can use a threshold value, above which the character value is treated as a suspicious result. A value of 64 is recommended for this purpose. A value less than 64 will indicate that the character was recognized with high confidence. A value of 64 or greater marks that code is suspicious. This value (64) is also used internally in the same manner when the output-marking feature for suspicious characters in the output text is enabled.

There are also two properties in ImGearRecLetter Class called Confidence Property and WordUncertain Property. These are provided for convenience. They decode the contents of the ConfidenceInfo Property. The Confidence Property ranges between 0 and 100, and indicates confidence rather than error: 0 indicates low confidence, and 100 indicates high confidence. If the WordUncertain Property is true, the word containing the letter is uncertain.

The confidence reporting system works best when all three recognition modules are used in the voting scheme (OMNIFONT_PLUS3W). If other machine print recognition modules are used (OMNIFONT_PLUS2W, OMNIFONT_MTX, etc.) then confidence information is still available, but the ability of the system to properly report confidence will be reduced. This will result in a higher level of false negative and false positive reporting of suspicious recognition results. See ImGearRecRecognitionModule enumeration for information on recognition modules.

Alternatives

The ImGearRecLetter Class provides two other properties, AlternativeCharacters and WordSuggestions, which report possible alternative values for a character or word. These values, used in conjunction with the confidence values, may be helpful in a user-checking scenario when low confidence results are detected.

Alternative character values, found in the AlternativeCharacters Property, are a list of additional character options for the current letter result. Alternative character values are commonly available for characters that are frequently misrecognized during the OCR process. For example, ‘i’ and ‘l’ are common alternatives for each other. Characters recognized with high confidence will typically not have any alternative character values listed.

Alternative word suggestions, found in the WordSuggestions Property, are a list of additional word options for the current word result. Word suggestions will only be available from the first letter of a word. Consecutive words can have the same suggestions, which is the case when the suggestion combines two or more space-separated words into a single one without spaces. For example, the consecutive words “Image” and “Gear” might both have word suggestions of “ImageGear” and “Image Gear” if the space between the words is not sufficient enough to confidently determine the word breaks. Words recognized with high confidence will typically not have any word suggestion values listed.