Module name: |
MOR |
Module identifier: |
OMNIFONT_MOR |
Filling methods supported: |
OMNIFONT, DRAFTDOT24, OCRA, OCRB |
Filters supported: |
all filter elements |
Trade-off supported: |
FAST, BALANCED, ACCURATE |
Knowledge base files: |
RECOGN.BCT and RECOGN24.BCT |
The PLUS2W and PLUS3W recognition modules also require the presence of this module.
Application Areas
This module recognizes machine printed text; i.e., from printed publications, laser or ink-jet printers and electric typewriters. Output from mechanical typewriters in good condition may also be acceptable. It could also be used for letter or near letter quality (NLQ, LQ) output from dot-matrix printers. For Draft quality 24-pin dot-matrix documents use the DRAFTDOT24 filling method. NLQ or LQ quality output can usually be better recognized without using DRAFTDOT24.
The max. number of zones defined on an image that this module can handle is 500.
Range of Characters
This module can recognize about 500 characters, termed Engine’s Total Character Set. It includes the letters of the Latin, Greek and Cyrillic alphabets with enough accented letters to recognize the 119 Languages supported by the Engine
The set is classified as follows:
Non-accented |
Accented | |
Latin alphabet upper case letters |
26 |
89 |
Latin alphabet lower case letters |
26 |
91 |
Digits |
10 |
|
Punctuation |
29 |
|
Miscellaneous (math symbols, etc.) |
55 |
|
Cyrillic upper case letters |
33 |
14 |
Cyrillic lower case letters |
33 |
14 |
Greek upper case letters |
24 |
9 |
Greek lower case letters |
25 |
11 |
OCR (OCR-A) characters |
3 |
The characters are listed in category and alphanumeric order, together with their Code Page values, in Characters and Code Pages. These are the character categories used by the filter elements.
Character Attributes
The omnifont recognition module can detect and transmit character attributes: bold, italic or underlined text (or any combination of them). It can also detect and transmit character size, and can classify font types into three broad categories: serif, sans serif and monospaced.
Speed/Accuracy Choices
The multi-lingual omnifont recognition module basically uses contour analysis, but can supplement this with an innovative form of pattern matching not requiring enormous pre-stored shape libraries.
This module interprets all three page-level recognition trade-off settings: ACCURATE, BALANCED and FAST.
The module is tightly integrated with the checking module, giving a total of five speed/accuracy choices.
- Level 1: FAST without checking.
Fastest. The module reads text once and uses feature extraction only. Even this setting can give excellent accuracy on high-quality documents. Recommended also when accuracy is not a big issue (e.g. when OCR is only to allow fuzzy keyword searching in a document retrieval system) or for high-volume work when processing speed is most important. - Level 2: FAST with checking.
The recognition module reads text only once, with feature extraction, but sends words containing suspect or reject characters to a checker, together with its first and second guesses for unsure characters. The checker tries to find solutions based only on those characters. It also tries to repair other typical OCR faults (e.g. di9its embedded in words) and will flag all non-dictionary words it was unable to solve. Recommended e.g. when a Language dictionary is available and the texts are mono-lingual and liable to contain normal language (if not, a User dictionary could be employed). - Level 3: BALANCED without checking.
Two-pass recognition. During the first pass with feature extraction, the program builds up a library of sample characters and ligatured character pairs from the page, whose recognition was very sure. During the second reading pass it stops on all reject and unsure characters, consults its library and uses pattern matching to try and find solutions. That’s why the second pass is not very useful for pages with very little text – the library is too small. Recommended for multi-lingual documents or when a checker is not available. - Level 4: BALANCED with checking.
Two-pass recognition. Reading is a combination of the two processes used in levels 2 and 3. More accurate but processing will take more time. - Level 5: ACCURATE with checking compulsory.
Most accurate but slowest. Designed for use on very degraded mono-lingual documents or when maximum accuracy is very important. It involves two-pass recognition with Adaptive Cell Analysis. This is used to get a bigger library for the pattern matching: uniformly highly degraded documents typically can’t yield enough surely recognized characters to form a useful library. With ACA recognition, characters with somewhat lower certainty are accepted, provided they fall within words accepted by the checking module. This allows the pattern matching to work more successfully.