MOR Multi-Lingual Omnifont Recognition Module

Module name:	MOR
Module identifier:	OMNIFONT_MOR
Filling methods supported:	OMNIFONT, DRAFTDOT24, OCRA, OCRB
Filters supported:	all filter elements
Trade-off supported:	FAST, BALANCED, ACCURATE
Knowledge base files:	RECOGN.BCT and RECOGN24.BCT

The PLUS2W and PLUS3W recognition modules also require the presence of this module.

Application Areas

This module recognizes machine printed text; i.e., from printed publications, laser or ink-jet printers and electric typewriters. Output from mechanical typewriters in good condition may also be acceptable. It could also be used for letter or near letter quality (NLQ, LQ) output from dot-matrix printers. For Draft quality 24-pin dot-matrix documents use the DRAFTDOT24 filling method. NLQ or LQ quality output can usually be better recognized without using DRAFTDOT24.

The max. number of zones defined on an image that this module can handle is 500.

Range of Characters

This module can recognize about 500 characters, termed Engine’s Total Character Set. It includes the letters of the Latin, Greek and Cyrillic alphabets with enough accented letters to recognize the 119 Languages supported by the Engine

The set is classified as follows:

	Non-accented	Accented
Latin alphabet upper case letters	26	89
Latin alphabet lower case letters	26	91
Digits	10
Punctuation	29
Miscellaneous (math symbols, etc.)	55
Cyrillic upper case letters	33	14
Cyrillic lower case letters	33	14
Greek upper case letters	24	9
Greek lower case letters	25	11
OCR (OCR-A) characters	3

The characters are listed in category and alphanumeric order, together with their Code Page values, in Characters and Code Pages. These are the character categories used by the filter elements.

Character Attributes

The omnifont recognition module can detect and transmit character attributes: bold, italic or underlined text (or any combination of them). It can also detect and transmit character size, and can classify font types into three broad categories: serif, sans serif and monospaced.

Speed/Accuracy Choices

The multi-lingual omnifont recognition module basically uses contour analysis, but can supplement this with an innovative form of pattern matching not requiring enormous pre-stored shape libraries.

This module interprets all three page-level recognition trade-off settings: ACCURATE, BALANCED and FAST.

The module is tightly integrated with the checking module, giving a total of five speed/accuracy choices.

Level 1: FAST without checking.
Fastest. The module reads text once and uses feature extraction only. Even this setting can give excellent accuracy on high-quality documents. Recommended also when accuracy is not a big issue (e.g. when OCR is only to allow fuzzy keyword searching in a document retrieval system) or for high-volume work when processing speed is most important.
Level 2: FAST with checking.
The recognition module reads text only once, with feature extraction, but sends words containing suspect or reject characters to a checker, together with its first and second guesses for unsure characters. The checker tries to find solutions based only on those characters. It also tries to repair other typical OCR faults (e.g. di9its embedded in words) and will flag all non-dictionary words it was unable to solve. Recommended e.g. when a Language dictionary is available and the texts are mono-lingual and liable to contain normal language (if not, a User dictionary could be employed).
Level 3: BALANCED without checking.
Two-pass recognition. During the first pass with feature extraction, the program builds up a library of sample characters and ligatured character pairs from the page, whose recognition was very sure. During the second reading pass it stops on all reject and unsure characters, consults its library and uses pattern matching to try and find solutions. That’s why the second pass is not very useful for pages with very little text – the library is too small. Recommended for multi-lingual documents or when a checker is not available.
Level 4: BALANCED with checking.
Two-pass recognition. Reading is a combination of the two processes used in levels 2 and 3. More accurate but processing will take more time.
Level 5: ACCURATE with checking compulsory.
Most accurate but slowest. Designed for use on very degraded mono-lingual documents or when maximum accuracy is very important. It involves two-pass recognition with Adaptive Cell Analysis. This is used to get a bigger library for the pattern matching: uniformly highly degraded documents typically can’t yield enough surely recognized characters to form a useful library. With ACA recognition, characters with somewhat lower certainty are accepted, provided they fall within words accepted by the checking module. This allows the pattern matching to work more successfully.