Recognition Modules

The ImageGear Recognition API can load a number of recognition modules: which ones are present depends on its configuration. You can program which one should run in a given zone with the zone's ImGearRecZone.RecognitionModule Property. This allows your application to perform "multi-module" recognition on any image.

The ImGearRecRecognitionModule Enumeration lists the possible recognition modules:

PLUS3W omnifont module for machine-printed text (default)
PLUS2W omnifont module for machine-printed text
MTX omnifont module for machine-printed text
MOR multi-lingual omnifont module for machine-printed text
FRX multi-lingual omnifont module for machine-printed text
DOT module for 9-pin draft dot-matrix printouts
ASIAN module for Asian language text

Corresponding to the above list, the RecognitionModule Property of any zone should contain one of the following recognition module IDs: OMNIFONT_PLUS3W, OMNIFONT_PLUS2W, OMNIFONT_MTX, OMNIFONT_MOR, OMNIFONT_FRX, DOT, or one of the two special ones: NO_OCR and AUTO. The NO_OCR value means that no recognition will be done in the zone concerned.

AUTO means that the recognition engine will choose the module most likely to be appropriate. It does this first of all by consulting the filling method set for the zone.

The filling method describes the type of data expected in the zone, e.g., dot matrix printed text or machine generated text. A degree of auto-detection is available for the filling method, with the ImGearRecPage.DetectFillingMethod Method, which is useful when the precise filling method used on incoming documents may not be known in advance. It is the programmer's responsibility to specify a valid recognition module-filling method pair. Any incorrectly set zone will have no recognition result. Some filling methods can be linked successfully with one and only one recognition module. However, some recognition modules support more than one filling method, and some filling methods are accepted by more than one recognition module. For example, with mid-quality 9-pin dot-matrix text, either DOT or one of the omnifont modules could give better results. With mid-quality 24-pin dot-matrix text, try RM OMNIFONT_MOR with FM DOTMATRIX24 and OMNIFONT.

You can find detailed information on this and the different recognition modules in this topic.

RM AUTO reads the filling method; if only one recognition module is suitable, it is used. When there is a choice, RM AUTO uses various checks (character set, image size, etc.) to select the best one. Thus, it protects against an invalid FM-RM pair.

If the recognition module is not present when it is needed, the Recognize Method throws an ImGearException with error code and value ImGearRecErrorCodes.REC_ENGINE_FAILURE / API_MODULEMISSING_ERR, and there will be no recognized data for the zone concerned. To avoid this risk, we recommend checking the presence and correct installation of the necessary recognition modules by examining the modules collection using the ImGearRecognition.Modules Property just after the Recognition API's initialization.

Filling Method - Recognition Module Combinations
Recognition Modules and the Widest Available Character Set
Recognition Modules and Filters
Recognition Modules and Trade-off Settings
Recognition Modules and the Checking Subsystem
DOT 9-Pin Draft Dot-Matrix Recognition Module
FRX Multi-Lingual Omnifont Recognition Module
MOR Multi-Lingual Omnifont Recognition Module
MTX Omnifont Recognition Module
PLUS2W and PLUS3W Omnifont Recognition Modules
Languages and Modules
Asian Recognition Module

Filling Method - Recognition Module Combinations

When a zone recognition module setting is ImGearRecRecognitionModule Enumeration.AUTO, set either by default or explicitly, the Engine takes care of recognition module selection for any filling method. When setting specific values for filling methods and recognition modules, it is the programmer’s responsibility to specify a valid recognition module-filling method pair. Any incorrectly set zone will have no recognition result. The following table shows which modules are considered by the automatic recognition module selection, called up by the ImGearRecRecognitionModule Enumeration.AUTO value. The order of the recognition modules in the second column shows the priority order for the automatic recognition module selection.

Filling Method	Permissible Recognition Module
OMNIFONT	OMNIFONT_PLUS2W, OMNIFONT_PLUS3W, OMNIFONT_MOR, OMNIFONT_FRX, OMNIFONT_MTX
DRAFTDOT9	DOT, OMNIFONT_MTX
DRAFTDOT24	OMNIFONT_MOR, OMNIFONT_MTX
OCRA	OMNIFONT_MOR, OMNIFONT_MTX
OCRB	OMNIFONT_MOR, OMNIFONT_MTX
FM_NO_OCR	-

Recognition Modules and the Widest Available Character Set

The correct assignment of a recognition module and a filling method to a zone should mean that the recognition module is able to satisfactorily process the contents of that zone. But it does not guarantee that the recognition module will be able to process every possible character. The characters supported by the Engine are listed in Characters and Code Pages. Most recognition modules recognize only a subset of these. Even if we restrict the character set to a limited Language Environment, e.g., selecting the German language, the recognition module may not be able to process all the enabled characters. Automatic recognition module selection takes Character Set support of modules into consideration. Selecting a recognition module directly, it is the programmer’s responsibility to select a recognition module capable of supporting the widest character set enabled in the zone. Otherwise this zone may have an incomplete recognition result. The precise character and language support for each module is given in the appropriate recognition module specifications.

Recognition Modules and Filters

Narrowing the Character Set has two effects:

It influences the automatic recognition module selection
It may increase accuracy

The filtering system allows the Language environment to be narrowed, by enabling only certain character classes, and also by enabling individual characters. A filter is built up from filter elements, as detailed under ImGearRecFilter.

Each filter element name tells which character class is enabled, e.g., ALL means no filtering. Not all recognition modules interpret all filter elements. Precise information appears in the sub-heading for each module.

Applying a filter may not always enable the same number of characters. E.g., MISCELLANEOUS can enable only those miscellaneous characters supported by the recognition module assigned to the zone.

Recognition Modules and Trade-off Settings

Three accuracy/speed trade-off settings can be specified at page or document level: ACCURATE, BALANCED, and FAST. Five recognition modules can interpret these. Precise information appears in the sub-heading for each module.

Recognition Modules and the Checking Subsystem

The checking module has two basic services. It can flag unacceptable recognition results without changing them or it can be permitted to modify recognition results using checking module feedback. The available acceptance rules can come from the following:

Language and/or Vertical dictionaries
User dictionaries containing precise entries and/or required patterns
User-written checking functions

These three sources may be combined freely. The checking module and each of its three parts can be enabled or disabled on a per-zone basis. The integrator should try to match the particular parts of the checking module to the contents and recognition modules of individual zones.

DOT 9-Pin Draft Dot-Matrix Recognition Module

Module name:	DOT
Module identifier:	DOT
Filling methods supported:	DRAFTDOT9
Filters supported:	all filter elements
Trade-off supported:	none
Knowledge base file:	`TUDAS.FA`

Application Areas

This module is designed for ONLY draft-quality 9-pin dot-matrix texts. For NLQ or LQ texts, the OMNIFONT_PLUS2W, OMNIFONT_PLUS3W, OMNIFONT_MTX, OMNIFONT_FRX or OMNIFONT_MOR modules are likely to give better results.

Range of Characters

This module supports 76 languages, of which 14 have dictionary support: Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish. It can read multiple languages. It can read 18 of the 29 punctuation characters. (The Low Double Comma Quotation Mark is missing). It supports 24 of the 55 miscellaneous characters. (Some missing ones are: the Euro Sign, the Small Script F, the Copyright Sign, Registered Trade Mark Sign and the Degree Sign.)

The following table lists all accented letters this module can recognize. The range of uppercase ones supported by this module is limited, since in 9-pin dot-matrix text, there is difficulty in printing these letters. Those with lowercase support only are listed separately:

Upper & Lowercase	Lowercase Only
Cap.&Small A Acute (A’)	Small A Circumflex (a^)
Cap.&Small AE (Ae)	Small A Macron (a-)
Cap.&Small A Ring (Ao)	Small A Grave (a`)
Cap.&Small A Umlaut (A:)	Small E Umlaut (e:)
Cap.&Small A Tilde (A~)	Small E Circumflex (e^)
Cap.&Small C Cedilla (C,)	Small E Grave (e`)
Cap.&Small E Acute (E’)	Small I Umlaut (I:)
Cap.&Small I Acute (I’)	Small I Circumflex (I^)
Cap.&Small N Tilde (N~)	Small I Grave (I`)
Cap.&Small O Double Acute (O")	Small O Circumflex (O^)
Cap.&Small O Acute (O’)	Small O Macron (O-)
Cap.&Small O Umlaut (O:)	Small O Grave (O`)
Cap.&Small O Tilde (O~)	Small S Hacek (Sv)
Cap.&Small O Slash (O/)	Small U Circumflex (U^)
Cap.&Small AE(OE)	Small U Grave (U`)
Cap.&Small U Double Acute (U")
Cap.&Small U Acute (U’)
Cap.&Small U Umlaut (U:)

Accuracy Issues

This module does not interpret the recognition trade-off setting.

Character Attributes

Since this module recognizes draft dot-matrix texts, character attributes are not applicable. Expanded characters are not recognized, condensed printout can be, but the accuracy is liable to be low.

Conditions

This module is used if it is directly specified in a zone structure.

If DRAFTDOT9 filling method is set together with AUTO, OMNIFONT_MTX is used, provided that all characters (or languages or filters) validated for the zone are supported by it. If any are not supported, this module is used.

It can generate confidence data on recognized characters and can interpret all filter values.

FRX Multi-Lingual Omnifont Recognition Module

Module name:	FRX
Module identifier:	FRX
Filling methods supported:	OMNIFONT
Filters supported:	all filter elements
Trade-off supported:	none
Knowledge base files:	none

The OMNIFONT_PLUS2W, and OMNIFONT_PLUS3W recognition modules require the presence of this module.

Its associated files are:

baltic.shp	Frx shape pack (code page) file.
cyrillic.shp	Frx shape pack (code page) file.
greek.shp	Frx shape pack (code page) file.
latin1.shp	Frx shape pack (code page) file.
latin2.shp	Frx shape pack (code page) file.
turkish.shp	Frx shape pack (code page) file.
charsettable.chr
asciieng.lng	Frx language dictionary. Used in case of multi-language selection.
czech.lng	Frx language dictionary data file.
danish.lng	Frx language dictionary data file.
dutch.lng	Frx language dictionary data file.
english.lng	Frx language dictionary data file.
finnish.lng	Frx language dictionary data file.
french.lng	Frx language dictionary data file.
german.lng	Frx language dictionary data file.
greek.lng	Frx language dictionary data file.
hungar.lng	Frx language dictionary data file.
italian.lng	Frx language dictionary data file.
norsk.lng	Frx language dictionary data file.
polish.lng	Frx language dictionary data file.
port.lng	Frx language dictionary data file.
russian.lng	Frx language dictionary data file.
spanish.lng	Frx language dictionary data file.
swedish.lng	Frx language dictionary data file.
turkish.lng	Frx language dictionary data file.

Application Areas

This module recognizes machine printed text; i.e. from printed publications, laser or ink-jet printers and electric typewriters. Output from mechanical typewriters in good condition may also be acceptable. It should also be used for letter or near letter quality (NLQ, LQ) output from dot-matrix printers.

Range of Characters

This module supports the recognition of Latin, Greek and Cyrillic alphabets with enough accented letters to recognize the 54 languages.

The characters are listed in category and alphanumeric order, together with their Code Page values, in Characters and Code Pages.

Multi-Lingual Language Support

The language support of this module is based on the module's internal code pages, which contain characters from a related group of languages. The internal code pages of this module are American/European (Latin 1, 1252), Baltic (1257), Central-European (Latin 2, 1250), Cyrillic (1251), Greek (1253) and Turkish (1254).

The module supports multi-language selection for recognition, though it may not recognize languages from different language groups properly. It supports only language combinations within the same Code Page. For example, it properly processes the English, German and Italian language combination, since all these languages belong to the Latin 1 (1252) code page. However, when specifying e.g. both the French and Czech languages, OMNIFONT_FRX may fail to properly recognize some accented characters in the Czech alphabet, since these languages are not in the same code page. The following table contains the languages by code pages supported by FRX.

Latin 2 (1250)	Polish, Czech, Hungarian, Romanian, Albanian, Croatian, Wend (Sorbian), Slovak, Slovenian
Cyrillic (1251)	Russian, Ukrainian, Byelorussian, Bulgarian, Macedonian, Serbian
Latin 1 (1252)	English, German, French, Spanish, Italian, Dutch, Swedish, Norwegian, Finnish, Danish, Portuguese, Portuguese (Brazilian), Catalan, Afrikaans, Aymara, Basque, Breton, Faroese, Friulian, Gaelic, Galician, Eskimo, Icelandic, Indonesian, Latin, Malaysian, Pidgin English, Swahili, Tahitian, Welsh, Frisian, Zulu
Greek (1253)	Greek
Turkish (1254)	Turkish, Kurdish (written in Latin alphabet)
Baltic (1257)	Estonian, Hawaiian, Latvian, Lithuanian

Character Attributes

The omnifont recognition module can detect and transmit character attributes: bold, italic or underlined text (or any combination of them). It can also detect and transmit character size, and can classify font types into three broad categories: serif, sans serif and monospaced.

MOR Multi-Lingual Omnifont Recognition Module

Module name:	MOR
Module identifier:	OMNIFONT_MOR
Filling methods supported:	OMNIFONT, DRAFTDOT24, OCRA, OCRB
Filters supported:	all filter elements
Trade-off supported:	FAST, BALANCED, ACCURATE
Knowledge base files:	RECOGN.BCT and RECOGN24.BCT

The PLUS2W and PLUS3W recognition modules also require the presence of this module.

Application Areas

This module recognizes machine printed text; i.e., from printed publications, laser or ink-jet printers and electric typewriters. Output from mechanical typewriters in good condition may also be acceptable. It could also be used for letter or near letter quality (NLQ, LQ) output from dot-matrix printers. For Draft quality 24-pin dot-matrix documents use the DRAFTDOT24 filling method. NLQ or LQ quality output can usually be better recognized without using DRAFTDOT24.

The max. number of zones defined on an image that this module can handle is 500.

Range of Characters

This module can recognize about 500 characters, termed Engine’s Total Character Set. It includes the letters of the Latin, Greek and Cyrillic alphabets with enough accented letters to recognize the 119 Languages supported by the Engine

The set is classified as follows:

	Non-accented	Accented
Latin alphabet upper case letters	26	89
Latin alphabet lower case letters	26	91
Digits	10
Punctuation	29
Miscellaneous (math symbols, etc.)	55
Cyrillic upper case letters	33	14
Cyrillic lower case letters	33	14
Greek upper case letters	24	9
Greek lower case letters	25	11
OCR (OCR-A) characters	3

The characters are listed in category and alphanumeric order, together with their Code Page values, in Characters and Code Pages. These are the character categories used by the filter elements.

Character Attributes

Speed/Accuracy Choices

The multi-lingual omnifont recognition module basically uses contour analysis, but can supplement this with an innovative form of pattern matching not requiring enormous pre-stored shape libraries.

This module interprets all three page-level recognition trade-off settings: ACCURATE, BALANCED and FAST.

The module is tightly integrated with the checking module, giving a total of five speed/accuracy choices.

Level 1: FAST without checking.
Fastest. The module reads text once and uses feature extraction only. Even this setting can give excellent accuracy on high-quality documents. Recommended also when accuracy is not a big issue (e.g. when OCR is only to allow fuzzy keyword searching in a document retrieval system) or for high-volume work when processing speed is most important.
Level 2: FAST with checking.
The recognition module reads text only once, with feature extraction, but sends words containing suspect or reject characters to a checker, together with its first and second guesses for unsure characters. The checker tries to find solutions based only on those characters. It also tries to repair other typical OCR faults (e.g. di9its embedded in words) and will flag all non-dictionary words it was unable to solve. Recommended e.g. when a Language dictionary is available and the texts are mono-lingual and liable to contain normal language (if not, a User dictionary could be employed).
Level 3: BALANCED without checking.
Two-pass recognition. During the first pass with feature extraction, the program builds up a library of sample characters and ligatured character pairs from the page, whose recognition was very sure. During the second reading pass it stops on all reject and unsure characters, consults its library and uses pattern matching to try and find solutions. That’s why the second pass is not very useful for pages with very little text – the library is too small. Recommended for multi-lingual documents or when a checker is not available.
Level 4: BALANCED with checking.
Two-pass recognition. Reading is a combination of the two processes used in levels 2 and 3. More accurate but processing will take more time.
Level 5: ACCURATE with checking compulsory.
Most accurate but slowest. Designed for use on very degraded mono-lingual documents or when maximum accuracy is very important. It involves two-pass recognition with Adaptive Cell Analysis. This is used to get a bigger library for the pattern matching: uniformly highly degraded documents typically can’t yield enough surely recognized characters to form a useful library. With ACA recognition, characters with somewhat lower certainty are accepted, provided they fall within words accepted by the checking module. This allows the pattern matching to work more successfully.

MTX Omnifont Recognition Module

Module name:	MTX
Module identifier:	OMNIFONT_MTX
Filling methods supported:	OMNIFONT, DRAFTDOT9, DRAFTDOT24, OCRA, OCRB
Filters supported:	ALL, DIGIT and ALPHA
Trade-off supported:	FAST, ACCURATE (BALANCED is equal to this)
Knowledge base file:	N / A

The PLUS2W and PLUS3W recognition modules also require the presence of this module.

Recognition module language binaries are xi*.bin, as follows:

For recognizing an English document, the filenames include the identifier ENG:
- xiengb.bin
- xiengc.bin
- xiengd.bin
- xienge.bin
- xiengf.bin (used unaltered for all languages)
- xiengl.bin (used unaltered for all languages)
- xiengp.bin
- xiengs.bin
- xiengz.bin

The files xiengf.bin and xiengl.bin are required, unaltered, for all languages. All other languages have language-specific equivalents of the remaining seven files. The identifier eng is changed as follows:

French: frn
Italian: itl
German: grm
Dutch: dut
Spanish: spn
Portuguese: prt
Swedish: swd
Norwegian: nrw
Danish: dan
Finnish: fin
Portuguese (Brazilian): brz

Application Areas

This recognition module recognizes machine printed text; i.e. from printed publications, laser or ink-jet printers and electric typewriters. Output from mechanical typewriters in good condition may also be acceptable. It should also be used for Letter or Near Letter Quality output from dot-matrix printers, and can also be used for Draft Quality.

Range of Characters

This module supports the characters of the following languages:

Language

English

French

Spanish

Italian

German

Norwegian

Portuguese

Danish

Dutch

Finnish

Swedish

Brazilian

Any of these languages can be combined.

Accuracy Issues

This module is influenced by the page-level trade-off setting, but reduces the three settings to two: FAST is respected, while BALANCED and ACCURATE are merged to one value.

Character Attributes

PLUS2W and PLUS3W Omnifont Recognition Modules

Module name:	PLUS2W and PLUS3W
Module identifier:	OMNIFONT_PLUS2W and OMNIFONT_PLUS3W
Filling methods supported:	OMNIFONT
Filters supported:	ALL, DIGIT and ALPHA
Trade-off supported:	FAST, BALANCED, ACCURATE
Knowledge base file:	RECOGN.BCT, RECOGN24.BCT

Both PLUS2W and PLUS3W require the presence of FRX, MTX and MOR recognition modules.

Application Areas

Range of Characters

This module supports the same set of characters as the OMNIFONT_MOR module.

Accuracy Issues

The PLUS2W and PLUS3W modules use voting technology to provide improved recognition results. The PLUS2W and PLUS3W modules use the results from one or more of FRX, MOR and MTX modules according to the trade-off. With either of these two voting modules, the accuracy is considerably better, but the recognition may need significantly more time than any single module.

Suspicious Marking

With these modules, the suspicious character and word marking feature is different from that used in MOR, MTX or FRX. These modules do not mark characters as suspicious if all the voting modules provided the same recognition result, even if they were suspiciously recognized in any of them. Consequently, there are likely to be fewer words marked as non-dictionary.

Character Attributes

Languages and Modules

The following table shows the text recognition module support for each of the 119 languages.

Language	MOR	MTX	FRX	PLUS2W	PLUS3W	DOT
Afrikaans	Yes	No	Yes	Yes	Yes	Yes C
Albanian	Yes	No	Yes	Yes	Yes	Yes C
Aymara	Yes	No	Yes	Yes	Yes	Yes
Basque	Yes	No	Yes	Yes	Yes	No
Bemba	Yes	Yes EN	No	Yes	Yes	Yes
Blackfoot	Yes	Yes EN	No	Yes	Yes	Yes
Brazilian B	Yes B	Yes	Yes	Yes	Yes	Yes
Breton	Yes	No	Yes	Yes	Yes	Yes C
Bugotu	Yes	Yes EN	No	Yes	Yes	Yes
Bulgarian	Yes	No	Yes	Yes	Yes	No
Byelorussian	Yes	No	Yes	Yes	Yes	No
Catalan	Yes	No	Yes	Yes	Yes	Yes C
Chamorro	Yes	No	No	Yes	Yes	Yes
Chechen	Yes	No	No	Yes	Yes	No
Corsican	Yes	No	No	Yes	Yes	Yes
Croatian	Yes	No	Yes	Yes	Yes	No
Crow	Yes	Yes EN	No	Yes	Yes	Yes
Czech	Yes	No	Yes	Yes	Yes	No
Danish	Yes	Yes	Yes	Yes	Yes	Yes
Dutch	Yes	Yes	Yes	Yes	Yes	Yes C
English	Yes	Yes	Yes	Yes	Yes	Yes
Eskimo (Inuit)	Yes	No	Yes	Yes	Yes	No
Esperanto	Yes	No	No	Yes	Yes	No
Estonian	Yes	No	Yes	Yes	Yes	Yes
Faroese	Yes	No	Yes	Yes	Yes	No
Fijian	Yes	No	No	Yes	Yes	No
Finnish	Yes	Yes	Yes	Yes	Yes	Yes
French	Yes	Yes	Yes	Yes	Yes	Yes C
Frisian	Yes	No	Yes	Yes	Yes	Yes C
Friulian	Yes	No	Yes	Yes	Yes	Yes C
Gaelic (Irish)	Yes	No	Yes	Yes	Yes	Yes
Gaelic (Scottish)	Yes	No	Yes	Yes	Yes	Yes C
Galician	Yes	Yes	Yes	Yes	Yes	Yes
Ganda	Yes	No	No	Yes	Yes	No
German	Yes	Yes	Yes	Yes	Yes	Yes
Greek	Yes	No	Yes	Yes	Yes	Yes
Guarani	Yes	No	No	Yes	Yes	Yes C
Hani *	Yes	Yes EN	No	Yes	Yes	Yes
Hawaiian	Yes	Yes EN	Yes	Yes	Yes	Yes
Hungarian	Yes	No	Yes	Yes	Yes	Yes
Icelandic	Yes	No	Yes	Yes	Yes	No
Ido	Yes	Yes EN	No	Yes	Yes	Yes
Indonesian	Yes	Yes EN	Yes	Yes	Yes	Yes
Interlingua	Yes	Yes EN	No	Yes	Yes	Yes
Italian	Yes	Yes	Yes	Yes	Yes	Yes C
Kabardian	Yes	No	No	Yes	Yes	No
Kasub	Yes	No	No	Yes	Yes	No
Kawa *	Yes	Yes EN	No	Yes	Yes	Yes
Kikuyu	Yes	No	No	Yes	Yes	No
Kongo	Yes	Yes EN	No	Yes	Yes	Yes
Kpelle	Yes	Yes EN	No	Yes	Yes	Yes
Kurdish *	Yes	No	Yes	Yes	Yes	No
Latin L	Yes	Yes L	Yes	Yes	Yes	Yes L
Latvian	Yes	No	Yes	Yes	Yes	No
Lithuanian	Yes	No	Yes	Yes	Yes	No
Luba	Yes	No	No	Yes	Yes	No
Luxembourgish	Yes	No	No	Yes	Yes	Yes C
Macedonian	Yes	No	Yes	Yes	Yes	No
Malagasy	Yes	Yes EN/M	No	Yes	Yes	Yes C
Malay	Yes	No	Yes	Yes	Yes	No
Malinke	Yes	No	No	Yes	Yes	Yes C
Maltese	Yes	No	No	Yes	Yes	No
Maori	Yes	Yes EN	No	Yes	Yes	Yes
Mayan	Yes	No	No	Yes	Yes	Yes
Miao *	Yes	Yes EN	No	Yes	Yes	Yes
Minangkabau	Yes	No	No	Yes	Yes	No
Mohawk	Yes	Yes EN	No	Yes	Yes	Yes
Moldavian	Yes	No	No	Yes	Yes	No
Nahuatl	Yes	Yes EN	No	Yes	Yes	Yes
Norwegian	Yes	Yes	Yes	Yes	Yes	Yes
Nyanja	Yes	Yes EN	No	Yes	Yes	Yes
Occidental	Yes	No	No	Yes	Yes	Yes
Ojibway	Yes	No	No	Yes	Yes	No
Papiamento	Yes	No	No	Yes	Yes	Yes
Pidgin English	Yes	Yes EN	Yes	Yes	Yes	Yes
Polish	Yes	No	Yes	Yes	Yes	No
Portuguese	Yes	Yes	Yes	Yes	Yes	Yes C
Provençal	Yes	No	No	Yes	Yes	Yes C
Quechua	Yes	No	No	Yes	Yes	Yes
Rhaetic	Yes	No	No	Yes	Yes	Yes C
Romanian	Yes	No	Yes	Yes	Yes	No
Romany	Yes	No	No	Yes	Yes	No
Rwanda	Yes	Yes EN	No	Yes	Yes	Yes
Rundi	Yes	Yes EN	No	Yes	Yes	Yes
Russian	Yes	No	Yes	Yes	Yes	No
Sami	Yes	No	No	Yes	Yes	No
Sami, Lule	Yes	No	No	Yes	Yes	No
Sami, Northern	Yes	No	No	Yes	Yes	No
Sami, Southern	Yes	No	No	Yes	Yes	No
Samoan	Yes	No	No	Yes	Yes	Yes C
Sardinian	Yes	No	No	Yes	Yes	Yes C
Serbian	Yes	No	Yes	Yes	Yes	No
Serbian, Latinic	Yes	No	Yes	Yes	Yes	No
Shona S	Yes	Yes S	No	Yes	Yes	Yes S
Sioux	Yes	Yes EN	No	Yes	Yes	Yes
Slovak	Yes	No	Yes	Yes	Yes	No
Slovenian	Yes	No	Yes	Yes	Yes	No
Somali	Yes	Yes EN	No	Yes	Yes	Yes
Sorbian (Wend)	Yes	No	Yes	Yes	Yes	No
Sotho	Yes	No	No	Yes	Yes	Yes
Spanish	Yes	Yes	Yes	Yes	Yes	Yes
Sundanese SN	Yes	No	No	Yes	Yes	Yes SN
Swahili	Yes	Yes EN	Yes	Yes	Yes	Yes
Swazi	Yes	No	No	Yes	Yes	No
Swedish	Yes	Yes	Yes	Yes	Yes	Yes
Tagalog	Yes	Yes EN	No	Yes	Yes	Yes
Tahitian	Yes	No	Yes	Yes	Yes	Yes C
Tinpo	Yes	Yes EN	No	Yes	Yes	Yes
Tongan	Yes	Yes EN	No	Yes	Yes	Yes
Tswana (Chuana)	Yes	No	No	Yes	Yes	Yes C
Tun *	Yes	Yes EN	No	Yes	Yes	Yes
Turkish	Yes	No	Yes	Yes	Yes	No
Ukrainian	Yes	No	Yes	Yes	Yes	No
Visayan	Yes	Yes EN	No	Yes	Yes	Yes
Welsh	Yes	No	Yes	Yes	Yes	Yes W
Wolof	Yes	No	No	Yes	Yes	Yes C
Xhosa	Yes	Yes EN	No	Yes	Yes	Yes
Zapotec	Yes	Yes EN	No	Yes	Yes	Yes
Zulu	Yes	No	Yes	Yes	Yes	No

The following table summarizes the above:

LANGUAGES	MOR	MTX	FRX	PLUS2W	PLUS3W	DOT
With dictionary support	18	12	17	17	17	14
Accented, non-dictionary	65	0	33	66	66	31
Non-accented, non-dictionary	31	31	4	31	31	31
Directly selectable	119	12	56	119	119	76
Total	119	43	56	119	119	76

Footnotes on Languages / General:

* = This language can be handled only if it is written in the Latin alphabet.
B = Brazilian has a separate dictionary from Portuguese in the MTX and FRX modules. Other modules treat Brazilian as Portuguese. Brazilian is available for language marking in the output document.
L = Latin is usually written without accented letters, but sometimes breves or macrons are placed over vowels. In these cases, the indicated modules do not provide support.
M = Some dialects of Malagasy are written without accents. In these cases, MTX provides support.
S = Shona may be written without accents, but sometimes uses acutes and graves on vowels. In these cases the indicated modules do not provide full support.
SN = Sundanese uses only one accented letter; sometimes this is E-breve, sometimes E-acute. The indicated modules support E-acute but not E-breve.
W = Welsh contains two rarely used characters: W-circumflex and Y-circumflex. These modules can handle Welsh with the exception of these two characters.

Footnotes on Modules / MTX:

The twelve selectable languages are those with Yes with no added footnote letter. For these languages this module uses its own language dictionaries.
EN = Languages denoted are thought to contain no accented letters. To read them, select English and disable spell checking from a main dictionary.

Footnotes on Modules / DOT:

C = Not all uppercase letters are supported. See the module specification for a precise listing. This is probably not a serious restriction, since many 9-pin dot-matrix printers cannot print all the accented uppercase characters.

Asian Recognition Module

Module name:	ASN
Module identifier:	ASIAN
Filling methods supported:	ASIAN
Filters supported:	Not used
Trade-off supported:	Not used

The Asian Recognition Module requires the ImGearRecLicenseFeature.AsianOcr license feature to be enabled.

Application Areas

This module provides recognition services for four Asian languages with horizontal or vertical text direction; these languages are Japanese, Korean and Chinese – Traditional and Simplified. It can also recognize short lengths of embedded English text, without explicitly enabling English in the Languages collection.

The Asian language module differs somewhat from those of Western languages. Below is a list of differences that should be taken into account when performing recognition of Asian text:

The checking subsystem is not available. This means spell checking, UD-Checking and User-Written Checking cannot be used when the Asian Recognition Module is active.
Only one Asian language should be set for recognition at a time.
Western languages should not be set alongside an Asian language.
Note: the Asian Recognition module can recognize short lengths of embedded English text, without English needing to be set. If text from other Latin character sets are embedded, these langauges similarly do not need to be set; however, accented characters may not always be handled correctly.
Character attributes, such as bold and italic styling, cannot be retrieved for Asian text, or for embedded English text.
The Deskew3D Method does not support images with text from Asian languages.

For the Asian Recognition Module to work correctly, the selected Asian language should be set before performing preprocessing.

Asian text can be horizontal and left-to-right (FLOW) or vertical - character flow top-to-bottom with line flow from right-to-left (VERTTEXT).

Non-Asian texts embedded in vertical texts can have three orientations: vertical (neon), right-rotated and side-by-side. All embedded texts will be converted to right rotation when exported to a formatted output document.

The orientation of Asian text is auto-detected on pages where user zones have not been inserted or on AUTO user zones. Auto-detection runs zone-by-zone, so pages with both horizontal and vertical text blocks (such as for picture captions) can be handled.

Digital camera input can be used for Asian-language input, but the automatic 3D deskewing is not useful is these cases.

Table zones can be inserted into Asian pages, but if the OCR engine cannot detect a table within such a zone, the zone is likely to produce zero recognition results.

Conditions

The ideal font point size for Asian language body text is 12 points, scanned at 300 dpi, resulting in characters with around 48 x 48 pixels. The minimum pixel count is about 30 x 30, that is 10.5 points at 300 dpi. For characters smaller than this, 400 dpi should be used.

When zones are defined by the user, it is recommended to create homogeneous user zones as much as possible, because they may give better results. It is especially important in the case of Asian languages. Zones that are automatically located can be inhomogeneous.

Automatic Deskew and Orientation

Support for images with text in Asian languages by the automatic deskew and orientation process can be turned on or off. By setting the ImGearRecAsianSettings.IgnoreAsianTextForDeskew and ImGearRecAsianSettings.IgnoreAsianTextForRotation properties to true, when the ImGearRecImage.PreProcess Method is called with DeskewMode and OrientationMode set to AUTO, the image will not be deskewed or rotated if the Asian Recognition module is enabled.

Character Attributes

The character attributes, such as bold and italic styling, cannot be retrieved for Asian text, or for embedded English text.

Confidence Data and Choices

Recognition results can be saved to memory as a LETTER array, making the confidence data and alternate character choices available for Asian languages.