ImageGear PDF v24.14 - Updated
Recognition Modules
User Guide > How to Work with... > OCR > Concepts > Recognition Modules

The ImageGear Recognition API can load a number of recognition modules: which ones are present depends on its configuration. You can program which one should run in a given zone with the zone's ImGearRecZone.RecognitionModule Property. This allows your application to perform "multi-module" recognition on any image.

The ImGearRecRecognitionModule Enumeration lists the possible recognition modules:

Corresponding to the above list, the RecognitionModule Property of any zone should contain one of the following recognition module IDs: OMNIFONT_PLUS3W, OMNIFONT_PLUS2W, OMNIFONT_MTX, OMNIFONT_MOR, OMNIFONT_FRX, DOT, or one of the two special ones: NO_OCR and AUTO. The NO_OCR value means that no recognition will be done in the zone concerned.

AUTO means that the recognition engine will choose the module most likely to be appropriate. It does this first of all by consulting the filling method set for the zone.

The filling method describes the type of data expected in the zone, e.g., dot matrix printed text or machine generated text. A degree of auto-detection is available for the filling method, with the ImGearRecPage.DetectFillingMethod Method, which is useful when the precise filling method used on incoming documents may not be known in advance. It is the programmer's responsibility to specify a valid recognition module-filling method pair. Any incorrectly set zone will have no recognition result. Some filling methods can be linked successfully with one and only one recognition module. However, some recognition modules support more than one filling method, and some filling methods are accepted by more than one recognition module. For example, with mid-quality 9-pin dot-matrix text, either DOT or one of the omnifont modules could give better results. With mid-quality 24-pin dot-matrix text, try RM OMNIFONT_MOR with FM DOTMATRIX24 and OMNIFONT.

You can find detailed information on this and the different recognition modules in this topic.

RM AUTO reads the filling method; if only one recognition module is suitable, it is used. When there is a choice, RM AUTO uses various checks (character set, image size, etc.) to select the best one. Thus, it protects against an invalid FM-RM pair.

If the recognition module is not present when it is needed, the Recognize Method throws an ImGearException with error code and value ImGearRecErrorCodes.REC_ENGINE_FAILURE / API_MODULEMISSING_ERR, and there will be no recognized data for the zone concerned. To avoid this risk, we recommend checking the presence and correct installation of the necessary recognition modules by examining the modules collection using the ImGearRecognition.Modules Property just after the Recognition API's initialization.

Filling Method - Recognition Module Combinations

When a zone recognition module setting is ImGearRecRecognitionModule Enumeration.AUTO, set either by default or explicitly, the Engine takes care of recognition module selection for any filling method. When setting specific values for filling methods and recognition modules, it is the programmer’s responsibility to specify a valid recognition module-filling method pair. Any incorrectly set zone will have no recognition result. The following table shows which modules are considered by the automatic recognition module selection, called up by the ImGearRecRecognitionModule Enumeration.AUTO value. The order of the recognition modules in the second column shows the priority order for the automatic recognition module selection.

Filling Method

Permissible Recognition Module  

OMNIFONT

OMNIFONT_PLUS2W, OMNIFONT_PLUS3W, OMNIFONT_MOR, OMNIFONT_FRX, OMNIFONT_MTX

DRAFTDOT9

DOT, OMNIFONT_MTX

DRAFTDOT24

OMNIFONT_MOR, OMNIFONT_MTX

OCRA

OMNIFONT_MOR, OMNIFONT_MTX

OCRB

OMNIFONT_MOR, OMNIFONT_MTX

FM_NO_OCR

-

Recognition Modules and the Widest Available Character Set

The correct assignment of a recognition module and a filling method to a zone should mean that the recognition module is able to satisfactorily process the contents of that zone. But it does not guarantee that the recognition module will be able to process every possible character. The characters supported by the Engine are listed in Characters and Code Pages. Most recognition modules recognize only a subset of these. Even if we restrict the character set to a limited Language Environment, e.g., selecting the German language, the recognition module may not be able to process all the enabled characters. Automatic recognition module selection takes Character Set support of modules into consideration. Selecting a recognition module directly, it is the programmer’s responsibility to select a recognition module capable of supporting the widest character set enabled in the zone. Otherwise this zone may have an incomplete recognition result. The precise character and language support for each module is given in the appropriate recognition module specifications.

Recognition Modules and Filters

Narrowing the Character Set has two effects:

The filtering system allows the Language environment to be narrowed, by enabling only certain character classes, and also by enabling individual characters. A filter is built up from filter elements, as detailed under ImGearRecFilter.

Each filter element name tells which character class is enabled, e.g., ALL means no filtering. Not all recognition modules interpret all filter elements. Precise information appears in the sub-heading for each module.

Applying a filter may not always enable the same number of characters. E.g., MISCELLANEOUS can enable only those miscellaneous characters supported by the recognition module assigned to the zone.

Recognition Modules and Trade-off Settings

Three accuracy/speed trade-off settings can be specified at page or document level: ACCURATE, BALANCED, and FAST. Five recognition modules can interpret these. Precise information appears in the sub-heading for each module.

Recognition Modules and the Checking Subsystem

The checking module has two basic services. It can flag unacceptable recognition results without changing them or it can be permitted to modify recognition results using checking module feedback. The available acceptance rules can come from the following:

These three sources may be combined freely. The checking module and each of its three parts can be enabled or disabled on a per-zone basis. The integrator should try to match the particular parts of the checking module to the contents and recognition modules of individual zones.

DOT 9-Pin Draft Dot-Matrix Recognition Module

Module name:

DOT

Module identifier:

DOT

Filling methods supported:

DRAFTDOT9

Filters supported:

all filter elements

Trade-off supported:

none

Knowledge base file:

TUDAS.FA

Application Areas

This module is designed for ONLY draft-quality 9-pin dot-matrix texts. For NLQ or LQ texts, the OMNIFONT_PLUS2W, OMNIFONT_PLUS3W, OMNIFONT_MTX, OMNIFONT_FRX or OMNIFONT_MOR modules are likely to give better results.

Range of Characters

This module supports 76 languages, of which 14 have dictionary support: Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish. It can read multiple languages. It can read 18 of the 29 punctuation characters. (The Low Double Comma Quotation Mark is missing). It supports 24 of the 55 miscellaneous characters. (Some missing ones are: the Euro Sign, the Small Script F, the Copyright Sign, Registered Trade Mark Sign and the Degree Sign.)

The following table lists all accented letters this module can recognize. The range of uppercase ones supported by this module is limited, since in 9-pin dot-matrix text, there is difficulty in printing these letters. Those with lowercase support only are listed separately:

Upper & Lowercase

Lowercase Only

Cap.&Small A Acute (A’)

Small A Circumflex (a^)

Cap.&Small AE (Ae)

Small A Macron (a-)

Cap.&Small A Ring (Ao)

Small A Grave (a`)

Cap.&Small A Umlaut (A:)

Small E Umlaut (e:)

Cap.&Small A Tilde (A~)

Small E Circumflex (e^)

Cap.&Small C Cedilla (C,)

Small E Grave (e`)

Cap.&Small E Acute (E’)

Small I Umlaut (I:)

Cap.&Small I Acute (I’)

Small I Circumflex (I^)

Cap.&Small N Tilde (N~)

Small I Grave (I`)

Cap.&Small O Double Acute (O")

Small O Circumflex (O^)

Cap.&Small O Acute (O’)

Small O Macron (O-)

Cap.&Small O Umlaut (O:)

Small O Grave (O`)

Cap.&Small O Tilde (O~)

Small S Hacek (Sv)

Cap.&Small O Slash (O/)

Small U Circumflex (U^)

Cap.&Small AE(OE)

Small U Grave (U`)

Cap.&Small U Double Acute (U")

Cap.&Small U Acute (U’)

Cap.&Small U Umlaut (U:)

Accuracy Issues

This module does not interpret the recognition trade-off setting.

Character Attributes

Since this module recognizes draft dot-matrix texts, character attributes are not applicable. Expanded characters are not recognized, condensed printout can be, but the accuracy is liable to be low.

Conditions

This module is used if it is directly specified in a zone structure.

If DRAFTDOT9 filling method is set together with AUTO, OMNIFONT_MTX is used, provided that all characters (or languages or filters) validated for the zone are supported by it. If any are not supported, this module is used.

It can generate confidence data on recognized characters and can interpret all filter values.

FRX Multi-Lingual Omnifont Recognition Module

Module name:

FRX

Module identifier:

FRX

Filling methods supported:

OMNIFONT

Filters supported:

all filter elements

Trade-off supported:

none

Knowledge base files:

none

The OMNIFONT_PLUS2W, and OMNIFONT_PLUS3W recognition modules require the presence of this module.

Its associated files are:

baltic.shp

Frx shape pack (code page) file.

cyrillic.shp

Frx shape pack (code page) file.

greek.shp

Frx shape pack (code page) file.

latin1.shp

Frx shape pack (code page) file.

latin2.shp

Frx shape pack (code page) file.

turkish.shp

Frx shape pack (code page) file.

charsettable.chr

asciieng.lng

Frx language dictionary. Used in case of multi-language selection.

czech.lng

Frx language dictionary data file.

danish.lng

Frx language dictionary data file.

dutch.lng

Frx language dictionary data file.

english.lng

Frx language dictionary data file.

finnish.lng

Frx language dictionary data file.

french.lng

Frx language dictionary data file.

german.lng

Frx language dictionary data file.

greek.lng

Frx language dictionary data file.

hungar.lng

Frx language dictionary data file.

italian.lng

Frx language dictionary data file.

norsk.lng

Frx language dictionary data file.

polish.lng

Frx language dictionary data file.

port.lng

Frx language dictionary data file.

russian.lng

Frx language dictionary data file.

spanish.lng

Frx language dictionary data file.

swedish.lng

Frx language dictionary data file.

turkish.lng

Frx language dictionary data file.

Application Areas

This module recognizes machine printed text; i.e. from printed publications, laser or ink-jet printers and electric typewriters. Output from mechanical typewriters in good condition may also be acceptable. It should also be used for letter or near letter quality (NLQ, LQ) output from dot-matrix printers.

Range of Characters

This module supports the recognition of Latin, Greek and Cyrillic alphabets with enough accented letters to recognize the 54 languages.

The characters are listed in category and alphanumeric order, together with their Code Page values, in Characters and Code Pages.

Multi-Lingual Language Support

The language support of this module is based on the module's internal code pages, which contain characters from a related group of languages. The internal code pages of this module are American/European (Latin 1, 1252), Baltic (1257), Central-European (Latin 2, 1250), Cyrillic (1251), Greek (1253) and Turkish (1254).

The module supports multi-language selection for recognition, though it may not recognize languages from different language groups properly. It supports only language combinations within the same Code Page. For example, it properly processes the English, German and Italian language combination, since all these languages belong to the Latin 1 (1252) code page. However, when specifying e.g. both the French and Czech languages, OMNIFONT_FRX may fail to properly recognize some accented characters in the Czech alphabet, since these languages are not in the same code page. The following table contains the languages by code pages supported by FRX.

Latin 2 (1250)

Polish, Czech, Hungarian, Romanian, Albanian, Croatian, Wend (Sorbian), Slovak, Slovenian

Cyrillic (1251)

Russian, Ukrainian, Byelorussian, Bulgarian, Macedonian, Serbian

Latin 1 (1252)

English, German, French, Spanish, Italian, Dutch, Swedish, Norwegian, Finnish, Danish, Portuguese, Portuguese (Brazilian), Catalan, Afrikaans, Aymara, Basque, Breton, Faroese, Friulian, Gaelic, Galician, Eskimo, Icelandic, Indonesian, Latin, Malaysian, Pidgin English, Swahili, Tahitian, Welsh, Frisian, Zulu

Greek (1253)

Greek

Turkish (1254)

Turkish, Kurdish (written in Latin alphabet)

Baltic (1257)

Estonian, Hawaiian, Latvian, Lithuanian

Character Attributes

The omnifont recognition module can detect and transmit character attributes: bold, italic or underlined text (or any combination of them). It can also detect and transmit character size, and can classify font types into three broad categories: serif, sans serif and monospaced.

MOR Multi-Lingual Omnifont Recognition Module

Module name:

MOR

Module identifier:

OMNIFONT_MOR

Filling methods supported:

OMNIFONT, DRAFTDOT24, OCRA, OCRB

Filters supported:

all filter elements

Trade-off supported:

FAST, BALANCED, ACCURATE

Knowledge base files:

RECOGN.BCT and RECOGN24.BCT

The PLUS2W and PLUS3W recognition modules also require the presence of this module.

Application Areas

This module recognizes machine printed text; i.e., from printed publications, laser or ink-jet printers and electric typewriters. Output from mechanical typewriters in good condition may also be acceptable. It could also be used for letter or near letter quality (NLQ, LQ) output from dot-matrix printers. For Draft quality 24-pin dot-matrix documents use the DRAFTDOT24 filling method. NLQ or LQ quality output can usually be better recognized without using DRAFTDOT24.

The max. number of zones defined on an image that this module can handle is 500.

Range of Characters

This module can recognize about 500 characters, termed Engine’s Total Character Set. It includes the letters of the Latin, Greek and Cyrillic alphabets with enough accented letters to recognize the 119 Languages supported by the Engine

The set is classified as follows:

Non-accented

Accented

Latin alphabet upper case letters

26

89

Latin alphabet lower case letters

26

91

Digits

10

Punctuation

29

Miscellaneous (math symbols, etc.)

55

Cyrillic upper case letters

33

14

Cyrillic lower case letters

33

14

Greek upper case letters

24

9

Greek lower case letters

25

11

OCR (OCR-A) characters

3

The characters are listed in category and alphanumeric order, together with their Code Page values, in Characters and Code Pages. These are the character categories used by the filter elements.

Character Attributes

The omnifont recognition module can detect and transmit character attributes: bold, italic or underlined text (or any combination of them). It can also detect and transmit character size, and can classify font types into three broad categories: serif, sans serif and monospaced.

Speed/Accuracy Choices

The multi-lingual omnifont recognition module basically uses contour analysis, but can supplement this with an innovative form of pattern matching not requiring enormous pre-stored shape libraries.

This module interprets all three page-level recognition trade-off settings: ACCURATE, BALANCED and FAST.

The module is tightly integrated with the checking module, giving a total of five speed/accuracy choices.

MTX Omnifont Recognition Module

Module name:

MTX

Module identifier:

OMNIFONT_MTX

Filling methods supported:

OMNIFONT, DRAFTDOT9, DRAFTDOT24, OCRA, OCRB

Filters supported:

ALL, DIGIT and ALPHA

Trade-off supported:

FAST, ACCURATE (BALANCED is equal to this)

Knowledge base file:

N / A

The PLUS2W and PLUS3W recognition modules also require the presence of this module.

Recognition module language binaries are xi*.bin, as follows:

The files xiengf.bin and xiengl.bin are required, unaltered, for all languages. All other languages have language-specific equivalents of the remaining seven files. The identifier eng is changed as follows:

Application Areas

This recognition module recognizes machine printed text; i.e. from printed publications, laser or ink-jet printers and electric typewriters. Output from mechanical typewriters in good condition may also be acceptable. It should also be used for Letter or Near Letter Quality output from dot-matrix printers, and can also be used for Draft Quality.

Range of Characters

This module supports the characters of the following languages:

Language

English

French

Spanish

Italian

German

Norwegian

Portuguese

Danish

Dutch

Finnish

Swedish

Brazilian

Any of these languages can be combined.

Accuracy Issues

This module is influenced by the page-level trade-off setting, but reduces the three settings to two: FAST is respected, while BALANCED and ACCURATE are merged to one value.

Character Attributes

The omnifont recognition module can detect and transmit character attributes: bold, italic or underlined text (or any combination of them). It can also detect and transmit character size, and can classify font types into three broad categories: serif, sans serif and monospaced.

PLUS2W and PLUS3W Omnifont Recognition Modules

Module name:

PLUS2W and PLUS3W

Module identifier:

OMNIFONT_PLUS2W and OMNIFONT_PLUS3W

Filling methods supported:

OMNIFONT

Filters supported:

ALL, DIGIT and ALPHA

Trade-off supported:

FAST, BALANCED, ACCURATE

Knowledge base file:

RECOGN.BCT, RECOGN24.BCT

Both PLUS2W and PLUS3W require the presence of FRX, MTX and MOR recognition modules.

Application Areas

This recognition module recognizes machine printed text; i.e. from printed publications, laser or ink-jet printers and electric typewriters. Output from mechanical typewriters in good condition may also be acceptable.

Range of Characters

This module supports the same set of characters as the OMNIFONT_MOR module.

Accuracy Issues

The PLUS2W and PLUS3W modules use voting technology to provide improved recognition results. The PLUS2W and PLUS3W modules use the results from one or more of FRX, MOR and MTX modules according to the trade-off. With either of these two voting modules, the accuracy is considerably better, but the recognition may need significantly more time than any single module.

Suspicious Marking

With these modules, the suspicious character and word marking feature is different from that used in MOR, MTX or FRX. These modules do not mark characters as suspicious if all the voting modules provided the same recognition result, even if they were suspiciously recognized in any of them. Consequently, there are likely to be fewer words marked as non-dictionary.

Character Attributes

The omnifont recognition module can detect and transmit character attributes: bold, italic or underlined text (or any combination of them). It can also detect and transmit character size, and can classify font types into three broad categories: serif, sans serif and monospaced.

Languages and Modules

The following table shows the text recognition module support for each of the 119 languages.

Language

MOR

MTX

FRX

PLUS2W

PLUS3W

DOT

Afrikaans

Yes

No

Yes

Yes

Yes

Yes C

Albanian

Yes

No

Yes

Yes

Yes

Yes C

Aymara

Yes

No

Yes

Yes

Yes

Yes

Basque

Yes

No

Yes

Yes

Yes

No

Bemba

Yes

Yes EN

No

Yes

Yes

Yes

Blackfoot

Yes

Yes EN

No

Yes

Yes

Yes

Brazilian B

Yes B

Yes

Yes

Yes

Yes

Yes

Breton

Yes

No

Yes

Yes

Yes

Yes C

Bugotu

Yes

Yes EN

No

Yes

Yes

Yes

Bulgarian

Yes

No

Yes

Yes

Yes

No

Byelorussian

Yes

No

Yes

Yes

Yes

No

Catalan

Yes

No

Yes

Yes

Yes

Yes C

Chamorro

Yes

No

No

Yes

Yes

Yes

Chechen

Yes

No

No

Yes

Yes

No

Corsican

Yes

No

No

Yes

Yes

Yes

Croatian

Yes

No

Yes

Yes

Yes

No

Crow

Yes

Yes EN

No

Yes

Yes

Yes

Czech

Yes

No

Yes

Yes

Yes

No

Danish

Yes

Yes

Yes

Yes

Yes

Yes

Dutch

Yes

Yes

Yes

Yes

Yes

Yes C

English

Yes

Yes

Yes

Yes

Yes

Yes

Eskimo (Inuit)

Yes

No

Yes

Yes

Yes

No

Esperanto

Yes

No

No

Yes

Yes

No

Estonian

Yes

No

Yes

Yes

Yes

Yes

Faroese

Yes

No

Yes

Yes

Yes

No

Fijian

Yes

No

No

Yes

Yes

No

Finnish

Yes

Yes

Yes

Yes

Yes

Yes

French

Yes

Yes

Yes

Yes

Yes

Yes C

Frisian

Yes

No

Yes

Yes

Yes

Yes C

Friulian

Yes

No

Yes

Yes

Yes

Yes C

Gaelic (Irish)

Yes

No

Yes

Yes

Yes

Yes

Gaelic (Scottish)

Yes

No

Yes

Yes

Yes

Yes C

Galician

Yes

Yes

Yes

Yes

Yes

Yes

Ganda

Yes

No

No

Yes

Yes

No

German

Yes

Yes

Yes

Yes

Yes

Yes

Greek

Yes

No

Yes

Yes

Yes

Yes

Guarani

Yes

No

No

Yes

Yes

Yes C

Hani *

Yes

Yes EN

No

Yes

Yes

Yes

Hawaiian

Yes

Yes EN

Yes

Yes

Yes

Yes

Hungarian

Yes

No

Yes

Yes

Yes

Yes

Icelandic

Yes

No

Yes

Yes

Yes

No

Ido

Yes

Yes EN

No

Yes

Yes

Yes

Indonesian

Yes

Yes EN

Yes

Yes

Yes

Yes

Interlingua

Yes

Yes EN

No

Yes

Yes

Yes

Italian

Yes

Yes

Yes

Yes

Yes

Yes C

Kabardian

Yes

No

No

Yes

Yes

No

Kasub

Yes

No

No

Yes

Yes

No

Kawa *

Yes

Yes EN

No

Yes

Yes

Yes

Kikuyu

Yes

No

No

Yes

Yes

No

Kongo

Yes

Yes EN

No

Yes

Yes

Yes

Kpelle

Yes

Yes EN

No

Yes

Yes

Yes

Kurdish *

Yes

No

Yes

Yes

Yes

No

Latin L

Yes

Yes L

Yes

Yes

Yes

Yes L

Latvian

Yes

No

Yes

Yes

Yes

No

Lithuanian

Yes

No

Yes

Yes

Yes

No

Luba

Yes

No

No

Yes

Yes

No

Luxembourgish

Yes

No

No

Yes

Yes

Yes C

Macedonian

Yes

No

Yes

Yes

Yes

No

Malagasy

Yes

Yes EN/M

No

Yes

Yes

Yes C

Malay

Yes

No

Yes

Yes

Yes

No

Malinke

Yes

No

No

Yes

Yes

Yes C

Maltese

Yes

No

No

Yes

Yes

No

Maori

Yes

Yes EN

No

Yes

Yes

Yes

Mayan

Yes

No

No

Yes

Yes

Yes

Miao *

Yes

Yes EN

No

Yes

Yes

Yes

Minangkabau

Yes

No

No

Yes

Yes

No

Mohawk

Yes

Yes EN

No

Yes

Yes

Yes

Moldavian

Yes

No

No

Yes

Yes

No

Nahuatl

Yes

Yes EN

No

Yes

Yes

Yes

Norwegian

Yes

Yes

Yes

Yes

Yes

Yes

Nyanja

Yes

Yes EN

No

Yes

Yes

Yes

Occidental

Yes

No

No

Yes

Yes

Yes

Ojibway

Yes

No

No

Yes

Yes

No

Papiamento

Yes

No

No

Yes

Yes

Yes

Pidgin English

Yes

Yes EN

Yes

Yes

Yes

Yes

Polish

Yes

No

Yes

Yes

Yes

No

Portuguese

Yes

Yes

Yes

Yes

Yes

Yes C

Provençal

Yes

No

No

Yes

Yes

Yes C

Quechua

Yes

No

No

Yes

Yes

Yes

Rhaetic

Yes

No

No

Yes

Yes

Yes C

Romanian

Yes

No

Yes

Yes

Yes

No

Romany

Yes

No

No

Yes

Yes

No

Rwanda

Yes

Yes EN

No

Yes

Yes

Yes

Rundi

Yes

Yes EN

No

Yes

Yes

Yes

Russian

Yes

No

Yes

Yes

Yes

No

Sami

Yes

No

No

Yes

Yes

No

Sami, Lule

Yes

No

No

Yes

Yes

No

Sami, Northern

Yes

No

No

Yes

Yes

No

Sami, Southern

Yes

No

No

Yes

Yes

No

Samoan

Yes

No

No

Yes

Yes

Yes C

Sardinian

Yes

No

No

Yes

Yes

Yes C

Serbian

Yes

No

Yes

Yes

Yes

No

Serbian, Latinic

Yes

No

Yes

Yes

Yes

No

Shona S

Yes

Yes S

No

Yes

Yes

Yes S

Sioux

Yes

Yes EN

No

Yes

Yes

Yes

Slovak

Yes

No

Yes

Yes

Yes

No

Slovenian

Yes

No

Yes

Yes

Yes

No

Somali

Yes

Yes EN

No

Yes

Yes

Yes

Sorbian (Wend)

Yes

No

Yes

Yes

Yes

No

Sotho

Yes

No

No

Yes

Yes

Yes

Spanish

Yes

Yes

Yes

Yes

Yes

Yes

Sundanese SN

Yes

No

No

Yes

Yes

Yes SN

Swahili

Yes

Yes EN

Yes

Yes

Yes

Yes

Swazi

Yes

No

No

Yes

Yes

No

Swedish

Yes

Yes

Yes

Yes

Yes

Yes

Tagalog

Yes

Yes EN

No

Yes

Yes

Yes

Tahitian

Yes

No

Yes

Yes

Yes

Yes C

Tinpo

Yes

Yes EN

No

Yes

Yes

Yes

Tongan

Yes

Yes EN

No

Yes

Yes

Yes

Tswana (Chuana)

Yes

No

No

Yes

Yes

Yes C

Tun *

Yes

Yes EN

No

Yes

Yes

Yes

Turkish

Yes

No

Yes

Yes

Yes

No

Ukrainian

Yes

No

Yes

Yes

Yes

No

Visayan

Yes

Yes EN

No

Yes

Yes

Yes

Welsh

Yes

No

Yes

Yes

Yes

Yes W

Wolof

Yes

No

No

Yes

Yes

Yes C

Xhosa

Yes

Yes EN

No

Yes

Yes

Yes

Zapotec

Yes

Yes EN

No

Yes

Yes

Yes

Zulu

Yes

No

Yes

Yes

Yes

No


The following table summarizes the above:

LANGUAGES

MOR

MTX

FRX

PLUS2W

PLUS3W

DOT

With dictionary support

18

12

17

17

17

14

Accented, non-dictionary

65

0

33

66

66

31

Non-accented, non-dictionary

31

31

4

31

31

31

Directly selectable

119

12

56

119

119

76

Total

119

43

56

119

119

76


Footnotes on Languages / General:

Footnotes on Modules / MTX:

Footnotes on Modules / DOT:

Asian Recognition Module

Module name: ASN
Module identifier: ASIAN
Filling methods supported: ASIAN
Filters supported: Not used
Trade-off supported: Not used
The Asian Recognition Module requires the ImGearRecLicenseFeature.AsianOcr license feature to be enabled.

Application Areas

This module provides recognition services for four Asian languages with horizontal or vertical text direction; these languages are Japanese, Korean and Chinese – Traditional and Simplified. It can also recognize short lengths of embedded English text, without explicitly enabling English in the Languages collection.

The Asian language module differs somewhat from those of Western languages. Below is a list of differences that should be taken into account when performing recognition of Asian text:

For the Asian Recognition Module to work correctly, the selected Asian language should be set before performing preprocessing.

Asian text can be horizontal and left-to-right (FLOW) or vertical - character flow top-to-bottom with line flow from right-to-left (VERTTEXT).

Non-Asian texts embedded in vertical texts can have three orientations: vertical (neon), right-rotated and side-by-side. All embedded texts will be converted to right rotation when exported to a formatted output document.

The orientation of Asian text is auto-detected on pages where user zones have not been inserted or on AUTO user zones. Auto-detection runs zone-by-zone, so pages with both horizontal and vertical text blocks (such as for picture captions) can be handled.

Digital camera input can be used for Asian-language input, but the automatic 3D deskewing is not useful is these cases.

Table zones can be inserted into Asian pages, but if the OCR engine cannot detect a table within such a zone, the zone is likely to produce zero recognition results.

Conditions

The ideal font point size for Asian language body text is 12 points, scanned at 300 dpi, resulting in characters with around 48 x 48 pixels. The minimum pixel count is about 30 x 30, that is 10.5 points at 300 dpi. For characters smaller than this, 400 dpi should be used.

When zones are defined by the user, it is recommended to create homogeneous user zones as much as possible, because they may give better results. It is especially important in the case of Asian languages. Zones that are automatically located can be inhomogeneous.

Automatic Deskew and Orientation

Support for images with text in Asian languages by the automatic deskew and orientation process can be turned on or off. By setting the ImGearRecAsianSettings.IgnoreAsianTextForDeskew and  ImGearRecAsianSettings.IgnoreAsianTextForRotation properties to true, when the ImGearRecImage.PreProcess Method is called with DeskewMode and OrientationMode set to AUTO, the image will not be deskewed or rotated if the Asian Recognition module is enabled.

Character Attributes

The character attributes, such as bold and italic styling, cannot be retrieved for Asian text, or for embedded English text.

Confidence Data and Choices

Recognition results can be saved to memory as a LETTER array, making the confidence data and alternate character choices available for Asian languages.