User Guide > How to Work with... > OCR > Concepts > Recognition Modules |
The ImageGear Recognition API can load a number of recognition modules: which ones are present depends on its configuration. You can program which one should run in a given zone with the zone's ImGearRecZone.RecognitionModule Property. This allows your application to perform "multi-module" recognition on any image.
The ImGearRecRecognitionModule Enumeration lists the possible recognition modules:
Corresponding to the above list, the RecognitionModule Property of any zone should contain one of the following recognition module IDs: OMNIFONT_PLUS3W, OMNIFONT_PLUS2W, OMNIFONT_MTX, OMNIFONT_MOR, OMNIFONT_FRX, DOT, or one of the two special ones: NO_OCR and AUTO. The NO_OCR value means that no recognition will be done in the zone concerned.
AUTO means that the recognition engine will choose the module most likely to be appropriate. It does this first of all by consulting the filling method set for the zone.
The filling method describes the type of data expected in the zone, e.g., dot matrix printed text or machine generated text. A degree of auto-detection is available for the filling method, with the ImGearRecPage.DetectFillingMethod Method, which is useful when the precise filling method used on incoming documents may not be known in advance. It is the programmer's responsibility to specify a valid recognition module-filling method pair. Any incorrectly set zone will have no recognition result. Some filling methods can be linked successfully with one and only one recognition module. However, some recognition modules support more than one filling method, and some filling methods are accepted by more than one recognition module. For example, with mid-quality 9-pin dot-matrix text, either DOT or one of the omnifont modules could give better results. With mid-quality 24-pin dot-matrix text, try RM OMNIFONT_MOR with FM DOTMATRIX24 and OMNIFONT.
You can find detailed information on this and the different recognition modules in this topic.
RM AUTO reads the filling method; if only one recognition module is suitable, it is used. When there is a choice, RM AUTO uses various checks (character set, image size, etc.) to select the best one. Thus, it protects against an invalid FM-RM pair.
If the recognition module is not present when it is needed, the Recognize Method throws an ImGearException with error code and value ImGearRecErrorCodes.REC_ENGINE_FAILURE / API_MODULEMISSING_ERR, and there will be no recognized data for the zone concerned. To avoid this risk, we recommend checking the presence and correct installation of the necessary recognition modules by examining the modules collection using the ImGearRecognition.Modules Property just after the Recognition API's initialization.
When a zone recognition module setting is ImGearRecRecognitionModule Enumeration.AUTO, set either by default or explicitly, the Engine takes care of recognition module selection for any filling method. When setting specific values for filling methods and recognition modules, it is the programmer’s responsibility to specify a valid recognition module-filling method pair. Any incorrectly set zone will have no recognition result. The following table shows which modules are considered by the automatic recognition module selection, called up by the ImGearRecRecognitionModule Enumeration.AUTO value. The order of the recognition modules in the second column shows the priority order for the automatic recognition module selection.
Filling Method |
Permissible Recognition Module |
OMNIFONT |
OMNIFONT_PLUS2W, OMNIFONT_PLUS3W, OMNIFONT_MOR, OMNIFONT_FRX, OMNIFONT_MTX |
DRAFTDOT9 |
DOT, OMNIFONT_MTX |
DRAFTDOT24 |
OMNIFONT_MOR, OMNIFONT_MTX |
OCRA |
OMNIFONT_MOR, OMNIFONT_MTX |
OCRB |
OMNIFONT_MOR, OMNIFONT_MTX |
FM_NO_OCR |
- |
The correct assignment of a recognition module and a filling method to a zone should mean that the recognition module is able to satisfactorily process the contents of that zone. But it does not guarantee that the recognition module will be able to process every possible character. The characters supported by the Engine are listed in Characters and Code Pages. Most recognition modules recognize only a subset of these. Even if we restrict the character set to a limited Language Environment, e.g., selecting the German language, the recognition module may not be able to process all the enabled characters. Automatic recognition module selection takes Character Set support of modules into consideration. Selecting a recognition module directly, it is the programmer’s responsibility to select a recognition module capable of supporting the widest character set enabled in the zone. Otherwise this zone may have an incomplete recognition result. The precise character and language support for each module is given in the appropriate recognition module specifications.
Narrowing the Character Set has two effects:
The filtering system allows the Language environment to be narrowed, by enabling only certain character classes, and also by enabling individual characters. A filter is built up from filter elements, as detailed under ImGearRecFilter.
Each filter element name tells which character class is enabled, e.g., ALL means no filtering. Not all recognition modules interpret all filter elements. Precise information appears in the sub-heading for each module.
Applying a filter may not always enable the same number of characters. E.g., MISCELLANEOUS can enable only those miscellaneous characters supported by the recognition module assigned to the zone.
Three accuracy/speed trade-off settings can be specified at page or document level: ACCURATE, BALANCED, and FAST. Five recognition modules can interpret these. Precise information appears in the sub-heading for each module.
The checking module has two basic services. It can flag unacceptable recognition results without changing them or it can be permitted to modify recognition results using checking module feedback. The available acceptance rules can come from the following:
These three sources may be combined freely. The checking module and each of its three parts can be enabled or disabled on a per-zone basis. The integrator should try to match the particular parts of the checking module to the contents and recognition modules of individual zones.
Module name: |
DOT |
Module identifier: |
DOT |
Filling methods supported: |
DRAFTDOT9 |
Filters supported: |
all filter elements |
Trade-off supported: |
none |
Knowledge base file: |
|
Application Areas
This module is designed for ONLY draft-quality 9-pin dot-matrix texts. For NLQ or LQ texts, the OMNIFONT_PLUS2W, OMNIFONT_PLUS3W, OMNIFONT_MTX, OMNIFONT_FRX or OMNIFONT_MOR modules are likely to give better results.
Range of Characters
This module supports 76 languages, of which 14 have dictionary support: Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish. It can read multiple languages. It can read 18 of the 29 punctuation characters. (The Low Double Comma Quotation Mark is missing). It supports 24 of the 55 miscellaneous characters. (Some missing ones are: the Euro Sign, the Small Script F, the Copyright Sign, Registered Trade Mark Sign and the Degree Sign.)
The following table lists all accented letters this module can recognize. The range of uppercase ones supported by this module is limited, since in 9-pin dot-matrix text, there is difficulty in printing these letters. Those with lowercase support only are listed separately:
Upper & Lowercase |
Lowercase Only |
Cap.&Small A Acute (A’) |
Small A Circumflex (a^) |
Cap.&Small AE (Ae) |
Small A Macron (a-) |
Cap.&Small A Ring (Ao) |
Small A Grave (a`) |
Cap.&Small A Umlaut (A:) |
Small E Umlaut (e:) |
Cap.&Small A Tilde (A~) |
Small E Circumflex (e^) |
Cap.&Small C Cedilla (C,) |
Small E Grave (e`) |
Cap.&Small E Acute (E’) |
Small I Umlaut (I:) |
Cap.&Small I Acute (I’) |
Small I Circumflex (I^) |
Cap.&Small N Tilde (N~) |
Small I Grave (I`) |
Cap.&Small O Double Acute (O") |
Small O Circumflex (O^) |
Cap.&Small O Acute (O’) |
Small O Macron (O-) |
Cap.&Small O Umlaut (O:) |
Small O Grave (O`) |
Cap.&Small O Tilde (O~) |
Small S Hacek (Sv) |
Cap.&Small O Slash (O/) |
Small U Circumflex (U^) |
Cap.&Small AE(OE) |
Small U Grave (U`) |
Cap.&Small U Double Acute (U") |
|
Cap.&Small U Acute (U’) |
|
Cap.&Small U Umlaut (U:) |
Accuracy Issues
This module does not interpret the recognition trade-off setting.
Character Attributes
Since this module recognizes draft dot-matrix texts, character attributes are not applicable. Expanded characters are not recognized, condensed printout can be, but the accuracy is liable to be low.
Conditions
This module is used if it is directly specified in a zone structure.
If DRAFTDOT9 filling method is set together with AUTO, OMNIFONT_MTX is used, provided that all characters (or languages or filters) validated for the zone are supported by it. If any are not supported, this module is used.
It can generate confidence data on recognized characters and can interpret all filter values.
Module name: |
FRX |
Module identifier: |
FRX |
Filling methods supported: |
OMNIFONT |
Filters supported: |
all filter elements |
Trade-off supported: |
none |
Knowledge base files: |
none |
The OMNIFONT_PLUS2W, and OMNIFONT_PLUS3W recognition modules require the presence of this module.
Its associated files are:
baltic.shp |
Frx shape pack (code page) file. |
cyrillic.shp |
Frx shape pack (code page) file. |
greek.shp |
Frx shape pack (code page) file. |
latin1.shp |
Frx shape pack (code page) file. |
latin2.shp |
Frx shape pack (code page) file. |
turkish.shp |
Frx shape pack (code page) file. |
charsettable.chr |
|
asciieng.lng |
Frx language dictionary. Used in case of multi-language selection. |
czech.lng |
Frx language dictionary data file. |
danish.lng |
Frx language dictionary data file. |
dutch.lng |
Frx language dictionary data file. |
english.lng |
Frx language dictionary data file. |
finnish.lng |
Frx language dictionary data file. |
french.lng |
Frx language dictionary data file. |
german.lng |
Frx language dictionary data file. |
greek.lng |
Frx language dictionary data file. |
hungar.lng |
Frx language dictionary data file. |
italian.lng |
Frx language dictionary data file. |
norsk.lng |
Frx language dictionary data file. |
polish.lng |
Frx language dictionary data file. |
port.lng |
Frx language dictionary data file. |
russian.lng |
Frx language dictionary data file. |
spanish.lng |
Frx language dictionary data file. |
swedish.lng |
Frx language dictionary data file. |
turkish.lng |
Frx language dictionary data file. |
Application Areas
This module recognizes machine printed text; i.e. from printed publications, laser or ink-jet printers and electric typewriters. Output from mechanical typewriters in good condition may also be acceptable. It should also be used for letter or near letter quality (NLQ, LQ) output from dot-matrix printers.
Range of Characters
This module supports the recognition of Latin, Greek and Cyrillic alphabets with enough accented letters to recognize the 54 languages.
The characters are listed in category and alphanumeric order, together with their Code Page values, in Characters and Code Pages.
Multi-Lingual Language Support
The language support of this module is based on the module's internal code pages, which contain characters from a related group of languages. The internal code pages of this module are American/European (Latin 1, 1252), Baltic (1257), Central-European (Latin 2, 1250), Cyrillic (1251), Greek (1253) and Turkish (1254).
The module supports multi-language selection for recognition, though it may not recognize languages from different language groups properly. It supports only language combinations within the same Code Page. For example, it properly processes the English, German and Italian language combination, since all these languages belong to the Latin 1 (1252) code page. However, when specifying e.g. both the French and Czech languages, OMNIFONT_FRX may fail to properly recognize some accented characters in the Czech alphabet, since these languages are not in the same code page. The following table contains the languages by code pages supported by FRX.
Latin 2 (1250) |
Polish, Czech, Hungarian, Romanian, Albanian, Croatian, Wend (Sorbian), Slovak, Slovenian |
Cyrillic (1251) |
Russian, Ukrainian, Byelorussian, Bulgarian, Macedonian, Serbian |
Latin 1 (1252) |
English, German, French, Spanish, Italian, Dutch, Swedish, Norwegian, Finnish, Danish, Portuguese, Portuguese (Brazilian), Catalan, Afrikaans, Aymara, Basque, Breton, Faroese, Friulian, Gaelic, Galician, Eskimo, Icelandic, Indonesian, Latin, Malaysian, Pidgin English, Swahili, Tahitian, Welsh, Frisian, Zulu |
Greek (1253) |
Greek |
Turkish (1254) |
Turkish, Kurdish (written in Latin alphabet) |
Baltic (1257) |
Estonian, Hawaiian, Latvian, Lithuanian |
Character Attributes
The omnifont recognition module can detect and transmit character attributes: bold, italic or underlined text (or any combination of them). It can also detect and transmit character size, and can classify font types into three broad categories: serif, sans serif and monospaced.
Module name: |
MOR |
Module identifier: |
OMNIFONT_MOR |
Filling methods supported: |
OMNIFONT, DRAFTDOT24, OCRA, OCRB |
Filters supported: |
all filter elements |
Trade-off supported: |
FAST, BALANCED, ACCURATE |
Knowledge base files: |
RECOGN.BCT and RECOGN24.BCT |
The PLUS2W and PLUS3W recognition modules also require the presence of this module.
Application Areas
This module recognizes machine printed text; i.e., from printed publications, laser or ink-jet printers and electric typewriters. Output from mechanical typewriters in good condition may also be acceptable. It could also be used for letter or near letter quality (NLQ, LQ) output from dot-matrix printers. For Draft quality 24-pin dot-matrix documents use the DRAFTDOT24 filling method. NLQ or LQ quality output can usually be better recognized without using DRAFTDOT24.
The max. number of zones defined on an image that this module can handle is 500.
Range of Characters
This module can recognize about 500 characters, termed Engine’s Total Character Set. It includes the letters of the Latin, Greek and Cyrillic alphabets with enough accented letters to recognize the 119 Languages supported by the Engine
The set is classified as follows:
Non-accented |
Accented | |
Latin alphabet upper case letters |
26 |
89 |
Latin alphabet lower case letters |
26 |
91 |
Digits |
10 |
|
Punctuation |
29 |
|
Miscellaneous (math symbols, etc.) |
55 |
|
Cyrillic upper case letters |
33 |
14 |
Cyrillic lower case letters |
33 |
14 |
Greek upper case letters |
24 |
9 |
Greek lower case letters |
25 |
11 |
OCR (OCR-A) characters |
3 |
The characters are listed in category and alphanumeric order, together with their Code Page values, in Characters and Code Pages. These are the character categories used by the filter elements.
Character Attributes
The omnifont recognition module can detect and transmit character attributes: bold, italic or underlined text (or any combination of them). It can also detect and transmit character size, and can classify font types into three broad categories: serif, sans serif and monospaced.
Speed/Accuracy Choices
The multi-lingual omnifont recognition module basically uses contour analysis, but can supplement this with an innovative form of pattern matching not requiring enormous pre-stored shape libraries.
This module interprets all three page-level recognition trade-off settings: ACCURATE, BALANCED and FAST.
The module is tightly integrated with the checking module, giving a total of five speed/accuracy choices.
Module name: |
MTX |
Module identifier: |
OMNIFONT_MTX |
Filling methods supported: |
OMNIFONT, DRAFTDOT9, DRAFTDOT24, OCRA, OCRB |
Filters supported: |
ALL, DIGIT and ALPHA |
Trade-off supported: |
FAST, ACCURATE (BALANCED is equal to this) |
Knowledge base file: |
N / A |
The PLUS2W and PLUS3W recognition modules also require the presence of this module.
Recognition module language binaries are xi*.bin, as follows:
The files xiengf.bin and xiengl.bin are required, unaltered, for all languages. All other languages have language-specific equivalents of the remaining seven files. The identifier eng is changed as follows:
Application Areas
This recognition module recognizes machine printed text; i.e. from printed publications, laser or ink-jet printers and electric typewriters. Output from mechanical typewriters in good condition may also be acceptable. It should also be used for Letter or Near Letter Quality output from dot-matrix printers, and can also be used for Draft Quality.
Range of Characters
This module supports the characters of the following languages:
Language |
English |
French |
Spanish |
Italian |
German |
Norwegian |
Portuguese |
Danish |
Dutch |
Finnish |
Swedish |
Brazilian |
Any of these languages can be combined.
Accuracy Issues
This module is influenced by the page-level trade-off setting, but reduces the three settings to two: FAST is respected, while BALANCED and ACCURATE are merged to one value.
Character Attributes
The omnifont recognition module can detect and transmit character attributes: bold, italic or underlined text (or any combination of them). It can also detect and transmit character size, and can classify font types into three broad categories: serif, sans serif and monospaced.
Module name: |
PLUS2W and PLUS3W |
Module identifier: |
OMNIFONT_PLUS2W and OMNIFONT_PLUS3W |
Filling methods supported: |
OMNIFONT |
Filters supported: |
ALL, DIGIT and ALPHA |
Trade-off supported: |
FAST, BALANCED, ACCURATE |
Knowledge base file: |
RECOGN.BCT, RECOGN24.BCT |
Both PLUS2W and PLUS3W require the presence of FRX, MTX and MOR recognition modules.
Application Areas
This recognition module recognizes machine printed text; i.e. from printed publications, laser or ink-jet printers and electric typewriters. Output from mechanical typewriters in good condition may also be acceptable.
Range of Characters
This module supports the same set of characters as the OMNIFONT_MOR module.
Accuracy Issues
The PLUS2W and PLUS3W modules use voting technology to provide improved recognition results. The PLUS2W and PLUS3W modules use the results from one or more of FRX, MOR and MTX modules according to the trade-off. With either of these two voting modules, the accuracy is considerably better, but the recognition may need significantly more time than any single module.
Suspicious Marking
With these modules, the suspicious character and word marking feature is different from that used in MOR, MTX or FRX. These modules do not mark characters as suspicious if all the voting modules provided the same recognition result, even if they were suspiciously recognized in any of them. Consequently, there are likely to be fewer words marked as non-dictionary.
Character Attributes
The omnifont recognition module can detect and transmit character attributes: bold, italic or underlined text (or any combination of them). It can also detect and transmit character size, and can classify font types into three broad categories: serif, sans serif and monospaced.
The following table shows the text recognition module support for each of the 119 languages.
Language |
MOR |
MTX |
FRX |
PLUS2W |
PLUS3W |
DOT |
Afrikaans |
Yes |
No |
Yes |
Yes |
Yes |
Yes C |
Albanian |
Yes |
No |
Yes |
Yes |
Yes |
Yes C |
Aymara |
Yes |
No |
Yes |
Yes |
Yes |
Yes |
Basque |
Yes |
No |
Yes |
Yes |
Yes |
No |
Bemba |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Blackfoot |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Brazilian B |
Yes B |
Yes |
Yes |
Yes |
Yes |
Yes |
Breton |
Yes |
No |
Yes |
Yes |
Yes |
Yes C |
Bugotu |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Bulgarian |
Yes |
No |
Yes |
Yes |
Yes |
No |
Byelorussian |
Yes |
No |
Yes |
Yes |
Yes |
No |
Catalan |
Yes |
No |
Yes |
Yes |
Yes |
Yes C |
Chamorro |
Yes |
No |
No |
Yes |
Yes |
Yes |
Chechen |
Yes |
No |
No |
Yes |
Yes |
No |
Corsican |
Yes |
No |
No |
Yes |
Yes |
Yes |
Croatian |
Yes |
No |
Yes |
Yes |
Yes |
No |
Crow |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Czech |
Yes |
No |
Yes |
Yes |
Yes |
No |
Danish |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Dutch |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes C |
English |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Eskimo (Inuit) |
Yes |
No |
Yes |
Yes |
Yes |
No |
Esperanto |
Yes |
No |
No |
Yes |
Yes |
No |
Estonian |
Yes |
No |
Yes |
Yes |
Yes |
Yes |
Faroese |
Yes |
No |
Yes |
Yes |
Yes |
No |
Fijian |
Yes |
No |
No |
Yes |
Yes |
No |
Finnish |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
French |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes C |
Frisian |
Yes |
No |
Yes |
Yes |
Yes |
Yes C |
Friulian |
Yes |
No |
Yes |
Yes |
Yes |
Yes C |
Gaelic (Irish) |
Yes |
No |
Yes |
Yes |
Yes |
Yes |
Gaelic (Scottish) |
Yes |
No |
Yes |
Yes |
Yes |
Yes C |
Galician |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Ganda |
Yes |
No |
No |
Yes |
Yes |
No |
German |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Greek |
Yes |
No |
Yes |
Yes |
Yes |
Yes |
Guarani |
Yes |
No |
No |
Yes |
Yes |
Yes C |
Hani * |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Hawaiian |
Yes |
Yes EN |
Yes |
Yes |
Yes |
Yes |
Hungarian |
Yes |
No |
Yes |
Yes |
Yes |
Yes |
Icelandic |
Yes |
No |
Yes |
Yes |
Yes |
No |
Ido |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Indonesian |
Yes |
Yes EN |
Yes |
Yes |
Yes |
Yes |
Interlingua |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Italian |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes C |
Kabardian |
Yes |
No |
No |
Yes |
Yes |
No |
Kasub |
Yes |
No |
No |
Yes |
Yes |
No |
Kawa * |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Kikuyu |
Yes |
No |
No |
Yes |
Yes |
No |
Kongo |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Kpelle |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Kurdish * |
Yes |
No |
Yes |
Yes |
Yes |
No |
Latin L |
Yes |
Yes L |
Yes |
Yes |
Yes |
Yes L |
Latvian |
Yes |
No |
Yes |
Yes |
Yes |
No |
Lithuanian |
Yes |
No |
Yes |
Yes |
Yes |
No |
Luba |
Yes |
No |
No |
Yes |
Yes |
No |
Luxembourgish |
Yes |
No |
No |
Yes |
Yes |
Yes C |
Macedonian |
Yes |
No |
Yes |
Yes |
Yes |
No |
Malagasy |
Yes |
No |
Yes |
Yes |
Yes C | |
Malay |
Yes |
No |
Yes |
Yes |
Yes |
No |
Malinke |
Yes |
No |
No |
Yes |
Yes |
Yes C |
Maltese |
Yes |
No |
No |
Yes |
Yes |
No |
Maori |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Mayan |
Yes |
No |
No |
Yes |
Yes |
Yes |
Miao * |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Minangkabau |
Yes |
No |
No |
Yes |
Yes |
No |
Mohawk |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Moldavian |
Yes |
No |
No |
Yes |
Yes |
No |
Nahuatl |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Norwegian |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Nyanja |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Occidental |
Yes |
No |
No |
Yes |
Yes |
Yes |
Ojibway |
Yes |
No |
No |
Yes |
Yes |
No |
Papiamento |
Yes |
No |
No |
Yes |
Yes |
Yes |
Pidgin English |
Yes |
Yes EN |
Yes |
Yes |
Yes |
Yes |
Polish |
Yes |
No |
Yes |
Yes |
Yes |
No |
Portuguese |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes C |
Provençal |
Yes |
No |
No |
Yes |
Yes |
Yes C |
Quechua |
Yes |
No |
No |
Yes |
Yes |
Yes |
Rhaetic |
Yes |
No |
No |
Yes |
Yes |
Yes C |
Romanian |
Yes |
No |
Yes |
Yes |
Yes |
No |
Romany |
Yes |
No |
No |
Yes |
Yes |
No |
Rwanda |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Rundi |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Russian |
Yes |
No |
Yes |
Yes |
Yes |
No |
Sami |
Yes |
No |
No |
Yes |
Yes |
No |
Sami, Lule |
Yes |
No |
No |
Yes |
Yes |
No |
Sami, Northern |
Yes |
No |
No |
Yes |
Yes |
No |
Sami, Southern |
Yes |
No |
No |
Yes |
Yes |
No |
Samoan |
Yes |
No |
No |
Yes |
Yes |
Yes C |
Sardinian |
Yes |
No |
No |
Yes |
Yes |
Yes C |
Serbian |
Yes |
No |
Yes |
Yes |
Yes |
No |
Serbian, Latinic |
Yes |
No |
Yes |
Yes |
Yes |
No |
Shona S |
Yes |
Yes S |
No |
Yes |
Yes |
Yes S |
Sioux |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Slovak |
Yes |
No |
Yes |
Yes |
Yes |
No |
Slovenian |
Yes |
No |
Yes |
Yes |
Yes |
No |
Somali |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Sorbian (Wend) |
Yes |
No |
Yes |
Yes |
Yes |
No |
Sotho |
Yes |
No |
No |
Yes |
Yes |
Yes |
Spanish |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Sundanese SN |
Yes |
No |
No |
Yes |
Yes |
Yes SN |
Swahili |
Yes |
Yes EN |
Yes |
Yes |
Yes |
Yes |
Swazi |
Yes |
No |
No |
Yes |
Yes |
No |
Swedish |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Tagalog |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Tahitian |
Yes |
No |
Yes |
Yes |
Yes |
Yes C |
Tinpo |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Tongan |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Tswana (Chuana) |
Yes |
No |
No |
Yes |
Yes |
Yes C |
Tun * |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Turkish |
Yes |
No |
Yes |
Yes |
Yes |
No |
Ukrainian |
Yes |
No |
Yes |
Yes |
Yes |
No |
Visayan |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Welsh |
Yes |
No |
Yes |
Yes |
Yes |
Yes W |
Wolof |
Yes |
No |
No |
Yes |
Yes |
Yes C |
Xhosa |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Zapotec |
Yes |
Yes EN |
No |
Yes |
Yes |
Yes |
Zulu |
Yes |
No |
Yes |
Yes |
Yes |
No |
The following table summarizes the above:
LANGUAGES |
MOR |
MTX |
FRX |
PLUS2W |
PLUS3W |
DOT |
With dictionary support |
18 |
12 |
17 |
17 |
17 |
14 |
Accented, non-dictionary |
65 |
0 |
33 |
66 |
66 |
31 |
Non-accented, non-dictionary |
31 |
31 |
4 |
31 |
31 |
31 |
Directly selectable |
119 |
12 |
56 |
119 |
119 |
76 |
Total |
119 |
43 |
56 |
119 |
119 |
76 |
Module name: | ASN |
Module identifier: | ASIAN |
Filling methods supported: | ASIAN |
Filters supported: | Not used |
Trade-off supported: | Not used |
The Asian Recognition Module requires the ImGearRecLicenseFeature.AsianOcr license feature to be enabled. |
Application Areas
This module provides recognition services for four Asian languages with horizontal or vertical text direction; these languages are Japanese, Korean and Chinese – Traditional and Simplified. It can also recognize short lengths of embedded English text, without explicitly enabling English in the Languages collection.
The Asian language module differs somewhat from those of Western languages. Below is a list of differences that should be taken into account when performing recognition of Asian text:
For the Asian Recognition Module to work correctly, the selected Asian language should be set before performing preprocessing. |
Asian text can be horizontal and left-to-right (FLOW) or vertical - character flow top-to-bottom with line flow from right-to-left (VERTTEXT).
Non-Asian texts embedded in vertical texts can have three orientations: vertical (neon), right-rotated and side-by-side. All embedded texts will be converted to right rotation when exported to a formatted output document.
The orientation of Asian text is auto-detected on pages where user zones have not been inserted or on AUTO user zones. Auto-detection runs zone-by-zone, so pages with both horizontal and vertical text blocks (such as for picture captions) can be handled.
Digital camera input can be used for Asian-language input, but the automatic 3D deskewing is not useful is these cases.
Table zones can be inserted into Asian pages, but if the OCR engine cannot detect a table within such a zone, the zone is likely to produce zero recognition results.
Conditions
The ideal font point size for Asian language body text is 12 points, scanned at 300 dpi, resulting in characters with around 48 x 48 pixels. The minimum pixel count is about 30 x 30, that is 10.5 points at 300 dpi. For characters smaller than this, 400 dpi should be used.
When zones are defined by the user, it is recommended to create homogeneous user zones as much as possible, because they may give better results. It is especially important in the case of Asian languages. Zones that are automatically located can be inhomogeneous.
Automatic Deskew and Orientation
Support for images with text in Asian languages by the automatic deskew and orientation process can be turned on or off. By setting the ImGearRecAsianSettings.IgnoreAsianTextForDeskew and ImGearRecAsianSettings.IgnoreAsianTextForRotation properties to true, when the ImGearRecImage.PreProcess Method is called with DeskewMode and OrientationMode set to AUTO, the image will not be deskewed or rotated if the Asian Recognition module is enabled.
Character Attributes
The character attributes, such as bold and italic styling, cannot be retrieved for Asian text, or for embedded English text.
Confidence Data and Choices
Recognition results can be saved to memory as a LETTER array, making the confidence data and alternate character choices available for Asian languages.