ImageGear for C and C++ on Windows v19.10 - Updated
Asian Recognition Module
User Guide > How to Work with... > OCR > Concepts > Recognition Modules > Asian Recognition Module

Module name:

ASIAN

Module identifier:

IG_REC_RM_ASIAN

Filling methods supported:

IG_REC_FM_ASIAN

Filters supported:

Not used

Trade-off supported:

Not used

Training supported:

No

The Asian Recognition Module requires the IG_REC_FEATURE_ASIAN license feature to be enabled.

This topic provides information about the following:

Application Areas

This module provides recognition services for four Asian languages with horizontal or vertical text direction; these languages are Japanese, Korean and Chinese - Traditional and Simplified. It can also recognize short lengths of embedded English text, without explicitly enabling English in the Languages collection.

The Asian language module differs somewhat from those of Western languages. Below is a list of differences that should be taken into account when performing recognition of Asian text:

The checking subsystem is not available. This means spell-checking, UD-checking and user-written cannot be used when the Asian Recognition Module is active.

Only one Asian language should be set for recognition at a time.

Western languages should not be set alongside an Asian language.

The Asian Recognition module can recognize short lengths of embedded English text, without English needing to be set. If text from other Latin character sets are embedded, these languages similarly do not need to be set; however, accented characters may not always be handled correctly.

Character attributes, such as bold and italic styling, cannot be retrieved for Asian text, or for embedded English text.

The Deskew3D Method does not support images with text from Asian languages.

For the Asian Recognition Module to work correctly, the selected Asian language should be set before performing preprocessing.

Asian text can be horizontal and left-to-right (FLOW) or vertical - character flow top-to-bottom with line flow from right-to-left (VERTTEXT).

Non-Asian texts embedded in vertical texts can have three orientations: vertical (neon), right-rotated and side-by-side. The latter is usually limited to three characters, and is most often used for Arabic numerals. All embedded texts will be converted to right rotation when exported to a formatted output document.

The orientation of Asian text is auto-detected on pages where user zones have not been inserted or on AUTO user zones. Auto-detection runs zone-by-zone, so pages with both horizontal and vertical text blocks (such as for picture captions) can be handled.

Digital camera input can be used for Asian-language input, but the automatic 3D deskewing is not useful is these cases.

Table zones can be inserted into Asian pages, but if the OCR engine cannot detect a table within such a zone, the zone is likely to produce zero recognition results.

Conditions

The ideal font point size for Asian language body text is 12 points, scanned at 300 dpi, resulting in characters with around 48 x 48 pixels. The minimum pixel count is about 30 x 30, that is 10.5 points at 300 dpi. For characters smaller than this, 400 dpi should be used.

When zones are defined by the user, it is recommended to create homogeneous user zones as much as possible, because they may give better results. It is especially important in the case of Asian languages. Zones that are automatically located can be inhomogeneous.

Deskew and Orientation

The deskew and orientation detection work in a different way than in the case of other recognition modules. The working of both operations can be adjusted using functions IG_REC_asian_deskew_enabled_set and IG_REC_asian_orientation_enabled_set. If these functions have been called with FALSE ( default setting), the AUTO methods (IG_REC_IMG_DESKEW_AUTO, IG_REC_IMG_ROTATE_AUTO) of these operations for Asian OCR equal to the case when they are switched off (IG_REC_IMG_DESKEW_NO, IG_REC_IMG_ROTATE_NO). If the settings are TRUE, or the deskew and orientation are not set to AUTO, the working of these methods are the same for both the Asian and the Western cases.

Character Attributes

The character attributes, such as bold and italic styling, cannot be retrieved for Asian text, or for embedded English text.

Confidence Data and Choices

Recognition results can be saved to memory as a AT_REC_LETTER array, making the confidence data and alternate character choices available for Asian languages.