ImageGear for C and C++ on Windows v19.1 - Updated
Store Output as Code Pages
User Guide > How to Work with... > OCR > How to... > Assess and Analyze OCR Output > Store Output as Code Pages

One of the recognition engine's settings is its Code Page setting, which can be set and/or inquired with the functions IG_REC_output_codepage_set() and IG_REC_output_codepage_get(), respectively.

Recognized characters are stored internally in the recognition engine in their UNICODE representation. The current Code Page is taken into account either when converting a character to/from this UNICODE representation, or when converting the recognition data to the final output document. The first needs to be done with the IG_REC_util_codepage_to_unicode() or IG_REC_util_unicode_to_codepage() calls.

The IG_REC_util_codepage_to_unicode() function is useful when an application requires a UNICODE character or character string, and you know only the character code in your target Code Page.

When either the IG_REC_output_direct_text_write() or IG_REC_document_write() function is called, it performs character code conversions from UNICODE into the current Code Page setting of the recognition engine while producing the final output document.

In most cases, the Code Page setting of the recognition engine must be specified together with the Output Text Format or Output Document Format for the final output document. While some output converters ignore the Code Page setting, others, typically the text converters, apply it during their running.

The current Code Page setting should be able to express all characters validated for recognition (i.e., the Character Set). You can use the IG_REC_output_codepage_check() function to decide whether the current Code Page fulfills this requirement.

The IG_REC_output_codepage_first_get() and the IG_REC_output_codepage_next_get() function-pair can be used to enumerate the list of available Code Pages.

Enumerating the List of Available Code Pages

C
Copy Code
AT_ERRCOUNT nErrCount;
AT_CHAR aCodepage[32];
nErrCount = IG_REC_output_codepage_first_get(aCodepage, sizeof(aCodepage));
do
{
    printf("Codepage: %s\n", aCodepage);
    nErrCount = IG_REC_output_codepage_next_get(aCodepage, sizeof(aCodepage));
    if(nErrCount == 0)
    {
        nErrCount = IG_warning_check();
    }
} while (nErrCount == 0);

There can be conflicts between the set of characters validated for recognition (i.e., the Character Set) and the Code Page selection; a selected Code Page may not support some characters. For example, if you select the Hungarian language, and the current Code Page is Windows ANSI (Code Page 1252), the final output document will not contain some accented characters for that language. Use the IG_REC_output_codepage_check function to check whether the current Code Page setting contains all the characters of the current Language environment (language selection, the LanguagesPlus characters), and any characters listed as FilterPlus characters. The output of IG_REC_output_codepage_check() is a string of characters not supported by the current Code Page (non-supported characters).

If the language and Code Page settings are such that there are non-supported characters when output conversion is performed, the recognition engine tries to replace non-supported characters with somewhat similar shaped ones in the final output document. This substitution does not work in all cases; mainly it is good for replacing non-supported accented characters with un-accented ones. The final output document will contain a missing symbol in the place of characters, which were recognized correctly but could not be either exported or substituted.

The application can call IG_REC_output_missing_symbol_set() to define which character from the current Code Page should be used to indicate a missing symbol.