ImageGear for C and C++ on Windows v19.6 - Updated
Define the Character Set
User Guide > How to Work with... > OCR > How to... > Define the Character Set

You can improve text recognition accuracy by narrowing the range of characters valid for recognition. This way the recognition engine doesn't always have to choose its solutions from all 500 characters in the recognition engine's Total Character Set. (The multi-lingual omnifont MOR recognition module supports all of these characters; other recognition modules recognize fewer of them.) The Character Set concept is documented in detail in the Character Set topic. Broadly, the Set is compiled as follows:

The following examples illustrate various techniques for limiting the Character Set:

OCR a Bi-Lingual Document

This example demonstrates recognizing a bi-lingual document (German and Spanish). In this example, the default recognition module (omnifont, unless specifically changed) will be assigned to all zones, as will the filter value IG_REC_ FILTER_DEFAULT, i.e., there is no local modification of the Character Set.

C
Copy Code
AT_ERRCOUNT nErrCount;
AT_INT i;
HIGEAR hIGear;
HIG_REC_IMAGE hImg;
enumIGRecLangEnable langs[IG_REC_LANG_SIZE];
for (i=0; i< IG_REC_LANG_SIZE; i++)
{
    langs[i] = IG_REC_LANG_DISABLED;
}
langs[IG_REC_LANG_GER] = IG_REC_LANG_ENABLED;
langs[IG_REC_LANG_SPA] = IG_REC_LANG_ENABLED;
nErrCount = IG_REC_languages_set(langs);
nErrCount = IG_load_file("Image.tif", &hIGear );
nErrCount = IG_REC_image_import(hIGear, &hImg);
nErrCount = IG_image_delete(hIGear);
nErrCount = IG_REC_image_recognize(hImg);
//...
nErrCount = IG_REC_image_delete(hImg);

Use a Character Set with No Language Selection

This example illustrates a rare case, such as you have a page containing zones with a very restricted number of characters to be recognized, e.g., in recognizing forms or multiple-choice test papers. In this case the application doesn't enable any language, but instead defines the few characters necessary as LanguagesPlus characters (a character set with no language selection: "a, A, b, B, c, C, d, D, e, E" as the only validated characters). This means there is no language selection, and the Language environment consists solely of the individually defined LanguagesPlus characters. Also note that there is no filtering and no locally (zone) validated FilterPlus characters; in this case, the Language environment fully defines the Character Set, and it will be valid for the defined zone, and for others inserted with an identical zone structure.

In this case, the zone list is not empty, so the function IG_REC_image_recognize() will not perform auto-decomposition (auto-zoning), but will act on the inserted zone(s).

C
Copy Code
AT_ERRCOUNT nErrCount;
HIGEAR hIGear;
HIG_REC_IMAGE hImg;
AT_INT i;
enumIGRecLangEnable langs[IG_REC_LANG_SIZE];
for (i=0; i< IG_REC_LANG_SIZE; i++)
{
    langs[i] = IG_REC_LANG_DISABLED;
}
nErrCount = IG_REC_languages_set(langs);
// Specify the LanguagesPlus characters
nErrCount = IG_REC_languagesplus_set(L"aAbBcCdDeE");
//
nErrCount = IG_load_file("Image.tif", &hIGear );
nErrCount = IG_REC_image_import(hIGear, &hImg);
nErrCount = IG_image_delete(hIGear);
nErrCount = IG_REC_image_recognize(hImg);
//...
nErrCount = IG_REC_image_delete(hImg);

Use Language Selection, LanguagesPlus Characters, and Local Filter

The third example will read a printed page in Hungarian, in which a Croatian town name appears repeatedly, containing the character "z-hacek" in lower and uppercase. The Windows Eastern Europe Code Page (1250), is needed as the current Code Page (and for export). The page includes a table containing numbers, which should be zoned separately for digits-only recognition.

In this example, the Language environment is formed from the language selection (Hungarian) plus the two additional LanguagesPlus characters "z-hacek" and "Z-hacek". There is no global filter, but there is a local one, IG_REC_FILTER_DIGIT, defined for one zone.

C
Copy Code
AT_ERRCOUNT nErrCount;
HIGEAR hIGear;
HIG_REC_IMAGE hImg;
AT_INT  i;
char c;
WCHAR uni[256];
AT_REC_ZONE zone;
AT_INT characterCount;
enumIGRecLangEnable langs[IG_REC_LANG_SIZE];
IG_REC_output_codepage_set("Windows Eastern"); // Code Page 1250
for (i=0; i< IG_REC_LANG_SIZE; i++)
{
    langs[i] = IG_REC_LANG_DISABLED;
}
langs[IG_REC_LANG_HUN] = IG_REC_LANG_ENABLED;
nErrCount = IG_REC_languages_set(langs);
memset(uni, 0, sizeof(uni)); //only one wide character terminating NULL is necessary
i=0;
c=0x9E; // Code of z-hacek in Code Page 1250
nErrCount = IG_REC_util_codepage_to_unicode(&c, 1, &uni[i++], &characterCount);
c=0x8E; // Code of Z-hacek in Code Page 1250
nErrCount = IG_REC_util_codepage_to_unicode(&c, 1, &uni[i++], &characterCount);
nErrCount = IG_REC_languagesplus_set(uni);
nErrCount = IG_load_file("Image.tif", &hIGear );
nErrCount = IG_REC_image_import(hIGear, &hImg);
nErrCount = IG_image_delete(hIGear);
 // 1st zone contains a table with digits
memset(&zone, 0, sizeof(AT_REC_ZONE));
zone.Rect.left = 10;      zone.Rect.top = 20;
zone.Rect.right = 330;    zone.Rect.bottom = 50;
zone.FillingMethod = IG_REC_FM_OMNIFONT;   
zone.RecognitionModule = IG_REC_RM_OMNIFONT_MOR;
zone.Filter = IG_REC_FILTER_DIGIT;
zone.Type = IG_REC_WT_TABLE;
nErrCount = IG_REC_zone_insert(hImg, 0, &zone);
 // 2nd zone contains flowed text without filtering
memset(&zone, 0, sizeof(AT_REC_ZONE));
zone.Rect.left = 10;      zone.Rect.top = 80;
zone.Rect.right = 330;    zone.Rect.bottom = 120;
zone.FillingMethod = IG_REC_FM_OMNIFONT;   
zone.RecognitionModule = IG_REC_RM_OMNIFONT_MOR;
zone.Type = IG_REC_WT_FLOW;
zone.Filter = IG_REC_FILTER_ALL;
nErrCount = IG_REC_zone_insert(hImg, 1, &zone);
 // 3rd zone contains flowed text without filtering
 // etc.
nErrCount = IG_REC_image_recognize(hImg);
//...
nErrCount = IG_REC_image_delete(hImg);