ImageGear .NET - Updated
Store Output as Code Pages
User Guide > How to Work with... > OCR > How to... > Access and Analyze OCR Output > Store Output as Code Pages

One of the recognition engine's settings is its Code Page setting, which can be set using the ImGearRecOutputManager.CodePage Property.

Recognized characters are stored internally in the recognition engine in their UNICODE representation. The current Code Page is taken into account either when converting a character to/from this UNICODE representation, or when converting the recognition data to the final output document. The utility methods ConvertCodePageToUnicode Method and ConvertUnicodeToCodePage Method are provided in case it is necessary for the application to perform such conversions, i.e., when configuring the language environment.

The ConvertCodePageToUnicode Method is useful when an API requires a UNICODE character or character string parameter, and you know only the character code in your target Code Page.

The output conversion process performs character code conversions from UNICODE into the current code page while producing the final output document.

In most cases, the Code Page setting of the recognition engine must be specified together with the Output Text Format for the final output document. While some output converters ignore the Code Page setting, others, typically the text converters, apply it during their running.

The current Code Page setting should be able to express all characters validated for recognition (i.e., the Character Set). You can use the OutOfCodePageCharacters Property to decide whether the current Code Page fulfills this requirement.

The CodePages Property can be used to access the list of available Code Pages.

Enumerating the List of Available Code Pages

C#
Copy Code
string codePageList = "";
for (int i = 0; i < igRecognition.OutputManager.CodePages.Count; i++)
     codePageList += igRecognition.OutputManager.CodePages[i].Name + Environment.NewLine;
System.Console.WriteLine(codePageList);
VB.NET
Copy Code
Dim codePageList As String = ""
Dim i As Integer = 0
While i < igRecognition.OutputManager.CodePages.Count
      codePageList += igRecognition.OutputManager.CodePages(i).Name + Environment.NewLine
      System.Math.Max(System.Threading.Interlocked.Increment(i), i - 1)
End While
System.Console.WriteLine(codePageList)

There can be conflicts between the set of characters validated for recognition (i.e., the Character Set) and the Code Page selection; a selected Code Page may not support some characters. For example, if you select the Hungarian language and the current Code Page is Windows ANSI (Code Page 1252), the final output document will not contain some accented characters for that language. Use the OutOfCodePageCharacters Property to check whether the current Code Page setting contains all the characters of the current Language environment (language selection, the LanguagesPlus characters), and any characters listed as FilterPlus characters. The OutOfCodePageCharacters Property returns a string of characters not supported by the current Code Page (non-supported characters).

If the language and Code Page settings are such that there are non-supported characters when output conversion is performed, the recognition engine tries to replace non-supported characters with somewhat similar shaped ones in the final output document. This substitution does not work in all cases; mainly it is good for replacing non-supported accented characters with un-accented ones. The final output document will contain a missing symbol in the place of characters, which were recognized correctly but could not be either exported or substituted.

The application can use the MissingSymbol Property to define which character from the current Code Page should be used to indicate a missing symbol.