ImageGear for C and C++ on Windows v19.4 - Updated
Export to a Formatted Document
User Guide > How to Work with... > OCR > How to... > Assess and Analyze OCR Output > Export to a Formatted Document

The ImageGear Recognition document API allows saving recognized data to a number of document formats, such as RTF, Microsoft Office Word, or Excel.

This API group requires IG_REC_FEATURE_FORMATTED_OUTPUT to be enabled.

After having successfully recognized the image (or a series of images), create an HIG_REC_DOCUMENT object for accumulating recognized pages and writing them to the final output document. Use IG_REC_document_create function to create an empty HIG_REC_DOCUMENT object. Then use IG_REC_document_page_insert function to insert recognized pages to the document. Recognition document API also allows you to remove, update, or reorder pages. You can also save the document into an intermediate file in the native data format, and reopen it later using IG_REC_document_save and IG_REC_document_open function, respectively. If the document is no longer needed, it must be closed with IG_REC_document_close function.

When a page has been added to the document, the document gets ownership of the recognized data, and the page object becomes invalid. If you need to re-recognize the image that has been added to a document, re-import it from HIGEAR again. You can then recognize it and update the corresponding page in the document using IG_REC_document_page_update.

When all document pages have been recognized, you can output the final document using IG_REC_document_write function. Code Page, format of the final output document, and the level of format retention should be specified beforehand, using the IG_REC_output_codepage_set, IG_REC_output_format_set, and IG_REC_output_level_set functions. The full list of supported output formats is given in the topic Output Text Format List.

Use IG_REC_output_format_first_get and IG_REC_output_format_next_get functions to get full list of the supported output formats.

This topic provides information about how to...

Enumerate the Available Output Text Formats

C
Copy Code
AT_ERRCOUNT nErrCount;
AT_CHAR szFormatName[128];
nErrCount = IG_REC_output_format_first_get((LPSTR)szFormatName, sizeof(szFormatName) );
while(nErrCount == 0)
{
    printf("%s\n", szFormatName);
    nErrCount = IG_REC_output_format_next_get((LPSTR)szFormatName, sizeof(szFormatName) );
    if(nErrCount == 0)
    {
        nErrCount = IG_warning_check();
    }
}

Recognize a Multi-Page Document

C
Copy Code
AT_ERRCOUNT nErrCount;
AT_INT i;
AT_INT nPageCount;
HIGEAR hIGear;
HIG_REC_IMAGE hImg;
HIG_REC_DOCUMENT hDocument;
LPSTR szFile = "Multipage.tif";
nErrCount = IG_REC_document_create("MULTIPAG.RDO", &hDocument);
nErrCount = IG_page_count_get(szFile, &nPageCount);
for (i = 0; i < nPageCount; i++)
{
    nErrCount = IG_fltr_load_file(szFile, i + 1, &hIGear );
    nErrCount = IG_REC_image_import(hIGear, &hImg);
    nErrCount = IG_image_delete(hIGear);
    nErrCount = IG_REC_image_preprocess(hImg);
    nErrCount = IG_REC_image_recognize(hImg);
    nErrCount = IG_REC_document_page_insert(hDocument, hImg, -1);
}
// Specifies the file format for the final output document
nErrCount = IG_REC_output_codepage_set("Windows ANSI");
nErrCount = IG_REC_output_format_set("Converters.Text.Word97");
// Save the recognized pages as MS Word97
nErrCount = IG_REC_document_write(hDocument, "MULTIPAG.DOC");
// Close the document
nErrCount = IG_REC_document_close(hDocument);

In this example the application specifies the output format (MS Word 97) for the final output document (MULTIPAG.DOC) with the IG_REC_output_format_set() call. Then the multi-page result of the recognition stored in the recognition data file ("MULTIPAG.RDO") is converted into the requested format with the IG_REC_document_write() function call. The customer should delete "MULTIPAG.RDO" file when it is no longer needed.

When the IG_REC_document_create function is called with the first parameter equal to NULL, the application doesn't have to deal with the recognition data file: the recognition component handles it internally (i.e., the default recognition data file is used automatically and deleted when the document is closed).

Retain Format in Final Output Documents

RecAPIPlus provides complex accurate layout retention outputs with several file formats such as RTF, DOC, WordML, XLS, WP. The IG_REC_document_write function exports the given document into the previously mentioned output file formats.

In several cases, the original layout will be retained in the output document, as far as possible. The different converters have different capabilities for retaining the layout. There are 5 output levels (enumIGRecOutputLevel) for the several layout retentions. Not every converter can realize every output mode. For example, a Word document has Flowing Page and True Page modes, which are very similar to the original output and there are simple text converters, which can retain only the simple text in Plain Text mode (formerly No Format mode) and the text with its attributes in Formatted Text mode (formerly Retain Font and Paragraphs mode).

Besides the output modes, converters have many settings, which can influence the layout. These settings can be controlled using IG_REC_converter_... API functions.

The ImageGear Recognition API is able to export images to the final output document for graphics zones. Images obtained from graphic zones are stored internally, and inserted to the final output document if the output format and level allow this.