The following list contains all the selectable output formats of the converters.
Commonly Used Output Formats
Office Formats
The toolkit can generate output for Office file types DOCX, XLSX and PPTX. These files can be opened in Office 2007 and higher versions.
The DOCX file type specification can be downloaded from: http://www.ecma-international.org/news/TC45_current_work/TC45_available_docs.htm.
The DOCX / XLSX / PPTX file types conform with a Microsoft standard called "Open Packaging Conventions (OPC)" with specifications available for download at http://go.microsoft.com/fwlink/?linkID=71255.
PDF Formats
- PDF with image on text - A PDF converter where the original (input) image is retained in the foreground with the recognized text hidden in the background (but in the correct position). Perfect for archiving & indexing documents.
- PDF - A highly configurable, general PDF output converter. It supports many PDF features, but relies heavily on the position of the recognized characters.
- PDF with image substitutes - A special PDF converter, where the suspect words are covered by their images cut out from the original image.
- PDF - Edited - This PDF converter does not rely on the position of the recognized characters, so it can be used even after inserting large new text portions in the editor.
Text Formats
- Text - This converter writes the recognized text into a simple text file that can be read by most text editors and word processors.
- Comma Separated Text - This converter writes the recognized text into a tabled text file (Comma delimited text file) that can be read by Excel. “List Separator” separates the cells and NL (new line character) separates the lines of the table.
- Formatted Text - This converter writes the recognized text into a text file, but tries to retain the layout of the page by inserting extra spaces.
- Text with line breaks - The same as Text converter, but this converter inserts line breaks at the end of lines instead of only inserting them at the end of the paragraphs.
- Unicode Text - Same as Text, but using two-byte Unicode characters.
- Unicode Comma Separated Text - Same as Comma Separated Text, but using two-byte Unicode characters.
- Unicode Formatted Text - Same as Formatted Text, but using two-byte Unicode characters.
- Unicode Text with line breaks - Same as Text with line breaks, but using two-byte Unicode characters.
ML Formats
- XML - An XML file format conforming to the Nuance XML schema, ssdoc-schema3.xsd, distributed in the ImageGear installation's Bin directory. It contains almost all layout related information and paragraph and character attributes. The page XML output format contains a general description of this format.
- HTML 4.0 - The HTML 4.0 format is not so clear as HTML 3.2, but Cascading Style Sheet (CSS) technology can be used for box-like absolute positioned objects, styles and manipulating all paragraph and character attributes.
Ebook Formats
Direct Text Output Formats
This group of output formats allows you to convert recognized text simply and quickly. That is, you use the output of the recognition module as is (without reading order and paragraph detection). Therefore, Direct Outputs are faster to produce, because they do not include slow detection processes.
- Direct Text - The Direct Text output is a simple text file.
- Direct CSV - The Direct CSV output is a simple format to represent tables. Microsoft Excel can read this format.
- Direct Formatted Text - The Direct Formatted Text delivers plain text, but attempts to keep layout as detected in the original image: this creates a text file that simulates columns and boxes using tabulators.
- Direct XML - The Direct XML output is typically used for further processing the recognized data. You can easily parse (e.g., MSXML) or transform (XSLT) the output xml file. The format of the xml output is specified by the Nuance XML schema, ssdoc-schema3.xsd, distributed in the ImageGear installation's Bin directory.
- Direct Binary - The Direct Binary used for creating files directly from the recognition data without any character conversion and formatting.
Legacy and Deprecated Output Formats