User Guide > How to Work with... > Formats with Additional Functionality > Office > Word Documents > Extracting Text from Word Documents |
ImageGear Office Assembly provides API for plain text extraction from a Word document; ImGearWordDocument.ExtractText allows text extraction from the following Word page areas:
Using ImGearWordTextExtractionOptions, it is possible to specify the area from which the text should be extracted. By default, ImageGear extracts the text from all areas mentioned above. The order of extraction is as follows:
For a multi-page document, each type of header and footer will be extracted just once - at the end of the file.
The ImGearWordTextExtractionMethod specifies how the text should be extracted. By default, ImageGear uses paragraph break order, i.e., extracts words in the native order they appear in the Word file (does not consider their coordinates on the page). Line break order extracts words according to their coordinates on the page.
The ImGearWordTextExtractionOptions.Encoding property specifies encoding to be used upon extraction. By default, ImageGear uses UTF-8 encoding that is native for OpenXML package.
The ImGearWordLineEndingType specifies what character sequence should be used to insert line breaks in the output stream. By default, ImageGear uses ImGearWordLineEndingType.CRLF sequence.
|