Word: Extracting Text from Word Documents
ImageGear Office Assembly provides API for plain text extraction from a Word document; ImGearWordDocument.ExtractText allows text extraction from the following Word page areas:
- Regular body
- Header
- Footer
Using ImGearWordTextExtractionOptions, it is possible to specify the area from which the text should be extracted. By default, ImageGear extracts the text from all areas mentioned above. The order of extraction is as follows:
- Extract body
- Extract header
- Extract footer
For a multi-page document, each type of header and footer will be extracted just once - at the end of the file.
The ImGearWordTextExtractionMethod specifies how the text should be extracted. By default, ImageGear uses paragraph break order, i.e., extracts words in the native order they appear in the Word file (does not consider their coordinates on the page). Line break order extracts words according to their coordinates on the page.
The ImGearWordTextExtractionOptions.Encoding property specifies encoding to be used upon extraction. By default, ImageGear uses UTF-8 encoding that is native for OpenXML package.
The ImGearWordLineEndingType specifies what character sequence should be used to insert line breaks in the output stream. By default, ImageGear uses ImGearWordLineEndingType.CRLF sequence.
- Graphical content (including text boxes and shapes with text) will not be extracted.
- Non-textual Numbering bullets will be extracted as '*' character, and hollow ellipse bullets will be extracted as 'o' character. The indentation between numbering bullet/character and paragraph text will be replaced by white space in paragraph break order, and by tabulation character '\t' in line-break order.