ImageGear support for text extraction includes:
- Extracting words from a PDF document or specified page
- Enumerating and sorting the words
- Getting word layouts, styles, and characters
To extract text from a PDF, you can use theExtractText
method orImGearPDFWordFinder
. UseExtractText
for single-call plaintext extraction from a PDF andImGearPDFWordFinder
to retrieve the page, position, style, and other information about each word.
Our PDFContentExtractText sample on github shows how to convert an ImGearDocument
into a string using ExtractText
.
The options parameter controls how the text is extracted. For example, given the following text in a PDF:
When options = ImGearPDFContextFlags.XY_ORDER, the call to System.Diagnostics.Debug.WriteLine would produce:
The quick brown
fox jumps over
the lazy dog. The green turtle
watched closely
and entertained.
When options = ImGearPDFContextFlags.PDF_ORDER would produce:
The quick brown fox jumps over the lazy dog. The green turtle watched closely and entertained.