ImageGear v26.3 - Updated November 9, 2022
Developer Guide / How to Work with ... / PDF / How to... / Manage PDF Content / Extract Text from a PDF
Extract Text from a PDF

ImageGear support for text extraction includes:

  • Extracting words from a PDF document or specified page
  • Enumerating and sorting the words
  • Getting word layouts, styles, and characters
    To extract text from a PDF, you can use the ExtractText method or ImGearPDFWordFinder. Use ExtractText for single-call plaintext extraction from a PDF and ImGearPDFWordFinder to retrieve the page, position, style, and other information about each word.

Our PDFContentExtractText sample on github shows how to convert an ImGearDocument into a string using ExtractText.

The options parameter controls how the text is extracted. For example, given the following text in a PDF:

PDF_textextraction

When options = ImGearPDFContextFlags.XY_ORDER, the call to System.Diagnostics.Debug.WriteLine would produce:

The quick brown
fox jumps over
the lazy dog. The green turtle
watched closely
and entertained.

When options = ImGearPDFContextFlags.PDF_ORDER would produce:

The quick brown fox jumps over the lazy dog. The green turtle watched closely and entertained.