ImageGear v26.3 - Updated
Developer Guide / How to Work with ... / PDF / How to... / Manage PDF Content / Extract Text from a PDF
In This Topic
    Extract Text from a PDF
    In This Topic

    ImageGear support for text extraction includes:

    • Extracting words from a PDF document or specified page
    • Enumerating and sorting the words
    • Getting word layouts, styles, and characters
      To extract text from a PDF, you can use the ExtractText method or ImGearPDFWordFinder. Use ExtractText for single-call plaintext extraction from a PDF and ImGearPDFWordFinder to retrieve the page, position, style, and other information about each word.

    Our PDFContentExtractText sample on github shows how to convert an ImGearDocument into a string using ExtractText.

    The options parameter controls how the text is extracted. For example, given the following text in a PDF:

    PDF_textextraction

    When options = ImGearPDFContextFlags.XY_ORDER, the call to System.Diagnostics.Debug.WriteLine would produce:

    The quick brown
    fox jumps over
    the lazy dog. The green turtle
    watched closely
    and entertained.
    

    When options = ImGearPDFContextFlags.PDF_ORDER would produce:

    The quick brown fox jumps over the lazy dog. The green turtle watched closely and entertained.