ImageGear .NET v25.2 - Updated
Developer Guide / How to Work with... / PDF / How to... / Manage PDF Content / Extract Text from a PDF
In This Topic
    Extract Text from a PDF
    In This Topic

    ImageGear .NET support for text extraction includes:

    To extract text from a PDF, you can use the ExtractText method or ImGearPDFWordFinder. Use ExtractText for single-call plaintext extraction from a PDF and ImGearPDFWordFinder to retrieve the page, position, style, and other information about each word.

    The following example shows how to convert an ImGearDocument into a string using ExtractText:

    C#
    Copy Code
             // Returns a string corresponding to the text extracted from the PDF.
           public string ExtractTextFromPDF(ImGearDocument igDocument)
           {
               ImGearPDFDocument pdfDocument = igDocument as ImGearPDFDocument;
               using (MemoryStream textFromPDF = new MemoryStream())
               {
                   // Extract text from all pages.
                   pdfDocument.ExtractText(0, igDocument.Pages.Count, ImGearPDFContextFlags.PDF_ORDER, textFromPDF);
                   return System.Text.Encoding.GetEncoding(0).GetString(textFromPDF.ToArray());
               }
           }
    
    VB.NET
    Copy Code
    ' Returns a string corresponding to the text extracted from the PDF.
           Public Function ExtractTextFromPDF(igDoc As ImGearDocument) As String
               Dim pdfDocument As ImGearPDFDocument = DirectCast(igDoc, ImGearPDFDocument)
               Using textFromPDF As New MemoryStream()
                   ' Extract text from all pages.
                   pdfDocument.ExtractText(0, igDoc.Pages.Count, ImGearPDFContextFlags.PDF_ORDER, textFromPDF)
                   Return System.Text.Encoding.GetEncoding(0).GetString(textFromPDF.ToArray())
               End Using
           End Function
    

    The options parameter controls how the text is extracted. For example, given the following text in a PDF:

     

    When optionsImGearPDFContextFlags.XY_ORDER, the call to System.Diagnostics.Debug.WriteLine would produce:

    The quick brown

    fox jumps over

    the lazy dog. The green turtle

    watched closely

    and entertained.

    When optionsImGearPDFContextFlags.PDF_ORDER would produce:

    The quick brown fox jumps over the lazy dog. The green turtle watched closely and entertained.