Extract Text from a PDF

In This Topic

ImageGear .NET support for text extraction includes:

Extracting words from a PDF document or specified page
Enumerating and sorting the words
Getting word layouts, styles, and characters

To extract text from a PDF, you can use the ExtractText method or ImGearPDFWordFinder. Use ExtractText for single-call plaintext extraction from a PDF and ImGearPDFWordFinder to retrieve the page, position, style, and other information about each word.

The following example shows how to convert an ImGearDocument into a string using ExtractText:

CS
VBNET

C#	Copy Code
// Returns a string corresponding to the text extracted from the PDF. public string ExtractTextFromPDF(ImGearDocument igDocument) { ImGearPDFDocument pdfDocument = igDocument as ImGearPDFDocument; using (MemoryStream textFromPDF = new MemoryStream()) { // Extract text from all pages. pdfDocument.ExtractText(0, igDocument.Pages.Count, ImGearPDFContextFlags.PDF_ORDER, textFromPDF); return System.Text.Encoding.GetEncoding(0).GetString(textFromPDF.ToArray()); } }

Copy Code

         // Returns a string corresponding to the text extracted from the PDF.
       public string ExtractTextFromPDF(ImGearDocument igDocument)
       {
           ImGearPDFDocument pdfDocument = igDocument as ImGearPDFDocument;
           using (MemoryStream textFromPDF = new MemoryStream())
           {
               // Extract text from all pages.
               pdfDocument.ExtractText(0, igDocument.Pages.Count, ImGearPDFContextFlags.PDF_ORDER, textFromPDF);
               return System.Text.Encoding.GetEncoding(0).GetString(textFromPDF.ToArray());
           }
       }

VB.NET	Copy Code
' Returns a string corresponding to the text extracted from the PDF. Public Function ExtractTextFromPDF(igDoc As ImGearDocument) As String Dim pdfDocument As ImGearPDFDocument = DirectCast(igDoc, ImGearPDFDocument) Using textFromPDF As New MemoryStream() ' Extract text from all pages. pdfDocument.ExtractText(0, igDoc.Pages.Count, ImGearPDFContextFlags.PDF_ORDER, textFromPDF) Return System.Text.Encoding.GetEncoding(0).GetString(textFromPDF.ToArray()) End Using End Function

VB.NET

Copy Code

' Returns a string corresponding to the text extracted from the PDF.
       Public Function ExtractTextFromPDF(igDoc As ImGearDocument) As String
           Dim pdfDocument As ImGearPDFDocument = DirectCast(igDoc, ImGearPDFDocument)
           Using textFromPDF As New MemoryStream()
               ' Extract text from all pages.
               pdfDocument.ExtractText(0, igDoc.Pages.Count, ImGearPDFContextFlags.PDF_ORDER, textFromPDF)
               Return System.Text.Encoding.GetEncoding(0).GetString(textFromPDF.ToArray())
           End Using
       End Function

The options parameter controls how the text is extracted. For example, given the following text in a PDF:

When options = ImGearPDFContextFlags.XY_ORDER, the call to System.Diagnostics.Debug.WriteLine would produce:

The quick brown

fox jumps over

the lazy dog. The green turtle

watched closely

and entertained.

When options = ImGearPDFContextFlags.PDF_ORDER would produce:

The quick brown fox jumps over the lazy dog. The green turtle watched closely and entertained.

Get Product Support