ImageGear .NET
Extract Text from a PDF

ImageGear .NET support for text extraction includes:

To extract text from a PDF you can use the ExtractText method. The following example shows how to convert an ImGearDocument into a string using ExtractText:

C#
Copy Code
         // Returns a string corresponding to the text extracted from the PDF.
       public string ExtractTextFromPDF(ImGearDocument igDocument)
       {
           ImGearPDFDocument pdfDocument = igDocument as ImGearPDFDocument;
           using (MemoryStream textFromPDF = new MemoryStream())
           {
               // Extract text from all pages.
               pdfDocument.ExtractText(0, igDocument.Pages.Count, ImGearPDFContextFlags.PDF_ORDER, textFromPDF);
               return System.Text.Encoding.GetEncoding(0).GetString(textFromPDF.ToArray());
           }
       }
VB.NET
Copy Code
' Returns a string corresponding to the text extracted from the PDF.
       Public Function ExtractTextFromPDF(igDoc As ImGearDocument) As String
           Dim pdfDocument As ImGearPDFDocument = DirectCast(igDoc, ImGearPDFDocument)
           Using textFromPDF As New MemoryStream()
               ' Extract text from all pages.
               pdfDocument.ExtractText(0, igDoc.Pages.Count, ImGearPDFContextFlags.PDF_ORDER, textFromPDF)
               Return System.Text.Encoding.GetEncoding(0).GetString(textFromPDF.ToArray())
           End Using
       End Function

The options parameter controls how the text is extracted. For example, given the following text in a PDF:

 

When optionsImGearPDFContextFlags.XY_ORDER, the call to System.Diagnostics.Debug.WriteLine would produce:

The quick brown

fox jumps over

the lazy dog. The green turtle

watched closely

and entertained.

When optionsImGearPDFContextFlags.PDF_ORDER would produce:

The quick brown fox jumps over the lazy dog. The green turtle watched closely and entertained.

 

 


©2017. Accusoft Corporation. All Rights Reserved.

Send Feedback