User Guide > How to Work with... > PDF > How to... > Manage PDF Content > Extract Text from a PDF |
ImageGear for .NET support for text extraction includes:
To extract text from a PDF you can use the ExtractText method. The following example shows how to convert an ImGearDocument into a string using ExtractText:
C# |
Copy Code |
---|---|
// Returns a string corresponding to the text extracted from the PDF. public string ExtractTextFromPDF(ImGearDocument igDocument) { ImGearPDFDocument pdfDocument = igDocument as ImGearPDFDocument; using (MemoryStream textFromPDF = new MemoryStream()) { // Extract text from all pages. pdfDocument.ExtractText(0, igDocument.Pages.Count, ImGearPDFContextFlags.PDF_ORDER, textFromPDF); return System.Text.Encoding.GetEncoding(0).GetString(textFromPDF.ToArray()); } } |
VB.NET |
Copy Code |
---|---|
' Returns a string corresponding to the text extracted from the PDF. Public Function ExtractTextFromPDF(igDoc As ImGearDocument) As String Dim pdfDocument As ImGearPDFDocument = DirectCast(igDoc, ImGearPDFDocument) Using textFromPDF As New MemoryStream() ' Extract text from all pages. pdfDocument.ExtractText(0, igDoc.Pages.Count, ImGearPDFContextFlags.PDF_ORDER, textFromPDF) Return System.Text.Encoding.GetEncoding(0).GetString(textFromPDF.ToArray()) End Using End Function |
The options parameter controls how the text is extracted. For example, given the following text in a PDF:
When options = ImGearPDFContextFlags.XY_ORDER, the call to System.Diagnostics.Debug.WriteLine would produce:
The quick brown
fox jumps over
the lazy dog. The green turtle
watched closely
and entertained.
When options = ImGearPDFContextFlags.PDF_ORDER would produce:
The quick brown fox jumps over the lazy dog. The green turtle watched closely and entertained.