ImageGear .NET support for text extraction includes:
- Extracting words from a PDF document or specified page
- Enumerating and sorting the words
- Getting word layouts, styles, and characters
To extract text from a PDF you can use the ExtractText method. The following example shows how to convert an ImGearDocument into a string using ExtractText:
C# |
Copy Code |
// Returns a string corresponding to the text extracted from the PDF.
public string ExtractTextFromPDF(ImGearDocument igDocument)
{
ImGearPDFDocument pdfDocument = igDocument as ImGearPDFDocument;
using (MemoryStream textFromPDF = new MemoryStream())
{
// Extract text from all pages.
pdfDocument.ExtractText(0, igDocument.Pages.Count, ImGearPDFContextFlags.PDF_ORDER, textFromPDF);
return System.Text.Encoding.GetEncoding(0).GetString(textFromPDF.ToArray());
}
} |
VB.NET |
Copy Code |
' Returns a string corresponding to the text extracted from the PDF.
Public Function ExtractTextFromPDF(igDoc As ImGearDocument) As String
Dim pdfDocument As ImGearPDFDocument = DirectCast(igDoc, ImGearPDFDocument)
Using textFromPDF As New MemoryStream()
' Extract text from all pages.
pdfDocument.ExtractText(0, igDoc.Pages.Count, ImGearPDFContextFlags.PDF_ORDER, textFromPDF)
Return System.Text.Encoding.GetEncoding(0).GetString(textFromPDF.ToArray())
End Using
End Function |
The options parameter controls how the text is extracted. For example, given the following text in a PDF:
When options = ImGearPDFContextFlags.XY_ORDER, the call to System.Diagnostics.Debug.WriteLine would produce:
The quick brown
fox jumps over
the lazy dog. The green turtle
watched closely
and entertained.
When options = ImGearPDFContextFlags.PDF_ORDER would produce:
The quick brown fox jumps over the lazy dog. The green turtle watched closely and entertained.