User Guide > How to Work with... > Formats with Additional Functionality > PDF > How to... > Manage PDF Content > Extract Text from a PDF |
Before working with a PDF document, make sure to initialize the PDF component (see Getting Started with PDF). |
Refer to the Extract Text from a PDF Sample for complete sample code that illustrates how to use this capability. |
The simplest way to extract text from a PDF is using the IG_PDF_text_extract method. It reads in a PDF and writes out the text into a TXT file.
C |
Copy Code
|
---|---|
LPSTR fileIn = "input.pdf"; LPSTR fileOut = "output.txt"; // The page index starts at 1. UINT startPage = 1; UINT pageCount = 1; IG_PDF_text_extract(fileIn, fileOut, startPage, pageCount); |
C++ |
Copy Code
|
---|---|
std::string fileIn = "input.pdf"; std::string fileOut = "output.txt"; // The page index starts at 1. UINT startPage = 1; UINT pageCount = 1; IG_PDF_text_extract((LPSTR)fileIn.c_str(), (LPSTR)fileOut.c_str(), startPage, pageCount); |
If you want to manipulate the text in memory, then you should use a wordfinder to extract the text.
C |
Copy Code
|
---|---|
LPSTR fileIn = "input.pdf"; HMIGEAR document = 0; HIG_PDF_DOC hDoc = 0; IG_mpi_create(&document, 0); IG_mpi_file_open(fileIn, document, IG_FORMAT_PDF, IG_MP_OPENMODE_READONLY); IG_mpi_info_get(document, IG_MP_DOCUMENT, &hDoc, sizeof(hDoc)); |
C++ |
Copy Code
|
---|---|
std::string fileIn = "input.pdf"; HMIGEAR document = 0; IG_mpi_create(&document, 0); IG_mpi_file_open((LPSTR)fileIn.c_str(), document, IG_FORMAT_PDF, IG_MP_OPENMODE_READONLY); HIG_PDF_DOC hDoc = 0; IG_mpi_info_get(document, IG_MP_DOCUMENT, &hDoc, sizeof(hDoc)); |
C |
Copy Code
|
---|---|
// Specifies the type of character at each position. If set to NULL the default encoding is // used. WORD encodingInfo = NULL; // Glyph names to use in conjunction with encodingInfo. If set to NULL the default encoding // vector is used. char* encodingVector = NULL; // The ligerature table to be used in conjunction with the encodingVector and encodingInfo. // If set to NULL the default ligerature table is used. char* ligeratureTable = NULL; // IG_PDF_WF_LATEST_VERSION is the best option unless you are using an older version // of Acrobat. SHORT algorithmVersion = IG_PDF_WF_LATEST_VERSION; // Determines the sort order. WORD flags = IG_PDF_XY_SORT; // Pointer to user-supplied data to pass to the newly created word finder. Set to NULL. LPVOID clientData = NULL; HIG_PDF_WORDFINDER wordFinder; IG_PDF_doc_create_wordfinder(hDoc, encodingInfo, encodingVector, ligeratureTable, algorithmVersion, flags, clientData, &wordFinder); |
C++ |
Copy Code
|
---|---|
// Specifies the type of character at each position. If set to NULL the default encoding is // used. WORD encodingInfo = NULL; // Glyph names to use in conjunction with encodingInfo. If set to NULL the default encoding // vector is used. char* encodingVector = NULL; // The ligerature table to be used in conjunction with the encodingVector and encodingInfo. // If set to NULL the default ligerature table is used. char* ligeratureTable = NULL; // IG_PDF_WF_LATEST_VERSION is the best option unless you are using an older version // of Acrobat. SHORT algorithmVersion = IG_PDF_WF_LATEST_VERSION; // Determines the sort order. WORD flags = IG_PDF_XY_SORT; // Pointer to user-supplied data to pass to the newly created word finder. Set to NULL. LPVOID clientData = NULL; HIG_PDF_WORDFINDER wordFinder; IG_PDF_doc_create_wordfinder(hDoc, encodingInfo, encodingVector, ligeratureTable, algorithmVersion, flags, clientData, &wordFinder); |
C |
Copy Code
|
---|---|
HIG_PDF_WORD word; WORD length; char* buffer; int i; for (i = 0; i < 10; i++) { IG_PDF_wordfinder_get_word(wordFinder, IG_PDF_XY_SORT, i, &word); IG_PDF_word_get_length(word, &length); IG_PDF_word_get_string(word, &buffer, length); // Here you can do what you want with each word. // Clean up each word. IG_PDF_word_delete(word); } |
C++ |
Copy Code
|
---|---|
for (int i = 0; i < wordCount; i++) { HIG_PDF_WORD word; IG_PDF_wordfinder_get_word(wordFinder, IG_PDF_XY_SORT, i, &word); WORD length; IG_PDF_word_get_length(word, &length); LPCHAR wordText; vector<char> buffer(length + 2, 0); IG_PDF_word_get_string(word, &buffer[0], length); // Here you can do what you want with each word. // Clean up each word. IG_PDF_word_delete(word); } |
To learn more about these word objects, you may want to use these two methods:
If you would prefer to access the text through each PDE element, you can do that as well. Refer to the AddNewPageWithImage sample for a demonstration of manipulating PDE elements. For each HIG_PDE_TEXT you can use a variety of methods, such as IG_PDE_text_get_text_unicode.