Extract Text from a PDF

In This Topic

Before working with a PDF document, make sure to initialize the PDF component (see Getting Started with PDF).

Refer to the Extract Text from a PDF Sample for complete sample code that illustrates how to use this capability.

The simplest way to extract text from a PDF is using the IG_PDF_text_extract method. It reads in a PDF and writes out the text into a TXT file.

C	Copy Code
LPSTR fileIn = "input.pdf"; LPSTR fileOut = "output.txt"; // The page index starts at 1. UINT startPage = 1; UINT pageCount = 1; IG_PDF_text_extract(fileIn, fileOut, startPage, pageCount);

C++	Copy Code
std::string fileIn = "input.pdf"; std::string fileOut = "output.txt"; // The page index starts at 1. UINT startPage = 1; UINT pageCount = 1; IG_PDF_text_extract((LPSTR)fileIn.c_str(), (LPSTR)fileOut.c_str(), startPage, pageCount);

C++

Copy Code

std::string fileIn = "input.pdf";
std::string fileOut = "output.txt";
// The page index starts at 1.
UINT startPage = 1;
UINT pageCount = 1;
IG_PDF_text_extract((LPSTR)fileIn.c_str(), (LPSTR)fileOut.c_str(), startPage, pageCount);

If you want to manipulate the text in memory, then you should use a wordfinder to extract the text.

Open the PDF document and load it into an HIG_PDF_DOC:

C	Copy Code
LPSTR fileIn = "input.pdf"; HMIGEAR document = 0; HIG_PDF_DOC hDoc = 0; IG_mpi_create(&document, 0); IG_mpi_file_open(fileIn, document, IG_FORMAT_PDF, IG_MP_OPENMODE_READONLY); IG_mpi_info_get(document, IG_MP_DOCUMENT, &hDoc, sizeof(hDoc));

Copy Code

LPSTR fileIn = "input.pdf";
HMIGEAR document = 0;
HIG_PDF_DOC hDoc = 0;
IG_mpi_create(&document, 0);
IG_mpi_file_open(fileIn, document, IG_FORMAT_PDF, IG_MP_OPENMODE_READONLY);
IG_mpi_info_get(document, IG_MP_DOCUMENT, &hDoc, sizeof(hDoc));

C++	Copy Code
std::string fileIn = "input.pdf"; HMIGEAR document = 0; IG_mpi_create(&document, 0); IG_mpi_file_open((LPSTR)fileIn.c_str(), document, IG_FORMAT_PDF, IG_MP_OPENMODE_READONLY); HIG_PDF_DOC hDoc = 0; IG_mpi_info_get(document, IG_MP_DOCUMENT, &hDoc, sizeof(hDoc));

C++

Copy Code

std::string fileIn = "input.pdf";
HMIGEAR document = 0;
IG_mpi_create(&document, 0);
IG_mpi_file_open((LPSTR)fileIn.c_str(), document, IG_FORMAT_PDF, IG_MP_OPENMODE_READONLY);
HIG_PDF_DOC hDoc = 0;
IG_mpi_info_get(document, IG_MP_DOCUMENT, &hDoc, sizeof(hDoc));

Create a wordfinder for that PDF:

C	Copy Code
// Specifies the type of character at each position. If set to NULL the default encoding is // used. WORD encodingInfo = NULL; // Glyph names to use in conjunction with encodingInfo. If set to NULL the default encoding // vector is used. char* encodingVector = NULL; // The ligerature table to be used in conjunction with the encodingVector and encodingInfo. // If set to NULL the default ligerature table is used. char* ligeratureTable = NULL; // IG_PDF_WF_LATEST_VERSION is the best option unless you are using an older version // of Acrobat. SHORT algorithmVersion = IG_PDF_WF_LATEST_VERSION; // Determines the sort order. WORD flags = IG_PDF_XY_SORT; // Pointer to user-supplied data to pass to the newly created word finder. Set to NULL. LPVOID clientData = NULL; HIG_PDF_WORDFINDER wordFinder; IG_PDF_doc_create_wordfinder(hDoc, encodingInfo, encodingVector, ligeratureTable, algorithmVersion, flags, clientData, &wordFinder);

Copy Code

// Specifies the type of character at each position. If set to NULL the default encoding is
// used.
WORD encodingInfo = NULL;
// Glyph names to use in conjunction with encodingInfo. If set to NULL the default encoding
// vector is used.
char* encodingVector = NULL;
// The ligerature table to be used in conjunction with the encodingVector and encodingInfo.
// If set to NULL the default ligerature table is used.
char* ligeratureTable = NULL;
// IG_PDF_WF_LATEST_VERSION is the best option unless you are using an older version
// of Acrobat.
SHORT algorithmVersion = IG_PDF_WF_LATEST_VERSION;
// Determines the sort order.
WORD flags = IG_PDF_XY_SORT;
// Pointer to user-supplied data to pass to the newly created word finder. Set to NULL.
LPVOID clientData = NULL;
HIG_PDF_WORDFINDER wordFinder;
IG_PDF_doc_create_wordfinder(hDoc, encodingInfo, encodingVector, ligeratureTable,
       algorithmVersion, flags, clientData, &wordFinder);

C++	Copy Code
// Specifies the type of character at each position. If set to NULL the default encoding is // used. WORD encodingInfo = NULL; // Glyph names to use in conjunction with encodingInfo. If set to NULL the default encoding // vector is used. char* encodingVector = NULL; // The ligerature table to be used in conjunction with the encodingVector and encodingInfo. // If set to NULL the default ligerature table is used. char* ligeratureTable = NULL; // IG_PDF_WF_LATEST_VERSION is the best option unless you are using an older version // of Acrobat. SHORT algorithmVersion = IG_PDF_WF_LATEST_VERSION; // Determines the sort order. WORD flags = IG_PDF_XY_SORT; // Pointer to user-supplied data to pass to the newly created word finder. Set to NULL. LPVOID clientData = NULL; HIG_PDF_WORDFINDER wordFinder; IG_PDF_doc_create_wordfinder(hDoc, encodingInfo, encodingVector, ligeratureTable, algorithmVersion, flags, clientData, &wordFinder);

C++

Copy Code

// Specifies the type of character at each position. If set to NULL the default encoding is
// used.
WORD encodingInfo = NULL;
// Glyph names to use in conjunction with encodingInfo. If set to NULL the default encoding
// vector is used.
char* encodingVector = NULL;
// The ligerature table to be used in conjunction with the encodingVector and encodingInfo.
// If set to NULL the default ligerature table is used.
char* ligeratureTable = NULL;
// IG_PDF_WF_LATEST_VERSION is the best option unless you are using an older version
// of Acrobat.
SHORT algorithmVersion = IG_PDF_WF_LATEST_VERSION;
// Determines the sort order.
WORD flags = IG_PDF_XY_SORT;
// Pointer to user-supplied data to pass to the newly created word finder. Set to NULL.
LPVOID clientData = NULL;
HIG_PDF_WORDFINDER wordFinder;
IG_PDF_doc_create_wordfinder(hDoc, encodingInfo, encodingVector, ligeratureTable,
       algorithmVersion, flags, clientData, &wordFinder);

Get the number of words on the page so that we can iterate through them:

C	Copy Code
LONG wordCount; LONG pageNumber = 0; errCount = IG_PDF_wordfinder_acquire_wordlist(wordFinder, pageNumber, &wordCount);

C++	Copy Code
LONG wordCount; LONG pageNumber = 0; errCount = IG_PDF_wordfinder_acquire_wordlist(wordFinder, pageNumber, &wordCount);

Then we iterate through each word:

C	Copy Code
HIG_PDF_WORD word; WORD length; char* buffer; int i; for (i = 0; i < 10; i++) { IG_PDF_wordfinder_get_word(wordFinder, IG_PDF_XY_SORT, i, &word); IG_PDF_word_get_length(word, &length); IG_PDF_word_get_string(word, &buffer, length); // Here you can do what you want with each word. // Clean up each word. IG_PDF_word_delete(word); }

Copy Code

HIG_PDF_WORD word;
WORD length;
char* buffer;
int i;
for (i = 0; i < 10; i++)
{
    IG_PDF_wordfinder_get_word(wordFinder, IG_PDF_XY_SORT, i, &word);

    IG_PDF_word_get_length(word, &length);

    IG_PDF_word_get_string(word, &buffer, length);

    // Here you can do what you want with each word.

    // Clean up each word.
    IG_PDF_word_delete(word);
}

C++	Copy Code
for (int i = 0; i < wordCount; i++) { HIG_PDF_WORD word; IG_PDF_wordfinder_get_word(wordFinder, IG_PDF_XY_SORT, i, &word); WORD length; IG_PDF_word_get_length(word, &length); LPCHAR wordText; vector<char> buffer(length + 2, 0); IG_PDF_word_get_string(word, &buffer[0], length); // Here you can do what you want with each word. // Clean up each word. IG_PDF_word_delete(word); }

C++

Copy Code

for (int i = 0; i < wordCount; i++)
{
    HIG_PDF_WORD word;
    IG_PDF_wordfinder_get_word(wordFinder, IG_PDF_XY_SORT, i, &word);

    WORD length;
    IG_PDF_word_get_length(word, &length);

    LPCHAR wordText;

    vector<char> buffer(length + 2, 0);
    IG_PDF_word_get_string(word, &buffer[0], length);

    // Here you can do what you want with each word.

    // Clean up each word.
    IG_PDF_word_delete(word);
}

To learn more about these word objects, you may want to use these two methods:

IG_PDF_word_get_char_style - get the color, font, or style
IG_PDF_word_get_quad - get the location of the word

If you would prefer to access the text through each PDE element, you can do that as well. Refer to the AddNewPageWithImage sample for a demonstration of manipulating PDE elements. For each HIG_PDE_TEXT you can use a variety of methods, such as IG_PDE_text_get_text_unicode.

Get Product Support