ImageGear for C and C++ on Linux v18.8 - Updated
Extract Text from a PDF
User Guide > How to Work with... > Formats with Additional Functionality > PDF > How to... > Manage PDF Content > Extract Text from a PDF

Before working with a PDF document, make sure to initialize the PDF component (see Getting Started with PDF).

Refer to the Extract Text from a PDF Sample for complete sample code that illustrates how to use this capability.

The simplest way to extract text from a PDF is using the IG_PDF_text_extract method. It reads in a PDF and writes out the text into a TXT file.

C
Copy Code
LPSTR fileIn = "input.pdf";
LPSTR fileOut = "output.txt";
// The page index starts at 1.
UINT startPage = 1;
UINT pageCount = 1;
IG_PDF_text_extract(fileIn, fileOut, startPage, pageCount);
C++
Copy Code
std::string fileIn = "input.pdf";
std::string fileOut = "output.txt";
// The page index starts at 1.
UINT startPage = 1;
UINT pageCount = 1;
IG_PDF_text_extract((LPSTR)fileIn.c_str(), (LPSTR)fileOut.c_str(), startPage, pageCount);

If you want to manipulate the text in memory, then you should use a wordfinder to extract the text.

  1. Open the PDF document and load it into an HIG_PDF_DOC:
    C
    Copy Code
    LPSTR fileIn = "input.pdf";
    HMIGEAR document = 0;
    HIG_PDF_DOC hDoc = 0;
    IG_mpi_create(&document, 0);
    IG_mpi_file_open(fileIn, document, IG_FORMAT_PDF, IG_MP_OPENMODE_READONLY);
    IG_mpi_info_get(document, IG_MP_DOCUMENT, &hDoc, sizeof(hDoc));
    
    C++
    Copy Code
    std::string fileIn = "input.pdf";
    HMIGEAR document = 0;
    IG_mpi_create(&document, 0);
    IG_mpi_file_open((LPSTR)fileIn.c_str(), document, IG_FORMAT_PDF, IG_MP_OPENMODE_READONLY);
    HIG_PDF_DOC hDoc = 0;
    IG_mpi_info_get(document, IG_MP_DOCUMENT, &hDoc, sizeof(hDoc));
    
  2. Create a wordfinder for that PDF:
    C
    Copy Code
    // Specifies the type of character at each position. If set to NULL the default encoding is
    // used.
    WORD encodingInfo = NULL;
    // Glyph names to use in conjunction with encodingInfo. If set to NULL the default encoding
    // vector is used.
    char* encodingVector = NULL;
    // The ligerature table to be used in conjunction with the encodingVector and encodingInfo.
    // If set to NULL the default ligerature table is used.
    char* ligeratureTable = NULL;
    // IG_PDF_WF_LATEST_VERSION is the best option unless you are using an older version
    // of Acrobat.
    SHORT algorithmVersion = IG_PDF_WF_LATEST_VERSION;
    // Determines the sort order.
    WORD flags = IG_PDF_XY_SORT;
    // Pointer to user-supplied data to pass to the newly created word finder. Set to NULL.
    LPVOID clientData = NULL;
    HIG_PDF_WORDFINDER wordFinder;
    IG_PDF_doc_create_wordfinder(hDoc, encodingInfo, encodingVector, ligeratureTable,
           algorithmVersion, flags, clientData, &wordFinder);
    
    C++
    Copy Code
    // Specifies the type of character at each position. If set to NULL the default encoding is
    // used.
    WORD encodingInfo = NULL;
    // Glyph names to use in conjunction with encodingInfo. If set to NULL the default encoding
    // vector is used.
    char* encodingVector = NULL;
    // The ligerature table to be used in conjunction with the encodingVector and encodingInfo.
    // If set to NULL the default ligerature table is used.
    char* ligeratureTable = NULL;
    // IG_PDF_WF_LATEST_VERSION is the best option unless you are using an older version
    // of Acrobat.
    SHORT algorithmVersion = IG_PDF_WF_LATEST_VERSION;
    // Determines the sort order.
    WORD flags = IG_PDF_XY_SORT;
    // Pointer to user-supplied data to pass to the newly created word finder. Set to NULL.
    LPVOID clientData = NULL;
    HIG_PDF_WORDFINDER wordFinder;
    IG_PDF_doc_create_wordfinder(hDoc, encodingInfo, encodingVector, ligeratureTable,
           algorithmVersion, flags, clientData, &wordFinder);
    
  3. Get the number of words on the page so that we can iterate through them:
    C
    Copy Code
    LONG wordCount;
    LONG pageNumber = 0;
    errCount = IG_PDF_wordfinder_acquire_wordlist(wordFinder, pageNumber, &wordCount);
    
    C++
    Copy Code
    LONG wordCount;
    LONG pageNumber = 0;
    errCount = IG_PDF_wordfinder_acquire_wordlist(wordFinder, pageNumber, &wordCount);
    
  4. Then we iterate through each word:
    C
    Copy Code
    HIG_PDF_WORD word;
    WORD length;
    char* buffer;
    int i;
    for (i = 0; i < 10; i++)
    {
        IG_PDF_wordfinder_get_word(wordFinder, IG_PDF_XY_SORT, i, &word);
    
        IG_PDF_word_get_length(word, &length);
    
        IG_PDF_word_get_string(word, &buffer, length);
    
        // Here you can do what you want with each word.
    
        // Clean up each word.
        IG_PDF_word_delete(word);
    }
    
    C++
    Copy Code
    for (int i = 0; i < wordCount; i++)
    {
        HIG_PDF_WORD word;
        IG_PDF_wordfinder_get_word(wordFinder, IG_PDF_XY_SORT, i, &word);
    
        WORD length;
        IG_PDF_word_get_length(word, &length);
    
        LPCHAR wordText;
    
        vector<char> buffer(length + 2, 0);
        IG_PDF_word_get_string(word, &buffer[0], length);
    
        // Here you can do what you want with each word.
    
        // Clean up each word.
        IG_PDF_word_delete(word);
    }
    

To learn more about these word objects, you may want to use these two methods:

If you would prefer to access the text through each PDE element, you can do that as well. Refer to the AddNewPageWithImage sample for a demonstration of manipulating PDE elements. For each HIG_PDE_TEXT you can use a variety of methods, such as IG_PDE_text_get_text_unicode.