ImageGear for C and C++ on Windows v21.0 - Updated
User Guide / How to Work with ... / Formats with Additional Functionality / PDF / How to... / Manage PDF Content / Extract Text from a PDF
In This Topic
    Extract Text from a PDF
    In This Topic

    Before working with a PDF document, make sure to initialize the PDF component (see Getting Started with PDF).

    Refer to the Extract Text from a PDF Sample for complete sample code that illustrates how to use this capability.

    The simplest way to extract text from a PDF is using the IG_PDF_text_extract method. It reads in a PDF and writes out the text into a TXT file.

    C
    Copy Code
    LPSTR fileIn = "input.pdf";
    LPSTR fileOut = "output.txt";
    // The page index starts at 1.
    UINT startPage = 1;
    UINT pageCount = 1;
    IG_PDF_text_extract(fileIn, fileOut, startPage, pageCount);
    
    C++
    Copy Code
    std::string fileIn = "input.pdf";
    std::string fileOut = "output.txt";
    // The page index starts at 1.
    UINT startPage = 1;
    UINT pageCount = 1;
    IG_PDF_text_extract((LPSTR)fileIn.c_str(), (LPSTR)fileOut.c_str(), startPage, pageCount);
    

    If you want to manipulate the text in memory, then you should use a wordfinder to extract the text.

    1. Open the PDF document and load it into an HIG_PDF_DOC:
      C
      Copy Code
      LPSTR fileIn = "input.pdf";
      HMIGEAR document = 0;
      HIG_PDF_DOC hDoc = 0;
      IG_mpi_create(&document, 0);
      IG_mpi_file_open(fileIn, document, IG_FORMAT_PDF, IG_MP_OPENMODE_READONLY);
      IG_mpi_info_get(document, IG_MP_DOCUMENT, &hDoc, sizeof(hDoc));
      
      C++
      Copy Code
      std::string fileIn = "input.pdf";
      HMIGEAR document = 0;
      IG_mpi_create(&document, 0);
      IG_mpi_file_open((LPSTR)fileIn.c_str(), document, IG_FORMAT_PDF, IG_MP_OPENMODE_READONLY);
      HIG_PDF_DOC hDoc = 0;
      IG_mpi_info_get(document, IG_MP_DOCUMENT, &hDoc, sizeof(hDoc));
      
    2. Create a wordfinder for that PDF:
      C
      Copy Code
      // Specifies the type of character at each position. If set to NULL the default encoding is
      // used.
      WORD encodingInfo = NULL;
      // Glyph names to use in conjunction with encodingInfo. If set to NULL the default encoding
      // vector is used.
      char* encodingVector = NULL;
      // The ligerature table to be used in conjunction with the encodingVector and encodingInfo.
      // If set to NULL the default ligerature table is used.
      char* ligeratureTable = NULL;
      // IG_PDF_WF_LATEST_VERSION is the best option unless you are using an older version
      // of Acrobat.
      SHORT algorithmVersion = IG_PDF_WF_LATEST_VERSION;
      // Determines the sort order.
      WORD flags = IG_PDF_XY_SORT;
      // Pointer to user-supplied data to pass to the newly created word finder. Set to NULL.
      LPVOID clientData = NULL;
      HIG_PDF_WORDFINDER wordFinder;
      IG_PDF_doc_create_wordfinder(hDoc, encodingInfo, encodingVector, ligeratureTable,
             algorithmVersion, flags, clientData, &wordFinder);
      
      C++
      Copy Code
      // Specifies the type of character at each position. If set to NULL the default encoding is
      // used.
      WORD encodingInfo = NULL;
      // Glyph names to use in conjunction with encodingInfo. If set to NULL the default encoding
      // vector is used.
      char* encodingVector = NULL;
      // The ligerature table to be used in conjunction with the encodingVector and encodingInfo.
      // If set to NULL the default ligerature table is used.
      char* ligeratureTable = NULL;
      // IG_PDF_WF_LATEST_VERSION is the best option unless you are using an older version
      // of Acrobat.
      SHORT algorithmVersion = IG_PDF_WF_LATEST_VERSION;
      // Determines the sort order.
      WORD flags = IG_PDF_XY_SORT;
      // Pointer to user-supplied data to pass to the newly created word finder. Set to NULL.
      LPVOID clientData = NULL;
      HIG_PDF_WORDFINDER wordFinder;
      IG_PDF_doc_create_wordfinder(hDoc, encodingInfo, encodingVector, ligeratureTable,
             algorithmVersion, flags, clientData, &wordFinder);
      
    3. Get the number of words on the page so that we can iterate through them:
      C
      Copy Code
      LONG wordCount;
      LONG pageNumber = 0;
      errCount = IG_PDF_wordfinder_acquire_wordlist(wordFinder, pageNumber, &wordCount);
      
      C++
      Copy Code
      LONG wordCount;
      LONG pageNumber = 0;
      errCount = IG_PDF_wordfinder_acquire_wordlist(wordFinder, pageNumber, &wordCount);
      
    4. Then we iterate through each word:
      C
      Copy Code
      HIG_PDF_WORD word;
      WORD length;
      char* buffer;
      int i;
      for (i = 0; i < 10; i++)
      {
          IG_PDF_wordfinder_get_word(wordFinder, IG_PDF_XY_SORT, i, &word);
      
          IG_PDF_word_get_length(word, &length);
      
          IG_PDF_word_get_string(word, &buffer, length);
      
          // Here you can do what you want with each word.
      
          // Clean up each word.
          IG_PDF_word_delete(word);
      }
      
      C++
      Copy Code
      for (int i = 0; i < wordCount; i++)
      {
          HIG_PDF_WORD word;
          IG_PDF_wordfinder_get_word(wordFinder, IG_PDF_XY_SORT, i, &word);
      
          WORD length;
          IG_PDF_word_get_length(word, &length);
      
          LPCHAR wordText;
      
          vector<char> buffer(length + 2, 0);
          IG_PDF_word_get_string(word, &buffer[0], length);
      
          // Here you can do what you want with each word.
      
          // Clean up each word.
          IG_PDF_word_delete(word);
      }
      

    To learn more about these word objects, you may want to use these two methods:

    If you would prefer to access the text through each PDE element, you can do that as well. Refer to the AddNewPageWithImage sample for a demonstration of manipulating PDE elements. For each HIG_PDE_TEXT you can use a variety of methods, such as IG_PDE_text_get_text_unicode.