ImageGear for C and C++ on Windows v19.6 - Updated
IG_PDF_doc_create_wordfinder
API Reference Guide > PDF Component API Reference > PDF Component Objects Reference > General Objects > HIG_PDF_DOC > IG_PDF_doc_create_wordfinder

Creates a word finder that is used to extract text in the host encoding from a PDF file.

Declaration:

 
Copy Code
AT_ERRCOUNT ACCUAPI IG_PDF_doc_create_wordfinder(
        HIG_PDF_DOC hDoc,
        LPWORD lpOutEncInfo,
        LPCHAR* lpOutEncVec, 
        LPCHAR* lpLigatureTbl, 
        SHORT nAlgVersion, 
        WORD nFlags, 
        LPVOID lpClientData, 
        LPHIG_PDF_WORDFINDER lphWordFinder 
);

Arguments:

Name Type Description
hDoc HIG_PDF_DOC The document on which the word finder is used.
lpOutEncInfo LPWORD Array of 256 flags, specifying the type of character at each position in the encoding. Each flag is an OR of the Character Type Codes. If lpOutEncInfo is NULL, the platform's default encoding info is used. Use lpOutEncInfo and lpOutEncVec together; for every lpOutEncInfo use a corresponding lpOutEncVec to specify the character at that position in the encoding.
lpOutEncVec LPCHAR* Array of 256 null-terminated strings that are the glyph names in encoding order. See the discussion of character names in Section 5.3 of the PostScript Language Reference Manual, Third Edition. If lpOutEncVec is NULL, the platform's default encoding vector is used. Use this parameter with lpOutEncInfo.
lpLigatureTbl LPCHAR* A null-terminated array of null-terminated strings. Each string is the glyph name of a ligature in the font. When a word contains a ligature, the glyph name of the ligature is substituted for the ligature (for example, ff is substituted for the ff ligature). If ligatureTbl is NULL, a default ligature table is used, containing the following ligatures: fi, ff, fl, ffi, ffl, ch, cl, ct, ll, ss, fs, st, oe, OE.
nAlgVersion SHORT The version of the word-finding algorithm to use.
nFlags WORD Word-finding options that determine the tables filled when using IG_PDF_wordfinder_acquire_word_list. Must be an OR of one or more of enumIGPDFWordFlags.
lpClientData LPVOID Pointer to user-supplied data to pass to the newly created word finder. Set to NULL.
lphWordFinder LPHIG_PDF_WORDFINDER Handle to the new WordFinder.

Return Value:

Error count.

Supported Raster Image Formats:

This function does not process image pixels.

Remarks:

The word finder also extracts text from Form XObjects that are executed in the page contents. For information about Form XObjects, see Section 4.9 in the PDF Reference.

This function also works for non-Roman (CJK or Chinese-Japanese-Korean) viewers. In this case, words are extracted to the host encoding. Users desiring Unicode output must use IG_PDF_doc_create_wordfinder_ucs, which does the extraction for Roman or non-Roman text.

The type of WordFinder determines the encoding of the string returned by IG_PDF_word_get_string. For instance, if IG_PDF_doc_create_wordfinder_ucs is used to create the word finder, IG_PDF_word_get_string returns only Unicode.

For CJK viewers, words are stored internally using CID encoding. For more information on CIDFonts and related topics, see Section 5.6 in the PDF Reference. For detailed information on CIDFonts, see Technical Note #5092, CID-Keyed Font Technology Overview, and Technical Note #5014, Adobe CMap and CIDFont Files Specification.