Extracting text for searchable PDF output
The document conversion feature extracts and converts vector or document file formats such as AFP/MO:DCA, PCL, and MSWord to vector PDF format. The PDF file will be in a true vector format, meaning that it will not be in a bitmap format. The PDF file will retain the original text and graphics commands. Font information such as the font typeface, font height, and bold/Italic attributes will remain the same. This allows the output PDF file to be created as text searchable. The PDF file created can be searched for words or phrases with the use of a text searching application. Please note that the only currently supported input formats for creating searchable PDF output are AFP/MO:DCA, PTOCA, PCL, DOC (MS Word), and MS Excel files.
Conversion and text extraction occur in the following two step process:
A call is made to extract the text, graphics, and bitmap data. The IMGLOW_extract_text()
method extracts text, graphics, and position information from the file name passed in. The buffer returned is used as an argument in the call to write out the new PDF file. See IMGLOW_extract_text(String, int, int, int) for more information.
The IMG_save_document()
method takes a buffer passed in with text, graphics, and position information to create the document file output. The output file contains searchable text. Normally, the IMG_save_bitmap() methods only create a bitmap file. This only supports the PDF file as an output file. See IMG_save_document(String, byte, int) and IMG_save_document(byte, byte, int) for more information.
The IMGLOW_extract_text(String,int,int,int)
method extracts the specified page from a multi-page document. IMGLOW_extract_page
differs from IMG_decompress_image
in that it preserves the format of the original page rather than converting it to the RasterMaster common raster format. The currently supported input formats are raster PDF, searchable PDF, and TIFF.
Methods Used for Save Searchable PDF
The saveSearchablePDF sample extracts the text from the input document and saves it as a searchable PDF. The user can also specify a text string to search for by assigning a value to the stringToSearch variable. You can find the samples in the following directory: [RM Java install dir]\Samples\com\snowbound\samples. For more information on the methods used in the The saveSearchablePDF sample, click on one of the links below:
- IMG_save_document(String, byte, int)
- IMGLOW_get_pages(String)
- IMGLOW_extract_text(String, int, int, int)
- IMGLOW_search_text(byte[], String, int, int, int[])
Have questions, corrections, or concerns about this topic? Please let us know!