The search operation is conducted by an ImageGear approximate regex object (HIG_REC_APPROX_REGEX) against a page (HIGEAR or HIG_REC_IMAGE), a document (HMIGEAR or HIG_REC_DOCUMENT), or a Unicode string (LPAT_WCHAR) to produce an array of matches (AT_REC_MATCH_RESULT).
To search a page, document, or string:
Create an Approximate Regex Object
Manage the lifetime of the approximate regex object using IG_REC_approx_regex_create and IG_REC_approx_regex_delete.
Create an approximate regex object using the function IG_REC_approx_regex_create. When it is no longer needed, use the function IG_REC_approx_regex_delete to release its resources.
ImageGear HIG_REC_APPROX_REGEX instances are not thread-safe. Callers are responsible for synchronizing access to instances shared across multiple threads before invoking operations that could modify or delete that instance.
Configure the Search
After creating an approximate regex object, configure it to perform an exact or approximate search and to broadcast notifications as the search is conducted. Configurable settings include:
-
Search patterns. Choose one or more regular expression patterns to match text from the search domain. Use the function IG_REC_approx_regex_pattern_set to specify one or more valid regular expression patterns.
The approximate regex object supports POSIX 1003.2 Extended Regular Expression (ERE) syntax and the Basic Regular Expression (BRE) syntax. The supported regular expression syntax is elaborated in the Regular Expressions page.
- Case-sensitive matches. Case-sensitive matches consider uppercase and lowercase letters as distinct letters. Case-insensitive consider uppercase and lowercase letters as the same letter. Use the function IG_REC_approx_regex_is_case_sensitive_set to toggle between case-sensitive and case-insensitive matches.
- Greedy matches. When two or more matches are made at the same position, the longest match is classified as the greedy match. Use the function IG_REC_approx_regex_is_greedy_set to enable or disable greedy matches.
-
Fuzzy matches. Fuzzy matches tolerate incorrect, missing, or extraneous letters in the search domain. A fuzzy search will locate substrings within a search domain that only match a pattern after one or more characters are inserted, deleted, or substituted from the pattern:
- Insert count. An insert adds a character to produce a match. For example, the pattern "she" will match "shoe" if the letter 'o' is inserted. Likewise, the pattern "lent" will match "learnt" if the letters 'a' and 'r' are inserted. Use the function IG_REC_approx_regex_maximum_insert_count_set to specify the maximum number of inserts that the search will tolerate.
- Delete count. A delete removes a character to produce a match. For example, the pattern "she" will match "he" if the letter 's' is deleted. Likewise, the pattern "them" will match "he" if the letter 't' and 'm' are deleted. Use the function IG_REC_approx_regex_maximum_delete_count_set to specify the maximum number of deletes that the search will tolerate.
- Substitute count. A substitute replaces a character to produce a match. For example, the pattern "milk" will match "mi1k" if the character '1' is replaced with 'l'. Likewise, the pattern "jail" will match "pain" if 'p' is replaced with 'j' and 'l' is replaced with 'n'. Use the function IG_REC_approx_regex_maximum_substitute_count_set to specify the maximum number of substitutions the search will tolerate.
- Error count. Each insert, delete, or substitute applied to the search domain to induce a match is treated as an error. After a maximum number of errors are encountered, the potential match is rejected and the search continues. Use the function IG_REC_approx_regex_maximum_error_count_set to specify the maximum number of errors the search will tolerate.
-
Notifications. Install callbacks to receive notification that a word is recognized, a match is made, and forward-progress is noted during the search:
- Recognize word callback. The recognize word callback is invoked for each recognized word during preparation of the search domain. These notifications are encountered prior to actually searching for pattern matches. Use the function IG_REC_approx_regex_recognize_word_cb_set to install a user-defined recognize word callback.
- Match callback. The match callback is invoked after each successful match is found. This callback presents an opportunity for the application to inspect and reject a match result. Rejected matches are excluded from the array of matches returned at the successful conclusion of the search. Use the function IG_REC_approx_regex_match_cb_set to install a user-defined match callback.
- Progress callback. The progress callback is invoked periodically during the search operation to report the estimated percentage of the search completed. This callback also presents an opportunity for the caller to stop the current search prior to completion. Use the function IG_REC_approx_regex_progress_cb_set to install a user-defined progress callback.
Search for Matches
After configuration is complete, conduct a search of any of these supported types to recover an array of matches:
- Single page raster image (HIGEAR). Optical Character Recognition (OCR) is performed upon a single raster page. All recognized letters are aggregated into a text domain. The text domain is searched, and accepted pattern matches are returned as an array of matches. Invoke the function IG_REC_approx_regex_search_page to recover matches from an HIGEAR instance.
- Multi-page raster image (HMIGEAR). OCR is performed upon each page of the multi-page image. Recognized letters from each page are aggregated into separate text domains. Each text domain is searched, and accepted pattern matches are returned as an array of matches. Invoke the function IG_REC_approx_regex_search_document to recover matches from an HMIGEAR instance.
- Recognition page (HIG_REC_IMAGE). Pre-recognized letters are aggregated into a text domain. The text domain is searched, and accepted pattern matches are returned as an array of matches. Invoke the function IG_REC_approx_regex_search_rec_page to recover matches from an HIG_REC_IMAGE instance.
- Recognition document (HIG_REC_DOCUMENT). Pre-recognized letters from each page are aggregated into text domains. Each text domain is searched, and accepted pattern matches are returned as an array of matches. Invoke the function IG_REC_approx_regex_search_rec_document to recover matches from an HIG_REC_DOCUMENT instance.
- Unicode string (LPAT_WCHAR). The Unicode string is searched, and accepted pattern matches are returned as an array of matches. Invoke the function IG_REC_approx_regex_search_text to recover matches from a Unicode string.
Zero Matches?
A search that reveals zero matches may not be valid. Image resolution will affect the accuracy of ImageGear Recognition engine OCR, and consequently auto-redact.
Consider an attempt to redact the text “football” from the 96 DPI page depicted below:
A search for the pattern “football” fails to recover any matches. The OCR text recovered from the 96-dpi page, used as the search domain, is not accurate:
eApefscp :sainmscins ON Jalial u :seinffisqns oivki suogwetrio :amffiscins auo
necnooi. :seielep ON Jouids :selelep omi illy° :alelep au°
sseuiseeivks :spesu! ON wndpuei]eizi :spesu! oftni
19(1981701e :pasu! auo
For this particular example, using the ImageGear function IG_image_resolution_set to change the page’s reported resolution from 96 DPI to 128 DPI is sufficient to coerce the expected OCR text:
One insert: alph4abet Two inserts: refe[rendpum No inserts: sweetness
One delete: crittr Two deletes: spiner No deletes: football
One substitute: chambions Two substitutes: n telfer No substitutes: objective
Repeating the search for the pattern “football” locates a single match that is subsequently redacted, as depicted below:
The page OCR Performance Issues offers some additional suggestions that may improve the accuracy of ImageGear’s recognition engine for some images.