ImageGear .NET v24.14 - Updated September 15, 2020
OCR an Image or Document
User Guide > How to Work with... > OCR > How to... > OCR an Image or Document

The recognition process is initiated by the ImGearRecPage.Recognize Method.

The Recognize Method processes the single image associated with the ImGearRecPage Class. The method takes the zone list of the image or if it is empty, automatically calls the page-layout decomposition process (auto-zoning).

Internally this method operates on a bitonal image. If the image is not bitonal or the image despeckle mode setting is enabled, an implicit secondary image conversion step is performed automatically to create a secondary image. This implicit image conversion can be influenced with previous settings of the ImGearRecPreprocessingSettings.SecondaryReductionMode Property and DespeckleMode Property.

The Recognize Method enumerates the zones in the zone list and activates the appropriate recognition modules for them. The recognition modules are given the calculated Character Set information zone by zone.

This section provides information about the following:

OCR a Multi-Page TIFF Image

In this example, the application specifies the output format (Converters.Text.Word2000) for the final output document (MULTIPAG.DOC) with the Format property. Then the multi-page result of the recognition stored in the recognition data file is converted into the requested format with the ImGearRecOutputManager.WriteDocument Method call.

C#
Copy Code
// Initialize the Recognition Engine.
ImGearRecognition igRecognition = new ImGearRecognition();
// Open a FileStream for our output document.
using (FileStream outputStream = new FileStream("outputDoc.txt", FileMode.OpenOrCreate, FileAccess.ReadWrite))
{
  // Open a FileStream for our source multi-page image.
  using (FileStream multiPageDocument = new FileStream("Multi-Page.tif", FileMode.Open))
  {
    // Load every page of the multi-page document. Starting at page 0 and loading the range of spaces specified.
    // Since the range is -1, that specifies that all pages shall be loaded. 
    ImGearDocument doc = ImGearFileFormats.LoadDocument(multiPageDocument, 0, -1);
    // Determine the amount of pages in the multi-page image.
    int numPages = ImGearFileFormats.GetPageCount(multiPageDocument, ImGearFormats.UNKNOWN);
    // Recognize each page of the multi-page document and add the results to outputStream.
    for (int pageNumber = 0; pageNumber < numPages; pageNumber++)
    {                     
      // Cast the current page to a raster page and import that page.
      ImGearRecPage igRecPage = igRecognition.ImportPage((ImGearRasterPage)doc.Pages[pageNumber]);
      // Preprocess the page.
      igRecPage.Image.Preprocess();
      // Perform recognition.
      igRecPage.Recognize();
      // Add OCR results to the outputStream.
      igRecognition.OutputManager.WriteDirectText(igRecPage, outputStream);
    }
  }
}
// Dispose of objects we are no longer using.
igRecognition.Dispose();

OCR a Single-Page Document

C#
Copy Code
// Initialize the Recognition Engine.
ImGearRecognition igRecognition = new ImGearRecognition();
// Open a FileStream for our output document.
using (FileStream outputStream = new FileStream("outputDoc.txt", FileMode.OpenOrCreate, FileAccess.ReadWrite))
{
  // Open a FileStream for our source multi-page image.
  using (FileStream inputImage = new FileStream("Multi-Page.tif", FileMode.Open))
  {
    // Load every page of the multi-page document. Starting at page 0 and loading the range of spaces specified
    // Since the range is -1, that specifies that all pages shall be loaded. 
    ImGearDocument doc = ImGearFileFormats.LoadDocument(inputImage, 0, -1);
    // Cast the current page to a raster page and import that page.
    ImGearRecPage igRecPage = igRecognition.ImportPage((ImGearRasterPage)doc.Pages[0]);
    // Preprocess the page.
    igRecPage.Image.Preprocess();
    // Perform recognition.
    igRecPage.Recognize();
    // Add OCR results to the outputStream.
    igRecognition.OutputManager.WriteDirectText(igRecPage, outputStream);
  }
}
// Dispose of objects we are no longer using.
igRecognition.Dispose();

OCR a PDF Document

C#
Copy Code
// Initialize support for Pdf Format.
ImGearFileFormats.Filters.Add(ImGearPDF.CreatePDFFormat());
ImGearPDF.Initialize();
// Initialize the Recognition Engine.
ImGearRecognition igRecognition = new ImGearRecognition();
// Open a FileStream for our output document.
using (FileStream outputStream = new FileStream("outputDoc.txt", FileMode.OpenOrCreate, FileAccess.ReadWrite))
{
  // This ImGearDocument will hold the PDF in memory.
  ImGearDocument imGearDocument = new ImGearDocument();
  // Open a FileStream for our source PDF.
  using (FileStream multiPageDocument = new FileStream("pdf.pdf", FileMode.Open, FileAccess.Read, FileShare.Read))
  {
    // Load the entire multi-page document into imGearDocument.
    imGearDocument = ImGearFileFormats.LoadDocument(multiPageDocument);
    // Recognize each page of the multi-page document and add the results to outputStream.
    for (int pageNumber = 0; pageNumber < imGearDocument.Pages.Count; pageNumber++)
    {
      // Load page specified by pageNumber
      ImGearPage igPage = imGearDocument.Pages[pageNumber];
      // OCR only works on raster images, so we need to rasterize the page if it's an ImGearVectorPage.
      if (igPage is ImGearVectorPage)
      {
        igPage = (igPage as ImGearVectorPage).Rasterize(24, 300, 300);
      }
      // Cast igPage to a raster page and import that page.
      ImGearRecPage igRecPage = igRecognition.ImportPage((ImGearRasterPage)igPage);
      // Preprocess the page.
      igRecPage.Image.Preprocess();
      // Perform recognition.
      igRecPage.Recognize();
      // Add OCR results to the outputStream.
      igRecognition.OutputManager.WriteDirectText(igRecPage, outputStream);
    }
  }
}
// Dispose of objects we are no longer using.
igRecognition.Dispose();

See Also