Perform Multi-Threaded OCR

This topic provides some background information about using ImageGear OCR in a multi-threaded application and walks you through creating a new multi-threaded application using OCR.

About Using OCR in a Multi-Threaded Application
Memory Usage Best Practices
Creating the Project
Creating the Page Processor Class
Calling the Page Processor Class

About Using OCR in a Multi-Threaded Application

The OCR API can be used in a multi-threaded application. However, it is generally the application's responsibility to ensure that the different recognition activities (e.g., pre-processing, decomposition, recognition, etc.) occur sequentially for any given image or page.

Multiple instances of the ImGearOCR object are now supported in a single process. While all instances of this object will access the same underlying native OCR engine, each instance operates on its own set of pages and documents.

Using multiple instances of the ImGearOCR object in separate threads enables the application to perform recognition activities on different images and pages in parallel. However, it is important to remember that for each page, the recognition process must occur sequentially, meaning that a page must first be imported and pre-processed before it can be recognized and exported.

No re-entrance is allowed within the same thread. This can occur when the API is firing an event. When the focus of control is in the application-defined event handler, no methods of the OCR API can be called.

Memory Usage Best Practices

The ImageGear OCR assembly provides you with the ability to perform recognition activities on multiple images in parallel. However, you will want to ensure that the amount of memory being consumed by this process does not exceed the limitations of the system. This can occur if, for example, all pages of a large PDF are opened at once and then sent to the OCR assembly for processing.

The following points should always be considered when using the OCR API, but especially when calling it within multiple threads or when large images are to be recognized:

The ImGearOCR.ImportPage method will create a copy of the imported ImGearRasterPage that will be used exclusively by the recognition engine. Care should be taken to account for this memory overhead in your application.
The OCR API uses native binaries and therefore inherits from the IDisposable interface. The Dispose methods of the recognition objects should be called as soon as the objects are no longer needed in the application.
ImageGear will not monitor memory usage for you. Failure to do so can result in Out Of Memory Exceptions in high-stress applications using multiple threads and/or large images.

An alternative to the “all-at-once” processing style mentioned above is a processing technique where smaller chunks of images are opened, processed, exported and closed in an assembly line style. This technique ensures that only a specified number of images are opened and being processed at one time, keeping a consistent and manageable memory footprint throughout the process.

The following walkthrough describes this technique using the System.Threading.Tasks.Parallel class included in the .NET 4.0 Framework. In this walkthrough, you will create a .NET 4 Windows application that processes all pages of a PDF file while ensuring a controlled memory footprint throughout the task.

Creating the Project

This section describes how to create the project for this sample:

Start Visual Studio and create a new Windows Forms Application project in C# named ParallelSample.
In Visual Studio, add the following ImageGear references to your project:
- ImageGear.Core
- ImageGear.Evaluation
- ImageGear.Formats.Advanced
- ImageGear.Formats.Common
- ImageGear.Formats.Pdf
- ImageGear.OCR
- ImageGear.Windows.Forms

Creating the Page Processor Class

This section describes how to create the class that will perform the parsing and recognition of the PDF file using multiple threads:

In Visual Studio, add a new class to the ParallelSample project called PageProcessorTest.

Add the following using statements to the top of the class file. This code imports the proper types for use in the class.

CS
VBNET

C#	Copy Code
using System; using System.Collections.Concurrent; using System.IO; using System.Threading.Tasks; using ImageGear.Core; using ImageGear.Evaluation; using ImageGear.Formats; using ImageGear.Formats.PDF; using ImageGear.OCR;

Copy Code

using System;
using System.Collections.Concurrent;
using System.IO;
using System.Threading.Tasks;

using ImageGear.Core;
using ImageGear.Evaluation;
using ImageGear.Formats;
using ImageGear.Formats.PDF;
using ImageGear.OCR;

VB.NET	Copy Code
Imports System.Collections.Concurrent Imports System.IO Imports ImageGear.Core Imports ImageGear.Evaluation Imports ImageGear.Formats Imports ImageGear.Formats.PDF Imports ImageGear.OCR

Add the following code to initialize a new instance of the class. This code sets up licensing, initializes the ImageGear Formats assembly and adds the required format filters. In this sample, we are only supporting the PDF and PostScript formats:

CS
VBNET

C#	Copy Code
public PageProcessorTest() { // Initialize evaluation manager. ImGearEvaluationManager.Initialize(); // *The SetSolutionName, SetSolutionKey and possibly the SetOEMLicenseKey // methods must be called to distribute the runtime.* // ImGearLicense.SetSolutionName("YourSolutionName"); // ImGearLicense.SetSolutionKey(12345, 12345, 12345, 12345); // Manually Reported Runtime licenses also require the following method // call to SetOEMLicenseKey. // ImGearLicense.SetOEMLicenseKey("2.0.AStringForOEMLicensing..."); ImGearCommonFormats.Initialize(); ImGearFileFormats.Filters.Insert(0, ImGearPDF.CreatePDFFormat()); ImGearFileFormats.Filters.Insert(0, ImGearPDF.CreatePSFormat()); ImGearPDF.Initialize(); }

Copy Code

public PageProcessorTest()
{
    // Initialize evaluation manager.
    ImGearEvaluationManager.Initialize();
    // ***The SetSolutionName, SetSolutionKey and possibly the SetOEMLicenseKey
    // methods must be called to distribute the runtime.***
    // ImGearLicense.SetSolutionName("YourSolutionName");
    // ImGearLicense.SetSolutionKey(12345, 12345, 12345, 12345);
    // Manually Reported Runtime licenses also require the following method
    // call to SetOEMLicenseKey.
    // ImGearLicense.SetOEMLicenseKey("2.0.AStringForOEMLicensing..."); 
    ImGearCommonFormats.Initialize();
    ImGearFileFormats.Filters.Insert(0, ImGearPDF.CreatePDFFormat());
    ImGearFileFormats.Filters.Insert(0, ImGearPDF.CreatePSFormat());
    ImGearPDF.Initialize();
}

VB.NET	Copy Code
Public Sub Initialize() ' Initialize evaluation manager. ImGearEvaluationManager.Initialize() ' *The SetSolutionName, SetSolutionKey and possibly the SetOEMLicenseKey ' methods must be called to distribute the runtime.* ' ImGearLicense.SetSolutionName("YourSolutionName") ' ImGearLicense.SetSolutionKey(12345, 12345, 12345, 12345) ' Manually Reported Runtime licenses also require the following method ' call to SetOEMLicenseKey. ' ImGearLicense.SetOEMLicenseKey("2.0.AStringForOEMLicensing..."); ImGearCommonFormats.Initialize () ImGearFileFormats.Filters.Insert(0, ImGearPDF.CreatePDFFormat()) ImGearFileFormats.Filters.Insert(0, ImGearPDF.CreatePSFormat()) ImGearPDF.Initialize() End Sub

VB.NET

Copy Code

Public Sub Initialize()
 ' Initialize evaluation manager.
 ImGearEvaluationManager.Initialize()

 ' ***The SetSolutionName, SetSolutionKey and possibly the SetOEMLicenseKey
 ' methods must be called to distribute the runtime.***
 ' ImGearLicense.SetSolutionName("YourSolutionName")
 ' ImGearLicense.SetSolutionKey(12345, 12345, 12345, 12345)
 ' Manually Reported Runtime licenses also require the following method
 ' call to SetOEMLicenseKey.
 ' ImGearLicense.SetOEMLicenseKey("2.0.AStringForOEMLicensing..."); ImGearCommonFormats.Initialize ()
 ImGearFileFormats.Filters.Insert(0, ImGearPDF.CreatePDFFormat())
 ImGearFileFormats.Filters.Insert(0, ImGearPDF.CreatePSFormat())
 ImGearPDF.Initialize()
End Sub

Add the following code to create a private method that processes a chunk of pages. This code works on an array of ImGearOCRPage objects to pre-process and recognize them in parallel. After each page has been processed, the results are added to the output collection.

In this example, the order of the recognized pages is not guaranteed to be the same as the order of the original pages, in favor of code simplicity.

CS
VBNET

C#	Copy Code
private void ProcessPageChunk(ImGearRasterPage[] recPagesChunk, BlockingCollection<string> recognizedContent) { Parallel.ForEach(recPagesChunk, pg => { if (pg != null) { using (var igRecognition = ImGearOCR.Create()) { using (var ocrPage = igRecognition.ImportPage(pg)) { ocrPage.Recognize(); recognizedContent.Add(ocrPage.Text); } } } }); }

Copy Code

private void ProcessPageChunk(ImGearRasterPage[] recPagesChunk, BlockingCollection<string> recognizedContent)
        {
            Parallel.ForEach(recPagesChunk, pg =>
            {
                if (pg != null)
                {
                    using (var igRecognition = ImGearOCR.Create())
                    {
                        using (var ocrPage = igRecognition.ImportPage(pg))
                        {
                            ocrPage.Recognize();
                            recognizedContent.Add(ocrPage.Text);
                        }
                    }
                }
            });
        }

VB.NET	Copy Code
Private Sub ProcessPageChunk(ByRef recPagesChunk As ImGearRasterPage(), ByVal recognizedContent As BlockingCollection(Of String)) Parallel.ForEach(Of ImGearRasterPage)(recPagesChunk, Sub(pg) If pg IsNot Nothing Then Using igRecognition = ImGearOCR.Create() Using ocrPage = igRecognition.ImportPage(pg) ocrPage.Recognize() recognizedContent.Add(ocrPage.Text) End Using End Using End If End Sub) End Sub End Class

VB.NET

Copy Code

Private Sub ProcessPageChunk(ByRef recPagesChunk As ImGearRasterPage(), ByVal recognizedContent As BlockingCollection(Of String))
        Parallel.ForEach(Of ImGearRasterPage)(recPagesChunk, Sub(pg)
                   If pg IsNot Nothing Then
                          Using igRecognition = ImGearOCR.Create()
                                   Using ocrPage = igRecognition.ImportPage(pg)
                                              ocrPage.Recognize()
                                              recognizedContent.Add(ocrPage.Text)
                                   End Using
                          End Using
                   End If
                                                             End Sub)
    End Sub
End Class

Add the following code to create a public method called Process. This code initializes the ImageGear OCR assembly, creates the output document and calls the ProcessPageChunk method created in Step 3 (above) for each chunk until all pages are processed. In this example, a maximum of 4 pages will be processed in parallel. This value can change in your situation depending upon the number of cores and threads your CPU supports.

CS
VBNET

C#	Copy Code
public void Process(FileInfo file) { string txtFileName = "output.txt"; int numberOfCores = 4; // Create an ImGearRecognition object to initialize the Recognition engine. using (var content = new FileStream(file.FullName, FileMode.Open, FileAccess.Read)) { int numberOfPages = ImGearFileFormats.GetPageCount(content, ImGearFormats.UNKNOWN); var recognizedPagesContent = new BlockingCollection<string>(numberOfPages); var recPagesChunk = new ImGearRasterPage[numberOfCores]; for (int i = 0; i < numberOfPages; i++) { // Index to track the current index within the smaller // chunk of pages int chunkIndex = i % numberOfCores; ImGearPage igPage = ImGearFileFormats.LoadPage(content, i); // Rasterize the page if it's a vector page if (igPage is ImGearVectorPage) { ImGearPage tempPage = ((ImGearVectorPage)igPage).Rasterize(); if (igPage is IDisposable) { (igPage as IDisposable).Dispose(); } igPage = tempPage; } recPagesChunk[chunkIndex] = (ImGearRasterPage)igPage; if ((chunkIndex == numberOfCores - 1) \|\| (i == numberOfPages - 1)) { ProcessPageChunk(recPagesChunk, recognizedPagesContent); recPagesChunk = new ImGearRasterPage[numberOfCores]; } } using (var streamWriter = new StreamWriter(txtFileName)) { foreach (var pageText in recognizedPagesContent) { streamWriter.Write(pageText); } } } }

Copy Code

public void Process(FileInfo file)
        {
            string txtFileName = "output.txt";
            int numberOfCores = 4;
           
            // Create an ImGearRecognition object to initialize the Recognition engine.
            using (var content = new FileStream(file.FullName, FileMode.Open, FileAccess.Read))
            {
                int numberOfPages = ImGearFileFormats.GetPageCount(content, ImGearFormats.UNKNOWN);
                var recognizedPagesContent = new BlockingCollection<string>(numberOfPages);
                var recPagesChunk = new ImGearRasterPage[numberOfCores];
 
                for (int i = 0; i < numberOfPages; i++)
                {
                    // Index to track the current index within the smaller
                    // chunk of pages
                    int chunkIndex = i % numberOfCores;
                    ImGearPage igPage = ImGearFileFormats.LoadPage(content, i);
 
                    // Rasterize the page if it's a vector page
                    if (igPage is ImGearVectorPage)
                    {
                        ImGearPage tempPage = ((ImGearVectorPage)igPage).Rasterize();
                        if (igPage is IDisposable)
                        {
                            (igPage as IDisposable).Dispose();
                        }
                        igPage = tempPage;
                    }
 
                    recPagesChunk[chunkIndex] = (ImGearRasterPage)igPage;
                    if ((chunkIndex == numberOfCores - 1) || (i == numberOfPages - 1))
                    {
                        ProcessPageChunk(recPagesChunk, recognizedPagesContent);
                        recPagesChunk = new ImGearRasterPage[numberOfCores];
                    }
                }
 
                using (var streamWriter = new StreamWriter(txtFileName))
                {
                    foreach (var pageText in recognizedPagesContent)
                    {
                        streamWriter.Write(pageText);
                    }
                }
            }
        }

VB.NET	Copy Code
Public Sub Process(ByVal file As FileInfo) Dim txtFileName As String = "output_vb.txt" Dim numberOfCores As Integer = 4 Using content = New FileStream(file.FullName, FileMode.Open, FileAccess.Read) Dim numberOfPages As Integer = ImGearFileFormats.GetPageCount(content, ImGearFormats.UNKNOWN) Dim recognizedPagesContent = New BlockingCollection(Of String)(numberOfPages) Dim recPagesChunk = New ImGearRasterPage(numberOfCores - 1) {} For i As Integer = 0 To numberOfPages - 1 Dim chunkIndex As Integer = i Mod numberOfCores Dim igPage As ImGearPage = ImGearFileFormats.LoadPage(content, i) If TypeOf igPage Is ImGearVectorPage Then Dim tempPage As ImGearPage = (CType(igPage, ImGearVectorPage)).Rasterize() If TypeOf igPage Is IDisposable Then Dim disposableInterface As IDisposable = CType(igPage, IDisposable) disposableInterface.Dispose() End If igPage = tempPage End If recPagesChunk(chunkIndex) = CType(igPage, ImGearRasterPage) If (chunkIndex = numberOfCores - 1) OrElse (i = numberOfPages - 1) Then ProcessPageChunk(recPagesChunk, recognizedPagesContent) recPagesChunk = New ImGearRasterPage(numberOfCores - 1) {} End If Next Using streamWriter = New StreamWriter(txtFileName) For Each pageText In recognizedPagesContent streamWriter.Write(pageText) Next End Using End Using End Sub

VB.NET

Copy Code

Public Sub Process(ByVal file As FileInfo)
        Dim txtFileName As String = "output_vb.txt"
        Dim numberOfCores As Integer = 4
 
        Using content = New FileStream(file.FullName, FileMode.Open, FileAccess.Read)
            Dim numberOfPages As Integer = ImGearFileFormats.GetPageCount(content, ImGearFormats.UNKNOWN)
            Dim recognizedPagesContent = New BlockingCollection(Of String)(numberOfPages)
            Dim recPagesChunk = New ImGearRasterPage(numberOfCores - 1) {}
 
            For i As Integer = 0 To numberOfPages - 1
                Dim chunkIndex As Integer = i Mod numberOfCores
                Dim igPage As ImGearPage = ImGearFileFormats.LoadPage(content, i)
 
                If TypeOf igPage Is ImGearVectorPage Then
                    Dim tempPage As ImGearPage = (CType(igPage, ImGearVectorPage)).Rasterize()
 
                    If TypeOf igPage Is IDisposable Then
                        Dim disposableInterface As IDisposable = CType(igPage, IDisposable)
                        disposableInterface.Dispose()
                    End If
 
                    igPage = tempPage
                End If
 
                recPagesChunk(chunkIndex) = CType(igPage, ImGearRasterPage)
 
                If (chunkIndex = numberOfCores - 1) OrElse (i = numberOfPages - 1) Then
                    ProcessPageChunk(recPagesChunk, recognizedPagesContent)
                    recPagesChunk = New ImGearRasterPage(numberOfCores - 1) {}
                End If
            Next
 
            Using streamWriter = New StreamWriter(txtFileName)
 
                For Each pageText In recognizedPagesContent
                    streamWriter.Write(pageText)
                Next
            End Using
        End Using
    End Sub

Calling the Page Processor Class

This section describes how to call the PageProcessorTest class that you created above:

Create your own User Interface to enable the user to select a PDF file to process. This example assumes that a file has been selected and valid filename is available.

Add the following code to call the PageProcessorTest class you created:

CS
VBNET

C#	Copy Code
PageProcessorTest processor = new PageProcessorTest(); processor.Process(new System.IO.FileInfo(filename));

VB.NET	Copy Code
Dim processor As New PageProcessorTest() processor.Initialize() processor.Process(New System.IO.FileInfo(filename))