How to OCR Images and PDF files to a Text-Searchable PDF

This guide explains how to perform OCR on one or more images and/or PDF files to produce a new, text-searchable PDF.

var prizmDocServer = new PrizmDocServerClient(/* your connection info */);

Then, call OcrToPdfAsync to take one or more local files and have PrizmDoc Server OCR it and produce a text-searchable PDF.

You can use both images (JPEG, PNG, TIFF, BMP, and more) as well as PDFs with image-only page data. The output will always be a single PDF which is text-searchable.

For example, you can use a single image as input:

ConversionResult result = await prizmDocServer.OcrToPdfAsync("scan.jpeg");

You can also use multiple images as input:

ConversionResult result = await prizmDocServer.OcrToPdfAsync(new ConversionSourceDocument[]
{
    new ConversionSourceDocument("page-1-scan.jpeg"),
    new ConversionSourceDocument("page-2-scan.jpeg"),
    new ConversionSourceDocument("page-3-scan.jpeg")
});

Or you can use a multi-page PDF as input:

ConversionResult result = await prizmDocServer.OcrToPdfAsync("scanned.pdf");

You can even combine these, optionally specifying the specific pages to use for a particular file:

ConversionResult result = await prizmDocServer.OcrToPdfAsync(new ConversionSourceDocument[]
{
    new ConversionSourceDocument("boilerplate-cover-page.png"),
    new ConversionSourceDocument("contract.pdf", pages: "2-5"),
    new ConversionSourceDocument("affidavit.tiff"),
    new ConversionSourceDocument("meeting-minutes.jpeg"),
});

Whether you use one input or many, the call to OcrToPdfAsync will upload the input files to PrizmDoc Server, ask PrizmDoc Server to perform OCR on these inputs and produce a single, text-searchable PDF as output which contains all of the input pages in order.

The returned result is just metadata about the output; the actual output file has not been downloaded yet. To actually download the result from PrizmDoc Server, call result.RemoteWorkFile.SaveAsync:

await result.RemoteWorkFile.SaveAsync("output.pdf");

Or, if you'd prefer instead to download the bytes to a stream, call result.RemoteWorkFile.CopyToAsync:

await result.RemoteWorkFile.CopyToAsync(myStream);

Here is a complete example:

using System;
using System.IO;
using System.Threading.Tasks;
using Accusoft.PrizmDocServer;

namespace Demos
{
    class Program
    {
        static void Main(string[] args)
        {
            MainAsync().GetAwaiter().GetResult();
        }

        static async Task MainAsync()
        {
            var prizmDocServer = new PrizmDocServerClient(/* your connection info */);

            // OCR an image-only PDF, creating a new PDF:
            ConversionResult result = await prizmDocServer.OcrToPdfAsync("scanned.pdf");
            await result.RemoteWorkFile.SaveAsync("output.pdf");

            // OCR a collection of JPEG scans, creating a single output PDF:
            ConversionResult result = await prizmDocServer.OcrToPdfAsync(new ConversionSourceDocument[]
            {
                "scan-page-1.jpg",
                "scan-page-2.jpg",
                "scan-page-3.jpg"
            });
            await result.RemoteWorkFile.SaveAsync("output.pdf");
        }
    }
}

There are additional overloads of OcrToPdfAsync which offer more options.

Also, note that the OcrToPdfAsync methods are actually just convenience wrappers around the lower-level ConvertAsync methods. You could achieve the same sort of thing with a ConvertAsync call like so:

IEnumerable<ConversionResult> results = await prizmDocServer.ConvertAsync("project-proposal.docx", new DestinationOptions(DestinationFileFormat.Pdf)
{
    PdfOptions = new PdfDestinationOptions
    {
        Ocr = new OcrOptions()
        {
            Language = "english"
        }
    }
});
ConversionResult result = results.Single();

See the PrizmDocServerClient API reference for more information.