Distributed OCR

June 10, 2017

OCR, Document Management, AWS, Tesseract

Some estimate that more than 80% of the world's data lies in an unstructured form. Often this is free text but this can also include image based formats including pdfs and tif of scanned documents. To parse this information for valuable insights or reduce the data entry/audit workload for humans, we must first perform optical character recognition (OCR) to the image. This is a processor intensive task and if you have a large corpus of historical data, you could be waiting for a long time unless you distribute the work across multiple machines.

To get started in performing OCR, you should take a look at the open-source command line OCR engine Tesseract. Installing on Ubuntu is as easy as: apt-get install tesseract-ocr. While you're at it, you should also install imagemagick, as the tesseract documentation points out that preprocessing your pdfs may significantly impact OCR fidelity.

After you've experimented with ImageMagick to crop noisy footers, change contrast, or reformat a pdf into a tif, you can feed the document to tesseract for a searchable pdf, a txt file, or even an hocr file containing additional structured information like the bounding boxes for each token. However, you will notice this takes some time for any document with more than a couple pages. This is where GNU parallel helps out. Parallel allows you to pipe jobs via command line to take advantage of multicore architectures. For example, let's tell tesseract to OCR (english) all the tiff files under the current directory into the hocr format using all cores.

find ./ -type f -name  '*.tif' \

| parallel -j0 tesseract -l eng {} {.} hocr

This is fine when you have a few dozen files to convert to text, otherwise you will find even large machines bogged down. Faced with a repository of 2 million documents, I couldn't wait for weeks with my workstation tied up. And so, I increased the parallelization by distributing the work across multiple machines. I ran 40 c3.8xlarge instances using AWS. Initially, I wanted to spread documents over 26 servers (one for each letter of the alphabet) because I can use the file naming convention and awscli's pattern matching to quickly move the documents to their respective server. However, the uneven distribution of work becomes a problem as some machines are overloaded while others finish work shortly after starting.

Since I began with an S3 bucket of unprocessed files, I used awscli to get the names of the documents. I chunked the list of file names (with PHI) into lists of files with no PHI in the names since SQS is not HIPAA compliant. Next, I created an SQS queue to pass the chunked info so that an OCR worker can pull down the message from SQS, retrieve the associated list in S3, and parse for attachment file names in another bucket to pull for OCR.

After setting up the SQS queue, I configure the machines which will perform the work. Each instance uses an AMI with the necessary software like tesseract, boto3, imagemagick, etc. Upon startup, each will retrieve and execute a script from S3. Specifically, each instance will mount EBS volumes for additional storage, pull messages from SQS, retrieve the listed files from the message, begin preprocessing image files, perform the OCR, copy the results to the appropriate bucket, and grab new messages until there are no more, using all available cores and self-terminating when there are no more documents left. Check out the scripts here