CyberData India

Cleanup Scanned Images for Quality OCR

Press Release   •   Sep 03, 2010 04:49 EDT

With the advent of ebooks most publishers are running to convert hardcopy books. The convenience of reading on a small hand-held device called a eBook Reader is unmatched. There are several eBook Readers in the market like Amazon’s Kindle, Sony’s Reader, Barnes & Noble’s Nook etc.

Most eBook Readers support the .epub format. Which means hardcopy books need to be converted to this format. The first step in the conversion is to scan a book to an image format like JPG, TIF etc.

The image is then converted to a formatted text file with embedded graphics. This conversion is cheapest done using an OCR software (ABBYY’s FineReader is a good example). The accuracy of this OCR process is not very high and every little speck interspersed in the text block is read as a punctuation by the software. Smudges maybe read during the OCR as a word which would be irrelevant to the text.

Prior to running the OCR if the scanned images are cleaned the resultant OCR would be of a higher quality. Cleanup of images can be done using any standard image editing software. My favourite is IrfanView and it’s free!

Image cleanup would involve some processes like deskew, despeckle, adjust contrast/brightness which can be run as a batch on all the images. In addition to the batch processing every page image needs to be viewed for inconsistencies and corrected manually.

This involves huge labor costs if done in-house, but we could do it at a fraction of that cost. We have worked on millions of images, cleaning them and converting to text.

CyberData India is an information management company, with fifteen years of experience in the India data entry, image editing, indexing, OCR cleanup business which give us an edge over others. Our quality control processes are ISO 9001: 2008 certified. For more information visit us at http://www.edatashop.com.