Today, I had to convert a scanned 3-page PDF file back into a editable document. So, open source software to the rescue. I was able to complete the task with the help of:
- tesseract — for OCR, and
- imagemagick — for converting PDF pages to an image format that tesseract accepts.
-
Installing the software
sudo apt-get -y install tesseract-ocr imagemagick
-
Convert PDF pages to image
convert -density 300 -depth 8 scan.pdf[0] scan0.png convert -density 300 -depth 8 scan.pdf[1] scan1.png convert -density 300 -depth 8 scan.pdf[2] scan2.png
convert
is a member of theamagemagick
tools. You can use it to convert between image formats as well as resize an image, blur, crop, despeckle, dither, draw on, flip, join, re-sample, and much more.Here, I’m only using two options:
-density
width
to set the resolution of an image for rendering to devices. The default unit of measure is dots per inch. The default resolution is 72 dpi.-depth
value
to set the number of bits in a color sample within a pixel.The numbers between the brackets mark the page in the PDF document to be converted. Of course, as any programmer can tell you, you start counting at zero.
-
OCR page images to text
$ tesseract scan0.png scan0.txt Tesseract Open Source OCR Engine v3.02.01 with Leptonica $ tesseract scan1.png scan1.txt Tesseract Open Source OCR Engine v3.02.01 with Leptonica $ tesseract scan2.png scan2.txt Tesseract Open Source OCR Engine v3.02.01 with Leptonica
And then just copy the OCR text from the text files into a new document to clear up any typo and reformat the document.