OCR Text from PDF Document

✎ Last updated on 2014-07-11 at 12:40 EDT

Today, I had to convert a scanned 3-page PDF file back into a editable document. So, open source software to the rescue. I was able to complete the task with the help of:

tesseract — for OCR, and
imagemagick — for converting PDF pages to an image format that tesseract accepts.

Installing the software

sudo apt-get -y install tesseract-ocr imagemagick
Convert PDF pages to image
```
convert -density 300 -depth 8 scan.pdf[0] scan0.png
convert -density 300 -depth 8 scan.pdf[1] scan1.png
convert -density 300 -depth 8 scan.pdf[2] scan2.png
```
convert is a member of the amagemagick tools. You can use it to convert between image formats as well as resize an image, blur, crop, despeckle, dither, draw on, flip, join, re-sample, and much more.

Here, I’m only using two options:

-density width
to set the resolution of an image for rendering to devices. The default unit of measure is dots per inch. The default resolution is 72 dpi.

-depth value
to set the number of bits in a color sample within a pixel.

The numbers between the brackets mark the page in the PDF document to be converted. Of course, as any programmer can tell you, you start counting at zero.

OCR page images to text

$ tesseract scan0.png scan0.txt
Tesseract Open Source OCR Engine v3.02.01 with Leptonica
$ tesseract scan1.png scan1.txt
Tesseract Open Source OCR Engine v3.02.01 with Leptonica
$ tesseract scan2.png scan2.txt
Tesseract Open Source OCR Engine v3.02.01 with Leptonica

And then just copy the OCR text from the text files into a new document to clear up any typo and reformat the document.

dircolors for better ls listing

✎ Last updated on 2014-12-25 at 22:07 EST

ls uses the environment variable LS_COLORS to determine the colors in which the filenames are to be displayed. This environment variable is usually set by a command in the .bashrc file like
eval 'dircolors some_path/dir_colors'
to create a customize .dircolors file, use the command

dircolors -p > .dircolors

and then edit the .dircolors file.

for example, change the color of execute permission to red with: EXEC 00;31

The comments in the generated .dircolors file already listed the color codes.
ISO 6429 color sequences are composed of sequences of numbers separated by semicolons. The most common codes are:

Attribute Codes:
00	none — to restore default color
01	bold — for brighter colors
04	underscore — for underlined text
05	blink — for flashing text
07	reverse — to reverse background and foreground colors
08	concealed — to hide text

Text Color Codes:		Background Color Codes
30	for black foreground	40	for black background
31	for red foreground	41	for red background
32	for green foreground	42	for green background
33	for orange foreground	43	for brown background
34	for blue foreground	44	for blue background
35	for purple foreground	45	for purple background
36	for cyan foreground	46	for cyan background
37	for gray foreground	47	for gray background
Extra Text Color Codes:		Extra Background Color Codes
90	dark gray	100	dark gray background
91	light red	101	light red background
92	light green	102	light green background
93	yellow	103	yellow background
94	light blue	104	light blue background
95	light purple	105	light purple background
96	turquoise	106	turquoise background
97	white	107	white background

Install TrueType fonts in Linux System

✎ Last updated on 2020-01-29 at 11:23 EST

Copy font file to font directory at
/usr/share/fonts/truetype

or for a specific user, put the font files in
/home/<username>/.local/share/fonts

refresh cache with
fc-cache -f -v

Note: fc-cache is part of the fontconfig package.

Changing LightDM Login Background

Edit the file /etc/lightdm/unity-greeter.conf

Replace the image path in keyword background= with the exact path of the new background image.

e.g.

#background=/usr/share/backgrounds/warty-final-ubuntu.png
background=/home/panda/Pictures/green-bamboo.jpg

Adding Desktop Entries to System

Desktop entry files provide information about item in menus.

Desktop entry files must reside in the $XDG_DATA_DIRS/applications directory and must have a .desktop file extension. If $XDG_DATA_DIRS is not set, then the default path is /usr/share is used.

User specific desktop entries may be located at $XDG_DATA_HOME/applications which is searched first. If $XDG_DATA_HOME is not set, then the default path ~/.local/share is used. Desktop entries are collected from all directories in the $XDG_DATA_DIRS environment variable. Directories which appear first in $XDG_DATA_DIRS are given precedence when there are several .desktop files with the same name.

So, if you drop a correctly formatted .desktop file in any of the above mentioned locations, a new launcher icon will appear in the menu hierarchy as specified by the keywords in the .desktop file.

The freedesktop.org maintains software base platform and specification for desktop software on Linux and UNIX. To get more information on the latest desktop entry files, see the Desktop Entry Specification at freedesktop.org.

Here’s a sample desktop entry file for Eclipse:

[Desktop Entry]
Version=1.0
Name=Eclipse
GenericName=Eclipse IDE
Comment=Eclipse IDE
Exec=/home/puppychau/bin/eclipse/eclipse %F
Icon=/home/puppychau/bin/eclipse/icon.xpm
Type=Application
Terminal=false
Categories=Development;IDE

Installing the software

Convert PDF pages to image

OCR page images to text