OCR With Python
OCR With Python
In many cases the data that you want to report on is already in a digital format.
But what if that is not the case?
Imagine your PDF containing an image with texts instead of the actual written words.
Now, think about manually inputting all of this data. Sounds like a monotone, boring task, doesn’t it?
As no one likes manual work. There are several ways to work around this.
We will be discussing one of them in this blog. And it is a pretty simple solution as well!
It’s called Optical Character Recognition. OCR for short.
It is a technology that converts images, PDF, .. into digital texts who can be searched, edited or in our case saved.
We will be needing the following toolkit;
- Python
- Tesseract-OCR
I’m assuming python is already preinstalled on the system. On the internet you can find many ways on how to install python. For Tesseract I’ll write a short guide as it’s not always as straight forward.
Installation of Tesseract
INSTALLATION
An installation file for Tesseract can be found here :
https://github.com/UB-Mannheim/tesseract/wiki
ADD TO PATH
! Next step is important to have an easier working environment.
To access the tesseract properly, this should be added to the environment variable PATH of your system:
Advanced system settings > Advanced > Environment variables > PATH > New
‘C:Program FilesTesseract-OCR’
TRAINING SET
Because OCR isn’t an out of the box solution, you need to ‘train’ a ‘model’ to recognize the characters in a language. Luckily there are already pre-trained models available, called training set.
It is advisable to have a training set in the language of your document. This ‘training set’ is a file with ‘.traineddata’ as extension.
There are ‘pretrained’ models available here: https://github.com/tesseract-ocr/tessdata
! Important to place them in the correct folder
TESSERACT PACKAGE
To access tesseract in python we should install the python package ‘pytesseract’.
To create an image object in python we need the ‘Pillow’ package.
These 2 packages can be installed with pip, or any other methods.
Example in python
Okay, let’s go to the real work. As things can get really complicated really fast, I have created a short and simple guide on how to convert an Image into a text file.
First, while in our python file we will import the package we just installed.
You can do this by using the following statement:
import pytesseract as pt
from PIL import Image
We will be loading the image into the variable ‘image’ which we want to convert to text.
(Assuming ‘list.png’ is here in the same folder as our script)
image = Image.open(‘list.png’)
Then we will calling upon this magical function that will convert this image to a String.
We will pass the image object that we just created and the required language training set as parameters.
Additionally, you can pass extra options for tesseract depending on the image you provide.
Finally you can use the following statement to visualize your data:
content = pt.image_to_string(image, lang=’eng’)
print(‘output:’, content)
This was our original image;
And this is the text recognized by our code.
Now that we have the data, we can add some code to our script to insert this data into a table in a database.
Of course this is just a small example to show you how fast and easy a proof of concept can be developed.
You will see that OCR can be terrifyingly accurate when using a screenshot.
BUT WHAT ABOUT SCANNED DOCUMENTS?
From my own experience , I know that it’s less accurate when used with ‘out-of-the-box’ scanned documents.
This is because there is more noise, this makes recognizing the characters more difficult.
On the other side, the lines of text in a scanned document are not always parallel with the edges. Those lines are basically never perfectly horizontal.
But never say never, with a little bit of tweaking it’s possible to ‘recognize’ the angle and adjust the image to this, limit the noise around the characters. And after these steps, the results are also pretty accurate for scanned documents.
Like I said, this blog can be a first step into Optical Character Recognition with Python. Be free to experiment with it and release you from (some) manual work. I hope this helped you out or at least inspired you to tackle problems in your business. If you want to know more, please get in touch with us.
Lastly, I will provide the entire code I used for this small script so you can use it to start in pole position:
# Import libraries
import pytesseract as pt
from PIL import Image
import pyhdb
#Create connection to database
connection = pyhdb.connect(host=’server’, port=port, user=’USERNAME’, password=’PASSWORD’)
cursor = connection.cursor()
#Open image you want to
image = Image.open(‘list.png’)
content = pt.image_to_string(image, lang=’eng’, config=”–psm 6″)
print(‘output:’, content)
content = content.split(‘n’)
for line in content:
try:
output = line.split(‘:’)
sql = f”INSERT INTO SCHEMA.OCR_TEXT VALUES (‘{output[0]}’, {output[1]})”
cursor.execute(sql)
connection.commit()
except:
pass