Optical character recognition (OCR) con Tesseract, OpenCV y Python

Contents

This article was published as part of the Data Science Blogathon

Intro

U.S, the humans, we read texts almost every minute of our life. Wouldn't it be great if our machines or systems could also read text like we do? But the most important question is “How do we make our machines read”? This is where Optical Character Recognition comes in. (OCR).

Optical character recognition (OCR)

Optical character recognition (OCR) is a technique of reading or capturing text from printed or scanned photographs, handwritten images and converting them into an editable and searchable digital format.

Applications

OCR has many applications in today's business. Some of them are listed below:

  • Passport recognition at airports
  • Data entry automation
  • License plate recognition
  • Extract business card information into a contact list
  • Converting handwritten documents into electronic images
  • Create searchable PDF files
  • Create audible files (text to audio)

Some of the open source OCR tools are Tesseract, OCRopus.

In this article, we will focus on Tesseract OCR. And to read the images we need OpenCV.

Installing Tesseract OCR:

Download the latest installer for Windows 10 from “https://github.com/UB-Mannheim/tesseract/wiki“. Run the .exe file once it is downloaded.

Note: Don't forget to copy the software installation path from file. We will need it later as we need to add the path of the tesseract executable in the code if the installation directory is different from the default.

The typical installation path on Windows systems is C: Program files.

Then, in my case, it is “C: Program Files Tesseract-OCRtesseract.exe“.

Then, to install Python container for Tesseract, open command prompt and run command “pip instalar pytesseract“.

OpenCV

OpenCV (Open Source Computer Vision) is an open source library for image processing applications, machine learning and computer vision.

OpenCV-Python is the Python API for OpenCV.

To install it, open command prompt and run command “pip instalar opencv-python“.

Create sample OCR script

1. Read a sample image

import cv2

Read the image using the cv2.imread method () and save it in a variable "img".

img = cv2.imread("image.jpg")

If required, resize image using cv2.resize method ()

img = cv2.resize(img, (400, 400))

Show the image using the cv2.imshow method ()

cv2.imshow("Image", img)

Show the window infinitely (to prevent the kernel from crashing)

cv2.waitKey(0)

Close all open windows

cv2.destroyAllWindows()

2. Chain image conversion

import pytesseract

Set path of tesseract in code

pytesseract.pytesseract.tesseract_cmd=r'C:Program FilesTesseract-OCRtesseract.exe'

The following error occurs if we do not set the path.

39840error-7169515

To convert an image to a string, use pytesseract.image_to_string (img) and save it in a variable “text”

text = pytesseract.image_to_string(img)

print the result

print(text)

Complete code:

import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd=r'C:Program FilesTesseract-OCRtesseract.exe'
img = cv2.imread("image.jpg")
img = cv2.resize(img, (400, 450))
cv2.imshow("Image", img)
text = pytesseract.image_to_string(img)
print(text)
cv2.waitKey(0)
cv2.destroyAllWindows()

The output of the above code:

58994output-7193438

The output of the above code

If we look at the result, the main quote is perfectly extracted, but you don't get the philosopher's name and the text at the bottom of the image.

To extract text accurately and avoid precision drop, we must pre-process the image. I found this article (https://towardsdatascience.com/pre-processing-in-ocr-fc231c6035a7) quite useful. Please refer to it to better understand preprocessing techniques.

Perfect! Now that we have the required basics, let's look at some simple OCR applications.

1. Creating word clouds in review images

The word cloud is a visual representation of the frequency of words. The bigger the word appears in a word cloud, the word is most commonly used in the text.

For this, I took some Amazon review snapshots for the Apple iPad 8th Generation product.

Sample picture

7429015-8893161
Sample review image

Steps:

  1. Create a list of all available review images
  2. If required, view images using cv2.imshow method ()
  3. Read text from images using pytesseract
  4. Create a data frame
  5. Pre-process the text: remove special characters, stop words
  6. Build positive and negative word clouds

Paso 1: creates a list of all available review images

import os
folderPath = "Reviews"
myRevList = os.listdir(folderPath)

Paso 2: If required, view images using cv2.imshow method ()

for image in  myRevList:
    img = cv2.imread(f'{folderPath}/{image}')
    cv2.imshow("Image", img)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

Paso 3: read text from images using pytesseract

import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd=r'C:Program FilesTesseract-OCRtesseract.exe'
corpus = []
for images in myRevList:
    img = cv2.imread(f'{folderPath}/{images}')
    if img is None:
        corpus.append("Could not read the image.")
    else:
        rev = pytesseract.image_to_string(img)
        corpus.append(rev)
list(corpus)
corpus
30320pytesseract_output-8845577
The output of the above code

Paso 4: create a data frame

import pandas as pd
data = pd.DataFrame(list(corpus), columns=['Review'])
data
25529dataframe-9579722

Paso 5: preprocess the text: remove special characters, empty words

#removing special characters
import re
def clean(text):
    return re.sub('[^ A-Za-z0-9" "]+', ' ', text)
data['Cleaned Review'] = data['Review'].apply(clean)
data
25761cleaned-5396591

Removing Stop Words from the 'Clean Review’ and adding all the remaining words to a list variable “final_list”.

  1. # removing stopwords
    import nltk
    from nltk.corpus import stopwords
    nltk.download("Point")
    from nltk import word_tokenize
    stop_words = stopwords.words('english')
    
    final_list = []
    for column in data[['Cleaned Review']]:
        columnSeriesObj = data[column]
        all_rev = columnSeriesObj.values
    
        for i in range(len(all_rev)):
            tokens = word_tokenize(all_rev[i])
            for word in tokens:
                if word.lower() not in stop_words:
                    final_list.append(word)

Paso 6: Build positive and negative word clouds

Install the word cloud library with the command “pip instalar wordcloud“.

In english language, we have a predefined set of positive and negative words called Opinion Lexicons. These files can be downloaded from Link or directly from me GitHub repository.

Once the files are downloaded, read those files in the code and create a list of positive and negative words.

with open(r"opinion-lexicon-Englishpositive-words.txt","r") as pos:
  poswords = pos.read().split("n")
with open(r"opinion-lexicon-Englishnegative-words.txt","r") as neg:
  negwords = neg.read().split("n")

Importing libraries to generate and display word clouds.

import matplotlib.pyplot as plt
from wordcloud import WordCloud

Positive word cloud

# Choosing the only words which are present in poswords
pos_in_pos = " ".join([w for w in final_list if w in poswords])
wordcloud_pos = WordCloud(
                      background_color="black",
                      width=1800,
                      height=1400
                     ).generate(pos_in_pos)
plt.imshow(wordcloud_pos)
25036pos_wordcloud-3955480

The word "good" is the most used word that catches our attention. If we look back at the reviews, people have written reviews saying that the iPad has a good screen, good sound, good software and hardware.

Negative word cloud

# Choosing the only words which are present in negwords
neg_in_neg = " ".join([w for w in final_list if w in negwords])
wordcloud_neg = WordCloud(
                      background_color="black",
                      width=1800,
                      height=1400
                     ).generate(neg_in_neg)
plt.imshow(wordcloud_neg)
42575neg_wordcloud-7900282

The words expensive, stuck, beaten, disappointment stood out in negative word cloud. If we look at the context of the word stuck, dice “Although it has only 3 GB of RAM, never get stuck”, which is a positive thing about the device.

Therefore, it's good to create bigrama word clouds / trigram so as not to lose context.

2. Create audible files (text to audio)

gTTS is a Python library with Google Translate's text-to-speech API.

For install, run the command “pip install gtts"At the command prompt.

Import required libraries

import cv2
import pytesseract
from gtts import gTTS
import os

Set the path of tesseract

pytesseract.pytesseract.tesseract_cmd=r'C:Program FilesTesseract-OCRtesseract.exe'

Read the image using cv2.imread () and grab the text from the image using pytesseract and save it to a variable.

rev = cv2.imread("Reviews15.PNG")
# display the image using cv2.imshow() method
# cv2.imshow("Image", rev)
# cv2.waitKey(0)
# cv2.destroyAllWindows()
# grab the text from image using pytesseract
txt = pytesseract.image_to_string(rev)
print(txt)

Set the language and create a text to audio conversion using gTTS without going through the text, language

language="in"

outObj = gTTS(text=txt, lang=language, slow=False)

Save the audio file as “rev.mp3”

outObj.save("rev.mp3")

play the audio file

os.system('rev.mp3')
27896gtts_output-4924276

Complete code:

  1. import cv2
    import pytesseract
    from gtts import gTTS
    import os
    rev = cv2.imread("Reviews15.PNG")
    
    # cv2.imshow("Image", rev)
    # cv2.waitKey(0)
    # cv2.destroyAllWindows()
    
    txt = pytesseract.image_to_string(rev)
    print(txt)
    language="in"
    outObj = gTTS(text=txt, lang=language, slow=False)
    outObj.save("rev.mp3")
    print('playing the audio file')
    os.system('rev.mp3')

Final notes

At the end of this article, we have understood the concept of optical character recognition (OCR) and we are familiar with reading images with OpenCV and capturing text from images with pytesseract. We have seen two basic OCR applications: build word clouds, create audible files by converting text to speech using gTTS.

References:

I hope this article is informative and, please, let me know if you have any queries or comments related to this article in the comment section. Happy learning 😊

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.