December 16, 2024
From Image to Text: How to Build an OCR Tool with Tesseract and Python
Would you want to automatically extract text from an image to save hours of data entry? Optical Character Recognition (OCR) is the most significant technology that provides this ability. That's why today, in this article, I'll help you create an OCR tool using Python and Tesseract, one of the best OCR libraries.
What is OCR?
Optical Character Recognition (OCR) turns scanned paper documents, PDFs, and digital camera photos into editable and searchable text. Many companies rely on OCR to scan old documents or extract text from images for analysis. OCR automates manual input, saving time and resources.
Setting Up the Environment
Set up your Python environment and install tools before starting the coding.
Install Python
If you don't have Python installed, download it from python.org.
Install Tesseract
OCR engine Tesseract is open-source. You may download it here. Installation should include adding the Tesseract executable to the PATH.
Install Required Libraries
Install the Python libraries for Tesseract interaction. Use your terminal or command prompt:
pip install pytesseract pillow opencv-python
Understanding the Code
Start coding after setting everything up! I will explain loading a picture, preparing it for OCR accuracy, and extracting text from it.
Import Libraries
Import required libraries first. Pytesseract for OCR, Pillow for image manipulation, and OpenCV for complex image preprocessing:
import pytesseract
from PIL import Image
import cv2
Loading the Image
Use OpenCV to load the image you want to process:
img = cv2.imread('image.png')
You can replace 'image.png' with the path to your image file.
Preprocessing the Image
Image preprocessing is essential for OCR accuracy. Grayscale the image and threshold it to highlight the text:
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
Extracting Text with Tesseract
Extract the text after preprocessing the image.
text = pytesseract.image_to_string(gray)
print(text)
Handling Different Languages and Output Formats
Tasseract extracts text in English by default but supports various languages. The image_to_string() function accepts lang to indicate other languages:
text = pytesseract.image_to_string(img, lang='eng+spa') # English and Spanish
Tesseract can return text as.txt, CSV, or JSON, depending on your requirements.
Final Words
So, you've now successfully created an OCR tool using Tesseract and Python. A basic technique, but OCR's flexibility and versatility is its potency.
Try to experiment with multiple image types and preprocessing methods to improve your tool's accuracy. Your OCR tool can now digitize ancient papers, analyze scanned books, and extract text for data input.
59 views