1 year ago

#73601

test-img

Pouria P

How to do OCR on a single character

I am writing a program that should be able to detect a single character from the image of it.

I think it should be pretty easy given how powerful OCR software have become these days but I have no real idea how to do it.

Here are the specifics:

  • The language is Persian

  • The character is not hand written.

  • There are no words or sentences, the image is of a single character generated from a PDF file. It will look like this:

persian letter "seen"

Now ideally I should be able to perform OCR on this image and determine the character.

But I was using another approach so far. The fonts used in the PDF files are from a finite set of fonts (100 something) and from those only 2-3 fonts are usually used. So I can actually "cheat", and compare this character to all the characters of these 100 fonts and determine what it is.

As an example these are some of the characters in the font "Roya". I intended to compare my character image with all of these and determine the letter. Repeat for every other font until a match is found.

roya font characters

I was doing a bitmap compare with imagemagick but I realized that even if the fonts are the same there are still small differences between the character images generated from the same font.

As an example, these two are both the character "beh" from the font "Zar". But as you can see there won't be an exact match when doing a bitmap compare between them:

Persian letter "beh"Persian letter "beh"

So given all this how should I go about doing the OCR?

Other notes:

  • The program is written in Java, but a standalone application or a C/C++ library is also acceptable.
  • I tried using Tesseract but I just couldn't get it to detect characters. Persian was very badly documented and it looked like it would need a ton of calibration and training. It also looked like it is optimized for detecting words and gave very bad results when detecting single characters.

ocr

image-comparison

0 Answers

Your Answer

Accepted video resources