Use pytesseract OCR to recognize text from an image

Question

I need to use Pytesseract to extract text from this picture:

and the code:

from PIL import Image, ImageEnhance, ImageFilter
import pytesseract
path = 'pic.gif'
img = Image.open(path)
img = img.convert('RGBA')
pix = img.load()
for y in range(img.size[1]):
    for x in range(img.size[0]):
        if pix[x, y][0] < 102 or pix[x, y][1] < 102 or pix[x, y][2] < 102:
            pix[x, y] = (0, 0, 0, 255)
        else:
            pix[x, y] = (255, 255, 255, 255)
img.save('temp.jpg')
text = pytesseract.image_to_string(Image.open('temp.jpg'))
# os.remove('temp.jpg')
print(text)

and the "temp.jpg" is

Not bad, but the result of print is ,2 WW Not the right text2HHH, so how can I remove those black dots?

score 37 · Answer 1 · edited Jul 01 '17 at 13:52

37

Here is my solution:

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

im = Image.open("temp.jpg") # the second one 
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.jpg')
text = pytesseract.image_to_string(Image.open('temp2.jpg'))
print(text)

edited Jul 01 '17 at 13:52

Olivier Coilland

3,058
14
20

answered Jun 10 '16 at 14:19

Smith John

957
1
9
18

Hi,when i use this code i am getting below error "UnicodeEncodeError: 'charmap' codec can't encode characters in position 11-12: c haracter maps to ". can you suggest a way to over come this – MAK Oct 31 '17 at 14:00
@MAK You will need to install win-unicode-console on your windows – Moon Cheesez Nov 03 '17 at 12:59
This will not work when The text in the image is not English. when i Tried this with Japanese and Arabic, The result is not good – Hariharan AR Mar 22 '22 at 08:15

nathancy · Answer 2 · 2022-04-15T22:15:53.930

Here's a simple approach using OpenCV and Pytesseract OCR. To perform OCR on an image, its important to preprocess the image. The idea is to obtain a processed image where the text to extract is in black with the background in white. To do this, we can convert to grayscale, apply a slight Gaussian blur, then Otsu's threshold to obtain a binary image. From here, we can apply morphological operations to remove noise. Finally we invert the image. We perform text extraction using the --psm 6 configuration option to assume a single uniform block of text. Take a look here for more options.

Here's a visualization of the image processing pipeline:

Input image

Convert to grayscale -> Gaussian blur -> Otsu's threshold

Notice how there are tiny specs of noise, to remove them we can perform morphological operations

Finally we invert the image

Result from Pytesseract OCR

2HHH

Code

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Grayscale, Gaussian blur, Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Morph open to remove noise and invert image
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
invert = 255 - opening

# Perform text extraction
data = pytesseract.image_to_string(invert, lang='eng', config='--psm 6')
print(data)

cv2.imshow('thresh', thresh)
cv2.imshow('opening', opening)
cv2.imshow('invert', invert)
cv2.waitKey()

this is one of the most accurate and neatly explained answers I have seen in SO! thanks! — Md. Rezaul Karim, Feb 23 '22 at 17:08

score 6 · Answer 3 · answered Dec 20 '18 at 12:01

6

I have something different pytesseract approach for our community. Here is my approach

import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open("temp.jpg"), lang='eng',
                        config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

print(text)

answered Dec 20 '18 at 12:01

Dinesh Chandra Kumawat

119
2
2

1

I have tried `-psm` and nothing worked, but after seeing your post I tried `--psm`and it solved everything. great – RAno Jul 08 '19 at 09:47

score 4 · Answer 4 · answered Dec 13 '18 at 22:00

To extract the text directly from the web, you can try the following implementation (making use of the first image):

import io
import requests
import pytesseract
from PIL import Image, ImageFilter, ImageEnhance

response = requests.get('https://i.stack.imgur.com/HWLay.gif')
img = Image.open(io.BytesIO(response.content))
img = img.convert('L')
img = img.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(2)
img = img.convert('1')
img.save('image.jpg')
imagetext = pytesseract.image_to_string(img)
print(imagetext)

nishit chittora · Answer 5 · 2018-06-15T09:43:14.227

3

Here is my small advancement with removing noise and arbitrary line within certain colour frequency range.

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

im = Image.open(img)  # img is the path of the image 
im = im.convert("RGBA")
newimdata = []
datas = im.getdata()

for item in datas:
    if item[0] < 112 or item[1] < 112 or item[2] < 112:
        newimdata.append(item)
    else:
        newimdata.append((255, 255, 255))
im.putdata(newimdata)

im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.jpg')
text = pytesseract.image_to_string(Image.open('temp2.jpg'),config='-c tessedit_char_whitelist=0123456789abcdefghijklmnopqrstuvwxyz -psm 6', lang='eng')
print(text)

edited Jun 15 '18 at 09:43

answered Jun 14 '18 at 07:41

nishit chittora

804
13
19

something never worked with the image, can you edit and try again? – David Jun 14 '18 at 08:09
@David can you please elaborate. What's not working? – nishit chittora Jun 15 '18 at 09:44
mhm, don't remeber in the moment, but I'm sure it was not related to the code but to an uploaded image here propably. Did you remove an upload? Don't see it anymore. – David Jun 15 '18 at 09:53

score 2 · Answer 6 · answered Jul 28 '19 at 08:22

you only need grow up the size of picture by cv2.resize

image = cv2.resize(image,(0,0),fx=7,fy=7)

my picture 200x40 -> HZUBS

resized same picture 1400x300 -> A 1234 (so, this is right)

and then,

retval, image = cv2.threshold(image,200,255, cv2.THRESH_BINARY)
image = cv2.GaussianBlur(image,(11,11),0)
image = cv2.medianBlur(image,9)

and change parameters for enhance results

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
            bypassing hacks that are Tesseract-specific.

score 1 · Answer 7 · answered Dec 13 '21 at 11:13

from PIL import Image, ImageEnhance, ImageFilter
import pytesseract
path = 'hhh.gif'
img = Image.open(path)
img = img.convert('RGBA')
pix = img.load()
for y in range(img.size[1]):
    for x in range(img.size[0]):
        if pix[x, y][0] < 102 or pix[x, y][1] < 102 or pix[x, y][2] < 102:
            pix[x, y] = (0, 0, 0, 255)
        else:
            pix[x, y] = (255, 255, 255, 255)
text = pytesseract.image_to_string(Image.open('hhh.gif'))
print(text)

Use pytesseract OCR to recognize text from an image

7 Answers7

Linked