Issue
I am attempting to collect data from a shop in a game ( starbase ) in order to feed the data to a website in order to be able to display them as a candle stick chart
So far I have started using Tesseract OCR 5.0.0 and I have been running into issues as I cannot get the values reliably
I have seen that the images can be pre-processed in order to increase the reliability but I have run into a bottleneck as I am not familiar enough with Tesseract and OpenCV in order to know what to do more
Please note that since this is an in-game UI the images are going to be very constant as there is no colour variations / light changes / font size changes / ... I technically only need to get it to work once and that's it
Here are the steps I have taken so far and the results :
I have started by getting a screen of only the part of the UI I am interested in in order to remove as much clutter as possible
I have then set a threshold as shown here ( I will also be using the cropping part when doing the automation but I am not there yet ), set the language to English and the psm
argument to 6 witch gives me the following code :
import cv2
import pytesseract
def clean_text(text):
ret = text.replace("\n\n", "\n") # remove the blank lines
return ret
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
img = cv2.imread('screens/ressources_list_array_1.png', 0)
thresh = 255 - cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
print("======= Output")
print(clean_text(pytesseract.image_to_string(thresh, lang='eng', config='--psm 6')))
cv2.imshow('thresh', thresh)
cv2.waitKey()
Here is an example of the output I get :
======= Output
Aegisium Ore 4490 456
Ajatite Ore 600 332
Arkanium Ore 84999 53
Bastium Ore 2350 421
Charodium Ore 5 280 366
Corazium Ore 39 896 212
Exorium Ore 5 380 112
Ice 980 141
Karnite Crystal ele) 111
Kutonium Ore 14 000 215
Lukium Ore 31 000 158
Nhurgite Crystal 3144 64
Surtrite Crystal 4198 70
Valkite Ore 545 150
Vokarium Ore 1850 415
Ymrium Ore 69 899 60
There are two main issues :
1 - It is not reliable enough, you can see it confused 6 000
with ele)
2 - it is not properly understanding where the numbers start and end, making the differentiation of the 2 columns difficult
I think I can solve the second issue by further splitting the image into 3 columns but I am unsure if it's not going to be a big hit on CPU / GPU usage witch I would preferably avoid
I also found the documentation of OpenCV that shows all of the possible Image processing methods but there is a lot and I am unsure on witch ones to use to further increase reliability
Any help is much appreciated
Solution
Pytesseract, on its own, doesn't handle table detection very well - the table format isn't retained in the output, which can make it difficult to parse, as seen in your output.
So splitting the table into distinct columns, performing OCR on each, and then rejoining the columns will help. This is slower, but it is more accurate.
Dilation can help, which adds white pixels to existing white areas (using the threshold and image you currently have). This expands the narrow areas of the numbers.
In my experience, to improve the accuracy generally means splitting the table up into different sections, as well as testing different thresholds and dilation settings.
import cv2
import numpy as np
import pandas as pd
def read_img(img):
'''
Read in a grayscale image.
'''
img = cv2.imread(img)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
return img
img = read_img("img_path.png")
thresh = 255 - cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1] # your current threshold
dilated = cv2.dilate(thresh, np.ones((3,1)), iterations=1) # dilate vertically (don't want to smudge the numbers together)
cols = []
for i, v in enumerate([dilated[:,0:200],thresh[:,200:500],dilated[:,800:900]]): # split image into columns by array slicing
# Note that the middle column isn't dilated, when so, a decimal point is found
config_options = '--psm 6'
cols.append(clean_text(pytesseract.image_to_string(v, lang='eng', config=config_options)).split('\n'))
pd.DataFrame(cols).T
0 1 2
0 Aegisium Ore 4490 456
1 Ajatite Ore 600 332
2 Arkanlum Ore 84999 53
3 Bastium Ore 2350 421
4 Charodium Ore 5 280 366
5 Corazium Ore 39 896 212
6 Exorlum Ore 5 380 112
7 Ice 980 141
8 Karnite Crystal 6 000 111
9 Kutonlum Ore 14 000 215
10 Lukium Ore 31 000 158
11 Nhurgite Crystal 3144 64
12 Surtrite Crystal 4198 70
13 Valkite Ore 545 150
14 Vokarlum Ore 1850 415
15 Ymrium Ore 69 899 60
The np.ones provides a kernel for the dilation to use. Documentation.
Lastly, depending on your use case, AWS Textract does a good job parsing tables and numbers, and they provide sample Python code in the documentation to connect to the API, which worked really well for me, at least. Hopefully some of this is helpful.
Answered By - jacob
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.