Issue
I was trying to extract text from a image using pytesseract.
I want the output file to be in the same format the image being processed.
By format I mean the output text to be arranged in rows and columns as the input image.
I have tried the following code but the output file looks nothing like the input but the text recognition is mostly accurate.
Code
import pytesseract
from pytesseract import Output
from PIL import Image
import pandas as pd
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng'
d = pytesseract.image_to_data(Image.open(r'_0.png'), config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)
# clean up blanks
df1 = df[(df.conf!='-1')&(df.text!=' ')&(df.text!='')]
# sort blocks vertically
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
curr = df1[df1['block_num']==block]
sel = curr[curr.text.str.len()>3]
char_w = (sel.width/sel.text.str.len()).mean()
prev_par, prev_line, prev_left = 0, 0, 0
text = ''
for ix, ln in curr.iterrows():
# add new line when necessary
if prev_par != ln['par_num']:
text += '\n'
prev_par = ln['par_num']
prev_line = ln['line_num']
prev_left = 0
elif prev_line != ln['line_num']:
text += '\n'
prev_line = ln['line_num']
prev_left = 0
added = 0 # num of spaces that should be added
if ln['left']/char_w > prev_left + 1:
added = int((ln['left'])/char_w) - prev_left
text += ' ' * added
text += ln['text'] + ' '
prev_left += len(ln['text']) + added + 1
text += '\n'
print(text)
Input Image
Output
Solution
First of all - remove noise -> it will produce extra errors.
Next try different output. e.g. hocr is html/xml output with bounding boxes info, so you can get exact position on screen for OCR result.
If you do not need exact position, maybe easier would be postprocesing of txt output. E.g. tesseract 5 and tessdata_best produce this output
$ tesseract YaVQ3.jpg - --psm 6 --dpi 300 -c preserve_interword_spaces=1
2
wf
10020 Knut Bratli, Brandval P.b. Chrysler 1936
10033 Erland Berg, Gjes&sen P.b. Dodge 1939
10054 Edvart Sandmo, Gardvik P.b. Opel 1937
10057 Hjalmar Aanerud, Vinger P.b. Opel 1932
10075 Reidar Holth, Flisa P.b. Volvo . 1960
10076 Einar Bredalen, Braskereidfoss P.b. Dodge 1929
10077 Reidar Holth, Flisa P.b. Volkswagen 1961
10089 Sor-Odal Bulldozerdrift, Skarnes Lb. White 1944 "
10090 Arne Radford, Galterud Lb. Ford 1939
10093 Sverre Langbraten, Brandval L.b. Citroén 1950
10096 Karl Tuhus, Skotterud P.b. Chrysler 1936
10101 Gunnar Bie-Larsen, Kongsvinger P.b. Ford : ©1961
10110 Martin Albertsen, Flisa Pb. Opel . 1960
10111 Alf @degaard, Kongsvinger P.b. Volkswagen 1958
10112 Asbjern Elverhoi, Kongsvinger Pb. Ford 1961
10114 Olav Sunde jr., Skarnes ¢ P.b. Plymouth 1937
10116 John Erichsen, Skarnes P.b. Ford 1960
10118 Ole Hasleengen, Véler \ Pb. Morris 1931
10120 Harald Eggen, Vinger \ P.b. Peugeot 1938
10121 Ola N. Berg, Gjesisen Pb. Ford 1960
10125 Reldar Rapstad, Roverud Pb. Ford 1954 Pp
10129 Erling Johnsrud, Skarnes Pb. Overland 1939
10130 Reidar Vangen, Disend P.b. Hudson 1947 v
10133 Oddvar Lilleseth, Skarnes V.b. Ford 1934
10136 Hans K. Kolbjornsrud, Austmarka P.b. Volvo 1939
10140 Rolv Snare, Kongsvinger P.h. Mercedes Benz 1950
10143 Olaf Storberget, Grue Finnskog L.b. Land Rover 1951
10146 Helge Strand, Magnor P.b. Hudson 1946
10148 Arne Hagan, Brandval Pb. Volkswagen’ 1957
10159 Brodbelfoss, E.verk, Vinger P.b. Chevrolet 1939
10160 Lauritz Hove, Sander Pb. Ford 1959
10161 Rolf Johnsen, Matrand Lb. Ford * 1937
10168 Sten Sooth Knutsen, Skotterud Pb. Volkswagen 1962
10170 Odd Norli, Knapper P.b. Buick 1938
10175 Gustav Solvang, Kongsvinger L.b. Chevrolet 1939 4
10180 Trygve Wolden, Kongsvinger Pb. Dodge 1920
10182 Kongsv. Handelsgartneri, Kongsv. Stb. Opel 1957
10186 Oddvar Berget, Namni Lb. Fordson 1933
10188 Sander Idrettslag, Sander . Buss Austin +1951
10185 Karl O. Halvorsen, Br.foss L.b. Hanomag 1955
NN -
3
: ll
v -—
Answered By - user898678
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.