Issue
I am working on python tesseract package with sample code like the follows:
import pytesseract
from PIL import Image
tessdata_dir_config = "--tessdata-dir \"/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/\""
image = Image.open("dataset/test.jpeg")
text = pytesseract.image_to_string(image, lang = "chi-sim", config = tessdata_dir_config)
print(text)
And I received the following error message:
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/chi-sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'chi-sim' Tesseract couldn't load any languages! Could not initialize tesseract.')
From my understanding, the error occurred when reading the file chi-sim.traineddata
(which stands for Simplified Chinese), as I will explain the attempts I have made to settle this problem below.
- My developing environment is M1 macOS, and I installed
tesseract
andtesseract-lang
from Homebrew. I am pretty sure that the path specified above is exactly where the source files are located, since when I call
print(pytesseract.get_languages(config = ""))
I get a long list of languages printed, including chi-sim.
- Further, if we just use English instead of Chinese, the following code can successfully recognize the English texts in an image:
text = pytesseract.image_to_string(image)
- I've tried to specify environment variable
TESSDATA_PREFIX
in multiple ways, including:
Using
config
parameter as in the original code.Adding global environment variable in PyCharm.
Adding the following line in the code
os.environ["TESSDATA_PREFIX"] = "tesseract/4.1.1/share/tessdata/"
- Adding the following line to
bash_profile
in terminal
export TESSDATA_PREFIX=/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/
But unfortunately, none of these works.
- It seems as if my file
chi-sim.traineddata
is, somehow, broken, so I directly downloaded the trained data file from GitHub (https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata), hit the "Download" button on the right, and placed the downloaded file in the tesseract-lang and original tesseract directory (whereeng.traineddata
is located). Yes, I've tried both, but neither works.
With respect to this issue, is there any potential solutions?
Solution
Code works for me on Linux if I use lang="chi_sim"
with _
instead of -
because file downloaded from server has name chi_sim.traineddata
also with _
instead of -
.
If I rename file into chi-sim.traineddata
then I can use lang="chi-sim"
(with -
instead of _
)
Answered By - furas
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.