Issue
I'm trying to get some OCR done in a docker file and since I couldn't get it to work with Tesseract I tried refactor to use PyMuPdf instead. The error I get is quite simple:
File "/code/table.py", line 35, in <module>
import fitz
ModuleNotFoundError: No module named 'fitz'
On my local (windows) machine I'm able to get it running with code that looks like this
import fitz
pages = fitz.open(source_path) # open document
for page in pages:
page_data = page.get_textpage_ocr(language='eng', dpi=600, full=True)
<etc>
However in Docker the same exact code does not work.
Relevant parts of my Dockerfile look like this
FROM python:3.10
WORKDIR /code
COPY ./requirements.txt ./
RUN pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt
# install PyMupdf
RUN pip install pymupdf
COPY . .
CMD ["python", "./run.py"]
I also have pymupdf in my requirements file, but I install it separately just in case. Building the image gives no errors and works as it should.
Relevant parts of Docker-compose.yml
build: .
container_name: ocr
command: python ./run.py
volumes:
- .:/code
- type: bind
source: "C:/Program Files/Tesseract-OCR/tessdata"
target: /code/tessdata
And in my .env
file I have a reference to the binded mount TESS_DATA_PREFIX='/code/tessdata
I've added TESS_DATA_PREFIX
to my environment variables, although it does not seem necessary anymore, and the error happens way before I try to even use OCR.
Solution
The issue was related to Docker not updating after changes during builds. Removed all containers and build cache and now it works.
EDIT: also, the correct ENV variable should be called TESSDATA_PREFIX
, not TESS_DATA_PREFIX
. This was my next error but after changing .env
to the correct variable name the code works exactly as configured above.
Answered By - qoob
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.