Issue
I can extract text from an image in pdf. The text that is extracted from the image is like this:
Koopliedenweg 38\nDeb. nr. : 108636 2991 LN BARENDRECHT\nYour VAT nr. : NL851703884B01 Nederland\nFactuur datum : 19-11-21\nAantal Omschrijving Prijs Bedrag\nOrder number : 76372 Loading date : 15-11-21 Incoterm: : FOT\nYour ref. : SCHOOLFRUIT Delivery date :\nWK46\nVerdi Import Schoolfruit\n566 Ananas
Crownless 14kg 10 Sweet CR Klasse I € 7,00 € 3.962,00\n706 Appels Royal Gala 13kg 60/65 Generica PL Klasse I € 4,68 € 3.304,08\n598 Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I € 6,30 3.767,40\nOrder number : 76462 Loading date : 18-11-21 Incoterm: : FOT\nYour ref. : SCHOOLFRUIT Delivery date
But now I want to return some specific text. So in this case I am looking for the string: Appels Royal Gala 13kg So I try it like this:
import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
pdfFile = wi(filename = "C:\\Users\\engel\\Documents\\python\\docs\\fixedPDF.pdf", resolution = 300)
image = pdfFile.convert('jpeg')
imageBlobs = []
for img in image.sequence:
imgPage = wi(image = img)
imageBlobs.append(imgPage.make_blob('jpeg'))
extract = []
for imgBlob in imageBlobs:
image = Image.open(io.BytesIO(imgBlob))
text = pytesseract.image_to_string(image, lang = 'eng')
extract.append(text)
interested_string = 'Appels Royal Gala 13kg'
line = [l[1] for l in extract if 'Appels Royal Gala 13kg' in l[1]]
print(line)
But the result is an empty array: []
So what I have to change?
Thank you
oke, if I do this:
line = [l[1] for l in extract if 'Appels Royal Gala 13kg' in l]
I get this as output:
['=']
and then if I do it like this:
for imgBlob in imageBlobs:
image = Image.open(io.BytesIO(imgBlob))
text = pytesseract.image_to_string(image, lang='eng')
extract.append(text)
interested_string = 'Verdi Import Schoolfruit'
line = [l[1] for l in extract if 'It was the best of' in l]
print(line)
then the result is:
['t']
oke, If I do this:
line = [l for l in extract if 'It was the best of' in l]
print(extract)
print(line)
then I get this as output:
['It was the best of\ntimes, it was the worst\nof times, it was the age\nof wisdom, it was the\nage of foolishness...\n\x0c']
['It was the best of\ntimes, it was the worst\nof times, it was the age\nof wisdom, it was the\nage of foolishness...\n\x0c']
Solution
You are doing
line = [l[1] for l in extract if 'Appels Royal Gala 13kg' in l[1]]
You are looking for your string in l[1]
but l[1]
is only the second letter of your sentence, so maybe:
line = [l[1] for l in extract if 'Appels Royal Gala 13kg' in l]
Answered By - PlainRavioli
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.