Issue
I try to fitler some text from a string of text.
So I have this string of text:
It was the best of\ntimes, it was the worst\nof times, it was the age\nof wisdom, it was the\nage of foolishness...\n\x0c'
and then I just want this part extracted from it:
It was the best of
So I try it like this:
import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
pdfFile = wi(
filename="C:\\Users\\engel\\Documents\\python\\docs\\text.png", resolution=300)
image = pdfFile.convert('jpeg')
imageBlobs = []
for img in image.sequence:
imgPage = wi(image=img)
imageBlobs.append(imgPage.make_blob('jpeg'))
extract = []
flag = False
string_to_check = ['It was the best of']
for imgBlob in imageBlobs:
image = Image.open(io.BytesIO(imgBlob))
text = pytesseract.image_to_string(image, lang='eng')
extract.append(text)
for char in string_to_check:
if char in extract:
print("Char \"" + char + "\" detected!")
But the output is empty.
So my question is: how can I improve this?
Thank you
oke, this is the complete code fragment:
import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
import re
pdfFile = wi(
filename="C:\\Users\\engel\\Documents\\python\\docs\\fixedPDF.pdf", resolution=300)
image = pdfFile.convert('jpeg')
imageBlobs = []
for img in image.sequence:
imgPage = wi(image=img)
imageBlobs.append(imgPage.make_blob('jpeg'))
extract = []
flag = False
string_to_check = ['']
substring = 'Peen Waspeen 14x1lkg'
for imgBlob in imageBlobs:
image = Image.open(io.BytesIO(imgBlob))
text = pytesseract.image_to_string(image, lang='eng')
extract.append(text)
allSubstring = re.findall(r'{}'.format(substring),text)
print(allSubstring[0])
But then I also get this error:
Peen Waspeen 14x1lkg
Traceback (most recent call last):
File "c:\Users\engel\Documents\python\code\textFromImages.py", line 29, in <module>
print(allSubstring[0])
IndexError: list index out of range
Solution
Use regex for substring filtering. If you want to find a particular substring within a string, you can simply use this:
import re
s = 'It was the best of\ntimes, it was the worst\nof times, it was the age\nof wisdom, it was the\nage of foolishness...\n\x0c'
substring = 'It was the best of'
allSubstrings = re.findall(r'{}'.format(substring), s)
print(allSubstrings[0])
If you want to find the first substring of substrings that are separated by \n, then you can split your string with:
import re
s = 'It was the best of\ntimes, it was the worst\nof times, it was the age\nof wisdom, it was the\nage of foolishness...\n\x0c'
allSubstrings = re.split(r'\n', s)
print(allSubstrings[0])
Both answers print the substring you are looking for, output:
It was the best of
Answered By - Nyquist
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.