Friday, September 30, 2022

[FIXED] How to filter substring from string in python?

September 30, 2022 python, python-tesseract No comments

Issue

I try to fitler some text from a string of text.

So I have this string of text:

It was the best of\ntimes, it was the worst\nof times, it was the age\nof wisdom, it was the\nage of foolishness...\n\x0c'

and then I just want this part extracted from it:

It was the best of

So I try it like this:

import io
from PIL import Image
import pytesseract
from wand.image import Image as wi

pdfFile = wi(
    filename="C:\\Users\\engel\\Documents\\python\\docs\\text.png", resolution=300)
image = pdfFile.convert('jpeg')

imageBlobs = []


for img in image.sequence:
    imgPage = wi(image=img)
    imageBlobs.append(imgPage.make_blob('jpeg'))

extract = []
flag = False
string_to_check = ['It was the best of']

for imgBlob in imageBlobs:
    image = Image.open(io.BytesIO(imgBlob))
    text = pytesseract.image_to_string(image, lang='eng')
    extract.append(text)
for char in string_to_check:
    if char in extract:
        print("Char \"" + char + "\" detected!")

But the output is empty.

So my question is: how can I improve this?

Thank you

oke, this is the complete code fragment:

import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
import re

pdfFile = wi(
    filename="C:\\Users\\engel\\Documents\\python\\docs\\fixedPDF.pdf", resolution=300)
image = pdfFile.convert('jpeg')

imageBlobs = []


for img in image.sequence:
    imgPage = wi(image=img)
    imageBlobs.append(imgPage.make_blob('jpeg'))

extract = []
flag = False
string_to_check = ['']
substring = 'Peen Waspeen 14x1lkg'


for imgBlob in imageBlobs:
    image = Image.open(io.BytesIO(imgBlob))
    text = pytesseract.image_to_string(image, lang='eng')
    extract.append(text)
    allSubstring = re.findall(r'{}'.format(substring),text) 
    print(allSubstring[0])

But then I also get this error:

Peen Waspeen 14x1lkg
Traceback (most recent call last):
  File "c:\Users\engel\Documents\python\code\textFromImages.py", line 29, in <module>
    print(allSubstring[0])
IndexError: list index out of range

Solution

Use regex for substring filtering. If you want to find a particular substring within a string, you can simply use this:

import re

s = 'It was the best of\ntimes, it was the worst\nof times, it was the age\nof wisdom, it was the\nage of foolishness...\n\x0c'
substring = 'It was the best of'

allSubstrings = re.findall(r'{}'.format(substring), s)
print(allSubstrings[0])

If you want to find the first substring of substrings that are separated by \n, then you can split your string with:

import re

s = 'It was the best of\ntimes, it was the worst\nof times, it was the age\nof wisdom, it was the\nage of foolishness...\n\x0c'

allSubstrings = re.split(r'\n', s)
print(allSubstrings[0])

Both answers print the substring you are looking for, output:

It was the best of

Answered By - Nyquist

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, September 30, 2022

[FIXED] How to filter substring from string in python?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels