Wednesday, June 8, 2022

[FIXED] Filtering through pytesseract results using regex

June 08, 2022 python, python-tesseract, regex, tesseract No comments

Issue

I'm using pytesseract to extract names from images (the images are the bouding boxes of the names so it's just the name by itself with nothing else)

I get good results but because my roi selection isn't very good sometimes I get bounding boxes on stuff I don't care for.

I got the idea to apply pytesseract-engine to all the images and then only save the ones where the return value on them was all caps and different from two specific words that are all caps but that I still don't care for.

This is the code:

# Adding custom options
folder = r"C:\Users\lenovo\PycharmProjects\SoftOCR_Final\names"
custom_config = r'--oem 3 --psm 6'
words = []
regex = r"\b[A-Z]+(?:\s+[A-Z]+)*\b"
for img in glob.glob(rf"{folder}\*.png") or range(20):
    text = pytesseract.image_to_string(img, config=custom_config)
    if re.search(regex, text) and text != 'NOM' and text != 'PRENOM':
        words.append(text)
print(words)

I still get values like this: highlighted in bold

['HAREFED\n\x0c', 'ACHRAF\n\x0c', 'MANSOUR\n\x0c', 'Nom et Prénom Surveillant(s) | Signature(s)\nTE Rakes |*nFabel Sha!* —— |\n|\n\x0c', 'ZAOQUAM\n\x0c', 'OUMAYMA\n\x0c']

I only want values like these names: highlighted in bold

['HAREFED\n\x0c', 'ACHRAF\n\x0c', 'MANSOUR\n\x0c', 'Nom et Prénom Surveillant(s) | Signature(s)\nTE Rakes |\nFabel Sha! —— |\n|\n\x0c', 'ZAOQUAM\n\x0c', 'OUMAYMA\n\x0c']

Someone help please, I feel like I'm very close to cracking this. I could be wrong though; I'm really only a beginner.

Solution

I'm having a hard time understanding what you're trying to do, but if you're looking to grab all-caps words you can do:

re.match('[A-Z]+$', text.rstrip())

Note I'm getting rid of the garbage at the end of the string so it just turns into a word with all-caps. Is that what you want here?

>>> [re.match(r'[A-Z]+$', s.strip()) for s in ['HAREFED\n\x0c', 'ACHRAF\n\x0c', 'MANSOUR\n\x0c', 'Nom et Prénom Surveillant(s) | Signature(s)\nTE Rakes |\nFabel Sha! —— |\n|\n\x0c', 'ZAOQUAM\n\x0c', 'OUMAYMA\n\x0c']]
[
    <re.Match object; span=(0, 7), match='HAREFED'>,
    <re.Match object; span=(0, 6), match='ACHRAF'>,
    <re.Match object; span=(0, 7), match='MANSOUR'>,
    None, 
    <re.Match object; span=(0, 7), match='ZAOQUAM'>,
    <re.Match object; span=(0, 7), match='OUMAYMA'>
]

And if it is as simple as that, then you don't need a regex as all, just check if text == text.upper():

>>> terms = ['HAREFED\n\x0c', 'ACHRAF\n\x0c', 'MANSOUR\n\x0c', 'Nom et Prénom Surveillant(s) | Signature(s)\nTE Rakes |\nFabel Sha! —— |\n|\n\x0c', 'ZAOQUAM\n\x0c', 'OUMAYMA\n\x0c']
>>> [s.strip() for s in terms if s==s.upper()]
# ['HAREFED', 'ACHRAF', 'MANSOUR', 'ZAOQUAM', 'OUMAYMA']

Answered By - David542

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, June 8, 2022

[FIXED] Filtering through pytesseract results using regex

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels