Friday, September 30, 2022

[FIXED] Regex - occurrences of a batch of keywords in a text

September 30, 2022 python, python-3.x, regex No comments

Issue

I'm doing keyword extraction on documents.

Entries are :

thousands of documents (up to 2GB in size)
about ~200k keywords aggregated by categories

As of now, for every document, we search every keyword one by one, which I think is inefficient.

So I thought about compiling regexes by category of keywords using pipes:

import re

text = """
Contrary to popular belief, Lorem Ipsum is not simply random text.
It has roots in a piece of classical Latin literature from 45 BC,
making it over 2000 years old.
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia,
looked up one of the more obscure Latin words,
consectetur, from a Lorem Ipsum passage,
and going through the cites of the word in classical literature,
discovered the undoubtable source.
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of
"de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero,
written in 45 BC. This book is a treatise on the theory of ethics,
very popular during the Renaissance. The first line of Lorem Ipsum,
"Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32. 
"""

regexes = [
    r'(?P<Writing__book>book)',
    r'(?P<Writing__word>word)',
    r'(?P<Writing__latin>latin)',
    r'(?P<Writing__text>text)',
    r'(?P<Writing__literature>literature)',
    r'(?P<Cities__virginia>virginia)',
    r'(?P<Genre__classical>classical)',
    r'(?P<Genre__renaissance>renaissance)',
]
compiled_regex = '|'.join(regexes)
results = re.findall(
        compiled_regex,
        text,
        flags=re.MULTILINE | re.IGNORECASE
    )
for result in results:
    print(result)

This prints:

('', '', '', 'text', '', '', '', '')
('', '', '', '', '', '', 'classical', '')
('', '', 'Latin', '', '', '', '', '')
('', '', '', '', 'literature', '', '', '')
('', '', 'Latin', '', '', '', '', '')
('', '', '', '', '', 'Virginia', '', '')
('', '', 'Latin', '', '', '', '', '')
('', 'word', '', '', '', '', '', '')
('', 'word', '', '', '', '', '', '')
('', '', '', '', '', '', 'classical', '')
('', '', '', '', 'literature', '', '', '')
('book', '', '', '', '', '', '', '')
('', '', '', '', '', '', '', 'Renaissance')

What I'd like to get is a dictionary with each category__keyword and the number of occurrences, like:

{'Writing__book': 1, 'Writing__word': 2, 'Cities__virginia': 1, ...}

Solution

Here is a solution you can try,

import re

from collections import defaultdict

text = """..."""

regexes = ["..."]

compiled_regex = '|'.join(regexes)

results = re.finditer(  # <-- Change to finditer, which returns a iterator (efficient on large data)
    compiled_regex,
    text,
    flags=re.MULTILINE | re.IGNORECASE
)

word_counts = defaultdict(int)  # <-- Default dict to track counts

for result in results:
    for key_, value_ in result.groupdict().items():  # <-- Use group dict, since the you have named capturing group
        if value_:
            word_counts[key_] += 1

print(word_counts)

defaultdict(<class 'int'>, {'Writing__text': 1, 'Genre__classical': 2, 'Writing__latin': 3, 'Writing__literature': 2, 'Cities__virginia': 1, 'Writing__word': 2, 'Writing__book': 1, 'Genre__renaissance': 1})

Answered By - sushanth

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, September 30, 2022

[FIXED] Regex - occurrences of a batch of keywords in a text

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels