Issue
I'm doing keyword extraction on documents.
Entries are :
- thousands of documents (up to 2GB in size)
- about ~200k keywords aggregated by categories
As of now, for every document, we search every keyword one by one, which I think is inefficient.
So I thought about compiling regexes by category of keywords using pipes:
import re
text = """
Contrary to popular belief, Lorem Ipsum is not simply random text.
It has roots in a piece of classical Latin literature from 45 BC,
making it over 2000 years old.
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia,
looked up one of the more obscure Latin words,
consectetur, from a Lorem Ipsum passage,
and going through the cites of the word in classical literature,
discovered the undoubtable source.
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of
"de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero,
written in 45 BC. This book is a treatise on the theory of ethics,
very popular during the Renaissance. The first line of Lorem Ipsum,
"Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.
"""
regexes = [
r'(?P<Writing__book>book)',
r'(?P<Writing__word>word)',
r'(?P<Writing__latin>latin)',
r'(?P<Writing__text>text)',
r'(?P<Writing__literature>literature)',
r'(?P<Cities__virginia>virginia)',
r'(?P<Genre__classical>classical)',
r'(?P<Genre__renaissance>renaissance)',
]
compiled_regex = '|'.join(regexes)
results = re.findall(
compiled_regex,
text,
flags=re.MULTILINE | re.IGNORECASE
)
for result in results:
print(result)
This prints:
('', '', '', 'text', '', '', '', '')
('', '', '', '', '', '', 'classical', '')
('', '', 'Latin', '', '', '', '', '')
('', '', '', '', 'literature', '', '', '')
('', '', 'Latin', '', '', '', '', '')
('', '', '', '', '', 'Virginia', '', '')
('', '', 'Latin', '', '', '', '', '')
('', 'word', '', '', '', '', '', '')
('', 'word', '', '', '', '', '', '')
('', '', '', '', '', '', 'classical', '')
('', '', '', '', 'literature', '', '', '')
('book', '', '', '', '', '', '', '')
('', '', '', '', '', '', '', 'Renaissance')
What I'd like to get is a dictionary with each category__keyword
and the number of occurrences, like:
{'Writing__book': 1, 'Writing__word': 2, 'Cities__virginia': 1, ...}
Solution
Here is a solution you can try,
import re
from collections import defaultdict
text = """..."""
regexes = ["..."]
compiled_regex = '|'.join(regexes)
results = re.finditer( # <-- Change to finditer, which returns a iterator (efficient on large data)
compiled_regex,
text,
flags=re.MULTILINE | re.IGNORECASE
)
word_counts = defaultdict(int) # <-- Default dict to track counts
for result in results:
for key_, value_ in result.groupdict().items(): # <-- Use group dict, since the you have named capturing group
if value_:
word_counts[key_] += 1
print(word_counts)
defaultdict(<class 'int'>, {'Writing__text': 1, 'Genre__classical': 2, 'Writing__latin': 3, 'Writing__literature': 2, 'Cities__virginia': 1, 'Writing__word': 2, 'Writing__book': 1, 'Genre__renaissance': 1})
Answered By - sushanth
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.