Issue
If I use:
import re
words = re.findall(r"(?u)\b\w\w+\b", "aaa, bbb ccc. ddd\naaa xxx yyy")
print(words)
print(len(words))
as expected, I get:
['aaa', 'bbb', 'ccc', 'ddd', 'aaa', 'xxx', 'yyy']
7
Now I would like to modify the regular expression in order to also be able to count 2-grams and 3-grams, taking into account punctuation and newlines. In particular, the result I expect in this case is:
['aaa', 'bbb', 'ccc', 'ddd', 'aaa', 'xxx', 'yyy', 'bbb ccc', 'aaa xxx', 'xxx yyy', 'aaa xxx yyy']
11
How can I modify the regular expression to be able to do this?
Solution
Original answer
import re
from itertools import chain
s = "aaa, bbb ccc. ddd\naaa xxx yyy"
result = list(chain(*(re.findall('(?=((?<!\w)\w\w\w+' + ' \w\w\w+' * n + '(?!\w)))', s)
for n in range(3))))
Output:
>>> result
['aaa', 'bbb', 'ccc', 'ddd', 'aaa', 'xxx', 'yyy', 'bbb ccc', 'aaa xxx', 'xxx yyy', 'aaa xxx yyy']
Improved answer (thanks to @CasimiretHippolyte for the useful comments)
import re
from itertools import chain
s = "aaa, bbb ccc. ddd\naaa xxx yyy"
result = list(chain(*(re.findall(r'\b(?=(\w\w\w+' + ' \w\w\w+' * n + '))', s)
for n in range(3))))
Answered By - Riccardo Bucco
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.