Issue
I'm using the PyPI module regex
for regex matching. It says
Default Unicode word boundary
The
WORD
flag changes the definition of a ‘word boundary’ to that of a default Unicode word boundary. This applies to\b
and\B
.
But nothing seems to have changed:
>>> r1 = regex.compile(r".\b.", flags=regex.UNICODE)
>>> r2 = regex.compile(r".\b.", flags=regex.UNICODE | regex.WORD)
>>> r1.findall("русский ελλανικα")
['й ', ' ε']
>>> r2.findall("русский ελλανικα")
['й ', ' ε']
I didn't observe any difference...?
Solution
The difference between with or without the WORD
flag is the way word boundaries are defined.
Given this example:
import regex
t = 'A number: 3.4 :)'
print(regex.search(r'\b3\b', t))
print(regex.search(r'\b3\b', t, flags=regex.WORD))
The first will print a match while the second returns None
, why? Because “Unicode word boundary” contains a set of rules for distinguishing word boundaries, while the default python word boundary defines it as any non-\w
characters (which is still Unicode alphanumeric).
In the example, 3.4
was split by python’s default word boundary since a \W
character was present, the period, therefore it’s a word boundary. For Unicode word boundary,
A rule states “Forbidden Breaks on “.”” example as “3.4”, therefore the period wasn’t considered a word boundary.
See all the Unicode word boundary rules here: https://unicode.org/reports/tr29/#Sentence_Boundary_Rules
Conclusion:
They both work with Unicode or your LOCALE
, but WORD
flag provides additional set of rules for distinguishing word boundaries in addition to just empty string of a \W
, since “a word is defined as a sequence of word character [\w
]”.
Answered By - Taku
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.