Issue
While working on an answer to this question, I came up with this regex:
(?:(?!\2)(?:,foo=([^,]*),(?=())|.))*\2bar=2
(Note: this regex requires the PyPI regex
module)
(Short explanation: The regex relies on the fact that capture groups in lookaheads can't change their value after they've matched once, so after the first foo=
is found, the (?=())
matches and from that point onwards (?!\2)
will always fail.)
This regex works correctly with the 2 examples given in the question:
>>> pattern = r'(?:(?!\2)(?:,foo=([^,]*),(?=())|.))*\2bar=2'
>>> regex.match(pattern, 'baz=0,foo=1,bar=2,foo=3,bar=4').group(1)
'1'
>>> regex.match(pattern, 'baz=0,foo=1,foo=1,bar=2')
>>>
But something strange happens if there's an occurrence of foo=
after a bar=2
:
>>> # this doesn't match, as expected:
>>> regex.match(pattern, 'notfoo=1,bar=2')
>>> # but how the heck does it match this ?!
>>> regex.match(pattern, 'notfoo=1,bar=2,foo=3,')
<regex.Match object; span=(0, 14), match='notfoo=1,bar=2'>
As you can see, the string 'notfoo=1,bar=2,foo=3,'
produced a match of notfoo=1,bar=2
. The foo=3,
isn't even included in the match, but if it's removed, the regex no longer matches! How is this possible? Is this a bug in the regex
module?
Solution
This actually makes perfect sense. The reason for this behavior is simple: backtracking.
The sequence of events is as follows:
- The greedy group
(?:...)*
advances one character at a time until it finally finds an occurrence offoo=
at,foo=3,
- The regex attempts to match
bar=2
, but this fails - The regex backtracks 1 character at a time until
bar=2
matches, giving us a result ofnotfoo=1,bar=2
.
So, what can we do about this? We can move the bar=2
into the greedy group and use another capture group to assert that the regex matched successfully:
(?:(?!\2)(?:,foo=([^,]*),(?=())bar=2()|.))*\3
>>> pattern = r'(?:(?!\2)(?:,foo=([^,]*),(?=())bar=2()|.))*\3'
>>> regex.match(pattern, 'baz=0,foo=1,bar=2,foo=3,bar=4').group(1)
'1'
>>> regex.match(pattern, 'baz=0,foo=1,foo=1,bar=2')
>>> regex.match(pattern, 'notfoo=1,bar=2')
>>> regex.match(pattern, 'notfoo=1,bar=2,foo=3,')
>>>
Answered By - Aran-Fey
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.