Issue
I'm attempting to parse a wikipedia file dump with RegEx.
I want to match and remove everything between a set of brackets, including the brackets themselves. I also want to be able to check if the first word after the opening bracket is a certain word, and do not delete it if it is. In my case, a single bracket consists of two characters, say {{
and }}
.
For example, take the following sequence into consideration:
{{{{}}{{}}{{}}}} Don't delete me {{notmeeither}}
Using the following regex:
{{(?!(notmeeither))(.|\n)*?\}}
results in matching the first {{{{}}
, resulting in leftover brackets. Making the match greedy does not help, as it affects the text inbetween as well as the text not supposed to be matched. How would I go about this? TIA.
Edit: Make requirements more specific
Solution
With the regex package you can specify recursive patterns:
>>> import regex
>>> regex.sub(r"\((?!(notmeeither))((?>[^()]+|(?R))*)\)","","(()()()) Don't delete me (notmeeither)")
" Don't delete me (notmeeither)"
EDIT (since the question changed):
>>> regex.sub(r"{{(?!(notmeeither))((?>[^{}]+|(?R))*)}}","","{{{{}}{{}}{{}}}} Don't delete me {{notmeeither}}")
" Don't delete me {{notmeeither}}"
Answered By - logi-kal
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.