Issue
This is my code below. When I run it i get the error:
c:\Users\renne\Documents\Code\Text Analysis\Assignment1.1C.py:27:
FutureWarning: Possible nested set at position 54
for item in
re.finditer("(?P<host>\d{3}[.]\d{3}[.]\d{3}[.]\d{3})(?P<user_name>[[\w]+\d{4}]|[-])(?P<time>\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})(?P<request>[A-Z]+ \S* HTTP/\d[.]\d)", logdata):
I dont know how to solve this. I have looked over my code a few times and cant figure out the problem.
I used a random string out of the test data instead of the enire txt file to make the testing easier. When this works ill change logdata = '...'
to a read.
import re
logdata = '146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622'
dict = {}
expression = """
(?P<host>\d{3}[.]\d{3}[.]\d{3}[.]\d{3})
(?P<user_name>[[\w]+\d{4}]|[-])
(?P<time>\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})
(?P<request>[A-Z]+ \S* HTTP/\d[.]\d)
"""
for item in re.finditer("(?P<host>\d{3}[.]\d{3}[.]\d{3}[.]\d{3})(?P<user_name>[[\w]+\d{4}]|[-])(?P<time>\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})(?P<request>[A-Z]+ \S* HTTP/\d[.]\d)", logdata):
print(item.groupdict()['host'])
print(item.groupdict())
Solution
You get the warning because you have a pair of unescaped square brackets inside a pair unescaped square brackets. See the re
documentation:
Support of nested sets and set operations as in Unicode Technical Standard #18 might be added in the future. This would change the syntax, so to facilitate this change a FutureWarning will be raised in ambiguous cases for the time being. That includes sets starting with a literal
'['
or containing literal character sequences'--'
,'&&'
,'~~'
, and'||'
. To avoid a warning escape them with a backslash.
The [[\w]+\d{4}]
is wrong as it matches one or more [
or word chars (with [[\w]+
) amd then four digits (with \d{4}
) and then a literal ]
char (with ]
). You need to remove all square brackets here.
You can use
r'(?P<host>\d{3}\.\d{3}\.\d{3}\.\d{3}) - (?P<user_name>\w+\d{4}|-) \[(?P<time>\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})] "(?P<request>[A-Z]+ \S* HTTP/\d\.\d)'
See the regex demo.
If you encounter this error in other scenarios, you may need to fix it differently:
- When you need to match a literal
[
or]
and use them inside square brackets, escape]
and do not escape[
. E.g.[a-zA-Z[\]]
matches an ASCII letter,[
or]
. You may also keep]
unescaped if put at the start of a character class:[]A-Za-z[]
=[a-zA-Z[\]]
=[][a-zA-Z]
. - When you want to match a literal
[
or]
outside of square brackets (character class), you need to escape[
and keep]
unescaped. E.g.\[[0-9]+]
matches[
, then one or more digits and then a]
char. - Note that using single shorthands or chars inside character classes is considered bad practice and may lead to misunderstandings that in their turn might lead to issues like this. Instead of
[\w]+
, always use\w+
.
Answered By - Wiktor Stribiżew
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.