Issue
I'm trying to build a regex pattern in Python that will match strings like these:
"THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER)", "VEHICLE - STOLEN", "TRANSPORTATION FACILITY (AIRPORT)", "5600 N FIGUEROA" and "400 WORLD WY" ST.
import re
hello = {"meta": 1, "reza": [[ "row-f696.af3d.c3v9", "00000000-0000-0000-2D2F-EA38F9F11DB9", 0, 1642111191, 1642111191, "{ }", "201412343", "2020-06-15T00:00:00", "2020-06-15T00:00:00", "0700", "14", "Pacific", "1494", "1", "331", "THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER)", "1606 0344 1300 1402", "60", "F", "W", "212", "TRANSPORTATION FACILITY (AIRPORT)", "IC", "Invest Cont", "331", "998", "400 WORLD WY", "33.9433", "-118.4072" ] ,
[ "row-f2wh.yte2-zhv8", "00000000-0000-0000-0BF4-2A6281C66DEF", 0, 1636553859, 1636553859, "{ }", "201107194", "2020-03-11T00:00:00", "2020-03-11T00:00:00", "1100", "11", "Northeast", "1118", "1", "510", "VEHICLE - STOLEN", "0", "108", "PARKING LOT", "IC", "Invest Cont", "510", "5600 N FIGUEROA ST", "34.114", "-118.1949" ]]}
crime = []
for items in hello["reza"]:
for item in items:
pattern = re.compile(r'[A-Z].*')
crime = re.findall(pattern,str(item))
print(crime)
Solution
The most obvious problem in your code is that you're overwriting crime
at each iteration of your nested loop. You will therefore print the result of the last findall
call. Since findall
returns a list (of all matches in str(item)
) you end up with an empty list (since there is no match in your last item).
Furthermore, you didn't described how you want to filter your results. Your pattern [A-Z].*
will match strings starting with an uppercase letter but it will obviously exclude 5600 N FIGUEROA
.
Here a suggestion checking for strings with at least three uppercase letters following each other and not starting with digits directly followed by -
(also replacing multiple whitespaces with a single one):
import re
hello = {"meta": 1, "reza": [[ "row-f696.af3d.c3v9", "00000000-0000-0000-2D2F-EA38F9F11DB9", 0, 1642111191, 1642111191, "{ }", "201412343", "2020-06-15T00:00:00", "2020-06-15T00:00:00", "0700", "14", "Pacific", "1494", "1", "331", "THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER)", "1606 0344 1300 1402", "60", "F", "W", "212", "TRANSPORTATION FACILITY (AIRPORT)", "IC", "Invest Cont", "331", "998", "400 WORLD WY", "33.9433", "-118.4072" ] ,
[ "row-f2wh.yte2-zhv8", "00000000-0000-0000-0BF4-2A6281C66DEF", 0, 1636553859, 1636553859, "{ }", "201107194", "2020-03-11T00:00:00", "2020-03-11T00:00:00", "1100", "11", "Northeast", "1118", "1", "510", "VEHICLE - STOLEN", "0", "108", "PARKING LOT", "IC", "Invest Cont", "510", "5600 N FIGUEROA ST", "34.114", "-118.1949" ]]}
crime = []
pattern = re.compile(r'(?!\d+-).*[A-Z]{3,}')
for items in hello["reza"]:
for item in items:
if isinstance(item, str) and re.match(pattern, item):
crime.append(re.sub(r'\s+', ' ', item))
print(crime)
Output:
['THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER)', 'TRANSPORTATION FACILITY (AIRPORT)', '400 WORLD WY', 'VEHICLE - STOLEN', 'PARKING LOT', '5600 N FIGUEROA ST']
Answered By - Tranbi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.