Issue
I have a variety of complex filenames that I need to match with a regex. This is the general pattern, optional groups in round brackets:
<main>_<country>-<region>(_<id>)(_<provider>)_<year><month><day>_<hour><minute><second>_<sensor>_<resolution>(_<bittype>).<format>
Here are some examples of what those filenames can look like:
fn1 = 'FOO_is-atest_123456_COMPANY_20190729_153343_SATEL_m0001_32bit.tif'
fn2 = 'FOO_is-atest_COMPANY_20190729_153343_SATEL_m0001_32bit.tif'
fn3 = 'FOO_is-atest_COMPANY_20190729_153343_SATEL_m0001.tif'
fn4 = 'FOO_is-atest_32tnt_20211125_120005_SATEL_m0001.tif'
fn5 = 'FOO_is-atest_20211125_120005_SATEL_cm070.tif'
fn6 = 'FOO_is-atest_20211125_120005_SATEL_cm070_32bit.tif'
The different components can have varying lengths sometimes. The tricky part is that tile
and provider
can basically consist of any given length and any character.
I just can't get it to match all the cases. Here is the closest I came, using this nice online regex test page here:
import re
pattern = '(?P<product>\w{3})' \
'_(?P<country>\w{2})' \
'-(?P<region>\w+)' \
'_?(?P<tile>\w+)?' \
'_?(?P<provider>\w+)?' \
'_(?P<year>\d{4})' \
'(?P<month>\d{2})' \
'(?P<day>\d{2})' \
'_(?P<hour>\d{2})' \
'(?P<minute>\d{2})' \
'(?P<second>\d{2})' \
'_(?P<sensor>\w{5})' \
'_(?P<res_unit>km|m|cm)' \
'(?P<resolution>\d{3,4})' \
'_?(?P<bittype>\d{1,2}bit)?' \
'.(?P<format>\w+)'
p = re.compile(pattern)
print(p.match(fn1).group('tile'), p.match(fn1).group('provider'))
print(p.match(fn2).group('provider'), p.match(fn2).group('bittype'))
print(p.match(fn3).group('provider'), p.match(fn3).group('resolution'))
print(p.match(fn4).group('tile'), p.match(fn4).group('year'))
print(p.match(fn5).group('provider'), p.match(fn5).group('resolution'))
print(p.match(fn6).group('provider'), p.match(fn5).group('bittype'))
# OUTPUTS:
>>> (None, None)
>>> (None, '32bit')
>>> (None, '0001')
>>> (None, '2021')
>>> (None, '070')
>>> (None, None)
As you see, tile
and provider
are not correctly recognized, so something it still not right. Everything else seems to work fine. Regexes are still somewhat of a mystery to me, to be honest.
Solution
You can use
^(?P<product>[^\W_]{3})_(?P<country>[^\W_]{2})-(?P<region>\w+?)(?:_(?P<tile>[^_]+))??(?:_(?P<provider>[^\W_]+))?_(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})_(?P<hour>\d{2})(?P<minute>\d{2})(?P<second>\d{2})_(?P<sensor>[^\W_]{5})_(?P<res_unit>km|m|cm)(?P<resolution>\d{3,4})(?:_(?P<bittype>\d{1,2}bit))?\.(?P<format>\w+)$
See the regex demo. Details:
^
- start of string(?P<product>[^\W_]{3})
- Group "product": three alphanumeric chars_
- an underscore(?P<country>[^\W_]{2})
- Group "country": two alphanumeric chars-
- a hyphen(?P<region>\w+?)
- Group "region": one or more alphanumeric or underscore chars, as few as possible(?:_(?P<tile>[^_]+))??
- an optional sequence of patterns that is matched only if the subsequent patterns in the regex fail to match (see lazy??
quantifier):_
- an underscore(?P<tile>[^_]+)
- Group "title": one or more chars other than_
(?:_(?P<provider>[^\W_]+))?
_(?P<year>\d{4})
(?P<month>\d{2})
- Group "month": two digits(?P<day>\d{2})
- Group "day": two digits_
- an underscore(?P<hour>\d{2})
- Group "hour": two digits(?P<minute>\d{2})
- Group "minute": two digits(?P<second>\d{2})
- Group "second": two digits_
- an underscore(?P<sensor>[^\W_]{5})
- Group "sensor": five alphanumeric chars_
- an underscore(?P<res_unit>km|m|cm)
- Group "res_unit":km
,m
orcm
(also[kc]m
can be used)(?P<resolution>\d{3,4})
- Group "resolution": three or four digits(?:_(?P<bittype>\d{1,2}bit))?
- an optional sequence of_
and then Group "bittype" capturing one or two digits and thenbit
string\.
- a dot(?P<format>\w+)
- Group "format": one or more alphanumeric/underscore chars$
- end of string.
Answered By - Wiktor Stribiżew
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.