Issue
Goal: replace values in column que_text
with matches of re.search pattern. Else None
Problem: Receiving only None
values in que_text_new
column although regex pattern is thoroughly tested!
def override(s):
x = re.search(r'(an|frage(\s+ich)?)\s+d(i|ı)e\s+Staatsreg(i|ı)erung(.*)(Dresden(\.|,|\s+)?)?', str(s), flags = re.DOTALL | re.MULTILINE))
if x :
return x.group(5)
return None
df2['que_text_new'] = df2['que_text'].apply(override)
What am i doing wrong? removing return None
doesent help. There must be some structural error within my function, i assume.
Solution
You can use a pattern with a single capturing group and then simpy use Series.str.extract
and chain .fillna(np.nan)
to fill the non-matched values with NaN
:
pattern = r'(?s)(?:an|frage(?:\s+ich)?)\s+d[iı]e\s+Staatsreg[iı]erung(.*)'
df2['que_text_new'] = df2['que_text'].astype(str).str.extract(pattern).fillna(np.nan)
Not sure you need .astype(str)
, but there is str(s)
in your code, so it might be safer with this part.
Here,
- Capturing groups with single char alternatives are converted to character classes, e.g.
(i|ı)
->[iı]
- Other capturing groups are converted to non-capturing ones, i.e.
(
->(?:
. - To make
np.nan
work do not forget toimport numpy as np
. (?s)
is an in-patternre.DOTALL
option.
Answered By - Wiktor Stribiżew
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.