Issue
Suppose I have a dataframe df
as follows:
id value_type_and_model_name
0 1 actual_value
1 2 fitted_value_RUT_ARIMA
2 3 fitted_lower_value_RUT_ARIMA
3 4 fitted_upper_value_RUT_ARIMA
4 5 predicted_value_RUT_ARIMA
5 6 predicted_lower_value_RUT_ARIMA
6 7 predicted_upper_value_RUT_ARIMA
7 8 fitted_value_RUT_ES
8 9 fitted_lower_value_RUT_ES
9 10 fitted_upper_value_RUT_ES
10 11 predicted_value_RUT_ES
11 12 predicted_lower_value_RUT_ES
12 13 predicted_upper_value_RUT_ES
13 14 fitted_value_RUT_SARIMAX
14 15 fitted_lower_value_RUT_SARIMAX
15 16 fitted_upper_value_RUT_SARIMAX
16 17 predicted_value_RUT_SARIMAX
17 18 predicted_lower_value_RUT_SARIMAX
18 19 predicted_upper_value_RUT_SARIMAX
I need to split value_type_and_model_name
column (except 'actual_value'
in this column) into two columns: value_type
and model_name
using the second underscore from the right as the delimiter.
The expected result is as follows:
id value_type model_name
0 1 actual_value NaN
1 2 fitted_value RUT_ARIMA
2 3 fitted_lower_value RUT_ARIMA
3 4 fitted_upper_value RUT_ARIMA
4 5 predicted_value RUT_ARIMA
5 6 predicted_lower_value RUT_ARIMA
6 7 predicted_upper_value RUT_ARIMA
7 8 fitted_value UT_ES
8 9 fitted_lower_value UT_ES
9 10 fitted_upper_value UT_ES
10 11 predicted_value UT_ES
11 12 predicted_lower_value UT_ES
12 13 predicted_upper_value UT_ES
13 14 fitted_value RUT_SARIMAX
14 15 fitted_lower_value RUT_SARIMAX
15 16 fitted_upper_value RUT_SARIMAX
16 17 predicted_value RUT_SARIMAX
17 18 predicted_lower_value RUT_SARIMAX
18 19 predicted_upper_value RUT_SARIMAX
How to achieve this? Thanks. I try with code: df['value_type_and_model_name'].str.rsplit('_', n=2, expand=True)
, but it's not working out.
Solution
Since you "only" have one underscore in the second chunk, the easiest is to craft a regex for that specific case:
out = (df['value_type_and_model_name']
.str.extract(r'(?P<value_type>.*)_(?P<model_name>[^_]*_[^_]*)$')
.fillna({'value_type': df['value_type_and_model_name']})
)
Output:
value_type model_name
0 actual_value NaN
1 fitted_value RUT_ARIMA
2 fitted_lower_value RUT_ARIMA
3 fitted_upper_value RUT_ARIMA
4 predicted_value RUT_ARIMA
5 predicted_lower_value RUT_ARIMA
6 predicted_upper_value RUT_ARIMA
7 fitted_value RUT_ES
8 fitted_lower_value RUT_ES
9 fitted_upper_value RUT_ES
10 predicted_value RUT_ES
11 predicted_lower_value RUT_ES
12 predicted_upper_value RUT_ES
13 fitted_value RUT_SARIMAX
14 fitted_lower_value RUT_SARIMAX
15 fitted_upper_value RUT_SARIMAX
16 predicted_value RUT_SARIMAX
17 predicted_lower_value RUT_SARIMAX
18 predicted_upper_value RUT_SARIMAX
If you really want to split
:
df['value_type_and_model_name'].str.split('_(?=[^_]*_[^_]*$)', expand=True)
If you want, you can assign to the original dataframe:
s = df.pop('value_type_and_model_name')
df[['value_type', 'model_name']] = (s.str.extract(r'(.*)_([^_]*_[^_]*)$')
.fillna({0: s})
)
Output:
id value_type model
0 1 actual_value NaN
1 2 fitted_value RUT_ARIMA
2 3 fitted_lower_value RUT_ARIMA
3 4 fitted_upper_value RUT_ARIMA
4 5 predicted_value RUT_ARIMA
5 6 predicted_lower_value RUT_ARIMA
6 7 predicted_upper_value RUT_ARIMA
7 8 fitted_value RUT_ES
8 9 fitted_lower_value RUT_ES
9 10 fitted_upper_value RUT_ES
10 11 predicted_value RUT_ES
11 12 predicted_lower_value RUT_ES
12 13 predicted_upper_value RUT_ES
13 14 fitted_value RUT_SARIMAX
14 15 fitted_lower_value RUT_SARIMAX
15 16 fitted_upper_value RUT_SARIMAX
16 17 predicted_value RUT_SARIMAX
17 18 predicted_lower_value RUT_SARIMAX
18 19 predicted_upper_value RUT_SARIMAX
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.