Issue
I am trying to join multiple columns using pandas and the separator is defined into another column. The problem I am facing is to avoid the separator for cells which are blank.
The sample code for my attempt and the output to explain the problem is given below:
import pandas as pd
df = pd.DataFrame({'col_1': ['', '1', '1', '2', '2', '3', '3', '4', '', '4', '5', '5', '5', '5', '5', '5'],
'col_2': ['A', '', 'C', 'A', '', 'C', 'D', 'D', 'A', 'A', 'B', 'E', 'F', 'G', 'H', 'I'],
'col_3': ['256', '546', '985', '573', '265', '731', '968', '592', '364', '', '953', '476', '835',
'', '572', '903'],
'col_4': ['.', '.', '.', '-', '_', '_', '-', '.', '.', '/', '/', '.', '_', '_', '-', '.']})
df['concatenated'] = df['col_1'] + df['col_4'] + df['col_2'] + df['col_4'] + df['col_3']
print(df)
The output I am getting is:
col_1 col_2 col_3 col_4 concatenated
0 A 256 . .A.256
1 1 546 . 1..546
2 1 C 985 . 1.C.985
3 2 A 573 - 2-A-573
4 2 265 _ 2__265
5 3 C 731 _ 3_C_731
6 3 D 968 - 3-D-968
7 4 D 592 . 4.D.592
8 A 364 . .A.364
9 4 A / 4/A/
10 5 B 953 / 5/B/953
11 5 E 476 . 5.E.476
12 5 F 835 _ 5_F_835
13 5 G _ 5_G_
14 5 H 572 - 5-H-572
15 5 I 903 . 5.I.903
But the expected output is:
col_1 col_2 col_3 col_4 concatenated
0 A 256 . A.256
1 1 546 . 1.546
2 1 C 985 . 1.C.985
3 2 A 573 - 2-A-573
4 2 265 _ 2_265
5 3 C 731 _ 3_C_731
6 3 D 968 - 3-D-968
7 4 D 592 . 4.D.592
8 A 364 . A.364
9 4 A / 4/A
10 5 B 953 / 5/B/953
11 5 E 476 . 5.E.476
12 5 F 835 _ 5_F_835
13 5 G _ 5_G
14 5 H 572 - 5-H-572
15 5 I 903 . 5.I.903
The actual data contains many more columns but I need to join only selective columns.
Can anyone help me to find out the solution or guide me in the right direction?
Solution
You can use str.strip()
to remove the extra separators at both ends and also str.replace()
to remove repeated consecutive separators, as follows:
import re
sep = list(map(re.escape, df['col_4'].unique()))
sep_regex = '|'.join(sep)
df['concatenated'] = (df['concatenated'].str.strip(sep_regex)
.str.replace(fr'({sep_regex})\1', r'\1', regex=True)
)
Result:
print(df)
col_1 col_2 col_3 col_4 concatenated
0 A 256 . A.256
1 1 546 . 1.546
2 1 C 985 . 1.C.985
3 2 A 573 - 2-A-573
4 2 265 _ 2_265
5 3 C 731 _ 3_C_731
6 3 D 968 - 3-D-968
7 4 D 592 . 4.D.592
8 A 364 . A.364
9 4 A / 4/A
10 5 B 953 / 5/B/953
11 5 E 476 . 5.E.476
12 5 F 835 _ 5_F_835
13 5 G _ 5_G
14 5 H 572 - 5-H-572
15 5 I 903 . 5.I.903
Explanation:
Here, we created a list of unique sysmbols in col_4
and escaped these characters if they are regex meta-characters by using re.escape
.
print(sep)
['\\.', '\\-', '_', '/']
Also, in order to match with these characters in str.strip()
and str.replace()
, we further make a regex expression listing these possible alternatives:
These are the |
(that is "or") of those escaped separators above:
print(sep_regex)
'\\.|\\-|_|/'
We used regex back-referencing \1
to detect repeated consecutive characters and replace them with single occurrence of these characters.
Answered By - SeaBean
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.