Tuesday, December 7, 2021

[FIXED] Pandas: Concatenate multiple columns using another separator column and avoid extra separators for blank values

December 07, 2021 dataframe, pandas, python No comments

Issue

I am trying to join multiple columns using pandas and the separator is defined into another column. The problem I am facing is to avoid the separator for cells which are blank.

The sample code for my attempt and the output to explain the problem is given below:

import pandas as pd
df = pd.DataFrame({'col_1': ['', '1', '1', '2', '2', '3', '3', '4', '', '4', '5', '5', '5', '5', '5', '5'],
                   'col_2': ['A', '', 'C', 'A', '', 'C', 'D', 'D', 'A', 'A', 'B', 'E', 'F', 'G', 'H', 'I'],
                   'col_3': ['256', '546', '985', '573', '265', '731', '968', '592', '364', '', '953', '476', '835',
                             '', '572', '903'],
                   'col_4': ['.', '.', '.', '-', '_', '_', '-', '.', '.', '/', '/', '.', '_', '_', '-', '.']})

df['concatenated'] = df['col_1'] + df['col_4'] + df['col_2'] + df['col_4'] + df['col_3']
print(df)

The output I am getting is:

     col_1 col_2 col_3 col_4    concatenated
0            A   256     .       .A.256
1      1         546     .       1..546
2      1     C   985     .      1.C.985
3      2     A   573     -      2-A-573
4      2         265     _       2__265
5      3     C   731     _      3_C_731
6      3     D   968     -      3-D-968
7      4     D   592     .      4.D.592
8            A   364     .       .A.364
9      4     A           /         4/A/
10     5     B   953     /      5/B/953
11     5     E   476     .      5.E.476
12     5     F   835     _      5_F_835
13     5     G           _         5_G_
14     5     H   572     -      5-H-572
15     5     I   903     .      5.I.903

But the expected output is:

     col_1 col_2 col_3 col_4   concatenated
0            A   256     .      A.256
1      1         546     .      1.546
2      1     C   985     .      1.C.985
3      2     A   573     -      2-A-573
4      2         265     _      2_265
5      3     C   731     _      3_C_731
6      3     D   968     -      3-D-968
7      4     D   592     .      4.D.592
8            A   364     .      A.364
9      4     A           /      4/A
10     5     B   953     /      5/B/953
11     5     E   476     .      5.E.476
12     5     F   835     _      5_F_835
13     5     G           _      5_G
14     5     H   572     -      5-H-572
15     5     I   903     .      5.I.903

The actual data contains many more columns but I need to join only selective columns.

Can anyone help me to find out the solution or guide me in the right direction?

Solution

You can use str.strip() to remove the extra separators at both ends and also str.replace() to remove repeated consecutive separators, as follows:

import re
sep = list(map(re.escape, df['col_4'].unique()))
sep_regex = '|'.join(sep)

df['concatenated'] = (df['concatenated'].str.strip(sep_regex)
                                        .str.replace(fr'({sep_regex})\1', r'\1', regex=True)
                     )

Result:

print(df)

   col_1 col_2 col_3 col_4 concatenated
0            A   256     .        A.256
1      1         546     .        1.546
2      1     C   985     .      1.C.985
3      2     A   573     -      2-A-573
4      2         265     _        2_265
5      3     C   731     _      3_C_731
6      3     D   968     -      3-D-968
7      4     D   592     .      4.D.592
8            A   364     .        A.364
9      4     A           /          4/A
10     5     B   953     /      5/B/953
11     5     E   476     .      5.E.476
12     5     F   835     _      5_F_835
13     5     G           _          5_G
14     5     H   572     -      5-H-572
15     5     I   903     .      5.I.903

Explanation:

Here, we created a list of unique sysmbols in col_4 and escaped these characters if they are regex meta-characters by using re.escape.

print(sep)

['\\.', '\\-', '_', '/']

Also, in order to match with these characters in str.strip() and str.replace(), we further make a regex expression listing these possible alternatives:

These are the | (that is "or") of those escaped separators above:

print(sep_regex)

'\\.|\\-|_|/'

We used regex back-referencing \1 to detect repeated consecutive characters and replace them with single occurrence of these characters.

Answered By - SeaBean

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, December 7, 2021

[FIXED] Pandas: Concatenate multiple columns using another separator column and avoid extra separators for blank values

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels