Wednesday, April 13, 2022

[FIXED] Manual format data Python

April 13, 2022 pandas, python, scikit-learn No comments

Issue

Please help encode on my sequence dataset manually not with sklearn, previously I used the sklearn library to encode like



k1,    m2,   A3,  A4,   A5, P1
A1,    k2,   A7,  A9,   A9, P2
A99,   m77,  A22,  A22,   A22, P9

Solution

First do it for one column and later put code in function and run in for-loop with different columns

First step is to get unique values in column

unique = sorted(df['COL1'].unique())

And next use for-loop and for every value in unique you can run

df['COL1'] == value

to get column with True/False which you can convert to integer to get 1 /0

(df['COL1'] == val).astype(int)

and you can put it back to original dataframe with new name

df['COL1' + '_' + val] = (df['COL1'] == val).astype(int)

text = '''COL1, COL2, COL3, COL4, COL5, LABELS
A1,    A2,   A3,  A4,   A5, P1
A1,    A2,   A7,  A9,   A9, P2
A99,   A77,  A22,  A22,   A22, P9
A1,    A2,   A8,  A9,   A0, P7
A1,    A2,   A8,  A90,   A9, P2
A1,    A21,  A8,  A9,   A11, P1
A11,   A2,   A81,  A9,   A9, P1'''

import pandas as pd
import io

df = pd.read_csv(io.StringIO(text), sep=',\s+')

unique = sorted(df['COL1'].unique())
print(unique)

for val in unique:
    col = (df['COL1'] == val).astype(int)
    #col.name += '_' + val
    print(col)
    df[col.name + '_' + val] = col

print(df)

Result:

  COL1 COL2 COL3 COL4 COL5 LABELS  COL1_A1  COL1_A11  COL1_A99
0   A1   A2   A3   A4   A5     P1        1         0         0
1   A1   A2   A7   A9   A9     P2        1         0         0
2  A99  A77  A22  A22  A22     P9        0         0         1
3   A1   A2   A8   A9   A0     P7        1         0         0
4   A1   A2   A8  A90   A9     P2        1         0         0
5   A1  A21   A8   A9  A11     P1        1         0         0
6  A11   A2  A81   A9   A9     P1        0         1         0

It need also to remove COL1 and it need to remeber unique values in some dictionary so later you can convert back to these values.

Now you have to put it in some function and run for other columns.

text = '''COL1, COL2, COL3, COL4, COL5, LABELS
A1,    A2,   A3,  A4,   A5, P1
A1,    A2,   A7,  A9,   A9, P2
A99,   A77,  A22,  A22,   A22, P9
A1,    A2,   A8,  A9,   A0, P7
A1,    A2,   A8,  A90,   A9, P2
A1,    A21,  A8,  A9,   A11, P1
A11,   A2,   A81,  A9,   A9, P1'''

import pandas as pd
import io

def convert(df, col_name):
    unique = sorted(df[col_name].unique())
    #print(unique)

    for val in unique:
        df[col_name + '_' + val] = (df[col_name] == val).astype(int)
    
    df.drop(columns=col_name, inplace=True)
    
    return unique

# ---

df = pd.read_csv(io.StringIO(text), sep=',\s+')

transformations = {}

for name in df.columns:
    if name.startswith('COL'):
        transformations[name] = convert(df, name)
        
print(transformations)        
    
print(df)

Result:

{'COL1': ['A1', 'A11', 'A99'], 
'COL2': ['A2', 'A21', 'A77'], 
'COL3': ['A22', 'A3', 'A7', 'A8', 'A81'], 
'COL4': ['A22', 'A4', 'A9', 'A90'], 
'COL5': ['A0', 'A11', 'A22', 'A5', 'A9']}

  LABELS  COL1_A1  COL1_A11  COL1_A99  ...  COL5_A11  COL5_A22  COL5_A5  COL5_A9
0     P1        1         0         0  ...         0         0        1        0
1     P2        1         0         0  ...         0         0        0        1
2     P9        0         0         1  ...         0         1        0        0
3     P7        1         0         0  ...         0         0        0        0
4     P2        1         0         0  ...         0         0        0        1
5     P1        1         0         0  ...         1         0        0        0
6     P1        0         1         0  ...         0         0        0        1

[7 rows x 21 columns]

Answered By - furas

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, April 13, 2022

[FIXED] Manual format data Python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels