Issue
Please help encode on my sequence dataset manually not with sklearn, previously I used the sklearn library to encode like
k1, m2, A3, A4, A5, P1
A1, k2, A7, A9, A9, P2
A99, m77, A22, A22, A22, P9
Solution
First do it for one column and later put code in function and run in for
-loop with different columns
First step is to get unique values in column
unique = sorted(df['COL1'].unique())
And next use for
-loop and for every value
in unique
you can run
df['COL1'] == value
to get column with True
/False
which you can convert to integer to get 1
/0
(df['COL1'] == val).astype(int)
and you can put it back to original dataframe with new name
df['COL1' + '_' + val] = (df['COL1'] == val).astype(int)
text = '''COL1, COL2, COL3, COL4, COL5, LABELS
A1, A2, A3, A4, A5, P1
A1, A2, A7, A9, A9, P2
A99, A77, A22, A22, A22, P9
A1, A2, A8, A9, A0, P7
A1, A2, A8, A90, A9, P2
A1, A21, A8, A9, A11, P1
A11, A2, A81, A9, A9, P1'''
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text), sep=',\s+')
unique = sorted(df['COL1'].unique())
print(unique)
for val in unique:
col = (df['COL1'] == val).astype(int)
#col.name += '_' + val
print(col)
df[col.name + '_' + val] = col
print(df)
Result:
COL1 COL2 COL3 COL4 COL5 LABELS COL1_A1 COL1_A11 COL1_A99
0 A1 A2 A3 A4 A5 P1 1 0 0
1 A1 A2 A7 A9 A9 P2 1 0 0
2 A99 A77 A22 A22 A22 P9 0 0 1
3 A1 A2 A8 A9 A0 P7 1 0 0
4 A1 A2 A8 A90 A9 P2 1 0 0
5 A1 A21 A8 A9 A11 P1 1 0 0
6 A11 A2 A81 A9 A9 P1 0 1 0
It need also to remove COL1
and it need to remeber unique values in some dictionary so later you can convert back to these values.
Now you have to put it in some function and run for other columns.
text = '''COL1, COL2, COL3, COL4, COL5, LABELS
A1, A2, A3, A4, A5, P1
A1, A2, A7, A9, A9, P2
A99, A77, A22, A22, A22, P9
A1, A2, A8, A9, A0, P7
A1, A2, A8, A90, A9, P2
A1, A21, A8, A9, A11, P1
A11, A2, A81, A9, A9, P1'''
import pandas as pd
import io
def convert(df, col_name):
unique = sorted(df[col_name].unique())
#print(unique)
for val in unique:
df[col_name + '_' + val] = (df[col_name] == val).astype(int)
df.drop(columns=col_name, inplace=True)
return unique
# ---
df = pd.read_csv(io.StringIO(text), sep=',\s+')
transformations = {}
for name in df.columns:
if name.startswith('COL'):
transformations[name] = convert(df, name)
print(transformations)
print(df)
Result:
{'COL1': ['A1', 'A11', 'A99'],
'COL2': ['A2', 'A21', 'A77'],
'COL3': ['A22', 'A3', 'A7', 'A8', 'A81'],
'COL4': ['A22', 'A4', 'A9', 'A90'],
'COL5': ['A0', 'A11', 'A22', 'A5', 'A9']}
LABELS COL1_A1 COL1_A11 COL1_A99 ... COL5_A11 COL5_A22 COL5_A5 COL5_A9
0 P1 1 0 0 ... 0 0 1 0
1 P2 1 0 0 ... 0 0 0 1
2 P9 0 0 1 ... 0 1 0 0
3 P7 1 0 0 ... 0 0 0 0
4 P2 1 0 0 ... 0 0 0 1
5 P1 1 0 0 ... 1 0 0 0
6 P1 0 1 0 ... 0 0 0 1
[7 rows x 21 columns]
Answered By - furas
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.