Issue
In my DataFrame, I have many instances of same AutoNumber
having different KeyValue_String
. I would like to convert these instances to a single row where the KeyValue_String
is a list comprised of the multiple unique values.
AutoNumber KeyValue_String ReferralType Description
0 50899 DD 3 Web Search
1 50905 Cheque 1 Gatestone Collections
2 50906 DD 2 Centum Mortgage Brokers
3 50907 Cheque 1 Financial Debt Recovery Ltd.
4 50908 DD 2 Centum Mortgage Brokers
5 50909 DD 2 Centum Mortgage Brokers
6 50910 Cheque 1 Allied International Credit
7 50911 Cheque 1 D&A Collection Corp
8 50912 Cheque 1 Gatestone Collections
9 50913 Cheque 1 Financial Debt Recovery Ltd.
10 50914 Cheque 3 Existing Customer - Refinancing
11 50914 DD 3 Existing Customer - Refinancing
12 50915 Cheque 1 Gatestone Collections
13 50916 Cheque 3 Existing Customer - Refinancing
14 50916 Cheque 3 Existing Customer - Refinancing
The desired output would look like this, except I want to keep all of the other columns
AutoNumber KeyValue_String
0 50899 DD
1 50905 Cheque
2 50906 DD
3 50907 Cheque
4 50908 DD
5 50909 DD
6 50910 Cheque
7 50911 Cheque
8 50912 Cheque
9 50913 Cheque
10 50914 [Cheque, DD]
11 50915 Cheque
12 50916 Cheque
13 50917 Cheque
14 50918 Cheque
Solution
If I understand correctly, you could opt for using groupby
, transform
, and unique
.
df['KeyValue_String'] = df.groupby('AutoNumber').KeyValue_String.transform('unique')
Then you can drop duplicates assuming as mentioned in the comments that rows with the same AutoNumber contain duplicate information besides the KeyValue_String.
df = df.drop_duplicates(subset='AutoNumber')
I would advise if you want arrays you keep everything in the column as an array, and don't expend effort putting mixed types in the column which will just be harder to work with anyways.
Demo
>>> df
AutoNumber KeyValue_String
0 50899 DD
1 50905 Cheque
2 50906 DD
3 50907 Cheque
4 50908 DD
5 50909 DD
6 50910 Cheque
7 50911 Cheque
8 50912 Cheque
9 50913 Cheque
10 50914 Cheque
11 50914 DD
12 50915 Cheque
13 50916 Cheque
14 50916 Cheque
>>> df['KeyValue_String'] = df.groupby('AutoNumber').KeyValue_String.transform('unique')
>>> df.drop_duplicates(subset='AutoNumber')
AutoNumber KeyValue_String
0 50899 [DD]
1 50905 [Cheque]
2 50906 [DD]
3 50907 [Cheque]
4 50908 [DD]
5 50909 [DD]
6 50910 [Cheque]
7 50911 [Cheque]
8 50912 [Cheque]
9 50913 [Cheque]
10 50914 [Cheque, DD]
12 50915 [Cheque]
13 50916 [Cheque]
Answered By - miradulo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.