Issue
I have a data frame with the same values in some of the rows. I would like to return the row number and values that has more than 1 of the same value starting from Num1 to Num7 in df.
Also would like to return the row number and same value. Either or I'm okay with.
For example
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,1,2,4,5,6,7,7],
[2,5,6,7,22,23,34,48],
[3,3,5,6,7,45,46,48],
[4,6,7,14,29,32,6,29], # duplicate 6 and 29
[5,6,7,13,23,33,35,7], # duplicate 7 but nothing else
[6,1,6,7,8,9,10,8],
[7,0,2,5,7,19,7,5]], # duplicate 7,5
columns = ['Row_Num', 'Num1','Num2','Num3','Num4','Num5','Num6','Num7'])
First result
df2 = pd.DataFrame([[4,6,7,14,29,32], # Return 6 and 29
[7,0,2,5,7,19]], # Return 7,5
columns = ['Row_Num', 'Num1','Num2','Num3','Num4','Num5'])
Second result
df3 = pd.DataFrame([[4,6,29], # Return 6 and 29
[7,7,5]], # Return 7,5
columns = ['Row_Num', 'Num1','Num2'])
Solution
IIUC you can use pd.duplicated with a some data manipulation to get your result:
df = df.set_index('Row_Num') # set index
df_duplicated = df.transform(lambda x: x.duplicated(), axis=1) # returns if rows have duplicates
# First result where
res1 = df[df_duplicated.sum(axis=1) >= 2][~df_duplicated[df_duplicated.sum(axis=1) >= 2]].dropna(axis=1)
# second result
res2 = df[df_duplicated.sum(axis=1) >= 2][df_duplicated[df_duplicated.sum(axis=1) >= 2]].dropna(axis=1)
Output:
result1
Num1 Num2 Num3 Num4 Num5
Row_Num
4 6 7 14 29 32
7 0 2 5 7 19
result2
:
Num6 Num7
Row_Num
4 6 29
7 7 5
To exactly match your output just reset_index
and rename column names for the second result.
Answered By - Suraj Shourie
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.