Issue
I have a dataframe that looks like the following, with x number of person ids (more than 1000 persons), x number of transactions per person, and x number of variables (more than 1000 variables):
Person_ID | transaction_ID | variable_1 | variable_2 | variable_3 | variable_X |
---|---|---|---|---|---|
person1 | transaction1 | 123 | 0 | 1 | abc |
person1 | transaction2 | 456 | 1 | 0 | def |
person1 | transaction3 | 123 | 0 | 1 | abc |
personx | transaction1 | 123 | 0 | 1 | abc |
personx | transaction2 | 456 | 0 | 1 | def |
I want to pad it with rows containing -10 at the beginning of every person id group so that the total number of rows per person id group is 6, like the following:
Person_ID | transaction_ID | variable_1 | variable_2 | variable_3 | variable_X |
---|---|---|---|---|---|
person1 | -10 | -10 | -10 | -10 | -10 |
person1 | -10 | -10 | -10 | -10 | -10 |
person1 | -10 | -10 | -10 | -10 | -10 |
person1 | transaction1 | 123 | 0 | 1 | abc |
person1 | transaction2 | 456 | 1 | 0 | def |
person1 | transaction3 | 123 | 0 | 1 | abc |
personx | -10 | -10 | -10 | -10 | -10 |
personx | -10 | -10 | -10 | -10 | -10 |
personx | -10 | -10 | -10 | -10 | -10 |
personx | -10 | -10 | -10 | -10 | -10 |
personx | transaction1 | 123 | 0 | 1 | abc |
personx | transaction2 | 456 | 0 | 1 | def |
Here is the code I tried (updated with concat) and the error below it.
df2 = pd.DataFrame([[''] * len(newdf.columns)], columns=newdf.columns)
df2
for row in newdf.groupby('person_id')['transaction_id']:
x=newdf.groupby('person_id')['person_id'].nunique()
if x.any() < 6:
newdf=pd.concat([newdf, df2*(6-x)], ignore_index=True)
RuntimeWarning: '<' not supported between instances of 'int' and 'tuple', sort order is undefined for incomparable objects.
newdf=pd.concat([newdf, df2*(6-x)], ignore_index=True)
It appended several NaN rows to the bottom of the dataframe, but not inbetween groups as needed. Thank you in advance as I am a beginner.
Solution
Code
use groupby
+ apply
def func1(df):
n = 6 - len(df)
if n > 0:
df1 = pd.DataFrame(df['Person_ID'].iloc[0], columns=['Person_ID'], index=range(0, n))
return pd.concat([df1.reindex(df.columns, axis=1, fill_value=-10), df])
out = df.groupby('Person_ID', group_keys=False).apply(func1).reset_index(drop=True)
out
out
Example Code
import pandas as pd
data1 = {'Person_ID': ['person1', 'person1', 'person1', 'personx', 'personx'],
'transaction_ID': ['transaction1', 'transaction2', 'transaction3', 'transaction1', 'transaction2'],
'variable_1': [123, 456, 123, 123, 456],
'variable_2': [0, 1, 0, 0, 0],
'variable_3': [1, 0, 1, 1, 1],
'variable_X': ['abc', 'def', 'abc', 'abc', 'def']}
df = pd.DataFrame(data1)
Answered By - Panda Kim
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.