Issue
Stata's fillin
command makes a dataset rectangular. How can I do the same in Pandas?
I have tried to simulate the fillin
command in this way, but it is very slow and expensive:
from itertools import product
collapse_df = collapse_df[['cod', 'loc_id', 'fob']]
var = list(product(collapse_df['loc_id'], collapse_df['cod']))
var = list(set([i for i in var]))
var_df = collapse_df[0:0]
for idx,item in enumerate(var):
df_t = collapse_df[(collapse_df['loc_id'] == item[0]) & (collapse_df['cod'] == item[1])]
if (len(df_t) == 0):
df_t.loc[0, 'loc_id'] = item[0]
df_t.loc[0, 'cod'] = item[1]
var_df = pd.concat([var_df, df_t], axis=0)
collapse_df = var_df.drop_duplicates()
Edit:
input: https://drive.google.com/file/d/1giWlKXNFaXeLpaVSUDc04AwyogD-ASJK/view?usp=sharing
output: https://drive.google.com/file/d/1UcADbQnbDELGPHZVIt2BmCYYTzU5pZtf/view?usp=sharing
Resource: https://www.stata.com/manuals13/dfillin.pdf
Solution
I'm not 100% certain what your desired result is from the input that you've provided. However, I took a stab at solving this going off what the Stata docs state.
Set up test data (very important when asking a question, please provide in the future)
import pandas as pd
import numpy as np
import itertools
np.random.seed(42)
test_data = pd.DataFrame(
{
'AgeGroup': np.random.choice(['20-24', '18-19', '10-17'], size=10, p=[0.75, 0.20, 0.05]),
'Sex': np.random.choice(['male', 'female'], size=10),
'Race': np.random.choice(['black', 'white'], size=10, p=[0.3, 0.7]),
'x1': np.random.uniform(size=10),
'x2': np.random.normal(0, 1, size=10)
}
)
test_data
Out:
AgeGroup Sex Race x1 x2
0 20-24 female black 0.785176 -0.600254
1 10-17 male white 0.199674 0.947440
2 20-24 female white 0.514234 0.291034
3 20-24 female white 0.592415 -0.635560
4 20-24 female black 0.046450 -1.021552
5 20-24 female white 0.607545 -0.161755
6 20-24 female black 0.170524 -0.533649
7 18-19 female black 0.065052 -0.005528
8 20-24 female white 0.948886 -0.229450
9 20-24 female white 0.965632 0.389349
My approach is basically:
- Find all possible combinations of identity columns
- Find which combinations do not exist in the provided dataset
- Create an empty dataset with the missing combinations and concatenate it to the existing dataset
def fill_in(df, id_cols):
"""Fill in empty records for combinations of id_cols that do not exist
in dataset.
Args:
df: dataset
id_cols: list of identity columns
Returns:
filled_df: dataframe with empty records for missing combinations of id_cols
"""
# create all possible unique combinations of id_cols
# and find combos that do not exist in the dataset
id_combos = list(itertools.product(*[df[c].unique() for c in id_cols]))
existing_combos = df[id_cols].apply(tuple, axis=1).unique()
missing_combos = set(id_combos) - set(existing_combos)
# create an empty dataframe with the missing combos
other_cols = [c for c in df.columns if c not in id_cols]
new_idx = pd.MultiIndex.from_tuples(missing_combos, names=id_cols)
empty_data = np.empty(shape=(len(missing_combos), len(other_cols))).fill(np.nan)
filled_df = pd.DataFrame(data=empty_data, index=new_idx, columns=other_cols).reset_index()
# concat dataset with empty dataset for missing combos
return pd.concat([df.assign(_fill_in=0), filled_df.assign(_fill_in=1)])
Trying it out:
fill_df(test_data, ['AgeGroup', 'Sex', 'Race'])
Results:
AgeGroup Sex Race x1 x2 _fill_in
0 20-24 female black 0.785176 -0.600254 0
1 10-17 male white 0.199674 0.947440 0
2 20-24 female white 0.514234 0.291034 0
3 20-24 female white 0.592415 -0.635560 0
4 20-24 female black 0.046450 -1.021552 0
5 20-24 female white 0.607545 -0.161755 0
6 20-24 female black 0.170524 -0.533649 0
7 18-19 female black 0.065052 -0.005528 0
8 20-24 female white 0.948886 -0.229450 0
9 20-24 female white 0.965632 0.389349 0
0 10-17 female white NaN NaN 1
1 10-17 female black NaN NaN 1
2 18-19 male black NaN NaN 1
3 10-17 male black NaN NaN 1
4 18-19 male white NaN NaN 1
5 18-19 female white NaN NaN 1
6 20-24 male white NaN NaN 1
7 20-24 male black NaN NaN 1
Answered By - JWilliams1
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.