Monday, November 15, 2021

[FIXED] Is there a Pandas function equivalent to Stata fillin?

November 15, 2021 dataframe, pandas, python, python-3.x, stata No comments

Issue

Stata's fillin command makes a dataset rectangular. How can I do the same in Pandas?

I have tried to simulate the fillin command in this way, but it is very slow and expensive:

from itertools import product

collapse_df = collapse_df[['cod', 'loc_id', 'fob']]
var = list(product(collapse_df['loc_id'], collapse_df['cod']))
var = list(set([i for i in var]))
var_df = collapse_df[0:0]
for idx,item in enumerate(var):
    df_t = collapse_df[(collapse_df['loc_id'] == item[0]) & (collapse_df['cod'] == item[1])]
    if (len(df_t) == 0):
        df_t.loc[0, 'loc_id'] = item[0]
        df_t.loc[0, 'cod'] = item[1]
    var_df = pd.concat([var_df, df_t], axis=0)
collapse_df = var_df.drop_duplicates()

Edit:

input: https://drive.google.com/file/d/1giWlKXNFaXeLpaVSUDc04AwyogD-ASJK/view?usp=sharing

output: https://drive.google.com/file/d/1UcADbQnbDELGPHZVIt2BmCYYTzU5pZtf/view?usp=sharing

Resource: https://www.stata.com/manuals13/dfillin.pdf

Solution

I'm not 100% certain what your desired result is from the input that you've provided. However, I took a stab at solving this going off what the Stata docs state.

Set up test data (very important when asking a question, please provide in the future)

import pandas as pd 
import numpy as np 
import itertools

np.random.seed(42)
test_data = pd.DataFrame(
    {
        'AgeGroup': np.random.choice(['20-24', '18-19', '10-17'], size=10, p=[0.75, 0.20, 0.05]),
        'Sex': np.random.choice(['male', 'female'], size=10),
        'Race': np.random.choice(['black', 'white'], size=10, p=[0.3, 0.7]),
        'x1': np.random.uniform(size=10),
        'x2': np.random.normal(0, 1, size=10)
    }
)

test_data

Out:

  AgeGroup     Sex   Race        x1        x2
0    20-24  female  black  0.785176 -0.600254
1    10-17    male  white  0.199674  0.947440
2    20-24  female  white  0.514234  0.291034
3    20-24  female  white  0.592415 -0.635560
4    20-24  female  black  0.046450 -1.021552
5    20-24  female  white  0.607545 -0.161755
6    20-24  female  black  0.170524 -0.533649
7    18-19  female  black  0.065052 -0.005528
8    20-24  female  white  0.948886 -0.229450
9    20-24  female  white  0.965632  0.389349

My approach is basically:

Find all possible combinations of identity columns
Find which combinations do not exist in the provided dataset
Create an empty dataset with the missing combinations and concatenate it to the existing dataset

def fill_in(df, id_cols):
    """Fill in empty records for combinations of id_cols that do not exist
    in dataset.
    
    Args:
        df: dataset
        id_cols: list of identity columns

    Returns:
        filled_df: dataframe with empty records for missing combinations of id_cols
    """
    # create all possible unique combinations of id_cols
    # and find combos that do not exist in the dataset
    id_combos = list(itertools.product(*[df[c].unique() for c in id_cols]))
    existing_combos = df[id_cols].apply(tuple, axis=1).unique()
    missing_combos = set(id_combos) - set(existing_combos)

    # create an empty dataframe with the missing combos
    other_cols = [c for c in df.columns if c not in id_cols]
    new_idx = pd.MultiIndex.from_tuples(missing_combos, names=id_cols)
    empty_data = np.empty(shape=(len(missing_combos), len(other_cols))).fill(np.nan)
    filled_df = pd.DataFrame(data=empty_data, index=new_idx, columns=other_cols).reset_index()

    # concat dataset with empty dataset for missing combos
    return pd.concat([df.assign(_fill_in=0), filled_df.assign(_fill_in=1)])

Trying it out:

fill_df(test_data, ['AgeGroup', 'Sex', 'Race'])

Results:

  AgeGroup     Sex   Race        x1        x2  _fill_in
0    20-24  female  black  0.785176 -0.600254         0
1    10-17    male  white  0.199674  0.947440         0
2    20-24  female  white  0.514234  0.291034         0
3    20-24  female  white  0.592415 -0.635560         0
4    20-24  female  black  0.046450 -1.021552         0
5    20-24  female  white  0.607545 -0.161755         0
6    20-24  female  black  0.170524 -0.533649         0
7    18-19  female  black  0.065052 -0.005528         0
8    20-24  female  white  0.948886 -0.229450         0
9    20-24  female  white  0.965632  0.389349         0
0    10-17  female  white       NaN       NaN         1
1    10-17  female  black       NaN       NaN         1
2    18-19    male  black       NaN       NaN         1
3    10-17    male  black       NaN       NaN         1
4    18-19    male  white       NaN       NaN         1
5    18-19  female  white       NaN       NaN         1
6    20-24    male  white       NaN       NaN         1
7    20-24    male  black       NaN       NaN         1

Answered By - JWilliams1

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 15, 2021

[FIXED] Is there a Pandas function equivalent to Stata fillin?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels