Issue
Context: I get a spreadsheet with numerous sheets. Each sheet references the same core columns but includes different subsets of the master sheet. How do I ensure each subset has the same ID married with the same secondary, tertiary column? This question starts with the assumption that we know how to load dataframes from excel:
Ex.
df_subset = pd.DataFrame([[1000, 'Jerry', 'BR1','BR1'],
[1001, 'Sal', 'BR2', 'BR1'],
[1005, 'Buck', 'BR3', 'BR2'],
[1003, 'Perry','BR4','BR1']],
columns=['ID', 'Name', 'Branch', 'Member of'])
df_master = pd.DataFrame([[1000, 'Jerry', 'BR1','Black'],
[1001, 'Sal', 'BR2', 'Mocha'],
[1002, 'Buck', 'BR3', 'Frap'],
[1003, 'Perry','BR4','Black'],
[1004, 'Mike', 'BR5', 'Mach']],
columns=['ID', 'Name', 'Branch', 'Coffee'])
Each of the above test frames represents a sheet. df_master
is the authoritative source of "Truth". How do I catch the error in df_subset
( See "Buck" )? Note the mismatched ID.
Expected output ( open to suggestions here, but it needs to flag errors ):
Master: 'Buck': 1002
Subset: 'Buck': 1005
If no mistakes, then no output.
UPDATE:
master_list = df_master[['ID','Name']].copy()
subset_list = df_subset[['ID','Name']].copy()
df_check = pd.concat([master_list,subset_list])
df_check.drop_duplicates(keep=False, ignore_index=False, subset=None)
The above code almost works but shows records present in the master that are not used in any subset. It does show the errors -- should I maybe do a merge
on ID
, then drop_duplicates
?
Solution
I found a better solution using indexes and isin
:
inconsistent_mask = (~df_master.set_index(['ID', 'Name']).index.isin(df_subset.set_index(['ID', 'Name']).index) & df_master.Name.isin(df_subset.Name))
names = df_master.loc[inconsistent_mask, 'Name'].tolist()
master_ids = df_master.loc[inconsistent_mask, 'ID'].tolist()
subset_ids = df_subset.set_index('Name').loc[names, 'ID'].tolist()
report_df = pd.DataFrame({'Name': names, 'Master': master_ids, 'Subset': subset_ids})
Output:
>>> report_df
Name Master Subset
0 Buck 1002 1005
Answered By - richardec
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.