Issue
I've downloaded the data (from https://www.nalpdirectory.com/) as seen in this excel screen shot.
How can I reformat this into a pandas dataframe? Trying to figure out how to do as much as possible in python instead of in excel. I'm looking into pd.stack()
and pd.unstack()
, but think I'm missing a few steps first. Thank you!
Solution
I would go for a DataFrame with a hierarchical index (2)
and columns (3)
:
LB = 14 # Length of blocks
NH = 3 # Number of headers
raw = pd.read_excel("file.xlsx", header=None, na_values=["UNK", "NC"])
mux_cols = (pd.MultiIndex.from_frame(
raw.iloc[:NH+1, 1:].ffill(axis=1).dropna().T).rename([None]*NH))
idx_blocks = (raw.index % LB)
df = (
raw.iloc[idx_blocks>NH]
.join(raw[0].where(idx_blocks < NH-1)
.ffill().rename("Cateogry"))
.rename(columns={0: "Sub_Cateogry"})
.set_index(["Cateogry", "Sub_Cateogry"])
.set_axis(mux_cols, axis=1)
#.fillna(0) # optional
.convert_dtypes()
)
Output (in Jupyter) :
Now, suppose you're interested in a specific data, you can loc
it this way :
cat = "Cleary Gottlieb Steen & Hamilton LLP, NEW YORK, New York"
sub_cat = "2 or More Races"
agg, year, gender = "Total Attorneys", 2023, "Men"
df.loc[(cat, sub_cat), (agg, year, gender)] # 7
NB : I couldn't download the spreadsheet from the link you shared but you can find the one I made and try my code online right here.
Answered By - Timeless
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.