Issue
I am currently running a for loop and if statement to check and replace values in cell based on row value of another column.
Simply put, my dataframe is 9000x27 rows similar to
I am using the below code to compare first 6 digits (CONT-1) of Package column with first 6 digits of Status column and if its true take entire row information to new dataframe, if not replace status column value with NaN.
new_dr = pd.DataFrame(columns = pr_compliant.columns)
for index, row in pr_compliant.iterrows():
col1 = row['Package ID'][:6]
col2 = row['Status ID'][:6]
if col1 == col2:
new_dr = new_dr._append(row, ignore_index=True)
else:
row['Status ID'] = np.nan
new_dr = new_dr._append(row, ignore_index=True)
new_dr = new_dr.drop_duplicates()
print(new_dr)
Here pr_compliant is source dataframe and new_dr is output dataframe. I want output as below
Currently its taking more than 30 secs to compare 9000 rows and push output. I am looking for efficient way to reduce the time as my master file that I am deploying this code will be 100000x27 dataframe.
Any thoughts for efficieny?
Solution
Try this:
pr_compliant['Status ID'] = ((pr_compliant['Status ID'].str[:6] == pr_compliant['Package ID'].str[:6])
*pr_compliant['Status ID']).replace({'': np.nan})
This is kind of a slick one liner, so I'll break it down. Adding .str
to a dataframe column lets us treat it like a string, so pr_compliant['Status ID'].str[:6]
will give us a column of just the first 6 characters of each entry. The when we do the comparison
pr_compliant['Status ID'].str[:6] == pr_compliant['Package ID'].str[:6]
That gives us a column of True and False values, with the Trues indicating the rows where the first 6 characters match. When we mulitply that by the original pr_compliant['Status ID']
column, that will put the status ID where the Trues are, and blank strings where the Falses are. Finally we replace those blank strings with nans using .replace({'': np.nan})
.
Answered By - Jacob H
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.