Issue
I have a large Pandas DataFrame made of 298k rows. One column is ID and it is unique for each row.
I want to perform two operations:
Filter the dataframe rows such that the ID value for each row is within a Python list, called my_list that contains 13k elements. I tried using query() and it works fine:
df= df.query(f'ID in {my_list}')
Now, I want to rearrange the dataframe such that the order of the IDs is the same as the one in my_list, which is an ordered list that may contain duplicates.
If there is a duplicate in the list, I want to add the duplicate row even in the dataframe, keeping the order.
I tried using categories to order the dataframe but it does not work with duplicates and I have no idea how to add the duplicate row if there is one in the list:df['ID'] = pd.Categorical(df['ID'], categories=mylist, ordered=True)
Any efficient solutions? Thanks
Numerical example:
I have a list with these IDs: [1, 2, 3, 1, 4, 2].
And I have a dataframe with these rows:
ID | Name | ... |
---|---|---|
0 | John | ... |
1 | Paul | ... |
2 | Rick | ... |
3 | Jean | ... |
4 | Mark | ... |
5 | Neil | ... |
6 | Martin | ... |
I want in output this dataframe, following the list order and adding duplicates in the df:
ID | Name | ... |
---|---|---|
1 | Paul | ... |
2 | Rick | ... |
3 | Jean | ... |
1 | Paul | ... |
4 | Mark | ... |
2 | Rick | ... |
Solution
If the original index is not critical, you can reindex
:
df.set_index('ID').reindex(my_list).reset_index()
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.