Issue
I couldn't find a transformer to remove duplicated entries in the scikit-learn, like drop_duplicates in pandas.
- How can I deal with this problem?
- Should I write a custom transformer? if yes, how? I would appreciate it if you could help me in this regard.
- Which one is faster, using pandas or sciki-learn?
Best regards,
Solution
It's better, simpler and faster to drop duplicates first and foremost before any pipeline work so the data doesn't get mixed up in the pipeline, and you can split it early using the train_test_split
. Simply use:
df_dropped = df.drop_duplicates()
Answered By - Baraa Zaid
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.