Issue
I have a dataframe with the following format:
ORDER_NUMBER | PRODUCT1 | PRODUCT2 | PRODUCT3 | ...
ORDER1 | 1.0 | 0.0 | 0.0 | ...
ORDER2 | 0.0 | 0.0 | 0.0 | ...
ORDER3 | 0.0 | 1.0 | 0.0 | ...
ORDER4 | 0.0 | 0.0 | 0.0 | ...
ORDER5 | 0.0 | 1.0 | 1.0 | ...
...
In essence, I have one-hot encoded all products per order so that it is clear that, e.g., ORDER1 contains PRODUCT1, but not PRODUCT2 and PRODUCT3, etc.
I also have a list of products that I am interested in, for example:
interesting_products = ['PRODUCT1', 'PRODUCT3']
What I want to do is to keep only the orders (rows) where the products in my interesting_products
list appear at least once (i.e., there's a '1.0' instead of '0.0').
So the resulting dataframe should be:
ORDER_NUMBER | PRODUCT1 | PRODUCT2 | PRODUCT3 | ...
ORDER1 | 1.0 | 0.0 | 0.0 | ...
ORDER5 | 0.0 | 1.0 | 1.0 | ...
...
My problem is that that this file is huge (a few GBs). What is the most efficient way of filtering my dataframe?
Solution
You can use sum
only columns by list and test if equal 1
and filter in boolean indexing
:
df = df[df[interesting_products].sum(axis=1).ne(0)]
Answered By - jezrael
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.