Issue
I have a result of isna()
like this:
data.isna().sum().sort_values()
index 0
Longtitude 0
Lattitude 0
Landsize 0
Bathroom 0
Bedroom2 0
Regionname 0
Distance 0
Postcode 0
SellerG 0
Method 0
Price 0
Type 0
Rooms 0
Address 0
Suburb 0
Date 0
Propertycount 0
Car 25
CouncilArea 553
YearBuilt 2130
BuildingArea 2542
dtype: int64
What I'd like to do is to get the column names in a list in ascending order of values where they're non-zero - so in the case above the last four. So essentially, I do this, but it gives me the wrong result:
>>> list(data.columns[data.isna().sum().sort_values() > 0])
['Lattitude', 'Longtitude', 'Regionname', 'Propertycount']
If I do not sort it, it works as expected:
>>> list(data.columns[data.isna().sum() > 0]) # no `.sort_values()`
['Car', 'BuildingArea', 'YearBuilt', 'CouncilArea']
but I'd like the list to be sorted.
BTW it's the same behaviour with isnull()
My questions are these:
- Why is the above happening? Why does sorting the result give some weird output (and its same every time - doesn't matter how many times you run it)
- How may I get the names of the columns in ascending order in a list?
Solution
list(data.columns[data.isna().sum().sort_values() > 0])
Let's break down this expression:
data.isna().sum()
will create a series with column names as index and count of NaNs as values.
Applying the sort_values()
method will sort the above series based on NaN counts, thereby changing the order of index elements. Changing the order of index elements will change the order of column names.
Now, when you filter the columns of your dataset using the sorted series, as in data.columns[data.isna().sum().sort_values() > 0]
, you essentially index the unsorted data.columns
with a series that has a sorted index. This is the reason for the unintended behavior in your expression.
To address this, you need to first sort the NaN count Series, then extract the column names with counts greater than 0 from this sorted Series. You can do something like the following:
sorted_counts = data.isna().sum().sort_values() # sorting the series
# extract column names from your sorted series
sorted_cols = sorted_counts[sorted_counts > 0].index.tolist()
Answered By - pparker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.