Saturday, January 20, 2024

[FIXED] `columns[summary.sort_values() > 0]` behaviour not making sense

January 20, 2024 pandas, python No comments

Issue

I have a result of isna() like this:

data.isna().sum().sort_values()

index               0
Longtitude          0
Lattitude           0
Landsize            0
Bathroom            0
Bedroom2            0
Regionname          0
Distance            0
Postcode            0
SellerG             0
Method              0
Price               0
Type                0
Rooms               0
Address             0
Suburb              0
Date                0
Propertycount       0
Car                25
CouncilArea       553
YearBuilt        2130
BuildingArea     2542
dtype: int64

What I'd like to do is to get the column names in a list in ascending order of values where they're non-zero - so in the case above the last four. So essentially, I do this, but it gives me the wrong result:

>>> list(data.columns[data.isna().sum().sort_values() > 0])
['Lattitude', 'Longtitude', 'Regionname', 'Propertycount']

If I do not sort it, it works as expected:

>>> list(data.columns[data.isna().sum() > 0])  # no `.sort_values()`
['Car', 'BuildingArea', 'YearBuilt', 'CouncilArea']

but I'd like the list to be sorted.

BTW it's the same behaviour with isnull()

My questions are these:

Why is the above happening? Why does sorting the result give some weird output (and its same every time - doesn't matter how many times you run it)
How may I get the names of the columns in ascending order in a list?

Solution

list(data.columns[data.isna().sum().sort_values() > 0])

Let's break down this expression:

data.isna().sum() will create a series with column names as index and count of NaNs as values.

Applying the sort_values() method will sort the above series based on NaN counts, thereby changing the order of index elements. Changing the order of index elements will change the order of column names.

Now, when you filter the columns of your dataset using the sorted series, as in data.columns[data.isna().sum().sort_values() > 0], you essentially index the unsorted data.columns with a series that has a sorted index. This is the reason for the unintended behavior in your expression.

To address this, you need to first sort the NaN count Series, then extract the column names with counts greater than 0 from this sorted Series. You can do something like the following:

sorted_counts = data.isna().sum().sort_values()  # sorting the series
# extract column names from your sorted series
sorted_cols = sorted_counts[sorted_counts > 0].index.tolist()

Answered By - pparker

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 20, 2024

[FIXED] `columns[summary.sort_values() > 0]` behaviour not making sense

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels