Tuesday, January 9, 2024

[FIXED] Why does the output of Pandas DataFrame.sort_values differ from Series.sort_values?

January 09, 2024 dataframe, pandas, python, series No comments

Issue

While teaching, one of my students pointed out that Pandas DataFrame.sort_values returns a different ordering (different tie breaks) to that from the equivalent Series.sort_values. Consider this

>>> import pandas as pd
>>> df = pd.read_csv('https://gist.githubusercontent.com/matthew-brett/806a356bb7b7
... 1f08c5c6d0c5235e2f3d/raw/facb1aab243a33033b46657378f65dcd41542596/business.csv'
... )
>>> df['name'].value_counts().head(6)
name
Peet's Coffee & Tea    20
Starbucks Coffee       13
McDonald's             10
Jamba Juice            10
STARBUCKS               9
Proper Food             9
Name: count, dtype: int64
>>> df.value_counts('name').head(6)
name
Peet's Coffee & Tea    20
Starbucks Coffee       13
McDonald's             10
Jamba Juice            10
Proper Food             9
STARBUCKS               9
Name: count, dtype: int64

Of course, both of these orders are valid, given a not-stable default (quicksort) sort, but it's difficult to see why these would differ in the two cases, given the default method appears to be the same in both cases.

Solution

It's different because the strategy is different for both methods.

To compute the value_counts of a DataFrame, Pandas use a groupby_size but the default behavior of groupby is to sort keys in a lexicographic order by default.

Series compute value_counts in a more direct way. Series use IndexOpsMixin.value_counts which use pandas.core.algorithms.value_counts_internal

So to get the same result than a Series, use:

>>> df.groupby('name', sort=False).size().sort_values(ascending=False).head(10)
name
Peet's Coffee & Tea          20
Starbucks Coffee             13
McDonald's                   10
Jamba Juice                  10
STARBUCKS                     9
Proper Food                   9
Mixt Greens/Mixt              8
Specialty's Cafe & Bakery     8
Philz Coffee                  7
The Organic Coup              7
dtype: int64

>>> df['name'].value_counts().head(10)
name
Peet's Coffee & Tea          20
Starbucks Coffee             13
McDonald's                   10
Jamba Juice                  10
STARBUCKS                     9
Proper Food                   9
Mixt Greens/Mixt              8
Specialty's Cafe & Bakery     8
Philz Coffee                  7
The Organic Coup              7
Name: count, dtype: int64

Answered By - Corralien

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 9, 2024

[FIXED] Why does the output of Pandas DataFrame.sort_values differ from Series.sort_values?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels