Issue
While teaching, one of my students pointed out that Pandas DataFrame.sort_values
returns a different ordering (different tie breaks) to that from the equivalent Series.sort_values
. Consider this
>>> import pandas as pd
>>> df = pd.read_csv('https://gist.githubusercontent.com/matthew-brett/806a356bb7b7
... 1f08c5c6d0c5235e2f3d/raw/facb1aab243a33033b46657378f65dcd41542596/business.csv'
... )
>>> df['name'].value_counts().head(6)
name
Peet's Coffee & Tea 20
Starbucks Coffee 13
McDonald's 10
Jamba Juice 10
STARBUCKS 9
Proper Food 9
Name: count, dtype: int64
>>> df.value_counts('name').head(6)
name
Peet's Coffee & Tea 20
Starbucks Coffee 13
McDonald's 10
Jamba Juice 10
Proper Food 9
STARBUCKS 9
Name: count, dtype: int64
Of course, both of these orders are valid, given a not-stable default (quicksort) sort, but it's difficult to see why these would differ in the two cases, given the default method appears to be the same in both cases.
Solution
It's different because the strategy is different for both methods.
To compute the value_counts
of a DataFrame
, Pandas use a groupby_size
but the default behavior of groupby
is to sort keys in a lexicographic order by default.
Series
compute value_counts
in a more direct way.
Series
use IndexOpsMixin.value_counts
which use pandas.core.algorithms.value_counts_internal
So to get the same result than a Series
, use:
>>> df.groupby('name', sort=False).size().sort_values(ascending=False).head(10)
name
Peet's Coffee & Tea 20
Starbucks Coffee 13
McDonald's 10
Jamba Juice 10
STARBUCKS 9
Proper Food 9
Mixt Greens/Mixt 8
Specialty's Cafe & Bakery 8
Philz Coffee 7
The Organic Coup 7
dtype: int64
>>> df['name'].value_counts().head(10)
name
Peet's Coffee & Tea 20
Starbucks Coffee 13
McDonald's 10
Jamba Juice 10
STARBUCKS 9
Proper Food 9
Mixt Greens/Mixt 8
Specialty's Cafe & Bakery 8
Philz Coffee 7
The Organic Coup 7
Name: count, dtype: int64
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.