Thursday, April 14, 2022

[FIXED] Pandas `value_counts()` and `unique()` result in different category orders

April 14, 2022 dataframe, numpy, pandas, python, python-3.x No comments

Issue

For a given column, value_counts() function of pandas counts the number of occurrences of each value that this column takes. On the other hand, unique() function returns the unique values that occur at least once.

Now, just to given an example, take the mushroom dataset in the UCI Repository.

When I list the unique values in a particular column

df["class"].unique()

I get the output:

array(['p', 'e'], dtype=object)

However, when I count the number of occurrences

df["class"].value_counts()

I get the output:

e    4208
p    3916
Name: class, dtype: int64

Here, we can observe that the orders of the categories are different. The first one starts with 'p', whereas the second one starts with 'e'. I do not understand why there is such a mismatch, as one would typically assume the same order for consistency. I am wondering if there is any explanation for this, and whether there is a good practice to fix this. What comes to mind initially is that, I can count the occurrences by value_counts() and then instead of using the unique() function I can take the indices of the result. Namely:

val_counts = df["class"].value_counts()
val_unique = np.array(val_counts.index)
val_unique

Output:

array(['e', 'p'], dtype=object)

Solution

pd.unique, np.unique, value_counts and groupby all have slightly different ordering rules. You can choose the one you want in order to get the desired ordering

import pandas as pd
import numpy as np

df = pd.DataFrame({'class': ['z', 'z', 'a', 'a', 'a', 'f', 'f', 'f', 'a', 'f', 'f']})

`pd.unique`

does not sort, output is ordered by first appearance

df['class'].unique()
#array(['z', 'a', 'f'], dtype=object)

`np.unique`

sorts the values

np.unique(df['class'])
#array(['a', 'f', 'z'], dtype=object)

`value_counts`

sorts on descending counts by default, can toggle to occurrence based

df['class'].value_counts()
#f    5
#a    4
#z    2
#Name: class, dtype: int64

df['class'].value_counts(sort=False)
#z    2
#a    4
#f    5
#Name: class, dtype: int64

`groupby` + `size`

sorts based on label, can be toggled to sort based on label occurrence

# Sorts output based on grouping keys (i.e. labels)
df.groupby('class').size()
#class
#a    4
#f    5
#z    2
#dtype: int64

# Output ordered by occurrence of grouping keys
df.groupby('class', sort=False).size()
#class
#z    2
#a    4
#f    5
#dtype: int64

In your case, you want either value_counts with sort=False, or groupby + size with sort=False

Answered By - ALollz

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, April 14, 2022

[FIXED] Pandas `value_counts()` and `unique()` result in different category orders

Issue

Solution

`pd.unique`

`np.unique`

`value_counts`

`groupby` + `size`

0 comments:

Post a Comment

Popular Posts

Labels

Thursday, April 14, 2022

Issue

Solution

pd.unique

np.unique

value_counts

groupby + size

0 comments:

Post a Comment

Popular Posts

Labels

`pd.unique`

`np.unique`

`value_counts`

`groupby` + `size`