Issue
For a given column, value_counts()
function of pandas counts the number of occurrences of each value that this column takes. On the other hand, unique()
function returns the unique values that occur at least once.
Now, just to given an example, take the mushroom
dataset in the UCI Repository.
When I list the unique values in a particular column
df["class"].unique()
I get the output:
array(['p', 'e'], dtype=object)
However, when I count the number of occurrences
df["class"].value_counts()
I get the output:
e 4208
p 3916
Name: class, dtype: int64
Here, we can observe that the orders of the categories are different. The first one starts with 'p'
, whereas the second one starts with 'e'
. I do not understand why there is such a mismatch, as one would typically assume the same order for consistency. I am wondering if there is any explanation for this, and whether there is a good practice to fix this. What comes to mind initially is that, I can count the occurrences by value_counts()
and then instead of using the unique()
function I can take the indices of the result. Namely:
val_counts = df["class"].value_counts()
val_unique = np.array(val_counts.index)
val_unique
Output:
array(['e', 'p'], dtype=object)
Solution
pd.unique
, np.unique
, value_counts
and groupby
all have slightly different ordering rules. You can choose the one you want in order to get the desired ordering
import pandas as pd
import numpy as np
df = pd.DataFrame({'class': ['z', 'z', 'a', 'a', 'a', 'f', 'f', 'f', 'a', 'f', 'f']})
pd.unique
does not sort, output is ordered by first appearance
df['class'].unique()
#array(['z', 'a', 'f'], dtype=object)
np.unique
sorts the values
np.unique(df['class'])
#array(['a', 'f', 'z'], dtype=object)
value_counts
sorts on descending counts by default, can toggle to occurrence based
df['class'].value_counts()
#f 5
#a 4
#z 2
#Name: class, dtype: int64
df['class'].value_counts(sort=False)
#z 2
#a 4
#f 5
#Name: class, dtype: int64
groupby
+ size
sorts based on label, can be toggled to sort based on label occurrence
# Sorts output based on grouping keys (i.e. labels)
df.groupby('class').size()
#class
#a 4
#f 5
#z 2
#dtype: int64
# Output ordered by occurrence of grouping keys
df.groupby('class', sort=False).size()
#class
#z 2
#a 4
#f 5
#dtype: int64
In your case, you want either value_counts
with sort=False
, or groupby
+ size
with sort=False
Answered By - ALollz
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.