Sunday, February 6, 2022

[FIXED] Finding the largest (N) proportion of percentage in pandas dataframe

February 06, 2022 dataframe, numpy, pandas, python-3.x No comments

Issue

Suppose I have the following df:

df = pd.DataFrame({'name':['Sara',  'John', 'Christine','Paul',  'Jo', 'Zack','Chris', 'Mathew', 'Suzan'],

                   'visits': [0, 0, 1,2, 3, 9,6, 10, 3]})

df

looks like:

    name      visits
0   Sara         0
1   John         0
2   Christine    1
3   Paul         2
4   Jo           3
5   Zack         9
6   Chris        6
7   Mathew      10
8   Suzan       3

I did some lines of code to get the percentage of visits per name and sort them descending:

df['percent'] = (df['visits'] / np.sum(df['visits']))
df.sort_values(by='percent', ascending=False).reset_index(drop=True)

Now I have got the percent of visits to total visits by all names:

    name    visits  percent
0   Mathew  10  0.294118
1   Zack    9   0.264706
2   Chris   6   0.176471
3   Jo      3   0.088235
4   Suzan   3   0.088235
5   Paul    2   0.058824
6   Christine   1   0.029412
7   Sara    0   0.000000
8   John    0   0.000000

What I need to get is the largest proportion of names with the highest percentage. For example, the first 3 rows represent ~73% of the total visits, and others could be neglected compared sum of % of the first 3 rows.

I know I can select the top 3 by using nlargest:

df.nlargest(3, 'percent')

But there is high variability in the data and the largest proportion could be the first 2 or 3 rows or even more.

EDIT:

How can I do it automatically to find the largest(N) proportion of % out of the total count of rows?

Solution

You have to define outliers in some way. One way is to use scipy.stats.zscore like in this answer:

import pandas as pd
import numpy as np
from scipy import stats

df = pd.DataFrame({'name':['Sara',  'John', 'Christine','Paul',  'Jo', 'Zack','Chris', 'Mathew', 'Suzan'],

                   'visits': [0, 0, 1,2, 3, 9,6, 10, 3]})

df['percent'] = (df['visits'] / np.sum(df['visits']))
df.loc[df['percent'][stats.zscore(df['percent']) > 0.6].index]

which prints

     name  visits   percent
5    Zack       9  0.264706
6   Chris       6  0.176471
7  Mathew      10  0.294118

Answered By - user1717828

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, February 6, 2022

[FIXED] Finding the largest (N) proportion of percentage in pandas dataframe

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels