Monday, November 27, 2023

[FIXED] How to customise pandas box and whisker plot with additional percentiles?

November 27, 2023 boxplot, matplotlib, pandas, seaborn No comments

Issue

I am trying to produce the following plot, that was made in excel, by using pandas.

A lot of plotting at at work is done using excel and it is fiddly and tedious to get the data into the required format. I want to use pandas, but my bosses want to see the exact same (or very close) plots being produced.

I normally use seaborn for box plots and find it very convenient but I need to show more percentiles (5th, 10th, 25th, 50th, 75th, 90th, and 95th) as shown on the figure legend.

I know that seaborn/matplotlib lets me change the whiskers extent using whis=[10,90], and I can use showmean=True, but that leaves the other markers (95th and 5th percentiles) to add to each plot. How to overlay those?

I have my data grouped as I want, and I can extract the percentiles using .describe() as below pcntls=assay.groupby(['LocalSTRAT']).describe(percentiles=[0.1,0.05,0.25,0.5,0.75,0.9,0.95])

and transform which gives me this:

LocalSTRAT  AB  CBC     CLB     LAB     LBB     LBL     MS  TB  TBL     UAB     UBB
count   982.000000  234.000000  159.000000  530.000000  1136.000000     72.000000   267.000000  1741.000000     16.000000   1641.000000     2099.000000
mean    0.687658    1.410962    0.118302    0.211321    0.110251    0.077917    0.766124    0.262648    0.191875    0.119174    1.320357
std     0.814027    0.855342    0.148397    0.286574    0.146550    0.088921    0.647259    0.309134    0.125497    0.207197    1.393613
min     0.005000    0.005000    0.020000    0.005000    0.005000    0.010000    0.005000    0.005000    0.060000    0.005000    0.005000
5%  0.030000    0.196500    0.030000    0.020000    0.020000    0.020000    0.060000    0.020000    0.067500    0.005000    0.170000
10%     0.050000    0.363000    0.038000    0.020000    0.020000    0.021000    0.096000    0.030000    0.070000    0.020000    0.230000
25%     0.130000    0.825000    0.045000    0.050000    0.030000    0.030000    0.225000    0.050000    0.077500    0.030000    0.450000
50%     0.400000    1.260000    0.070000    0.120000    0.050000    0.050000    0.610000    0.150000    0.175000    0.060000    0.940000
75%     0.950000    1.947500    0.140000    0.250000    0.120000    0.072500    1.120000    0.350000    0.257500    0.130000    1.570000
90%     1.720000    2.411000    0.262000    0.520000    0.265000    0.149000    1.624000    0.640000    0.340000    0.250000    2.770000
95%     2.370000    2.967500    0.322000    0.685500    0.390000    0.237000    2.037000    0.880000    0.390000    0.410000    4.322000
max     7.040000    5.070000    1.510000    2.620000    1.450000    0.580000    3.530000    2.390000    0.480000    4.190000    11.600000

I am stuck with how to begin to construct boxplots from scratch using this output.

I think it's easier to construct some boxplots in the normal way, and then add the extra couple of data points (5th and 95th percentile markers) over the top but can't figure out how to do this.

(Bonus points for a way to make a legend like the one shown or how to insert an image file of this into my plot, and get the log-style gridlines, and include the counts in the x axis!)

Solution

Just overlay a scatter plot using the percentiles extracted from the .describe() output, remembering to sort both to make sure the order doesn't get mixed up. The legend was made as an image externally and inserted separately.

Counts were calculated and added using plt.text().

Logarithmic gridlines applied using plt.grid(True, which='both') and setting the axis to log.

code and result below.

import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt


pathx = r"C:\boxplots2.xlsx"

pathx =  pathx.replace( "\\", "/")#avoid escape character issues
#print pathx
#pathx = pathx[1:len(pathx)-1]
df=pd.read_excel(pathx)

#this line removes missing data rows (where the strat is not specified)
df=df[df["STRAT"]!=0]



assay=df

factor_to_plot='Total %S'
f=factor_to_plot

x_axis_factor='STRAT'
g=x_axis_factor

pcntls=assay.groupby([g]).describe(percentiles=[0.05,0.1,0.25,0.5,0.75,0.9,0.95])
sumry= pcntls[f].T
#print sumry
ordered=sorted(assay[g].dropna().unique())

#set figure size and scale text
plt.rcParams['figure.figsize']=(15,10)
text_scaling=1.9
sns.set(style="whitegrid")
sns.set_context("paper", font_scale=text_scaling) 

#plot boxplot
ax=sns.boxplot(x=assay[g],y=assay[f],width=0.5,order=ordered, whis=[10,90],data=assay, showfliers=False,color='lightblue', 
            showmeans=True,meanprops={"marker":"x","markersize":12,"markerfacecolor":"white", "markeredgecolor":"black"})

plt.axhline(0.3, color='green',linestyle='dashed', label="S%=0.3")

#this line sets the scale to logarithmic
ax.set_yscale('log')

leg= plt.legend(markerscale=1.5,bbox_to_anchor=(1.0, 0.5) )#,bbox_to_anchor=(1.0, 0.5)
#plt.title("Assay data")
plt.grid(True, which='both')
ax.scatter(x=sorted(list(sumry.columns.values)),y=sumry.loc['5%'],s=120,color='white',edgecolor='black') 
ax.scatter(x=sorted(list(sumry.columns.values)),y=sumry.loc['95%'],s=120,color='white',edgecolor='black')


#add legend image
img = plt.imread("legend.jpg")
plt.figimage(img, 1900,900, zorder=1, alpha=1)


#next line is important, select a column that has no blanks or nans as the total items are counted. 
assay['value']=assay['From']

vals=assay.groupby([g])['value'].count()
j=vals

ymin, ymax = ax.get_ylim()
xmin, xmax = ax.get_xlim()
#print ymax

#put n= values at top of plot    
x=0
for i in range(len(j)):

    plt.text(x = x , y = ymax+0.2, s = "N=\n" +str(int(j[i])),horizontalalignment='center')
    #plt.text(x = x , y = 102.75, s = "n=",horizontalalignment='center')
    x+=1





#use the section below to adjust the y axis lable format to avoid default of 10^0 etc for log scale plots.
ylabels = ['{:.1f}'.format(y) for y in ax.get_yticks()]
ax.set_yticklabels(ylabels)

Which gives:

Answered By - flashliquid

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 27, 2023

[FIXED] How to customise pandas box and whisker plot with additional percentiles?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels