Issue
I am trying to produce the following plot, that was made in excel, by using pandas.
A lot of plotting at at work is done using excel and it is fiddly and tedious to get the data into the required format. I want to use pandas, but my bosses want to see the exact same (or very close) plots being produced.
I normally use seaborn for box plots and find it very convenient but I need to show more percentiles (5th, 10th, 25th, 50th, 75th, 90th, and 95th) as shown on the figure legend.
I know that seaborn/matplotlib lets me change the whiskers extent using whis=[10,90], and I can use showmean=True, but that leaves the other markers (95th and 5th percentiles) to add to each plot. How to overlay those?
I have my data grouped as I want, and I can extract the percentiles using .describe() as below
pcntls=assay.groupby(['LocalSTRAT']).describe(percentiles=[0.1,0.05,0.25,0.5,0.75,0.9,0.95])
and transform which gives me this:
LocalSTRAT AB CBC CLB LAB LBB LBL MS TB TBL UAB UBB
count 982.000000 234.000000 159.000000 530.000000 1136.000000 72.000000 267.000000 1741.000000 16.000000 1641.000000 2099.000000
mean 0.687658 1.410962 0.118302 0.211321 0.110251 0.077917 0.766124 0.262648 0.191875 0.119174 1.320357
std 0.814027 0.855342 0.148397 0.286574 0.146550 0.088921 0.647259 0.309134 0.125497 0.207197 1.393613
min 0.005000 0.005000 0.020000 0.005000 0.005000 0.010000 0.005000 0.005000 0.060000 0.005000 0.005000
5% 0.030000 0.196500 0.030000 0.020000 0.020000 0.020000 0.060000 0.020000 0.067500 0.005000 0.170000
10% 0.050000 0.363000 0.038000 0.020000 0.020000 0.021000 0.096000 0.030000 0.070000 0.020000 0.230000
25% 0.130000 0.825000 0.045000 0.050000 0.030000 0.030000 0.225000 0.050000 0.077500 0.030000 0.450000
50% 0.400000 1.260000 0.070000 0.120000 0.050000 0.050000 0.610000 0.150000 0.175000 0.060000 0.940000
75% 0.950000 1.947500 0.140000 0.250000 0.120000 0.072500 1.120000 0.350000 0.257500 0.130000 1.570000
90% 1.720000 2.411000 0.262000 0.520000 0.265000 0.149000 1.624000 0.640000 0.340000 0.250000 2.770000
95% 2.370000 2.967500 0.322000 0.685500 0.390000 0.237000 2.037000 0.880000 0.390000 0.410000 4.322000
max 7.040000 5.070000 1.510000 2.620000 1.450000 0.580000 3.530000 2.390000 0.480000 4.190000 11.600000
I am stuck with how to begin to construct boxplots from scratch using this output.
I think it's easier to construct some boxplots in the normal way, and then add the extra couple of data points (5th and 95th percentile markers) over the top but can't figure out how to do this.
(Bonus points for a way to make a legend like the one shown or how to insert an image file of this into my plot, and get the log-style gridlines, and include the counts in the x axis!)
Solution
Just overlay a scatter plot using the percentiles extracted from the .describe() output, remembering to sort both to make sure the order doesn't get mixed up. The legend was made as an image externally and inserted separately.
Counts were calculated and added using plt.text().
Logarithmic gridlines applied using plt.grid(True, which='both')
and setting the axis to log.
code and result below.
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
pathx = r"C:\boxplots2.xlsx"
pathx = pathx.replace( "\\", "/")#avoid escape character issues
#print pathx
#pathx = pathx[1:len(pathx)-1]
df=pd.read_excel(pathx)
#this line removes missing data rows (where the strat is not specified)
df=df[df["STRAT"]!=0]
assay=df
factor_to_plot='Total %S'
f=factor_to_plot
x_axis_factor='STRAT'
g=x_axis_factor
pcntls=assay.groupby([g]).describe(percentiles=[0.05,0.1,0.25,0.5,0.75,0.9,0.95])
sumry= pcntls[f].T
#print sumry
ordered=sorted(assay[g].dropna().unique())
#set figure size and scale text
plt.rcParams['figure.figsize']=(15,10)
text_scaling=1.9
sns.set(style="whitegrid")
sns.set_context("paper", font_scale=text_scaling)
#plot boxplot
ax=sns.boxplot(x=assay[g],y=assay[f],width=0.5,order=ordered, whis=[10,90],data=assay, showfliers=False,color='lightblue',
showmeans=True,meanprops={"marker":"x","markersize":12,"markerfacecolor":"white", "markeredgecolor":"black"})
plt.axhline(0.3, color='green',linestyle='dashed', label="S%=0.3")
#this line sets the scale to logarithmic
ax.set_yscale('log')
leg= plt.legend(markerscale=1.5,bbox_to_anchor=(1.0, 0.5) )#,bbox_to_anchor=(1.0, 0.5)
#plt.title("Assay data")
plt.grid(True, which='both')
ax.scatter(x=sorted(list(sumry.columns.values)),y=sumry.loc['5%'],s=120,color='white',edgecolor='black')
ax.scatter(x=sorted(list(sumry.columns.values)),y=sumry.loc['95%'],s=120,color='white',edgecolor='black')
#add legend image
img = plt.imread("legend.jpg")
plt.figimage(img, 1900,900, zorder=1, alpha=1)
#next line is important, select a column that has no blanks or nans as the total items are counted.
assay['value']=assay['From']
vals=assay.groupby([g])['value'].count()
j=vals
ymin, ymax = ax.get_ylim()
xmin, xmax = ax.get_xlim()
#print ymax
#put n= values at top of plot
x=0
for i in range(len(j)):
plt.text(x = x , y = ymax+0.2, s = "N=\n" +str(int(j[i])),horizontalalignment='center')
#plt.text(x = x , y = 102.75, s = "n=",horizontalalignment='center')
x+=1
#use the section below to adjust the y axis lable format to avoid default of 10^0 etc for log scale plots.
ylabels = ['{:.1f}'.format(y) for y in ax.get_yticks()]
ax.set_yticklabels(ylabels)
Which gives:
Answered By - flashliquid
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.