Wednesday, December 1, 2021

[FIXED] How to plot boxplots for two groups of data

December 01, 2021 boxplot, matplotlib, pandas, python, seaborn No comments

Issue

I am plotting two different box plots with pandas with this:

plt.figure()
df['mean_train_score_error'] = [1] - df['mean_train_score']
df.boxplot(column=['mean_train_score_error'], by='modelo',
                                        medianprops = medianprops,
                                         autorange=True,showfliers=False, patch_artist=True, 
                                         vert=True, showmeans=True,meanline=True)
plt.ylabel('Error: 1-F1 Score')
plt.title('Error de entrenamiento')
plt.suptitle('')



df['mean_test_score_error'] = [1] - df['mean_test_score']
df.boxplot(column=['mean_test_score_error'], by='modelo',
                                        medianprops = medianprops,
                                         autorange=True,showfliers=False, patch_artist=True, 
                                         vert=True, showmeans=True,meanline=True)

plt.ylabel('Error: 1-F1 Score')
plt.title('Error de validación')
plt.suptitle('')

And I am getting the following two plots:

The question is if is possible plot the 6 boxplot on the same plot and to use different color for the each three boxplot of the each plot?

Solution

The easiest way to do this is transform the data from a wide to long format, and then plot with seaborn, using the hue parameter.
pandas.wide_to_long
- There must be a unique id, hence adding the id column.
- The columns being transformed, must have similar stubnames, which is why I moved error to the front of the column name.
  - The error column names will be in one column and the value in a separate column

Imports and Test Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# setup data and dataframe
np.random.seed(365)
data = {'mod_lg': np.random.normal(0.3, .1, size=(30,)),
        'mod_rf': np.random.normal(0.05, .01, size=(30,)),
        'mod_bg': np.random.normal(0.02, 0.002, size=(30,)),
        'mean_train_score': np.random.normal(0.95, 0.3, size=(30,)),
        'mean_test_score': np.random.normal(0.86, 0.5, size=(30,))}

df = pd.DataFrame(data)
df['error_mean_test_score'] = [1] - df['mean_test_score']
df['error_mean_train_score'] = [1] - df['mean_train_score']
df["id"] = df.index

df = pd.wide_to_long(df, stubnames='mod', i='id', j='mode', sep='_', suffix='\D+').reset_index()
df["id"] = df.index

# display dataframe: this is probably what your dataframe looks like to generate your current plots
   id mode  mean_train_score  error_mean_test_score  mean_test_score  error_mean_train_score       mod
0   0   lg          0.663855              -0.343961         1.343961                0.336145  0.316792
1   1   lg          0.990114               0.472847         0.527153                0.009886  0.352351
2   2   lg          1.179775               0.324748         0.675252               -0.179775  0.381738
3   3   lg          0.693155               0.519526         0.480474                0.306845  0.470385
4   4   lg          1.191048              -0.128033         1.128033               -0.191048  0.085305

Transform and plot

The error_score_name column contains the suffix from error_mean_test_score & error_mean_train_score
The error_score_value column contains the values.

# convert df error columns to long format
dfl = pd.wide_to_long(df, stubnames='error', i='id', j='score', sep='_', suffix='\D+').reset_index(level=1)
dfl.rename(columns={'score': 'error_score_name', 'error': 'error_score_value'}, inplace=True)

# display dfl

   error_score_name  mean_train_score       mod  mean_test_score mode  error_score_value
id                                                                                      
0   mean_test_score          0.663855  0.316792         1.343961   lg          -0.343961
1   mean_test_score          0.990114  0.352351         0.527153   lg           0.472847
2   mean_test_score          1.179775  0.381738         0.675252   lg           0.324748
3   mean_test_score          0.693155  0.470385         0.480474   lg           0.519526
4   mean_test_score          1.191048  0.085305         1.128033   lg          -0.128033

# plot dfl
sns.boxplot(x='mode', y='error_score_value', data=dfl, hue='error_score_name')

Answered By - Trenton McKinney

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, December 1, 2021

[FIXED] How to plot boxplots for two groups of data

Issue

Solution

Imports and Test Data

Transform and plot

0 comments:

Post a Comment

Popular Posts

Labels