Issue
I'm using pandas to count the different types or errors and correct predictions for different (machine learning) models, in order to display confusion matrices.
A particular order of the prediction and ground truth labels makes sense, for example by putting the majority class 'B' first.
However, when I sort using pd.DataFrame.sort_index
, the other index levels are also permuted. I'd like to sort the second level per unique value of the first index.
errors = pd.DataFrame([
{'model': model, 'ground truth': ground_truth, 'prediction': prediction,
'count': np.random.randint(0, (10000 if prediction=='B' else 1000) if prediction==ground_truth else 100)}
for model in ['foo', 'bar']
for prediction in 'ABC'
for ground_truth in 'ABC'
])
def sort_index(index):
return index.map('BCA'.index)
errors.pivot(
index=['model', 'ground truth'],
columns=['prediction'],
values='count'
).fillna(0).astype(int).sort_index(level=1, key=sort_index)[['B', 'C', 'A']]
One solution is to sort by all earlier indices as well, but it's quite verbose. It's silly to have one function applied over all indices, as if they all are semantically the same. Moreover, this also rearranges the order of the models, which isn't necessarily needed. Finally it's a waste of compute in two ways: sorting smaller partitions is faster since sorting scales super-linearly, and element comparisons are slower when considering more indices.
def sort_index(index):
if index.name == 'ground truth':
return index.map('BCA'.index)
return index
errors.pivot(
index=['model', 'ground truth'],
columns=['prediction'],
values='count'
).fillna(0).astype(int).sort_index(level=[0, 1], key=sort_index)[['B', 'C', 'A']]
Is there a clean way to sort on higher index levels, keeping the earlier levels tied together?
Solution
You might want to use the reindex
method.
Code:
import numpy as np
import pandas as pd
# Create a sample dataframe
errors = pd.DataFrame([ {'model': model, 'ground truth': ground_truth, 'prediction': prediction, 'count': np.random.randint(0, (10000 if prediction=='B' else 1000) if prediction==ground_truth else 100)} for model in ['foo', 'bar'] for prediction in 'ABC' for ground_truth in 'ABC' ])
# Pivot and reindex the dataframe
errors.pivot(
index=['model', 'ground truth'],
columns=['prediction'],
values='count'
).fillna(0).astype(int).reindex(['B', 'C', 'A'], level=1)[['B', 'C', 'A']]
Output:
Answered By - quasi-human
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.