Saturday, March 5, 2022

[FIXED] Sort index at a level and per partitioning by earlier levels

March 05, 2022 dataframe, pandas, python-3.7 No comments

Issue

I'm using pandas to count the different types or errors and correct predictions for different (machine learning) models, in order to display confusion matrices.

A particular order of the prediction and ground truth labels makes sense, for example by putting the majority class 'B' first.

However, when I sort using pd.DataFrame.sort_index, the other index levels are also permuted. I'd like to sort the second level per unique value of the first index.

errors = pd.DataFrame([
  {'model': model, 'ground truth': ground_truth, 'prediction': prediction,
  'count': np.random.randint(0, (10000 if prediction=='B' else 1000) if prediction==ground_truth else 100)}
  for model in ['foo', 'bar']
  for prediction in 'ABC'
  for ground_truth in 'ABC'

])

def sort_index(index):
  return index.map('BCA'.index)

errors.pivot(
  index=['model', 'ground truth'],
  columns=['prediction'],
  values='count'
).fillna(0).astype(int).sort_index(level=1, key=sort_index)[['B', 'C', 'A']]

One solution is to sort by all earlier indices as well, but it's quite verbose. It's silly to have one function applied over all indices, as if they all are semantically the same. Moreover, this also rearranges the order of the models, which isn't necessarily needed. Finally it's a waste of compute in two ways: sorting smaller partitions is faster since sorting scales super-linearly, and element comparisons are slower when considering more indices.

def sort_index(index):
  if index.name == 'ground truth':
    return index.map('BCA'.index)
  return index

errors.pivot(
  index=['model', 'ground truth'],
  columns=['prediction'],
  values='count'
).fillna(0).astype(int).sort_index(level=[0, 1], key=sort_index)[['B', 'C', 'A']]

Is there a clean way to sort on higher index levels, keeping the earlier levels tied together?

Solution

You might want to use the reindex method.

Code:

import numpy as np
import pandas as pd

# Create a sample dataframe
errors = pd.DataFrame([ {'model': model, 'ground truth': ground_truth, 'prediction': prediction, 'count': np.random.randint(0, (10000 if prediction=='B' else 1000) if prediction==ground_truth else 100)} for model in ['foo', 'bar'] for prediction in 'ABC' for ground_truth in 'ABC' ])

# Pivot and reindex the dataframe
errors.pivot(
  index=['model', 'ground truth'],
  columns=['prediction'],
  values='count'
).fillna(0).astype(int).reindex(['B', 'C', 'A'], level=1)[['B', 'C', 'A']]

Output:

Answered By - quasi-human

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, March 5, 2022

[FIXED] Sort index at a level and per partitioning by earlier levels

Issue

Solution

Code:

Output:

0 comments:

Post a Comment

Popular Posts

Labels