Issue
I am running a multi-label prediction model. As a performance measure, I am checking whether the top N
predictions from my model contain the real cases where y=1
.
For example, if my model's top predictions for a data point are Yellow(90%), Green(80%), Red(75%) while reality is Green and Red, I count it as a "Correct" prediction, while a measure such as (Exact) accuracy would count it as incorrect.
Below is my implementation, which has a somewhat realistic example of large X and y matrices (with many columns). I need to find an implementation (or a completely different solution) which runs faster.
Reproducible example (which runs too slow, ~2 min) below:
from scipy.sparse import random
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import time
np.random.seed(14)
## Generate sparse X, and y
X = random(100_000, 1000, density=0.01, format='csr')
y = pd.DataFrame(np.random.choice([0, 1], size=(100_000, 10)))
# Define no change as 0 in all rows
y['no_change'] = np.where(y.sum(axis=1) == 0, 1, 0)
dt = DecisionTreeClassifier(max_depth=15)
dt.fit(X, y)
# Print precise accuracy -- truth must precisely match prediction
print(f"Accuracy score (precise): {accuracy_score(y_true=y, y_pred=dt.predict(X=X)):.1%}")
# Get top n predictions based on probability (in case of equality keep all)
def top_n_preds(row, n_top):
topcols = row[row > 0].nlargest(n=n_top, keep='all')
top_colnames = topcols.index.tolist()
return top_colnames
start = time.time()
# Retrieve probabilities of predictions
pred_probs = np.asarray(dt.predict_proba(X=X))
pred_probs = pd.DataFrame(pred_probs[:, :, 1].T, columns=y.columns)
# Find top 5 predictions
pred_probs['top_preds'] = pred_probs.apply(top_n_preds, axis=1, n_top=5)
# List all real changes in y
pred_probs['real_changes'] = y.apply(lambda row: row[row == 1].index.tolist(), axis=1)
# Check if real changes are contained in top 5 predictions
pred_probs['preds_cover_reality'] = pred_probs.apply(lambda row: set(row['real_changes']).issubset(set(row['top_preds'])), axis=1)
print(f"Accuracy present in top n_top predictions: {pred_probs['preds_cover_reality'].sum() / y.shape[0]:.1%}")
print(f"Time elapsed: {(time.time()-start)/60:.1f} minutes")
Solution
3 consecutive .apply
calls in your case produce significant overhead and delay.
To boost the performance I suggest to make a single traversal on paired datasets: pred_probs
and y.values == 1
(one-time obtained filtering dataset for real_changes
columns).
Another most expensive and time-delaying part is calling pandas.Series.nlargest
on pred_probs
rows.
Despite one may think that it can be replaced by numpy.argpartition
, that's not totally true. There might be cases when the amount of filtered values of some pred_probs
row will be less than top_N
, that breaks np.argpartition()
call.
Even more notable is a special case Series.nlargest(n=top_N, keep='all')
, the one you use, allows keeping duplicates so that the amount of the resulting sample would be greater than top_N
.
To somehow imitate that behavior I use combination of np.sort
+ np.in1d
+ np.where
.
My new version aggregates accuracy picks/marks for the final accuracy score in about 2.5 seconds.
top_N = 5
def agg_accuracy_picks(preds, y, top_n):
"""Aggregate accuracy picks/marks"""
p_cols, y_cols = preds.columns, y.columns
for p_row, y_row in zip(preds.values, y.values == 1):
# top N values with all duplicates
top_values = np.in1d(p_row, np.sort(p_row[p_row > 0])[-top_n:])
top_cols = p_cols[np.where(top_values)[0]]
yield set(y_cols[y_row]) <= set(top_cols)
start = time.time()
# Retrieve probabilities of predictions
pred_probs = np.asarray(dt.predict_proba(X=X))
pred_probs = pd.DataFrame(pred_probs[:, :, 1].T, columns=y.columns)
pred_probs['preds_cover_reality'] = list(agg_accuracy_picks(pred_probs, y=y, top_n=top_N))
print(f"Accuracy present in top n_top predictions: "
f"{pred_probs['preds_cover_reality'].sum() / y.shape[0]:.1%}")
print(f"Time elapsed: {(time.time() - start): .1f} seconds")
Sample output:
Accuracy score (precise): 0.4%
Accuracy present in top n_top predictions: 3.5%
Time elapsed: 2.5 seconds
Answered By - RomanPerekhrest
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.