Friday, December 22, 2023

[FIXED] XGBoost predict_proba slow inference performance

December 22, 2023 machine-learning, scikit-learn, xgboost No comments

Issue

I trained 2 gradient-boosting models on the same data, using Scikit-learn and XGBoost.

Scikit-learn model

GradientBoostingClassifier(
    n_estimators=5,
    learning_rate=0.17,
    max_depth=5,
    verbose=2
)

XGBoost model

XGBClassifier(
    n_estimators=5,
    learning_rate=0.17,
    max_depth=5,
    verbosity=2,
    eval_metric="logloss"
)

Then I checked inference performance:

Xgboost: 9.7 ms ± 84.6 µs per loop
Scikit-learn: 426 µs ± 12.5 µs per loop

Why XGBoost is so slow?

Solution

"Why is xgboost so slow?": XGBClassifier() is the scikit-learn API for XGBoost (see e.g. https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier for more details). If you call the function directly (not through an API) it will be faster. To compare the performance of the two functions it makes sense to call each function directly, instead of calling one function directly and one function through an API. Here is an example:

# benchmark_xgboost_vs_sklearn.py
# Adapted from `xgboost_test.py` by Jacob Schreiber 
# (https://gist.github.com/jmschrei/6b447aada61d631544cd)

"""
Benchmarking scripts for XGBoost versus sklearn (time and accuracy)
"""

import time
import random
import numpy as np
import xgboost as xgb
from sklearn.ensemble import GradientBoostingClassifier

random.seed(0)
np.random.seed(0)

def make_dataset(n=500, d=10, c=2, z=2):
    """
    Make a dataset of size n, with d dimensions and m classes,
    with a distance of z in each dimension, making each feature equally
    informative.
    """

    # Generate our data and our labels
    X = np.concatenate([np.random.randn(n, d) + z*i for i in range(c)])
    y = np.concatenate([np.ones(n) * i for i in range(c)])

    # Generate a random indexing
    idx = np.arange(n*c)
    np.random.shuffle(idx)

    # Randomize the dataset, preserving data-label pairing
    X = X[idx]
    y = y[idx]

    # Return x_train, x_test, y_train, y_test
    return X[::2], X[1::2], y[::2], y[1::2]

def main():
    """
    Run SKLearn, and then run xgboost,
    then xgboost via SKLearn XGBClassifier API wrapper
    """

    # Generate the dataset
    X_train, X_test, y_train, y_test = make_dataset(10, z=100)
    n_estimators=5
    max_depth=5
    learning_rate=0.17

    # sklearn first
    tic = time.time()
    clf = GradientBoostingClassifier(n_estimators=n_estimators,
        max_depth=max_depth, learning_rate=learning_rate)
    clf.fit(X_train, y_train)
    print("SKLearn GBClassifier: {}s".format(time.time() - tic))
    print("Acc: {}".format(clf.score(X_test, y_test)))
    print(y_test.sum())
    print(clf.predict(X_test))

    # Convert the data to DMatrix for xgboost
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest  = xgb.DMatrix(X_test, label=y_test)
    # Loop through multiple thread numbers for xgboost
    for threads in 1, 2, 4:
        # xgboost's sklearn interface
        tic = time.time()
        clf = xgb.XGBModel(n_estimators=n_estimators, max_depth=max_depth,
            learning_rate=learning_rate, nthread=threads)
        clf.fit(X_train, y_train)
        print("SKLearn XGBoost API Time: {}s".format(time.time() - tic))
        preds = np.round( clf.predict(X_test) )
        acc = 1. - (np.abs(preds - y_test).sum()  / y_test.shape[0])
        print("Acc: {}".format( acc ))
        print("{} threads: ".format( threads ))
        tic = time.time()
        param = {
                  'max_depth' : max_depth,
                        'eta' : 0.1,
                      'silent': 1,
                   'objective':'binary:logistic',
                     'nthread': threads
                }
        bst = xgb.train( param, dtrain, n_estimators,
            [(dtest, 'eval'), (dtrain, 'train')] )
        print("XGBoost (no wrapper) Time: {}s".format(time.time() - tic))
        preds = np.round(bst.predict(dtest) )
        acc = 1. - (np.abs(preds - y_test).sum() / y_test.shape[0])
        print("Acc: {}".format(acc))

if __name__ == '__main__':
    main()

Summarised results:

sklearn.ensemble.GradientBoostingClassifier()

Time: 0.003237009048461914s
Accuracy: 1.0

sklearn xgboost API wrapper XGBClassifier()

Time: 0.3436141014099121s
Accuracy: 1.0

XGBoost (no wrapper) xgb.train()

Time: 0.0028612613677978516s
Accuracy: 1.0

Answered By - jared_mamrot

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 22, 2023

[FIXED] XGBoost predict_proba slow inference performance

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels