Issue
I trained 2 gradient-boosting models on the same data, using Scikit-learn and XGBoost.
Scikit-learn model
GradientBoostingClassifier(
n_estimators=5,
learning_rate=0.17,
max_depth=5,
verbose=2
)
XGBoost model
XGBClassifier(
n_estimators=5,
learning_rate=0.17,
max_depth=5,
verbosity=2,
eval_metric="logloss"
)
Then I checked inference performance:
- Xgboost: 9.7 ms ± 84.6 µs per loop
- Scikit-learn: 426 µs ± 12.5 µs per loop
Why XGBoost is so slow?
Solution
"Why is xgboost so slow?": XGBClassifier()
is the scikit-learn API for XGBoost (see e.g. https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier for more details). If you call the function directly (not through an API) it will be faster. To compare the performance of the two functions it makes sense to call each function directly, instead of calling one function directly and one function through an API. Here is an example:
# benchmark_xgboost_vs_sklearn.py
# Adapted from `xgboost_test.py` by Jacob Schreiber
# (https://gist.github.com/jmschrei/6b447aada61d631544cd)
"""
Benchmarking scripts for XGBoost versus sklearn (time and accuracy)
"""
import time
import random
import numpy as np
import xgboost as xgb
from sklearn.ensemble import GradientBoostingClassifier
random.seed(0)
np.random.seed(0)
def make_dataset(n=500, d=10, c=2, z=2):
"""
Make a dataset of size n, with d dimensions and m classes,
with a distance of z in each dimension, making each feature equally
informative.
"""
# Generate our data and our labels
X = np.concatenate([np.random.randn(n, d) + z*i for i in range(c)])
y = np.concatenate([np.ones(n) * i for i in range(c)])
# Generate a random indexing
idx = np.arange(n*c)
np.random.shuffle(idx)
# Randomize the dataset, preserving data-label pairing
X = X[idx]
y = y[idx]
# Return x_train, x_test, y_train, y_test
return X[::2], X[1::2], y[::2], y[1::2]
def main():
"""
Run SKLearn, and then run xgboost,
then xgboost via SKLearn XGBClassifier API wrapper
"""
# Generate the dataset
X_train, X_test, y_train, y_test = make_dataset(10, z=100)
n_estimators=5
max_depth=5
learning_rate=0.17
# sklearn first
tic = time.time()
clf = GradientBoostingClassifier(n_estimators=n_estimators,
max_depth=max_depth, learning_rate=learning_rate)
clf.fit(X_train, y_train)
print("SKLearn GBClassifier: {}s".format(time.time() - tic))
print("Acc: {}".format(clf.score(X_test, y_test)))
print(y_test.sum())
print(clf.predict(X_test))
# Convert the data to DMatrix for xgboost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# Loop through multiple thread numbers for xgboost
for threads in 1, 2, 4:
# xgboost's sklearn interface
tic = time.time()
clf = xgb.XGBModel(n_estimators=n_estimators, max_depth=max_depth,
learning_rate=learning_rate, nthread=threads)
clf.fit(X_train, y_train)
print("SKLearn XGBoost API Time: {}s".format(time.time() - tic))
preds = np.round( clf.predict(X_test) )
acc = 1. - (np.abs(preds - y_test).sum() / y_test.shape[0])
print("Acc: {}".format( acc ))
print("{} threads: ".format( threads ))
tic = time.time()
param = {
'max_depth' : max_depth,
'eta' : 0.1,
'silent': 1,
'objective':'binary:logistic',
'nthread': threads
}
bst = xgb.train( param, dtrain, n_estimators,
[(dtest, 'eval'), (dtrain, 'train')] )
print("XGBoost (no wrapper) Time: {}s".format(time.time() - tic))
preds = np.round(bst.predict(dtest) )
acc = 1. - (np.abs(preds - y_test).sum() / y_test.shape[0])
print("Acc: {}".format(acc))
if __name__ == '__main__':
main()
Summarised results:
sklearn.ensemble.GradientBoostingClassifier()
- Time: 0.003237009048461914s
- Accuracy: 1.0
sklearn xgboost API wrapper XGBClassifier()
- Time: 0.3436141014099121s
- Accuracy: 1.0
XGBoost (no wrapper) xgb.train()
- Time: 0.0028612613677978516s
- Accuracy: 1.0
Answered By - jared_mamrot
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.