Issue
I've started working with quantile random forests (QRFs) from the scikit-garden
package. Previously I was creating regular random forests using RandomForestRegresser
from sklearn.ensemble
.
It appears that the speed of the QRF is comparable to the regular RF with small dataset sizes, but that as the size of the data increases, the QRF becomes MUCH slower at making predictions than the RF.
Is this expected? If so, could someone please explain why it takes such a long time to make these predictions and/or give any suggestions as to how I could get quantile predictions in a more timely manner.
See below for a toy example, where I test the training and predictive times for a variety of dataset sizes.
import matplotlib as mpl
mpl.use('Agg')
from sklearn.ensemble import RandomForestRegressor
from skgarden import RandomForestQuantileRegressor
from sklearn.model_selection import train_test_split
import numpy as np
import time
import matplotlib.pyplot as plt
log_ns = np.arange(0.5, 5, 0.5) # number of observations (log10)
ns = (10 ** (log_ns)).astype(int)
print(ns)
m = 14 # number of covariates
train_rf = []
train_qrf = []
pred_rf = []
pred_qrf = []
for n in ns:
# create dataset
print('n = {}'.format(n))
print('m = {}'.format(m))
rndms = np.random.normal(size=n)
X = np.random.uniform(size=[n,m])
betas = np.random.uniform(size=m)
y = 3 + np.sum(betas[None,:] * X, axis=1) + rndms
# split test/train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# random forest
rf = RandomForestRegressor(n_estimators=1000, random_state=0)
st = time.time()
rf.fit(X_train, y_train)
en = time.time()
print('Fit time RF = {} secs'.format(en - st))
train_rf.append(en - st)
# quantile random forest
qrf = RandomForestQuantileRegressor(random_state=0, min_samples_split=10, n_estimators=1000)
qrf.set_params(max_features = X.shape[1] // 3)
st = time.time()
qrf.fit(X_train, y_train)
en = time.time()
print('Fit time QRF = {} secs'.format(en - st))
train_qrf.append(en - st)
# predictions
st = time.time()
preds_rf = rf.predict(X_test)
en = time.time()
print('Prediction time RF = {}'.format(en - st))
pred_rf.append(en - st)
st = time.time()
preds_qrf = qrf.predict(X_test, quantile=50)
en = time.time()
print('Prediction time QRF = {}'.format(en - st))
pred_qrf.append(en - st)
fig, ax = plt.subplots()
ax.plot(np.log10(ns), train_rf, label='RF train', color='blue')
ax.plot(np.log10(ns), train_qrf, label='QRF train', color='red')
ax.plot(np.log10(ns), pred_rf, label='RF predict', color='blue', linestyle=':')
ax.plot(np.log10(ns), pred_qrf, label='QRF predict', color='red', linestyle =':')
ax.legend()
ax.set_xlabel('log(n)')
ax.set_ylabel('time (s)')
fig.savefig('time_comparison.png')
Here is the output: Time comparison of RF and QRF training and predictions
Solution
I am not a developer on this or any quantile regression package, but I've looked at the source code for both scikit-garden and quantRegForest/ranger, and I have some idea of why the R versions are so much faster:
EDIT: On a related github issue, lmssdd mentions how this method performs significantly worse than the 'standard procedure' from the paper. I haven't read the paper in detail, so take this answer with a grain of skepticism.
Explanation of difference in skgarden/quantregforest methods
The basic idea of the skgarden predict
function is to save all the y_train
values corresponding to all of the leaves. Then, when predicting a new sample, you gather the relevant leaves and corresponding y_train
values, and compute the (weighted) quantile of that array. The R versions take a shortcut: they only save a single, randomly chosen y_train
value per leaf node. This has two advantages: it makes the gathering of relevant y_train
values a lot simpler since there is always exactly one value in every leaf node. Secondly, it makes the quantile calculation a lot simpler since every leaf has the exact same weight.
Since you only use a single (random) value per leaf instead of all of them, this is an approximation method. In my experience, if you have enough trees, (at least 50-100 or so), this has very little effect on the result. However, I don't know enough about the math to say how good the approximation is exactly.
TL;DR: how to make skgarden predict faster
Below is an implementation of the simpler R method of quantile prediction, for a RandomForestQuantileRegressor model. Note that the first half of the function is the (one-time) process of selecting a random y_train value per leaf. If the author were to implement this method in skgarden, they would logically move this part to the fit
method, leaving only the last 6 or so lines, which makes for a much faster predict
method. Also in my example, I am using quantiles from 0 to 1, instead of from 0 to 100.
def predict_approx(model, X_test, quantiles=[0.05, 0.5, 0.95]):
"""
Function to predict quantiles much faster than the default skgarden method
This is the same method that the ranger and quantRegForest packages in R use
Output is (n_samples, n_quantiles) or (n_samples, ) if a scalar is given as quantiles
"""
# Begin one-time calculation of random_values. This only depends on model, so could be saved.
n_leaves = np.max(model.y_train_leaves_) + 1 # leaves run from 0 to max(leaf_number)
random_values = np.zeros((model.n_estimators, n_leaves))
for tree in range(model.n_estimators):
for leaf in range(n_leaves):
train_samples = np.argwhere(model.y_train_leaves_[tree, :] == leaf).reshape(-1)
if len(train_samples) == 0:
random_values[tree, leaf] = np.nan
else:
train_values = model.y_train_[train_samples]
random_values[tree, leaf] = np.random.choice(train_values)
# Optionally, save random_values as a model attribute for reuse later
# For each sample, get the random leaf values from all the leaves they land in
X_leaves = model.apply(X_test)
leaf_values = np.zeros((X_test.shape[0], model.n_estimators))
for i in range(model.n_estimators):
leaf_values[:, i] = random_values[i, X_leaves[:, i]]
# For each sample, calculate the quantiles of the leaf_values
return np.quantile(leaf_values, np.array(quantiles), axis=1).transpose()
Answered By - Fenno Vermeij
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.