Issue
I've trained a RandomForestClassifier model with the sklearn library and saved it with joblib. Now, I have a joblib file of nearly 1GB which I'm deploying on a Nginx/Flask/Guincorn stack. The issue is I have to find an efficient way to load this model from file and serve API requests. Is it possible to save the model without the datasets when doing:
joblib.dump(model, '/kaggle/working/mymodel.joblib')
print("random classifier saved")
Solution
The persistent representation of Scikit-Learn estimators DOES NOT include any training data.
Speaking about decision trees and their ensembles (such as random forests), then the size of the estimator object scales quadratically to the depth of decision trees (ie. the max_depth
parameter). This is so, because decision tree configuration is represented using (max_depth, max_depth)
matrices (float64
data type).
You can make your random forest objects smaller by limiting the max_depth
parameter. If you're worried about potential loss of predictive performance, you may increase the number of child estimators.
Longer term, you may wish to explore alternative representations for Scikit-Learn models. For example, converting them to PMML data format using the SkLearn2PMML package.
Answered By - user1808924
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.