Thursday, June 16, 2022

[FIXED] Why such a big pickle of a sklearn decision tree (30K times bigger)?

June 16, 2022 decision-tree, pickle, python, scikit-learn No comments

Issue

Why pickling a sklearn decision tree can generate a pickle thousands times bigger (in terms of memory) than the original estimator?

I ran into this issue at work where a random forest estimator (with 100 decision trees) over a dataset with around 1_000_000 samples and 7 features generated a pickle bigger than 2GB.

I was able to track down the issue to the pickling of a single decision tree and I was able to replicate the issue with a generated dataset as below.

For memory estimations I used pympler library. Sklearn version used is 1.0.1

# here using a regressor tree but I would expect the same issue to be present with a classification tree
import pickle
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_friedman1  # using a dataset generation function from sklear
from pympler import asizeof

# function that creates the dataset and trains the estimator
def make_example(n_samples: int):
    X, y = make_friedman1(n_samples=n_samples, n_features=7, noise=1.0, random_state=49)
    estimator = DecisionTreeRegressor(max_depth=50, max_features='auto', min_samples_split=5)
    estimator.fit(X, y)
    return X, y, estimator

# utilities to compute and compare the size of an object and its pickled version
def readable_size(size_in_bytes: int, suffix='B') -> str:
    num = size_in_bytes
    for unit in ['', 'k', 'M', 'G', 'T', 'P', 'E', 'Z']:
        if abs(num) < 1024.0:
            return "%3.1f %s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Yi', suffix)

def print_size(obj, skip_detail=False):
    obj_size = asizeof.asized(obj).size
    print(readable_size(obj_size))
    return obj_size

def compare_with_pickle(obj):
    size_obj = print_size(obj)
    size_pickle = print_size(pickle.dumps(obj))
    print(f"Ratio pickle/obj: {(size_pickle / size_obj):.2f}")
    
_, _, model100K = make_example(100_000)
compare_with_pickle(model100K)
_, _, model1M = make_example(1_000_000)
compare_with_pickle(model1M)

output:

1.7 kB
4.9 MB
Ratio pickle/obj: 2876.22
1.7 kB
49.3 MB
Ratio pickle/obj: 28982.84

Solution

As pointed out by @pygeek's answer and subsequent comments, the wrong assumption of the question is that the pickle is increasing the size of the object substantially. Instead the issue lies with pympler.asizeof which is not giving the correct estimate of the tree object.

Indeed the DecisionTreeRegressor object has a tree_ attribute that has a number of arrays of length tree_.node_count. Using help(sklearn.tree._tree.Tree) we can see that there are 8 such arrays (values, children_left, children_right, feature, impurity, threshold, n_node_samples, weighted_n_node_samples) and the underlying type of every array (except possibly the values array, see note below) should be an underlying 64 bit integer or 64 bit float (the underlying Tree object is a cython object), so a better estimate of the size of a DecisionTree is estimator.tree_.node_count*8*8.

Computing this estimate for the models above:

def print_tree_estimate(tree):
    print(f"A tree with max_depth {tree.max_depth} can have up to {2**(tree.max_depth -1)} nodes")
    print(f"This tree has node_count {tree.node_count} and a size estimate is {readable_size(tree.node_count*8*8)}")
    
print_tree_estimate(model100K.tree_)
print()
print_tree_estimate(model1M.tree_)

gives as output:

A tree with max_depth 37 can have up to 68719476736 nodes
This tree has node_count 80159 and a size estimate is 4.9 MB

A tree with max_depth 46 can have up to 35184372088832 nodes
This tree has node_count 807881 and a size estimate is 49.3 MB

and indeed these estimates are in line with the sizes of pickle objects.

Further note that the only way to be sure to bound the size of DecisionTree is to bound max_depth, since a binary tree can have a maximum number of nodes that is bounded by 2**(max_depth - 1), but the specific tree realizations above have a number of nodes well below this theoretical bound.

note: the above estimate is valid for this decision tree regressor which has a single output and no classes. estimator.tree_.values is an array of shape [node_count, n_outputs, max_n_classes] so for n_outputs > 1 and/or max_n_classes > 1 the size estimate would need to take into account those and the correct estimate would be estimator.tree_.node_count*8*(7 + n_outputs*max_n_classes)

Answered By - pietroppeter

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, June 16, 2022

[FIXED] Why such a big pickle of a sklearn decision tree (30K times bigger)?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels