Friday, February 11, 2022

[FIXED] Is it possible to fit() a scikit-learn model in a loop or with an iterator

February 11, 2022 python, scikit-learn No comments

Issue

Usually people use scikit-learn to train a model this way:

from sklearn.ensemble import GradientBoostingClassifier as gbc
clf = gbc()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)

It works fine as long as users' memory is large enough to accommodate the entire dataset. The dilemma for me is exactly this--the dataset is too big for my memory. My current solution is to enlarge the virtual memory of my machine and I have already made the system extremely slow by having too much virtual memory--so I start to think whether or not is it possible to feed the fit() method with samples in batches like this (and the answre is no, please keep reading and stop reminding me that the answer is no):

clf = gbc()
for i in range(X_train.shape[0]):
    clf.fit(X_train[i], y_train[i])

so that I can read the training set from hard drive only when needed. I read the sklearn's manual and it seems to me that it does not support this:

Calling fit() more than once will overwrite what was learned by any previous fit()

So, is this possible?

Solution

After reading the section 6. Strategies to scale computationally: bigger data of the official manual mentioned by @StupidWolf in this post, I am aware that this question is more to this than meets the eye.

The real difficulty is about the design of a lot of models.

Take Random Forest as an example, one of the most important techniques used to improve its performance compared with the simpler Decision Tree is the application of bagging, which means that the algorithm has to pick some random samples from the entire dataset to construct several weak learners as the basis of the Random Forest. It means that feeding the model with one sample after another won't work with this design.

Although it is still possible for scikit-learn to define an interface for end-users to implement so that scikit-learn can pick a random sample by calling this interface and end-users will decide how their implementation of the interface is about to return the needed data by scanning the dataset on the hard drive, it becomes way more complicated than I initially thought and the performance gain may not be very significant given that the IO-heavy "full table scan" (in database's term) is frequently needed.

Answered By - user2379740

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, February 11, 2022

[FIXED] Is it possible to fit() a scikit-learn model in a loop or with an iterator

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels