Issue
I came across the meaning of setting sparse=False
pre-processing my data with a OneHotEncoder. I did:
ct = ColumnTransformer([
("scaling", StandardScaler(), sca_col), #sca_col containing 3 columns
("onehot", OneHotEncoder(sparse=False, handle_unknown='ignore'), ohe_col)]) #ohe_col containing 15 columns
Then I train my model with:
feat = df.drop("label", axis=1)
X_train, X_test, y_train, y_test = train_test_split(feat, df.label, random_state=0)
ct.fit(X_train)
I get the error
[...]
MemoryError: Unable to allocate 151. GiB for an array with shape (239076, 84497) and data type float64
with the right shape according to my data and columns, but obviously not fitting in my RAM.
If I set sparse=True
, which is default, it works.
In which case you need to set sparse=False
, which I did for no obvious reasons a couple of weeks ago?
Solution
By setting this flag, you choose to represent your data in a sparse formatting. This saves a lot of memory when you have an array where most of the elements are zero.
From Scikit-learn's ColumTransformer documentation:
sparse_thresholdfloat, default=0.3
If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.
Whether to use a sparse matrix or not depends on the matrix's sparsity, which is the percentage of the values that are zero. In your case, if using a sparse matrix tackles your memory restrictions, then it's the way to go. In case your matrix is not sparse enough, you won't have any benefit in memory savings or any improvement in computational speed, if algorithms designed for sparse matrices are used.
Answered By - Alex Metsai
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.