Saturday, October 9, 2021

[FIXED] Memory error if not using Standard Scaler

October 09, 2021 decision-tree, machine-learning, python, scikit-learn No comments

Issue

I have read here https://towardsdatascience.com/do-decision-trees-need-feature-scaling-97809eaa60c6 and watch https://www.youtube.com/watch?v=nmBqnKSSKfM&ab_channel=KrishNaik video which stated that you don't need to use Standard Scaler for Decision Tree machine learning.

But, what happened is on my code is the opposite. Heres the code I am trying to run.

# importing libraries  
import numpy as nm  
import matplotlib.pyplot as mpl
import pandas as pd  
  
#importing datasets  
data_set= pd.read_csv('Social_Network_Ads.csv')  
  
#Extracting Independent and dependent Variable  
x= data_set.iloc[:, [2,3]].values  
y= data_set.iloc[:, 4].values  
  
# Splitting the dataset into training and test set.  
from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)  
  
#feature Scaling  

from sklearn.preprocessing import StandardScaler    
st_x= StandardScaler()  
x_train= st_x.fit_transform(x_train)    
x_test= st_x.transform(x_test)  


#Fitting Decision Tree classifier to the training set  
from sklearn.tree import DecisionTreeClassifier  
classifier= DecisionTreeClassifier(criterion='entropy', random_state=0)  
classifier.fit(x_train, y_train)

I continue the question on the part which I try to visualize the data. Here's the code.

#Visulaizing the trianing set result  
from matplotlib.colors import ListedColormap  
x_set,y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step  =0.01),  
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))  
mpl.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),  
alpha = 0.75, cmap = ListedColormap(('purple','green' )))  
mpl.xlim(x1.min(), x1.max())  
mpl.ylim(x2.min(), x2.max())  
for i, j in enumerate(nm.unique(y_set)):  
    mpl.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1], c = ListedColormap(('purple', 'green'))(i), label = j)  
mpl.title('Decision Tree Algorithm (Training set)')  
mpl.xlabel('Age')  
mpl.ylabel('Estimated Salary')  
mpl.legend()  
mpl.show()

The output is succeed if I ran it with the StandardScaler. The graph is showed nicely. But, as I hashed (comment) the StandardScaler part, it stated the Memory Error.

MemoryError                               Traceback (most recent call last)
<ipython-input-8-1282bf709e27> in <module>
      3 x_set,y_set = x_train, y_train
      4 x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step  =0.01),  
----> 5 nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))  
      6 mpl.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),  
      7 alpha = 0.75, cmap = ListedColormap(('purple','green' )))  

~\Anaconda3\lib\site-packages\numpy\lib\function_base.py in meshgrid(*xi, **kwargs)
   4209 
   4210     if copy_:
-> 4211         output = [x.copy() for x in output]
   4212 
   4213     return output

~\Anaconda3\lib\site-packages\numpy\lib\function_base.py in <listcomp>(.0)
   4209 
   4210     if copy_:
-> 4211         output = [x.copy() for x in output]
   4212 
   4213     return output

MemoryError:

The error only occurs on the visualizing part; in the other part of the code such prediction works nicely without the Standard Scaler.

Can the Decision Tree work without Standard Scaler? If yes, how can I fix this?

Solution

Decision Tree can work without Standard Scaler and with Standard Scaler. The important thing to note here is that scaling the data won't affect the performance of a Decision Tree model.

If you are plotting the data afterwards though I imagine you don't want to plot the scaled data but rather the original data; hence your problem.

The simplest solution I can think of for doing this is to pass sparse=True as an argument to numpy.meshgrid as that seems to be what's throwing the error in your traceback. There's some detail on that in a past question here.

So applied to your question, that would mean you change this line:

nm.meshgrid(
    nm.arange(start=x_set[:, 0].min() - 1, stop=x_set[:, 0].max() + 1, step=0.01),  
    nm.arange(start=x_set[:, 1].min() - 1, stop=x_set[:, 1].max() + 1, step=0.01),
)

nm.meshgrid(
    nm.arange(start=x_set[:, 0].min() - 1, stop=x_set[:, 0].max() + 1, step=0.01),  
    nm.arange(start=x_set[:, 1].min() - 1, stop=x_set[:, 1].max() + 1, step=0.01),
    sparse=True,
)

Answered By - osint_alex

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 9, 2021

[FIXED] Memory error if not using Standard Scaler

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels