Tuesday, February 1, 2022

[FIXED] Bad MSE while using Pipes

February 01, 2022 python, scikit-learn, svm No comments

Issue

I'm trying to predict some prices from a dataset that I scraped. I never used Python for this (I usually use tidyverse, but this time I wanted to explore pipeline. So here is the code snippet:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import numpy as np

df = pd.read_csv("https://raw.githubusercontent.com/norhther/idealista/main/idealistaBCN.csv")
df.drop("info", axis = 1, inplace = True)
df["floor"].fillna(1, inplace=True)
df.drop("neigh", axis = 1, inplace = True)
df.dropna(inplace = True)
df = df[df["habs"] < 11]
X = df.drop("price", axis = 1)
y = df["price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
ct = ColumnTransformer(
   [("standardScaler", StandardScaler(), ["habs", "m2", "floor"]),
   ("onehot", OneHotEncoder(), ["type"]
    )], remainder="passthrough")

pipe = Pipeline(steps = [("Transformer", ct),
                          ("svr", SVR())])

param_grid = {
  "svr__kernel" : ['linear', 'poly', 'rbf', 'sigmoid'],
  "svr__degree" : range(3,6),
  "svr__gamma" : ['scale', 'auto'],
  "svr__coef0" : np.linspace(0.01, 1, 2)
}

search = GridSearchCV(pipe, param_grid,  scoring = ['neg_mean_squared_error'], refit='neg_mean_squared_error')

search.fit(X_train, y_train)
print(search.best_score_)

pipe = Pipeline(steps = [("Transformer", ct),
                          ("svr", SVR(coef0 = search.best_params_["svr__coef0"],
                                     degree = search.best_params_["svr__degree"],
                                     kernel = 

search.best_params_["svr__kernel"]))])

from sklearn.metrics import mean_squared_error

pipe.fit(X_train, y_train)
preds = pipe.predict(X_train)
mean_squared_error(preds, y_train)

And search.best_score_ here is -443829697806.1671, and the MSE is 608953977916.3896 I think I messed up with something, maybe with the transformer, but I'm not completely sure. I think this is an exagerated MSE. I did a fearly similar approach with tidymodels and I got much better results. So here I wanted to know if there is something wrong with the transformer, or is just that the model is this bad.

Solution

The reason is that you did not include C in parameter and you need to cover a whole range of Cs to fit. If we fit it with the default C = 1, you can see where the problem lies:

import matplotlib.pyplot as plt
o = pipe.named_steps["Transformer"].fit_transform(X_train)
mdl = SVR(C=1)
mdl.fit(o,y_train)
plt.scatter(mdl.predict(o),y_train)

There are some price values that are 10x the average values (1e7 versus median of 5e5). If you use mse or r^2, these will be heavily decided by these extreme values. So we need to follow the data a bit more closely and this is decided by C, which you can read more about here. We try a range:

ct = ColumnTransformer(
   [("standardScaler", StandardScaler(), ["habs", "m2", "floor"]),
   ("onehot", OneHotEncoder(), ["type"]
    )], remainder="passthrough")

pipe = Pipeline(steps = [("Transformer", ct),
                          ("svr", SVR())])

#, 'poly', 'rbf', 'sigmoid'
param_grid = {
  "svr__kernel" : ['rbf'],
  "svr__gamma" : ['auto'],
  "svr__coef0" : [1,2],
   "svr__C" : [1e-03,1e-01,1e1,1e3,1e5,1e7]
}

search = GridSearchCV(pipe, param_grid, scoring = ['neg_mean_squared_error'], 
refit='neg_mean_squared_error')

search.fit(X_train, y_train)
print(search.best_score_)
-132061065775.25969

Your y values are high and the MSE values are going to be in the range of the the variance of your y values, so if we check that:

y_train.var()
545423126823.4545

132061065775.25969 / y_train.var()
0.24212590057261346

It is pretty ok, you reduce MSE to about 25% of the variance. We can check this with the test data, and I guess in this case it is quite lucky that the C values are pretty ok:

from sklearn.metrics import mean_squared_error

o = pipe.named_steps["Transformer"].fit_transform(X_train)
mdl = SVR(C=10000000.0, coef0=1, gamma='auto')
mdl.fit(o,y_train)

o_test = pipe.named_steps["Transformer"].fit_transform(X_test)

pred = mdl.predict(o_test)
print( mean_squared_error(pred,y_test) , mean_squared_error(pred,y_test)/y_test.var())
plt.scatter(mdl.predict(o_test),y_test)

Answered By - StupidWolf

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, February 1, 2022

[FIXED] Bad MSE while using Pipes

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels