Issue
I'm trying to predict some prices from a dataset that I scraped. I never used Python for this (I usually use tidyverse
, but this time I wanted to explore pipeline
.
So here is the code snippet:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/norhther/idealista/main/idealistaBCN.csv")
df.drop("info", axis = 1, inplace = True)
df["floor"].fillna(1, inplace=True)
df.drop("neigh", axis = 1, inplace = True)
df.dropna(inplace = True)
df = df[df["habs"] < 11]
X = df.drop("price", axis = 1)
y = df["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
ct = ColumnTransformer(
[("standardScaler", StandardScaler(), ["habs", "m2", "floor"]),
("onehot", OneHotEncoder(), ["type"]
)], remainder="passthrough")
pipe = Pipeline(steps = [("Transformer", ct),
("svr", SVR())])
param_grid = {
"svr__kernel" : ['linear', 'poly', 'rbf', 'sigmoid'],
"svr__degree" : range(3,6),
"svr__gamma" : ['scale', 'auto'],
"svr__coef0" : np.linspace(0.01, 1, 2)
}
search = GridSearchCV(pipe, param_grid, scoring = ['neg_mean_squared_error'], refit='neg_mean_squared_error')
search.fit(X_train, y_train)
print(search.best_score_)
pipe = Pipeline(steps = [("Transformer", ct),
("svr", SVR(coef0 = search.best_params_["svr__coef0"],
degree = search.best_params_["svr__degree"],
kernel =
search.best_params_["svr__kernel"]))])
from sklearn.metrics import mean_squared_error
pipe.fit(X_train, y_train)
preds = pipe.predict(X_train)
mean_squared_error(preds, y_train)
And search.best_score_
here is -443829697806.1671
, and the MSE
is 608953977916.3896
I think I messed up with something, maybe with the transformer, but I'm not completely sure. I think this is an exagerated MSE
. I did a fearly similar approach with tidymodels
and I got much better results.
So here I wanted to know if there is something wrong with the transformer, or is just that the model is this bad.
Solution
The reason is that you did not include C in parameter and you need to cover a whole range of Cs to fit. If we fit it with the default C = 1, you can see where the problem lies:
import matplotlib.pyplot as plt
o = pipe.named_steps["Transformer"].fit_transform(X_train)
mdl = SVR(C=1)
mdl.fit(o,y_train)
plt.scatter(mdl.predict(o),y_train)
There are some price values that are 10x the average values (1e7 versus median of 5e5). If you use mse or r^2, these will be heavily decided by these extreme values. So we need to follow the data a bit more closely and this is decided by C, which you can read more about here. We try a range:
ct = ColumnTransformer(
[("standardScaler", StandardScaler(), ["habs", "m2", "floor"]),
("onehot", OneHotEncoder(), ["type"]
)], remainder="passthrough")
pipe = Pipeline(steps = [("Transformer", ct),
("svr", SVR())])
#, 'poly', 'rbf', 'sigmoid'
param_grid = {
"svr__kernel" : ['rbf'],
"svr__gamma" : ['auto'],
"svr__coef0" : [1,2],
"svr__C" : [1e-03,1e-01,1e1,1e3,1e5,1e7]
}
search = GridSearchCV(pipe, param_grid, scoring = ['neg_mean_squared_error'],
refit='neg_mean_squared_error')
search.fit(X_train, y_train)
print(search.best_score_)
-132061065775.25969
Your y values are high and the MSE values are going to be in the range of the the variance of your y values, so if we check that:
y_train.var()
545423126823.4545
132061065775.25969 / y_train.var()
0.24212590057261346
It is pretty ok, you reduce MSE to about 25% of the variance. We can check this with the test data, and I guess in this case it is quite lucky that the C values are pretty ok:
from sklearn.metrics import mean_squared_error
o = pipe.named_steps["Transformer"].fit_transform(X_train)
mdl = SVR(C=10000000.0, coef0=1, gamma='auto')
mdl.fit(o,y_train)
o_test = pipe.named_steps["Transformer"].fit_transform(X_test)
pred = mdl.predict(o_test)
print( mean_squared_error(pred,y_test) , mean_squared_error(pred,y_test)/y_test.var())
plt.scatter(mdl.predict(o_test),y_test)
Answered By - StupidWolf
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.