Issue
Im trying to calculate the r squared value after the creation of a model using sklearn linear regression.
Im simply
- importing a csv dataset
- filtering the interesting columns
- splitting the dataset in train and test
- creating the model
- making a prediction on the test
- calculating the r squared in order to see how good is the model to fit the test dataset
the dataset is taken from https://www.kaggle.com/datasets/jeremylarcher/american-house-prices-and-demographics-of-top-cities
the code is as following
''' Lets verify if there s a correlation between price and beds number of bathroom'''
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.read_csv('data/American_Housing_Data_20231209.csv')
df_interesting_columns = df[['Beds', 'Baths', 'Price']]
independent_variables = df_interesting_columns[['Beds', 'Baths']]
dependent_variable = df_interesting_columns[['Price']]
X_train, X_test, y_train, y_test = train_test_split(independent_variables, dependent_variable, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
print(model.score(y_test, prediction))
but i get the error
ValueError: The feature names should match those that were passed during fit. Feature names unseen at fit time:
- Price Feature names seen at fit time, yet now missing:
- Baths
- Beds
what am I doing wrong?
Solution
Your last line is wrong. You misunderstood the score
method. score
take X
and y
as parameter not the y_true
and y_pred
Try:
from sklearn.metrics import r2_score
print(r2_score(y_test, prediction))
# 0.24499127100887863
Or with the score
method:
print(model.score(X_test, y_test))
# 0.24499127100887863
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.