Issue
I have the created a simple multiple linear regression model and would like to print the model summary - ei the OLS/regression summary.
However, I'm unsure if I should use scikitlearn or the statsmodels libraries as I found other posts/youtube videos that use both. Any explanation behind your choice would also be appreciated.
full code
# Importing the libraries
import numpy as np
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Encoding categorical data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Training the Multiple Linear Regression model on the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the Test set results
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
predictions = (np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
df
R&D Spend Administration Marketing Spend State Profit
0 165349.20 136897.80 471784.10 New York 192261.83
1 162597.70 151377.59 443898.53 California 191792.06
2 153441.51 101145.55 407934.54 Florida 191050.39
3 144372.41 118671.85 383199.62 New York 182901.99
4 142107.34 91391.77 366168.42 Florida 166187.94
5 131876.90 99814.71 362861.36 New York 156991.12
6 134615.46 147198.87 127716.82 California 156122.51
7 130298.13 145530.06 323876.68 Florida 155752.60
8 120542.52 148718.95 311613.29 New York 152211.77
9 123334.88 108679.17 304981.62 California 149759.96
10 101913.08 110594.11 229160.95 Florida 146121.95
11 100671.96 91790.61 249744.55 California 144259.40
12 93863.75 127320.38 249839.44 Florida 141585.52
13 91992.39 135495.07 252664.93 California 134307.35
14 119943.24 156547.42 256512.92 Florida 132602.65
15 114523.61 122616.84 261776.23 New York 129917.04
16 78013.11 121597.55 264346.06 California 126992.93
17 94657.16 145077.58 282574.31 New York 125370.37
18 91749.16 114175.79 294919.57 Florida 124266.90
19 86419.70 153514.11 0.00 New York 122776.86
Solution
There is no summary of an OLS model in sklearn
you will need to use statsmodel
and then call the summary()
method on the output of the OLS
model fit()
method. You can see more in the docs here
If you need R^2 for your sklearn
OLS model you will need to use the sklearn.meterics.r2_score
and pass it your predicted values to compare against the true values like so:
r2_score(y_true, y_pred)
With y_true
being the true values of the data and y_pred
being the predicted values from your OLS model. More info can be found on this method in the sklearn
docs here
Answered By - Matthew Barlowe
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.