Sunday, December 5, 2021

[FIXED] How to print summary of results for Multiple linear regression model (r2, etc) - Statsmodels vs SciKitLearn

December 05, 2021 numpy, pandas, python, scikit-learn No comments

Issue

I have the created a simple multiple linear regression model and would like to print the model summary - ei the OLS/regression summary.

However, I'm unsure if I should use scikitlearn or the statsmodels libraries as I found other posts/youtube videos that use both. Any explanation behind your choice would also be appreciated.

full code

# Importing the libraries
import numpy as np
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values


# Encoding categorical data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))


# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)


# Training the Multiple Linear Regression model on the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the Test set results
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
predictions = (np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

    R&D Spend  Administration  Marketing Spend       State     Profit
0   165349.20       136897.80        471784.10    New York  192261.83
1   162597.70       151377.59        443898.53  California  191792.06
2   153441.51       101145.55        407934.54     Florida  191050.39
3   144372.41       118671.85        383199.62    New York  182901.99
4   142107.34        91391.77        366168.42     Florida  166187.94
5   131876.90        99814.71        362861.36    New York  156991.12
6   134615.46       147198.87        127716.82  California  156122.51
7   130298.13       145530.06        323876.68     Florida  155752.60
8   120542.52       148718.95        311613.29    New York  152211.77
9   123334.88       108679.17        304981.62  California  149759.96
10  101913.08       110594.11        229160.95     Florida  146121.95
11  100671.96        91790.61        249744.55  California  144259.40
12   93863.75       127320.38        249839.44     Florida  141585.52
13   91992.39       135495.07        252664.93  California  134307.35
14  119943.24       156547.42        256512.92     Florida  132602.65
15  114523.61       122616.84        261776.23    New York  129917.04
16   78013.11       121597.55        264346.06  California  126992.93
17   94657.16       145077.58        282574.31    New York  125370.37
18   91749.16       114175.79        294919.57     Florida  124266.90
19   86419.70       153514.11             0.00    New York  122776.86

Solution

There is no summary of an OLS model in sklearn you will need to use statsmodel and then call the summary() method on the output of the OLS model fit() method. You can see more in the docs here

If you need R^2 for your sklearn OLS model you will need to use the sklearn.meterics.r2_score and pass it your predicted values to compare against the true values like so:

r2_score(y_true, y_pred)

With y_true being the true values of the data and y_pred being the predicted values from your OLS model. More info can be found on this method in the sklearn docs here

Answered By - Matthew Barlowe

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 5, 2021

[FIXED] How to print summary of results for Multiple linear regression model (r2, etc) - Statsmodels vs SciKitLearn

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels