Issue
I have a dataset that shows whether a person has diabetes based on indicators, it looks like this (original dataset):
I've created a straightforward model in order to predict the last column (Outcome).
#Libraries imported
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#Dataset imported
data = pd.read_csv('diabetes.csv')
#Assign X and y
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values
#Data preprocessed
sc = StandardScaler()
X = sc.fit_transform(X)
#Dataset split between train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Predicting the results for the whole dataset
y_pred2 = model.predict(data)
#Add prediction column to original dataset
data['prediction'] = y_pred2
However, I get the following error: ValueError: X has 9 features per sample; expecting 8.
My questions are:
- Why can't I create a new column with the predictions for my entire dataset?
- How can I make predictions for blank outcomes (that need to be predicted), that is to say, should I upload the file again? Let's say I want to predict the folowing:
Rows to predict:
Please let me know if my questions are clear!
Solution
You are feeding data
(with all 9 initial features) to a model that was trained with X
(8 features, since Outcome
has been removed to create y
), hence the error.
What you need to do is:
- Get predictions using
X
instead ofdata
- Append the predictions to your initial
data
set
i.e.:
y_pred2 = model.predict(X)
data['prediction'] = y_pred2
Keep in mind that this means that your prediction
variable will come from both data that have already been used for model fitting (i.e. the X_train
part) as well as from data unseen by the model during training (the X_test
part). Not quite sure what your final objective is (and neither this is what the question is about), but this is a rather unusual situation from an ML point of view.
If you have a new dataset data_new
to predict the outcome, you do it in a similar way; always assuming that X_new
has the same features with X
(i.e. again removing the Outcome
column as you have done with X
):
y_new = model.predict(X_new)
data_new['prediction'] = y_new
Answered By - desertnaut
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.