Issue
Here my code:
# AnyNan values in the target column or in my dataset
training_data.dropna(inplace=True, axis=0)
testing_data.dropna(inplace=True, axis=0)
# Perform one hot encoding on HomePlanet,
features = ['HomePlanet', 'Destination', 'CryoSleep', 'VIP' ]
X= pd.get_dummies(training_data[features]).astype(int)
y = pd.get_dummies(training_data.Transported).astype(int)
x_test = testing_data[features]
# Creating my model
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.6, test_size=0.4, random_state=42)
rt_model = RandomForestRegressor()
rt_model.fit(X_train,y_train)
predictions = rt_model.predict(X_test)
#save the csv
output = pd.DataFrame({'PassengerId': testing_data.PassengerId, 'Transported': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")
When I print the length of X
, y
and X_train
, y_train
after train-test split I get:
6606 6606
3963 3963
2643 2643
I tried reshaping X and y.
I tried performing one hot Encoding on my x_test dataframe
.
I did the iloc
method on my array.
The problem only comes from the last part trying to save it as a csv
.
Solution
From your first two lines, I assume that you already have testing data provided. There is no need to split the training data into additional testing data.
Therefore, your predictions should run on the provided test data x_test
not the splitted X_test
. Note that Python is case sensitive and naming variables like this is confusing and risks mixing up variables.
As you use X_test
, predictions
is an array with a different length than your testing_data
and you have therefore a length mismatch when you create a DataFrame from testing_data
and predictions
and try to save this DataFrame.
So using
predictions = rt_model.predict(x_test) # lowercase x
should work but I would change the code further and get rid of your additional split of the data as you throw away training data.
Answered By - Oskar Hofmann
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.