Issue
I am doing a linear regression on a data frame that ends at the of january 2021. The target variable is a monthly average, so it will predict the month of february.
I have the information that ends at the end of january and february in separate datasets. I want to train the model on the data of january and then compare the predictions to the data the dataframe that ends at the end of february.
For me to do this, do I need to merge the target column (from the february data frame) to the january data frame and run the model like this:
january.drop('january_avg_colum', axis=1, inplace=True)
df = pd.merge(january, february[['ID', 'february_avg_colum']], how="inner", on=["ID", "ID"])
X = df.drop('february_avg_colum', axis=1)
y = df['february_avg_colum']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression().fit(X_train, y_train)
y_preds = model.predict((X_test))
print('RMSE:', metrics.mean_squared_error(y_test, y_preds, squared=False))
Do I need to drop the january average before merging? Is this the correct way to go about this? Is there a simpler or more efficient way? Any help much appreciatted!
Solution
If you already know that you want to train your data on January and test on February, then no split is necessary, you already have your training and test datasets ready.
You could have wanted to split your training data into virtual training and test datasets if you had to fine-tune the parameters of your model, or test other models. Then it would be useful to do a train_test_split
on your training data, or even better, to do several splits and find the best models+parameters on all those runs, still without having seen February data (this is very important).
Here, LinearRegression
works without any parameter, so there is no real need to split or shuffle anything.
Now let me show you what your code does, and then how it would look if we translate what you want to achieve into code:
What you are doing here
- Remove any existence of the target values for January:
january.drop('january_info_colum', axis=1, inplace=True)
- Merge February targets to January features:
df = pd.merge(january, february[['ID', 'february_info_colum']],
how="inner", on=["ID", "ID"])
- Split this dataset in 80% train/20% test:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
- Fit and predict:
model = LinearRegression().fit(X_train, y_train)
y_preds = model.predict((X_test))
- Print metric
print('RMSE:', metrics.mean_squared_error(y_test, y_preds, squared=False))
What corresponds to the goal you describe
- Train on January features and targets:
model = LinearRegression().fit(january.drop('january_info_colum', axis=1), january['january_info_colum'])
- Predict on February features:
y_preds = model.predict(february.drop('february_info_colum', axis=1))
- Print metric
print('RMSE:', metrics.mean_squared_error(february['february_info_colum'], y_preds, squared=False))
Answered By - Guillaume Ansanay-Alex
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.