Wednesday, February 2, 2022

[FIXED] Predict And Compare Data From Different Months

February 02, 2022 jupyter-notebook, numpy, pandas, python, regression No comments

Issue

I am doing a linear regression on a data frame that ends at the of january 2021. The target variable is a monthly average, so it will predict the month of february.

I have the information that ends at the end of january and february in separate datasets. I want to train the model on the data of january and then compare the predictions to the data the dataframe that ends at the end of february.

For me to do this, do I need to merge the target column (from the february data frame) to the january data frame and run the model like this:

january.drop('january_avg_colum', axis=1, inplace=True)
df = pd.merge(january, february[['ID', 'february_avg_colum']], how="inner", on=["ID", "ID"])

X = df.drop('february_avg_colum', axis=1)
y = df['february_avg_colum']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression().fit(X_train, y_train)

y_preds = model.predict((X_test))

print('RMSE:', metrics.mean_squared_error(y_test, y_preds, squared=False))

Do I need to drop the january average before merging? Is this the correct way to go about this? Is there a simpler or more efficient way? Any help much appreciatted!

Solution

If you already know that you want to train your data on January and test on February, then no split is necessary, you already have your training and test datasets ready.

You could have wanted to split your training data into virtual training and test datasets if you had to fine-tune the parameters of your model, or test other models. Then it would be useful to do a train_test_split on your training data, or even better, to do several splits and find the best models+parameters on all those runs, still without having seen February data (this is very important).

Here, LinearRegression works without any parameter, so there is no real need to split or shuffle anything.

Now let me show you what your code does, and then how it would look if we translate what you want to achieve into code:

What you are doing here

Remove any existence of the target values for January:

january.drop('january_info_colum', axis=1, inplace=True)

Merge February targets to January features:

df = pd.merge(january, february[['ID', 'february_info_colum']], 
              how="inner", on=["ID", "ID"])

Split this dataset in 80% train/20% test:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Fit and predict:

model = LinearRegression().fit(X_train, y_train)
y_preds = model.predict((X_test))

Print metric

print('RMSE:', metrics.mean_squared_error(y_test, y_preds, squared=False))

What corresponds to the goal you describe

Train on January features and targets:

model = LinearRegression().fit(january.drop('january_info_colum', axis=1), january['january_info_colum'])

Predict on February features:

y_preds = model.predict(february.drop('february_info_colum', axis=1))

Print metric

print('RMSE:', metrics.mean_squared_error(february['february_info_colum'], y_preds, squared=False))

Answered By - Guillaume Ansanay-Alex

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, February 2, 2022

[FIXED] Predict And Compare Data From Different Months

Issue

Solution

What you are doing here

What corresponds to the goal you describe

0 comments:

Post a Comment

Popular Posts

Labels