Issue
I'm predicting the duration of a drying process in different silos. The support vector regression is retrained for a every silo, because they are more different than expected.
I want to check if there is a correlation between the 4 different silos, but I have a hard time getting all predictions into one dataframe.
The dataframe:
df = pd.DataFrame({'date': ['2019-09-06', '2019-09-07', '2019-09-08', '2019-09-09', '2019-09-10'] , 'real_duration': [400, 420, 450,500, 380], 'silo_nr': [1, 2, 2, 3, 4], 'temp_outside' : [4, 5, 2, 6 , 15], 'zomer_winter_dummy': ["winter","winter","winter", "winter", "zommer"]})
df.set_index('date', inplace= True)
My current code:
creation dataframe where i want to store all predictions:
df_predict[['real_duration', 'silo_nr']] = df[['real_duration', 'silo_nr']]
Not optimal, but current best plan is to repeat this code 4 times for all 4 silos:
X = df_test[df_test.silo == '1'][['moist', 'temp_outside', 'zomer_winter_dummy']]
X = pd.get_dummies(X, columns=['zomer_winter_dummy'], drop_first=True)
y = df_test[df_test.silo == '1']['real_duration']
model = SVR(kernel='poly', C= 1, epsilon = 1, gamma = 0.0001)
df_predict[df_predict.silo_nr =='1']['predict_kiln1'] = cross_val_predict(model, X, y, cv=5)
But I get the error:
ValueError: Length of values does not match length of index.
I was trying to avoid this error by only selecting those rows of silo_nr == '1'
to avoid this.
Can you please suggest an improvement on my code?
Thanks a lot !
Solution
Seems like you should be storing the resultant dataframes in a list or dictionary, then concatenating them together to create your output.
Sample data:
import pandas as pd
import numpy as np
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_predict
df = pd.DataFrame({
'date': pd.date_range('2019-09-06', '2019-10-15'),
'real_duration': np.random.randint(100, 500, size=40),
'silo_nr': np.repeat([1,2,3,4], 10),
'temp_outside' : np.random.randint(5, 20, size=40),
'zomer_winter_dummy': np.random.choice(["winter", "zommer"], size=40)
})
df = df.set_index('date')
print(df.head())
real_duration silo_nr temp_outside zomer_winter_dummy
date
2019-09-06 165 1 14 zommer
2019-09-07 370 1 19 zommer
2019-09-08 484 1 6 zommer
2019-09-09 358 1 16 zommer
2019-09-10 244 1 7 winter
groupby "silo_nr" and apply the model fit to each groups data. Store the resultant arrays into a dictionary where the key is the silo_nr value and the value is a Series of the predicted array. The series takes on the index of the input data to align for later use. Then use pd.concat
on that dictionary to join all of the Series together into a single dataframe.
model = SVR(kernel='poly', C= 1, epsilon = 1, gamma = 0.0001)
results = {}
for silo, group in df.groupby('silo_nr'):
# whats the point of subsetting if you overwrite X on the next line?
# X = group[['moist', 'temp_outside', 'zomer_winter_dummy']]
X = pd.get_dummies(group, columns=['zomer_winter_dummy'], drop_first=True)
y = group['real_duration']
results[silo] = pd.Series(cross_val_predict(model, X, y, cv=5), index=group.index, name="prediction")
result_df = pd.concat(results, names=["silo_nr"]).reset_index(level=0)
print(result_df.head(10)
silo_nr prediction
date
2019-09-06 1 237.048891
2019-09-07 1 348.963419
2019-09-08 1 539.037233
2019-09-09 1 338.392746
2019-09-10 1 259.937936
2019-09-11 1 359.046132
2019-09-12 1 432.049543
2019-09-13 1 573.457120
2019-09-14 1 279.745391
2019-09-15 1 292.807550
Lastly, if you want to join your predictions back with the original data, you can use the join
function since the index of both dataframes is the same (just need to drop one of the "silo_nr" columns since its present in both dataframes first).
final_df = result_df.drop("silo_nr", axis=1).join(df)
print(final_df.head(10))
prediction real_duration silo_nr temp_outside zomer_winter_dummy
date
2019-09-06 237.048891 165 1 14 zommer
2019-09-07 348.963419 370 1 19 zommer
2019-09-08 539.037233 484 1 6 zommer
2019-09-09 338.392746 358 1 16 zommer
2019-09-10 259.937936 244 1 7 winter
2019-09-11 359.046132 382 1 11 zommer
2019-09-12 432.049543 414 1 17 zommer
2019-09-13 573.457120 487 1 17 winter
2019-09-14 279.745391 237 1 10 winter
2019-09-15 292.807550 274 1 5 winter
To calculate the difference between predicted and observed results, you can calculate that on final_df
as you have done. However in order to pivot the results appropriately we're going to need to replace the index. In my data, none of the samples were measured on the same day meaning that "date" is not a particularly good index for these data (if you have a measurement for each silo on each date, then I would use "date" as the index).
final_df = (final_df
.assign(
prediction_difference=lambda df: df["real_duration"] - df["prediction"],
measurement_count=lambda df: df.groupby("silo_nr").cumcount()
)
.pivot(index="measurement_count", columns="silo_nr", values=["real_duration", "prediction"])
)
print(final_df)
real_duration prediction
silo_nr 1 2 3 4 1 2 3 4
measurement_count
0 497.0 161.0 178.0 174.0 552.684521 238.446608 209.933437 231.625109
1 377.0 486.0 166.0 227.0 373.993094 496.397113 206.773769 246.807474
2 429.0 490.0 166.0 312.0 419.621826 487.391489 174.880098 286.195194
3 176.0 196.0 232.0 388.0 268.697640 247.515843 205.233186 368.080966
4 377.0 435.0 414.0 437.0 368.084024 413.005861 432.138127 442.188875
5 470.0 177.0 485.0 417.0 484.663314 243.078444 602.527207 410.407151
6 325.0 354.0 349.0 458.0 331.416194 293.509960 310.791084 483.367988
7 318.0 499.0 236.0 235.0 326.878359 507.602317 198.263842 247.611772
8 181.0 357.0 100.0 266.0 265.770711 300.769953 162.219113 255.572195
9 352.0 280.0 393.0 264.0 345.185827 233.542568 376.882980 248.664156
Answered By - Cameron Riddell
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.