Issue
I am working with time series data to analyze the prices from 2018 until the end of 2023. However, it seems that regardless of the portion I take for training and testing data, there comes a point where the predictions become constant.
What could be causing this issue? Are there any methods or parameters that I can adjust to improve the model's performance? I tried using the Sliding Windows technique, but encountered the same problem.
I am importing the data like this:
df['Data'] = pd.to_datetime(df['Data'], format='%d.%m.%Y')
df.set_index('Data', inplace=True)
df = df.sort_values('Data')
And separating them like this:
train = df.loc[df.index < '01-01-2023']
test = df.loc[df.index >= '01-01-2023']
The definition of XGBoost is as follows:
model_XGB = xgb.XGBRegressor(n_estimators=300)
# Fitagem do modelo
model_XGB.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
verbose=100)
Solution
Why do you think there is a problem with your model. What you get is completely predictable.
Let's do some stats on your numeric data:
>>> X_train.iloc[:, :3].describe()
Valor_Londres Valor_NY ICCO_EUR
count 1287.000000 1287.000000 1287.000000
mean 1747.985478 2437.389534 2082.404064
std 114.495576 175.460090 168.307449
min 1379.000000 1893.670000 1574.000000
25% 1673.670000 2331.330000 1966.770000
50% 1750.670000 2451.000000 2089.550000
75% 1820.000000 2545.335000 2206.455000
max 2048.000000 2929.330000 2561.000000
>>> X_test.iloc[:, :3].describe()
Valor_Londres Valor_NY ICCO_EUR
count 254.000000 254.000000 254.000000
mean 2585.255669 3282.550236 3009.645906
std 483.913608 490.309141 513.643522
min 1952.670000 2572.670000 2313.290000
25% 2139.835000 2874.670000 2543.147500
50% 2491.165000 3312.170000 2945.900000
75% 2952.165000 3617.417500 3393.330000
max 3475.000000 4263.670000 4051.170000
Maybe you can already see what's wrong? In training data, the max values are (2048, 2929, 2561) but in test data, these values are near the min! You can also check the standard deviation (3x factor). Same observation for targets. You can also see that the shape of the curves is not the same. There is not the same seasonality and the same trend.
However, the start of 2023 (January - February) appears to be correctly predicted as the values are within the range of what the model has already seen during the training phase. After that, the regression is invaluable.
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.