Saturday, May 28, 2022

[FIXED] SVM classifier n_samples, n_splits problem sklearn Python

May 28, 2022 forecasting, python, scikit-learn, svm, volatility No comments

Issue

I'm trying to predict volatility one step ahead with an SVM model based on O'Reilly book example (Machine Learning for Financial Risk Management with Python). When I copy exactly the example (with S&P500 data) it works well but now I'm having troubles with this chunk of code with a particular fund returns data:

# returns
r = np.array([        nan,  0.0013933 ,  0.00118874,  0.00076462,  0.00168565,
       -0.00018507, -0.00390753,  0.00307275, -0.00351472])

# horizon
t = 252

# mean of returns
mu = r.mean()

# critical value
z = norm.ppf(0.95)

# realized volatility
vol = r.rolling(5).std()
vol = pd.DataFrame(vol)
vol.reset_index(drop=True, inplace=True)

# SVM GARCH
r_svm = r ** 2
r_svm = r_svm.reset_index()

# inputs X (returns and realized volatility)
X = pd.concat([vol, r_svm], axis=1, ignore_index=True)
X = X.dropna().copy()
X = X.reset_index()
X.drop([1, 'index'], axis=1, inplace=True)

# labels y realized volatility shifted 1 period onward
vol = vol.dropna().reset_index()
vol.drop('index', axis=1, inplace=True)

# linear kernel
svr_lin = SVR(kernel='linear')

# hyperparameters grid
para_grid = {'gamma': sp_rand(),
'C': sp_rand(),
'epsilon': sp_rand()}

# svm classifier (regression?)
clf = RandomizedSearchCV(svr_lin, para_grid)
clf.fit(X[:-1].dropna().values,
vol[1:].values.reshape(-1,))

# prediction
n_vol = clf.predict(X.iloc[-1:])

The raised error is:

ValueError: Cannot have number of splits n_splits=5 greater than the number of samples: n_samples=3.

The code works with longer returns series so I assume that the problem is the length of the array but I can't figure out how to solve it. can someone help me with that?

Solution

This error is getting raised because you use RandomizedSearchCV with default cv parameter. By default RandomizedSearchCV is running 5-folds cross-validation to find the best hyperparameters for the model.

5-folds cross-validation means splitting your training data into 5 subsets and training 5 different models based on these splits.

Looks like you have less than 5 objects in your training set, so splitting your data into 5 folds isn't possible.

To fix the issue you should either add more data or decrease number of folds for the RandomizedSearchCV by adding cv parameter:

clf = RandomizedSearchCV(svr_lin, para_grid, cv=2)

I'd recommend to collect more data, since 4 data points most likely won't be enough to make the model accurate or predictive.

Answered By - fshabashev

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, May 28, 2022

[FIXED] SVM classifier n_samples, n_splits problem sklearn Python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels