Issue
This is odd. I can successfully run the example grid_search_digits.py
. However, I am unable to do a grid search on my own data.
I have the following setup:
import sklearn
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import LeaveOneOut
from sklearn.metrics import auc_score
# ... Build X and y ....
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
loo = LeaveOneOut(len(y))
clf = GridSearchCV(SVC(C=1), tuned_parameters, score_func=auc_score)
clf.fit(X, y, cv=loo)
....
print clf.best_estimator_
....
But I never get passed clf.fit
(I left it run for ~1hr).
I have tried also with
clf.fit(X, y, cv=10)
and with
skf = StratifiedKFold(y,2)
clf.fit(X, y, cv=skf)
and had the same problem (it never finishes the clf.fit statement). My data is simple:
> X.shape
(27,26)
> y.shape
27
> numpy.sum(y)
5
> y.dtype
dtype('int64')
>?y
Type: ndarray
String Form:[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1]
Length: 27
File: /home/jacob04/opt/python/numpy/numpy-1.7.1/lib/python2.7/site-
packages/numpy/__init__.py
Docstring: <no docstring>
Class Docstring:
ndarray(shape, dtype=float, buffer=None, offset=0,
strides=None, order=None)
> ?X
Type: ndarray
String Form:
[[ -3.61238468e+03 -3.61253920e+03 -3.61290196e+03 -3.61326679e+03
7.84590361e+02 0.0000 <...> 0000e+00 2.22389150e+00 2.53252959e+00
2.11606216e+00 -1.99613432e+05 -1.99564828e+05]]
Length: 27
File: /home/jacob04/opt/python/numpy/numpy-1.7.1/lib/python2.7/site-
packages/numpy/__init__.py
Docstring: <no docstring>
Class Docstring:
ndarray(shape, dtype=float, buffer=None, offset=0,
strides=None, order=None)
This is all with the latest version of scikit-learn (0.13.1) and:
$ pip freeze
Cython==0.19.1
PIL==1.1.7
PyXB==1.2.2
PyYAML==3.10
argparse==1.2.1
distribute==0.6.34
epc==0.0.5
ipython==0.13.2
jedi==0.6.0
matplotlib==1.3.x
nltk==2.0.4
nose==1.3.0
numexpr==2.1
numpy==1.7.1
pandas==0.11.0
pyparsing==1.5.7
python-dateutil==2.1
pytz==2013b
rpy2==2.3.1
scikit-learn==0.13.1
scipy==0.12.0
sexpdata==0.0.3
six==1.3.0
stemming==1.0.1
-e git+https://github.com/PyTables/PyTables.git@df7b20444b0737cf34686b5d88b4e674ec85575b#egg=tables-dev
tornado==3.0.1
wsgiref==0.1.2
The odd thing is that fitting a single SVM is extremely fast:
> %timeit clf2 = svm.SVC(); clf2.fit(X,y)
1000 loops, best of 3: 328 us per loop
Update
I have noticed that if I pre-scale the data with:
from sklearn import preprocessing
X = preprocessing.scale(X)
the grid search is extremely fast.
Why? Why does GridSearchCV
is so sensitive to scaling while a regular svm.SVC().fit
is not?
Solution
As noted already,
for SVM
-based Classifiers ( as y == np.int*
)
preprocessing is a must, otherwise the ML-Estimator's prediction capability is lost right by skewed features' influence onto a decission function.
As objected the processing times:
- try to get better view what is your AI/ML-Model Overfit/Generalisation
[C,gamma]
landscape - try to add verbosity into the initial AI/ML-process tuning
- try to add n_jobs into the number crunching
- try to add Grid Computing move into your computation approach if scale requires
.
aGrid = aML_GS.GridSearchCV( aClassifierOBJECT,
param_grid = aGrid_of_parameters,
cv = cv,
n_jobs = n_JobsOnMultiCpuCores,
verbose = 5 )
Sometimes, the GridSearchCV()
can indeed take a huge amount of CPU-time / CPU-poolOfRESOURCEs, even after all the above mentioned tips are used.
So, keep calm and do not panic, if you are sure the Feature-Engineering, data-sanity & FeatureDOMAIN preprocessing was done correctly.
[GridSearchCV] ................ C=16777216.0, gamma=0.5, score=0.761619 -62.7min
[GridSearchCV] C=16777216.0, gamma=0.5 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=0.5, score=0.792793 -64.4min
[GridSearchCV] C=16777216.0, gamma=1.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.793103 -116.4min
[GridSearchCV] C=16777216.0, gamma=1.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.794603 -205.4min
[GridSearchCV] C=16777216.0, gamma=1.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.771772 -200.9min
[GridSearchCV] C=16777216.0, gamma=2.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.713643 -446.0min
[GridSearchCV] C=16777216.0, gamma=2.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.743628 -184.6min
[GridSearchCV] C=16777216.0, gamma=2.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.761261 -281.2min
[GridSearchCV] C=16777216.0, gamma=4.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=4.0, score=0.670165 -138.7min
[GridSearchCV] C=16777216.0, gamma=4.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=4.0, score=0.760120 -97.3min
[GridSearchCV] C=16777216.0, gamma=4.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=4.0, score=0.732733 -66.3min
[GridSearchCV] C=16777216.0, gamma=8.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.755622 -13.6min
[GridSearchCV] C=16777216.0, gamma=8.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.772114 - 4.6min
[GridSearchCV] C=16777216.0, gamma=8.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.717718 -14.7min
[GridSearchCV] C=16777216.0, gamma=16.0 ........................................
[GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.763118 - 1.3min
[GridSearchCV] C=16777216.0, gamma=16.0 ........................................
[GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.746627 - 25.4s
[GridSearchCV] C=16777216.0, gamma=16.0 ........................................
[GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.738739 - 44.9s
[Parallel(n_jobs=1)]: Done 2700 out of 2700 | elapsed: 5670.8min finished
As have asked above about "... a regular svm.SVC().fit
"
kindly notice,
it uses default [C,gamma]
values and thus have no relevance to behaviour of your Model / ProblemDOMAIN.
Re: Update
oh yes indeed, regularisation/scaling of SVM-inputs is a mandatory task for this AI/ML tool.
scikit-learn has a good instrumentation to produce and re-use aScalerOBJECT
for both a-priori scaling ( before aDataSET
goes into .fit()
) & ex-post ad-hoc scaling, once you need to re-scale a new example and send it to the predictor to answer it's magic
via a request to
anSvmCLASSIFIER.predict( aScalerOBJECT.transform( aNewExampleX ) )
( Yes, aNewExampleX
may be a matrix, so asking for a "vectorised" processing of several answers )
Performance relief of O( M 2 . N 1 ) computational complexity
In contrast to the below posted guess, that the Problem-"width", measured as N
== a number of SVM-Features in matrix X
is to be blamed for an overall computing time, the SVM classifier with rbf-kernel is by-design an O( M 2 . N 1 ) problem.
So, there is quadratic dependence on the overall number of observations ( examples ), moved into a Training ( .fit()
) or CrossValidation phase and one can hardly state, that the supervised learning classifier will get any better predictive power if one "reduces" the ( linear only ) "width" of features, that per se bear the inputs into the constructed predictive power of the SVM-classifier, don't they?
Answered By - user3666197
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.