Issue
I'm trying to tune hyperparameters for KNN on a quite small datasets ( Kaggle Leaf which has around 990 lines ):
def knnTuning(self, x_train, t_train):
params = {
'n_neighbors': [1, 2, 3, 4, 5, 7, 9],
'weights': ['uniform', 'distance'],
'leaf_size': [5,10, 15, 20]
}
grid = GridSearchCV(KNeighborsClassifier(), params)
grid.fit(x_train, t_train)
print(grid.best_params_)
print(grid.best_score_)
return knn.KNN(neighbors=grid.best_params_["n_neighbors"],
weight = grid.best_params_["weights"],
leafSize = grid.best_params_["leaf_size"])
Prints:
{'leaf_size': 5, 'n_neighbors': 1, 'weights': 'uniform'}
0.9119999999999999
And I return this classifier
class KNN:
def __init__(self, neighbors=1, weight = 'uniform', leafSize = 10):
self.clf = KNeighborsClassifier(n_neighbors = neighbors,
weights = weight, leaf_size = leafSize)
def train(self, X, t):
self.clf.fit(X, t)
def predict(self, x):
return self.clf.predict(x)
def global_accuracy(self, X, t):
predicted = self.predict(X)
accuracy = (predicted == t).mean()
return accuracy
I run this several time using 700 lines for the training and 200 for validation, which are chosen with random permutation.
I then got result for the global accuracy from 0.01 (often) to 0.4 (rarely).
I know that i'm not comparing two same metrics but I still can't understand the huge difference between the results.
Solution
Not very sure how you trained your model or how the preprocessing was done. The leaf dataset has about 100 labels (species) so you have to take care to split your test and train to ensure an even split of your samples. One reason for the weird accuracy could be that your samples are split unevenly.
Also you would need to scale your features:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
df = pd.read_csv("https://raw.githubusercontent.com/WenjinTao/Leaf-Classification--Kaggle/master/train.csv")
le = LabelEncoder()
scaler = StandardScaler()
X = df.drop(['id','species'],axis=1)
X = scaler.fit_transform(X)
y = le.fit_transform(df['species'])
strat = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=0).split(X,y)
x_train, y_train, x_test, y_test = [[X.iloc[train,:],t[train],X.iloc[test,:],t[test]] for train,test in strat][0]
If we do the training, and I would be careful about including n_neighbors = 1 :
params = {
'n_neighbors': [2, 3, 4],
'weights': ['uniform', 'distance'],
'leaf_size': [5,10, 15, 20]
}
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
grid = GridSearchCV(KNeighborsClassifier(), params, cv=sss)
grid.fit(x_train, y_train)
print(grid.best_params_)
print(grid.best_score_)
{'leaf_size': 5, 'n_neighbors': 2, 'weights': 'distance'}
0.9676258992805755
Then you can check on your test:
pred = grid.predict(x_test)
(y_test == pred).mean()
0.9831649831649831
Answered By - StupidWolf
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.