Issue
I got over 90% accuracy with the Random Forest classifier, but I worry the rest of the algorithms give much lower results: A table with the results But this is not the main concern. The problem is that when I used user inputs, the prediction was 100 percent wrong. The order of the columns of the user input corresponds to the training data set columns' places.
model = RandomForestClassifier()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
acc = accuracy_score(y_test, prediction) # output: 0.91
X_test_user = df_user_compounds_1.to_numpy()
user_input_predictions_1 = model.predict(X_test_user) #
user_input_predictions_1 # output: array([0, 0, 0, 0, 0], dtype=int64), but it should be: array([1, 1, 1, 1, 1],dtype=int64)
Does anyone have any idea why this is happening?
The dataset is preprocessed - no missing values, no duplicates, it was balanced with RandomOverSampler, scaled with MinMaxScaler, no negative values and contains 11 features/7K rows.
...........
Thank you so much @ElvinJafarov. These are parts from df_user_compounds_1, and X_test after your suggestion.
Since I had already used MinMaxScaler(), I had to add two more rows to df_user_compounds_1, containing the corresponding min and max values to simulate identical scaling to the original one. I found the max and min values through df.describe(include="all"), concatenated these two rows to the user inputs data frame and scaled
I am happy with the result: first 5 must be 1, i.e. 4 out of 5
Solution
First of all, it is okay that different algorithms give different accuracy rate.
Secondly, in your case, there might be several reasons.
- You have scaled your inputs in data but not in df_user_compounds_1
- your model might be overfitted
- dataset was preprocessed differently than df_user_compounds_1
Thirdly, this is not how you approach to choose a model. You have to try K-Fold Cross validationn , hyperparameter tuning
Answered By - Elvin Jafarov
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.