Issue
I have a dataframe that looks like this:
_id Points Averages Averages_2 Media Rank
a 324 858.2 NaN 0 Good
b 343 873.2 4.465e+06 1 Good
c 934 113.4 NaN 0 Bad
d 222 424.2 NaN 1 Bad
e 432 234.2 3.605e+06 1 Good
I want to predict the rank. Note that this is just a sample of a dataframe with 2000 rows and ca. 20 columns, but I tried to point out that there are columns, such as Averages_2
, with lots of NaNs, and there are columns with values 0 or 1.
I did the following:
import xgboost as xgb
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
data = 'C:\\me\\my_table.csv'
df = pd.read_csv(data)
cols_to_drop = ['_id'] #no need to write two lines if there's just one column to drop
#but since my original df is much bigger I used this to drop
#multiple columns
df.drop(cols_to_drop, axis=1, inplace=True)
X = df.drop('Rank', axis=1)
y = df['Rank']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)
lc = LabelEncoder()
lc = lc.fit(y)
lc_y = lc.transform(y)
model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
predictions = [round(int(value)) for value in y_pred]
I get ValueError: invalid literal for int() with base 10: 'Good'
I thought encoding the classes would work but what else does one do when the classes are strings?
Solution
It fails since your y_pred
contains strings like ["Good","Bad"]
, thus your last line tries to call e.g round(int("Good"))
which it, of course, cannot do (try call print(y_pred[:5])
and see what it shows).
You are actually not using your label-encoder on neither your training or test-set (since you train it on y
and never use it to transform y_pred
nor y_train
), and no need to when using XGboost, it handles the classes automatically.
Answered By - CutePoison
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.