Wednesday, October 13, 2021

[FIXED] XGBoost when classes are strings

October 13, 2021 pandas, python, scikit-learn No comments

Issue

I have a dataframe that looks like this:

_id    Points      Averages     Averages_2           Media               Rank 
a         324         858.2            NaN               0               Good
b         343         873.2      4.465e+06               1               Good
c         934         113.4            NaN               0                Bad
d         222         424.2            NaN               1                Bad
e         432         234.2      3.605e+06               1               Good

I want to predict the rank. Note that this is just a sample of a dataframe with 2000 rows and ca. 20 columns, but I tried to point out that there are columns, such as Averages_2, with lots of NaNs, and there are columns with values 0 or 1.

I did the following:

import xgboost as xgb

from sklearn import datasets

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder 
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

data = 'C:\\me\\my_table.csv'
df = pd.read_csv(data)

cols_to_drop = ['_id']  #no need to write two lines if there's just one column to drop
                        #but since my original df is much bigger I used this to drop 
                        #multiple columns

df.drop(cols_to_drop, axis=1, inplace=True)

X = df.drop('Rank', axis=1)
y = df['Rank']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)

lc = LabelEncoder() 
lc = lc.fit(y) 
lc_y = lc.transform(y)

model = XGBClassifier() 
model.fit(X_train, y_train)

y_pred = model.predict(X_test) 
predictions = [round(int(value)) for value in y_pred]

I get ValueError: invalid literal for int() with base 10: 'Good' I thought encoding the classes would work but what else does one do when the classes are strings?

Solution

It fails since your y_pred contains strings like ["Good","Bad"], thus your last line tries to call e.g round(int("Good")) which it, of course, cannot do (try call print(y_pred[:5]) and see what it shows).

You are actually not using your label-encoder on neither your training or test-set (since you train it on y and never use it to transform y_pred nor y_train), and no need to when using XGboost, it handles the classes automatically.

Answered By - CutePoison

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, October 13, 2021

[FIXED] XGBoost when classes are strings

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels