Issue
Dataframe has filled na values .
Schema of dataset has no object dtype as specified in documentation.
df.info()
output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 429 entries, 351 to 559
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 429 non-null category
1 Married 429 non-null category
2 Dependents 429 non-null category
3 Education 429 non-null category
4 Self_Employed 429 non-null category
5 ApplicantIncome 429 non-null int64
6 CoapplicantIncome 429 non-null float64
7 LoanAmount 429 non-null float64
8 Loan_Amount_Term 429 non-null float64
9 Credit_History 429 non-null float64
10 Property_Area 429 non-null category
dtypes: category(6), float64(4), int64(1)
memory usage: 23.3 KB
I have following code .....................................................................................................................................................................................................................................................................................................................
- I am trying to classification of dataset using lightgbm
import lightgbm as lgb
train_data=lgb.Dataset(x_train,label=y_train,categorical_feature=cat_cols)
#define parameters
params = {'learning_rate':0.001}
model= lgb.train(params, train_data, 100,categorical_feature=cat_cols)
getting following error :
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-178-aaa91a2d8719> in <module>
6
7
----> 8 model= lgb.train(params, train_data, 100,categorical_feature=cat_cols)
~\Anaconda3\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
229 # construct booster
230 try:
--> 231 booster = Booster(params=params, train_set=train_set)
232 if is_valid_contain_train:
233 booster.set_train_data_name(train_data_name)
~\Anaconda3\lib\site-packages\lightgbm\basic.py in __init__(self, params, train_set, model_file, model_str, silent)
1981 break
1982 # construct booster object
-> 1983 train_set.construct()
1984 # copy the parameters from train_set
1985 params.update(train_set.get_params())
~\Anaconda3\lib\site-packages\lightgbm\basic.py in construct(self)
1319 else:
1320 # create train
-> 1321 self._lazy_init(self.data, label=self.label,
1322 weight=self.weight, group=self.group,
1323 init_score=self.init_score, predictor=self._predictor,
~\Anaconda3\lib\site-packages\lightgbm\basic.py in _lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params)
1133 raise TypeError('Cannot initialize Dataset from {}'.format(type(data).__name__))
1134 if label is not None:
-> 1135 self.set_label(label)
1136 if self.get_label() is None:
1137 raise ValueError("Label should not be None")
~\Anaconda3\lib\site-packages\lightgbm\basic.py in set_label(self, label)
1648 self.label = label
1649 if self.handle is not None:
-> 1650 label = list_to_1d_numpy(_label_from_pandas(label), name='label')
1651 self.set_field('label', label)
1652 self.label = self.get_field('label') # original values can be modified at cpp side
~\Anaconda3\lib\site-packages\lightgbm\basic.py in list_to_1d_numpy(data, dtype, name)
88 elif isinstance(data, Series):
89 if _get_bad_pandas_dtypes([data.dtypes]):
---> 90 raise ValueError('Series.dtypes must be int, float or bool')
91 return np.array(data, dtype=dtype, copy=False) # SparseArray should be supported as well
92 else:
ValueError: Series.dtypes must be int, float or bool
Solution
did anyone helped you yet? If not: The answer lies within transforming your variable.
Go to this link:GitHub Discussion lightGBM
The creators of LightGBM were confronted with that same question once. In the Link above they (STRIKER) tell you, that you should: transform your variables with astype("category") (pandas/scikit) AND you should labelEncode them, because you need an INT ! value in your feature column, especially an INT32.
However, labelEncoding and astype('category') should normally do the same: Encoding
Antoher useful link is this advanced doc about the categorical feature:Categorical feature light gbm homepage where they tell you that they cant deal with object(string) dtypes as in your data.
If you are still feeling uncomfortable with this explanation, here is my code snippet from the kaggle space_race_set. If you are still having problems. Just ask away.
cat_feats = ['Company Name', 'Night_and_Day', 'Rocket Type', 'Rocket Mission Type', 'State', 'Country']
labelencoder = LabelEncoder()
for col in cat_feats:
train_df[col] = labelencoder.fit_transform(train_df[col])
for col in cat_feats:
train_df[col] = train_df[col].astype('int')
y = train_df[["Status Mission"]]
X = train_df.drop(["Status Mission"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
train_data = lgb.Dataset(X_train,
label=y_train,
categorical_feature=['Company Name', 'Night_and_Day', 'Rocket Type', 'Rocket Mission Type', 'State', 'Country'],
free_raw_data=False)
test_data = lgb.Dataset(X_test,
label=y_test,
categorical_feature=['Company Name', 'Night_and_Day', 'Rocket Type', 'Rocket Mission Type', 'State', 'Country'],
free_raw_data=False)
Answered By - Patrick Bormann
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.