Wednesday, January 12, 2022

[FIXED] How do I create a train data set with both numeric and 1-hot categorical features with Scikit-learn?

January 12, 2022 machine-learning, python, scikit-learn No comments

Issue

I have a training data set that contains both continuous and categorical values. I've used scikit learn to a training set with the categorical features, (x_train_1hot) and I also have a train set with the numerical features (x_train_num).

x_train_num = []
x_test_num = []

x_train_1hot = []
x_test_1hot = []

x_train_full = []
x_test_full = []

cat_feats = []
cat_feats_test = []

for instance in x_train:
    num_instance = []
    num_instance.append(instance[0])
    num_instance.append(instance[2])
    num_instance.append(instance[4])
    num_instance.append(instance[10])
    num_instance.append(instance[11])
    num_instance.append(instance[12])
    x_train_num.append(num_instance)
    
    cat_instance = []
    cat_instance.append(instance[1])
    cat_instance.append(instance[3])
    cat_instance.append(instance[5])
    cat_instance.append(instance[6])
    cat_instance.append(instance[7])
    cat_instance.append(instance[8])
    cat_instance.append(instance[9])
    cat_instance.append(instance[13])
    cat_feats.append(cat_instance)
    
for instance in x_test:
    num_instance = []
    num_instance.append(int(instance[0]))
    num_instance.append(int(instance[2]))
    num_instance.append(int(instance[4]))
    num_instance.append(int(instance[10]))
    num_instance.append(int(instance[11]))
    num_instance.append(int(instance[12]))
    x_test_num.append(num_instance)
    
    cat_instance = []
    cat_instance.append(instance[1])
    cat_instance.append(instance[3])
    cat_instance.append(instance[5])
    cat_instance.append(instance[6])
    cat_instance.append(instance[7])
    cat_instance.append(instance[8])
    cat_instance.append(instance[9])
    cat_instance.append(instance[13])
    cat_feats_test.append(cat_instance)

enc = OneHotEncoder(handle_unknown='ignore')
X = numpy.array(cat_feats)
x_train_1hot = enc.fit_transform(X).toarray()

How do I combine these into a full training set (x_train_full)? I've tried to add or concatenate the arrays, but this is met with a bunch of errors. I think I'm fundamentally misunderstanding something?

I would like to do this with just scikit-learn or pure python, and avoid using pandas.

Edit: Here's a sample of the training data set (x_train):

[['39', ' State-gov', ' 77516', ' Bachelors', ' 13', ' Never-married', ' Adm-clerical', ' Not-in-family', ' White', ' Male', ' 2174', ' 0', ' 40', ' United-States'], ['50', ' Self-emp-not-inc', ' 83311', ' Bachelors', ' 13', ' Married-civ-spouse', ' Exec-managerial', ' Husband', ' White', ' Male', ' 0', ' 0', ' 13', ' United-States'], ['38', ' Private', ' 215646', ' HS-grad', ' 9', ' Divorced', ' Handlers-cleaners', ' Not-in-family', ' White', ' Male', ' 0', ' 0', ' 40', ' United-States'], ['53', ' Private', ' 234721', ' 11th', ' 7', ' Married-civ-spouse', ' Handlers-cleaners', ' Husband', ' Black', ' Male', ' 0', ' 0', ' 40', ' United-States'], ['28', ' Private', ' 338409', ' Bachelors', ' 13', ' Married-civ-spouse', ' Prof-specialty', ' Wife', ' Black', ' Female', ' 0', ' 0', ' 40', ' Cuba'], ['37', ' Private', ' 284582', ' Masters', ' 14', ' Married-civ-spouse', ' Exec-managerial', ' Wife', ' White', ' Female', ' 0', ' 0', ' 40', ' United-States'], ['49', ' Private', ' 160187', ' 9th', ' 5', ' Married-spouse-absent', ' Other-service', ' Not-in-family', ' Black', ' Female', ' 0', ' 0', ' 16', ' Jamaica'], ['52', ' Self-emp-not-inc', ' 209642', ' HS-grad', ' 9', ' Married-civ-spouse', ' Exec-managerial', ' Husband', ' White', ' Male', ' 0', ' 0', ' 45', ' United-States'], ['31', ' Private', ' 45781', ' Masters', ' 14', ' Never-married', ' Prof-specialty', ' Not-in-family', ' White', ' Female', ' 14084', ' 0', ' 50', ' United-States'], ['42', ' Private', ' 159449', ' Bachelors', ' 13', ' Married-civ-spouse', ' Exec-managerial', ' Husband', ' White', ' Male', ' 5178', ' 0', ' 40', ' United-States']]

the full original dataset can be found here: http://archive.ics.uci.edu/ml/datasets/Adult

Solution

I noticed you weren't converting x_train_num to int. But you should be able to concatenate like so:

x_train_num = np.array(x_train_num, dtype=int)
x_train = np.concatenate([x_train_num, x_train_1hot], axis=1)
print(x_train.shape)
# (10, 33)

Answered By - Aaron Keesing

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 12, 2022

[FIXED] How do I create a train data set with both numeric and 1-hot categorical features with Scikit-learn?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels