Issue
I have a training data set that contains both continuous and categorical values. I've used scikit learn to a training set with the categorical features, (x_train_1hot) and I also have a train set with the numerical features (x_train_num).
x_train_num = []
x_test_num = []
x_train_1hot = []
x_test_1hot = []
x_train_full = []
x_test_full = []
cat_feats = []
cat_feats_test = []
for instance in x_train:
num_instance = []
num_instance.append(instance[0])
num_instance.append(instance[2])
num_instance.append(instance[4])
num_instance.append(instance[10])
num_instance.append(instance[11])
num_instance.append(instance[12])
x_train_num.append(num_instance)
cat_instance = []
cat_instance.append(instance[1])
cat_instance.append(instance[3])
cat_instance.append(instance[5])
cat_instance.append(instance[6])
cat_instance.append(instance[7])
cat_instance.append(instance[8])
cat_instance.append(instance[9])
cat_instance.append(instance[13])
cat_feats.append(cat_instance)
for instance in x_test:
num_instance = []
num_instance.append(int(instance[0]))
num_instance.append(int(instance[2]))
num_instance.append(int(instance[4]))
num_instance.append(int(instance[10]))
num_instance.append(int(instance[11]))
num_instance.append(int(instance[12]))
x_test_num.append(num_instance)
cat_instance = []
cat_instance.append(instance[1])
cat_instance.append(instance[3])
cat_instance.append(instance[5])
cat_instance.append(instance[6])
cat_instance.append(instance[7])
cat_instance.append(instance[8])
cat_instance.append(instance[9])
cat_instance.append(instance[13])
cat_feats_test.append(cat_instance)
enc = OneHotEncoder(handle_unknown='ignore')
X = numpy.array(cat_feats)
x_train_1hot = enc.fit_transform(X).toarray()
How do I combine these into a full training set (x_train_full)? I've tried to add or concatenate the arrays, but this is met with a bunch of errors. I think I'm fundamentally misunderstanding something?
I would like to do this with just scikit-learn or pure python, and avoid using pandas.
Edit: Here's a sample of the training data set (x_train):
[['39', ' State-gov', ' 77516', ' Bachelors', ' 13', ' Never-married', ' Adm-clerical', ' Not-in-family', ' White', ' Male', ' 2174', ' 0', ' 40', ' United-States'], ['50', ' Self-emp-not-inc', ' 83311', ' Bachelors', ' 13', ' Married-civ-spouse', ' Exec-managerial', ' Husband', ' White', ' Male', ' 0', ' 0', ' 13', ' United-States'], ['38', ' Private', ' 215646', ' HS-grad', ' 9', ' Divorced', ' Handlers-cleaners', ' Not-in-family', ' White', ' Male', ' 0', ' 0', ' 40', ' United-States'], ['53', ' Private', ' 234721', ' 11th', ' 7', ' Married-civ-spouse', ' Handlers-cleaners', ' Husband', ' Black', ' Male', ' 0', ' 0', ' 40', ' United-States'], ['28', ' Private', ' 338409', ' Bachelors', ' 13', ' Married-civ-spouse', ' Prof-specialty', ' Wife', ' Black', ' Female', ' 0', ' 0', ' 40', ' Cuba'], ['37', ' Private', ' 284582', ' Masters', ' 14', ' Married-civ-spouse', ' Exec-managerial', ' Wife', ' White', ' Female', ' 0', ' 0', ' 40', ' United-States'], ['49', ' Private', ' 160187', ' 9th', ' 5', ' Married-spouse-absent', ' Other-service', ' Not-in-family', ' Black', ' Female', ' 0', ' 0', ' 16', ' Jamaica'], ['52', ' Self-emp-not-inc', ' 209642', ' HS-grad', ' 9', ' Married-civ-spouse', ' Exec-managerial', ' Husband', ' White', ' Male', ' 0', ' 0', ' 45', ' United-States'], ['31', ' Private', ' 45781', ' Masters', ' 14', ' Never-married', ' Prof-specialty', ' Not-in-family', ' White', ' Female', ' 14084', ' 0', ' 50', ' United-States'], ['42', ' Private', ' 159449', ' Bachelors', ' 13', ' Married-civ-spouse', ' Exec-managerial', ' Husband', ' White', ' Male', ' 5178', ' 0', ' 40', ' United-States']]
the full original dataset can be found here: http://archive.ics.uci.edu/ml/datasets/Adult
Solution
I noticed you weren't converting x_train_num
to int
. But you should be able to concatenate like so:
x_train_num = np.array(x_train_num, dtype=int)
x_train = np.concatenate([x_train_num, x_train_1hot], axis=1)
print(x_train.shape)
# (10, 33)
Answered By - Aaron Keesing
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.