Saturday, November 13, 2021

[FIXED] How can we predict target values for new data, based on a different dataset? scikit learn / gaussianNB

November 13, 2021 machine-learning, naivebayes, python, scikit-learn No comments

Issue

I am struggling to understand how training our algorithms connects with making predictions on new data. My situation: I have an algorithm that I use on a labeled dataset. After the steps of importing it, encoding it, fit_transforming it and fitting it to make predictions on the data_test of the train_test_split function I get a really nice prediction from using the labeled dataset. I am stumped as to how I need to feed a new dataset (unlabeled this time) to the trained model, which has learned from the labeled dataset. I know that technically the data used to train withheld the labels from itself to predict, but I am unaware how I have to provide the gaussianNB algorithm new data features to predict unknown labels.

My code for the training:

df = pd.read_csv(chosen_file, sep=',')
cat_cols = df.select_dtypes(include=['object'])
cat_cols_filled = cat_cols.fillna('0')
le = LabelEncoder()
cat_cols_fitted = cat_cols_filled.apply(lambda col: le.fit_transform(col))
non_cat_cols = df.select_dtypes(exclude=['object'])
non_cat_cols_filled = non_cat_cols.fillna('0')
non_cat_cols_fitted = non_cat_cols_filled.apply(lambda col: le.fit_transform(col))
target_prep = df.iloc[:,-1]
target = le.fit_transform(target_prep.astype(str))
data = pd.concat([cat_cols_fitted, non_cat_cols_fitted], axis=1)
try:
    data_train, data_test, target_train, target_test = train_test_split(data, target, train_size=0.3))
alg = GaussianNB()
pred = alg.fit(data_train, target_train).predict(***data_test***)

This is all fine and dandy. But I cannot understand how I have to give something in place of data_test. Do I need to provide the new dataset with some placeholder values for the label column? My label column from the beginning dataframe is the last one.

My attempt:

new_df = pd.read_csv(new_chosen_file, sep=',')
new_cat_cols = new_df.select_dtypes(include=['object'])
new_cat_cols_filled = new_cat_cols.fillna('0')
new_cat_cols_fitted = new_cat_cols_filled.apply(lambda col: le.fit_transform(col))
new_non_cat_cols = new_df.select_dtypes(exclude=['object'])
new_non_cat_cols_filled = new_non_cat_cols.fillna('0')
new_non_cat_cols_fitted = new_non_cat_cols_filled.apply(lambda col: le.fit_transform(col))
new_data = pd.concat([new_cat_cols_fitted, new_non_cat_cols_fitted], axis=1)
print(new_data)
new_pred = alg.predict(new_data)
new_prediction = pd.DataFrame({'NEW ML prediction':new_pred})
print(new_pred)
print(new_prediction)

Notice I do not provide the target column in the new dataset. However the program errors out on me if I my column count does not match, so I am forced to add at least the label for the column for it to not do that:

Am I way off in my understanding of how this works? Please let me know.

EDIT:

I found my major screw-up in the code. I had not isolated my target column out of the data DataFrame. This was why data was 10 column shape. I can finally appreciate the simplicity of the code.

Solution

You are instantiating an empty model to alg. Returning the prediction from fitted model to a variable named pred. So you are not actually saving the fitted model.

The concatenation of multiple methods such as alg.fit(data_train, target_train).predict(***data_test***) is known as method chaining and can cause confusion.

A cleaner & more readable alternative is to :

alg = GaussianNB()                       # initiating model
alg = alg.fit(data_train, target_train)  # fitting model with train data
pred = alg.predict(***data_test***)      # testing with test data
new_pred = alg.predict(new_data)         # test with new data`

Answered By - rakidedigama

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, November 13, 2021

[FIXED] How can we predict target values for new data, based on a different dataset? scikit learn / gaussianNB

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels