Tuesday, February 22, 2022

[FIXED] Use 2 column for training data in machine learning

February 22, 2022 pandas, python, scikit-learn No comments

Issue

I have a dataframe named ds with 3 column of data as follows

         text     count      label
0   I have...        12        pos   
1   You sh...         8        neg
2   In thi...         9        neg
.
.

I was given an example by using only text for creating test and training data with code like

X = ds['text']
y = ds['label']
train_X, test_X, train_Y, test_Y = train_test_split(X, y, test_size=0.25, random_state=0)

df_train75 = pd.DataFrame()
df_train75['text'] = train_X
df_train75['label'] = train_Y

df_test25 = pd.DataFrame()
df_test25['text'] = test_X
df_test25['label'] = test_Y

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect_7525 = TfidfVectorizer(ngram_range = (1, 1))
tfidf_vect_7525.fit(ds['text'])
train_X_tfidf_7525 = tfidf_vect_7525.transform(df_train75['text'])
test_X_tfidf_7525 = tfidf_vect_7525.transform(df_test25['text'])

from sklearn.svm import SVC

model = SVC(kernel='linear')
model.fit(train_X_tfidf_7525,train_Y)

==================================================================================

I tried to include text and count column by simply changing the first line with

X = ds[['text', 'count']

And it gives me error on

from sklearn.svm import SVC

model = SVC(kernel='linear')
model.fit(train_X_tfidf_9010,train_Y)

==================================================================================

My question is, how should I approach this problem? I tried to look into other question but failed to have an answer. One "solution" I found is by using

X = ds['text'].astype(str) + ' ' + ds['count'].astype(str)

but I don't think it was the correct option to approach this problem. Thank you in advance!

Solution

In cases where you want to transform different columns of the input separately, you should choose ColumnTransformer. You can also choose to not transform specific columns. In any case, the results of each (non-)transformation will be concatenated again into a single array. A small example:

from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
import pandas as pd


df = pd.DataFrame({
    'text': ['This is doc1', 'This is doc2', 'Here is doc3']*3,
    'count': [12, 8, 9]*3,
    'label': ['pos', 'neg', 'neg']*3
})

X = df[['text', 'count']]
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

transformer = ColumnTransformer([
    ('vectorizer', TfidfVectorizer(ngram_range=(1, 1)), 'text')
], remainder='passthrough')

X_vec_train = transformer.fit_transform(X_train)
X_vec_test = transformer.transform(X_test)

model = SVC(kernel='linear')
model.fit(X_vec_train, y_train)

The syntax for the list of transformers is a list of tuples with (name, transformer, columns), where you can specify which transformer to apply to which column. By setting remainder='passthrough', all remaining columns that were not specified previously will be automatically passed through and will just be concatenated with the result. For more on this, see the documentation.

Answered By - afsharov

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, February 22, 2022

[FIXED] Use 2 column for training data in machine learning

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels