Issue
I have a dataframe named ds
with 3 column of data as follows
text count label
0 I have... 12 pos
1 You sh... 8 neg
2 In thi... 9 neg
.
.
I was given an example by using only text
for creating test and training data with code like
X = ds['text']
y = ds['label']
train_X, test_X, train_Y, test_Y = train_test_split(X, y, test_size=0.25, random_state=0)
df_train75 = pd.DataFrame()
df_train75['text'] = train_X
df_train75['label'] = train_Y
df_test25 = pd.DataFrame()
df_test25['text'] = test_X
df_test25['label'] = test_Y
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect_7525 = TfidfVectorizer(ngram_range = (1, 1))
tfidf_vect_7525.fit(ds['text'])
train_X_tfidf_7525 = tfidf_vect_7525.transform(df_train75['text'])
test_X_tfidf_7525 = tfidf_vect_7525.transform(df_test25['text'])
from sklearn.svm import SVC
model = SVC(kernel='linear')
model.fit(train_X_tfidf_7525,train_Y)
==================================================================================
I tried to include text
and count
column by simply changing the first line with
X = ds[['text', 'count']
And it gives me error on
from sklearn.svm import SVC
model = SVC(kernel='linear')
model.fit(train_X_tfidf_9010,train_Y)
==================================================================================
My question is, how should I approach this problem? I tried to look into other question but failed to have an answer. One "solution" I found is by using
X = ds['text'].astype(str) + ' ' + ds['count'].astype(str)
but I don't think it was the correct option to approach this problem. Thank you in advance!
Solution
In cases where you want to transform different columns of the input separately, you should choose ColumnTransformer
. You can also choose to not transform specific columns. In any case, the results of each (non-)transformation will be concatenated again into a single array. A small example:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
import pandas as pd
df = pd.DataFrame({
'text': ['This is doc1', 'This is doc2', 'Here is doc3']*3,
'count': [12, 8, 9]*3,
'label': ['pos', 'neg', 'neg']*3
})
X = df[['text', 'count']]
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
transformer = ColumnTransformer([
('vectorizer', TfidfVectorizer(ngram_range=(1, 1)), 'text')
], remainder='passthrough')
X_vec_train = transformer.fit_transform(X_train)
X_vec_test = transformer.transform(X_test)
model = SVC(kernel='linear')
model.fit(X_vec_train, y_train)
The syntax for the list of transformers is a list of tuples with (name, transformer, columns)
, where you can specify which transformer to apply to which column. By setting remainder='passthrough'
, all remaining columns that were not specified previously will be automatically passed through and will just be concatenated with the result. For more on this, see the documentation.
Answered By - afsharov
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.