Issue
Consider this simple example
data = pd.DataFrame({'text1' : ['hello world', 'hello universe'],
'text2': ['good morning', 'hello two three']})
data
Out[489]:
text1 text2
0 hello world good morning
1 hello universe hello two three
As you can see, text1
and text2
share one exact word in common: hello
. I am trying to create ngrams separately for text1
and text2
and I want to concatenate the results together into a countvectorizer object.
The idea is that I want to create ngrams separately for the two variables and used them as features in a ML algo. However, I do want the extra ngrams that would be created by concatenating the string together, like world good
in hello world good morning
. This is why I keep the ngram creation separated.
The issue is that by doing so, the resulting (sparse) vector will contain a duplicated hello
column.
See here:
vector = CountVectorizer(ngram_range=(1, 2))
v1 = vector.fit_transform(data.text1.values)
print(vector.get_feature_names())
['hello', 'hello universe', 'hello world', 'universe', 'world']
v2 = vector.fit_transform(data.text2.values)
print(vector.get_feature_names())
['good', 'good morning', 'hello', 'hello two', 'morning', 'three', 'two', 'two three']
And now concatenating v1
and v2
gives 13 columns
from scipy.sparse import hstack
print(hstack((v1, v2)).toarray())
[[1 0 1 0 1 1 1 0 0 1 0 0 0]
[1 1 0 1 0 0 0 1 1 0 1 1 1]]
The proper text-features should be 12:
hello
, word
, hello word
, good
, morning
, good morning
,hello universe
,universe
, two
, three
, hello two
, two three
What can I do here to have the proper unique words as features? Thanks!
Solution
I think that the best way to tackle this problematic would be to create a custom Transformer that use a CountVectorizer
.
I would do as follow:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
class MultiRowsCountVectorizer(BaseEstimator, TransformerMixin):
def __init__(self):
self.verctorizer = CountVectorizer(ngram_range=(1, 2))
def fit(self, X, y = None):
#concatenate all textual columns into one column
X_ = np.reshape(X.values, (-1,))
self.verctorizer.fit(X_)
return self
def transform(self, X, y = None):
#join all the textual columns into one column
X_ = X.apply(' '.join, axis=1)
return self.verctorizer.transform(X_)
def get_feature_names(self):
return self.verctorizer.get_feature_names()
transformer = MultiRowsCountVectorizer()
X_ = transformer.fit_transform(data)
transformer.get_feature_names()
The fit()
method is fitting the CountVectorizer
by treating the columns independently while transform()
is treating the columns as the same line of text.
np.reshape(X.values, (-1,))
is transforming a matrix of shape (N, n_columns)
into one dimensional array of size (N*n_columns,)
. This ensure that each text field is treated independently during the fit()
. After that the transformation is applied on all the text feature of a sample by joining them together.
This custom Transformer is returning the desired 12 features:
['good', 'good morning', 'hello', 'hello two', 'hello universe', 'hello world', 'morning', 'three', 'two', 'two three', 'universe', 'world']
and returning the following features:
[[1 1 1 0 0 1 1 0 0 0 0 1]
[0 0 2 1 1 0 0 1 1 1 1 0]]
NOTES: this custom transformer assume that X
is a pd.DataFrame
with n
textual columns.
EDIT: The textuals fields need to be joined with a space during the transform()
.
Answered By - Antoine Dubuis
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.