Issue
I'm trying to build an NLP model that uses XGBoost. In my following code.
loaded_model = joblib.load('fraud.sav')
def clean_data(user_input):
'''
Cleaning data, removing digits, punctuation etc.
'''
return data
data_processed = clean_data(data_raw)
input_cleaned = clean_data(user_data)
total_data = pd.concat([data_processed,input_cleaned])
vectorizer=TfidfVectorizer(strip_accents='unicode',
analyzer='word',
ngram_range=(1, 2),
max_features=15000,
smooth_idf=True,
sublinear_tf=True)
vectorizer.fit(total_data['text'])
X_training_vectorized = vectorizer.transform(total_data['text'])
X_test = vectorizer.transform(input_cleaned['text'])
pca = PCA(n_components=0.95)
pca.fit(X_training_vectorized.toarray())
X_test_pca = pca.transform(X_test.toarray())
y_test = loaded_model.predict(X_test_pca)
What I don't understand is, I previously trained my dataset with 10000+ data and got good results. I then decide to save the model so that I can make predictions using user data. My model detects whether a text document is fraudulent or real. I have a dataset thats labelled for fraudulent data.
I understand that when transforming data, vectorizer and pca should both be fitted to our whole dataset so that it will result in the same shape.
What I dont understand is, how do I make it such that, I can transform user input, and have it the same shape as my model that I pretrained? Whats the proper procedure for this? Would love answers that also consider performance/time needed to process the data.
Solution
This is done automatically as part of the CountVectorizer
- tokens which appear in a new dataset but not in the data it is fit upon are simply ignored and the shape of the output will remain the same.
For example:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
data = ['the cat in the hat', 'cats like milk', 'cats and rats']
cv = CountVectorizer()
dtm = cv.fit_transform(data)
pd.DataFrame(dtm.todense(), columns=cv.get_feature_names_out())
and | cat | cats | hat | in | like | milk | rats | the |
---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 2 |
0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
Now if we pass new data to the fitted CountVectorizer that includes tokens it hasn't seen before (birds, dogs) they are ignored, and the dimensionality of the document-term matrix remains the same:
data2 = ['dogs and cats', 'birds and cats', 'dogs and birds']
dtm = cv.transform(data2)
pd.DataFrame(dtm.todense(), columns=cv.get_feature_names_out())
and | cat | cats | hat | in | like | milk | rats | the |
---|---|---|---|---|---|---|---|---|
1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Since unseen tokens are ignored, this stresses the importance of retraining and/or consistency in the distribution of tokens in your training data and any data the model is used upon.
Additionally, I would also avoid using floating point value for n_components
for PCA but instead pick a set number of components (pass an integer value as opposed to a float) so that the output dimensionality of preprocessing is consistent.
Answered By - NLP from scratch
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.