Issue
I want to use a sklearn classifier to train a model to classify data entries (yes,no) using a text feature (content), a numerical feature (population) and a categorical feature (location).
The model below is using only the text data to classify each entry. The text is converted with TF-IDF into a sparse matrix before being imported into the classifier.
Is there a way to add/use also the other features? These features are not in sparse matrix format so not sure how to combine them with the text sparse matrix.
#import libraries
import string, re, nltk
import pandas as pd
from pandas import Series, DataFrame
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
# read data and remove empty lines
dataset = pd.read_csv('sample_data.txt',
sep='\t',
names=['content','location','population','target'])
.dropna(how='all')
.dropna(subset=['target'])
df = dataset[1:]
#reset dataframe index
df.reset_index(inplace = True)
#add an extra column which is the length of text
df['length'] = df['content'].apply(len)
#create a dataframe that contains only two columns the text and the target class
df_cont = df.copy()
df_cont = df_cont.drop(
['location','population','length'],axis = 1)
# function that takes in a string of text, removes all punctuation, stopwords and returns a list of cleaned text
def text_process(mess):
# lower case for string
mess = mess.lower()
# check characters and removes URLs
nourl = re.sub(r'http\S+', ' ', mess)
# check characters and removes punctuation
nopunc = [char for char in nourl if char not in string.punctuation]
# join the characters again to form the string and removes numbers
nopunc = ''.join([i for i in nopunc if not i.isdigit()])
# remove stopwords
return [ps.stem(word) for word in nopunc.split() if word not in set(stopwords.words('english'))]
#split the data in train and test set and train/test the model
cont_train, cont_test, target_train, target_test = train_test_split(df_cont['content'],df_cont['target'],test_size = 0.2,shuffle = True, random_state = 1)
pipeline = Pipeline([('bag_of_words',CountVectorizer(analyzer=text_process)),
('tfidf',TfidfTransformer()),
('classifier',MultinomialNB())])
pipeline.fit(cont_train,target_train)
predictions = pipeline.predict(cont_test)
print(classification_report(predictions,target_test))
The model is expected to return the following: accuracy, precision, recall ,f1-score
Solution
I believe you need to use one-hot vectoring for the 'location' feature. One-hot vectors for the given data would be,
London - 100
Manchester - 010
Edinburg - 001
Vector length is the number of cities you have in there. Note that each bit here would be a feature. Categorical data is usually converted to one-hot vectors before feeding to a machine learning algorithm.
Once this is done you can concat the whole row into a 1D array and then feed that to the classifier.
Answered By - Achintha Ihalage
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.