Monday, February 21, 2022

[FIXED] NLP - How to add more features?

February 21, 2022 machine-learning, nlp, python, scikit-learn, tf-idf No comments

Issue

I want to use a sklearn classifier to train a model to classify data entries (yes,no) using a text feature (content), a numerical feature (population) and a categorical feature (location).

The model below is using only the text data to classify each entry. The text is converted with TF-IDF into a sparse matrix before being imported into the classifier.

Is there a way to add/use also the other features? These features are not in sparse matrix format so not sure how to combine them with the text sparse matrix.


    #import libraries
    import string, re, nltk
    import pandas as pd
    from pandas import Series, DataFrame
    from nltk.corpus import stopwords
    from nltk.stem.porter import PorterStemmer
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report
    from sklearn.pipeline import Pipeline

    # read data and remove empty lines
    dataset = pd.read_csv('sample_data.txt',
                           sep='\t',
                           names=['content','location','population','target'])
                           .dropna(how='all')
                           .dropna(subset=['target'])

    df = dataset[1:]

    #reset dataframe index
    df.reset_index(inplace = True)

    #add an extra column which is the length of text
    df['length'] = df['content'].apply(len)

    #create a dataframe that contains only two columns the text and the target class
    df_cont = df.copy()
    df_cont = df_cont.drop(
        ['location','population','length'],axis = 1)

    # function that takes in a string of text, removes all punctuation, stopwords and returns a list of cleaned text

    def text_process(mess):
        # lower case for string
        mess = mess.lower()

        # check characters and removes URLs
       nourl = re.sub(r'http\S+', ' ', mess)

        # check characters and removes punctuation
        nopunc = [char for char in nourl if char not in string.punctuation]

        # join the characters again to form the string and removes numbers
        nopunc = ''.join([i for i in nopunc if not i.isdigit()])

        # remove stopwords
        return [ps.stem(word) for word in nopunc.split() if word not in set(stopwords.words('english'))]

    #split the data in train and test set and train/test the model

    cont_train, cont_test, target_train, target_test = train_test_split(df_cont['content'],df_cont['target'],test_size = 0.2,shuffle = True, random_state = 1)


    pipeline = Pipeline([('bag_of_words',CountVectorizer(analyzer=text_process)),
                         ('tfidf',TfidfTransformer()),
                         ('classifier',MultinomialNB())])

    pipeline.fit(cont_train,target_train)
    predictions = pipeline.predict(cont_test)

    print(classification_report(predictions,target_test))

The model is expected to return the following: accuracy, precision, recall ,f1-score

Solution

I believe you need to use one-hot vectoring for the 'location' feature. One-hot vectors for the given data would be,

London - 100

Manchester - 010

Edinburg - 001

Vector length is the number of cities you have in there. Note that each bit here would be a feature. Categorical data is usually converted to one-hot vectors before feeding to a machine learning algorithm.

Once this is done you can concat the whole row into a 1D array and then feed that to the classifier.

Answered By - Achintha Ihalage

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, February 21, 2022

[FIXED] NLP - How to add more features?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels