Wednesday, February 2, 2022

[FIXED] POS tags for train and test sets: ValueError

February 02, 2022 python, scikit-learn No comments

Issue

I am trying to extract POS tags information from the following dataset

                 Sentences                                        Characters    Label
803    A Complete Bibliography of Scientific American...             128        1
1373    Mandated MVNO access would 'likely lead to del...            244        0 
1257    What is PANS/PANDAS? And Why Are Cases On The ...            212        0
2405    St Laurence School | Care • Inspire • Succ Hea...            124        1
2589    Study reveals: The 50 most Instagrammed island...            212        0

I am applying the following functions:

(after Arya's suggestions)

import nltk
 
tagged_sentences = nltk.corpus.treebank.tagged_sents()
cutoff = int(.75 * len(tagged_sentences))
    
import nltk
 
tagged_sentences = nltk.corpus.treebank.tagged_sents()
cutoff = int(.75 * len(tagged_sentences))

    def features(sentence, index):
        return {
            'word': sentence[index],
            'is_first_word': int(index == 0),
            'is_last_word': int(index == len(sentence) - 1),
            'is_capitalized': sentence[index][0].upper() == sentence[index][0],
            'is_all_upper': int(sentence[index].upper() == sentence[index]), 
            'is_all_lower': int(sentence[index].lower() == sentence[index]), 
            'prev_word': '' if index == 0 else sentence[index - 1],
            'next_word': '' if index == len(sentence) - 1 else sentence[index + 1],
            'prefix-1': sentence[index][0],
            'prefix-2': sentence[index][:2],
            'prefix-3': sentence[index][:3],
            'suffix-1': sentence[index][-1],
            'suffix-2': sentence[index][-2:],
            'suffix-3': sentence[index][-3:],
        }

Following the steps described in this article: https://medium.com/analytics-vidhya/pos-tagging-using-conditional-random-fields-92077e5eaa31, I would like to apply the same to my data (still considering X, y, for example X=df[['Sentences','Characters']], and y=df['Label'].

X=df[['Sentences','Characters']]
y=df['Label']

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.25, random_state=40) 

train_df= pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)

This step should split the dataset into train and test. However, I have already this information, so I would not need to split the dataset into train and test.

def untag(tagged_sentences):
    return [w for w, t in tagged_sentences]


def prepareData(tagged_sentences):
    X,y=[],[]
    for sentence in tagged_sentences.Sentences:
        X.append([features(untag(sentence), index) for index in range(len(sentence))])
        y.append([tag for word,tag in sentence])
    return X,y
    
    X_train,y_train=prepareData(train_df)
    X_test,y_test=prepareData(test_df)

Running my dataset, I get the error:

----> 8 X_train,y_train=prepareData(train_df)

ValueError: not enough values to unpack (expected 2, got 1)

I hope you can tell me how to fix the ValueError.

I would need to assign tags using Sentences. Difficulties are in using my dataset (train and test) for doing the same that it was done in the link I shared.

Solution

Okay, I see the problem. Well, the three problems.

Problem 1. `prepareData` variable names

You're not copying from the tutorial you used carefully. This is how they define prepareData:

def prepareData(tagged_sentences):
    X,y=[],[]
    for sentences in tagged_sentences:
        X.append([features(untag(sentences), index) for index in range(len(sentences))])
        y.append([tag for word,tag in sentences])
    return X,y

(Incidentally, they called a variable sentences instead of sentence, which is a horribly confusing naming convention because it holds a single sentence.

But when you copied the tutorial, you changed the name tagged_sentences in two places, each to different things!

As the name of the function parameter, you changed it to sentences, which is already used. As the name of a variable inside the function, you changed it to training_sentences. Both of these names are going to confuse you, because they have other meanings in your code. But they should match! Try renaming it to tagged_sentences in both places.

Problem 2. `prepareData` isn't designed for `DataFrame`s.

Once you fix that problem, you'll have a new one. The tutorial isn't using a DataFrame. It's using a list. You will run into problems on the line:

    for sentences in tagged_sentences:

because looping over a Pandas DataFrame loops over the column names. (See my answer here.)

Instead, what you want is to look at only the sentences.

Change that line to:

    for sentences in tagged_sentences.Sentences:

This way, you get only the Series you care about. (You weren't using the Characters column anyway!)

Problem 3. You're not doing POS tagging! You don't need a CRF!

I'm trying to be clear here. POS tagging means that for every input word in a sentence, you predict a label for that word. (The label is the word's part of speech.)

You aren't doing that. You are creating one label for the entire sentence. (See your Label column.) For that style of problem, you don't need a CRF. There's no chain of outputs. Logistic regression is the equivalent model that does what you need.

Answered By - Arya McCarthy

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, February 2, 2022

[FIXED] POS tags for train and test sets: ValueError

Issue

Solution

Problem 1. `prepareData` variable names

Problem 2. `prepareData` isn't designed for `DataFrame`s.

Problem 3. You're not doing POS tagging! You don't need a CRF!

0 comments:

Post a Comment

Popular Posts

Labels

Wednesday, February 2, 2022

Issue

Solution

Problem 1. prepareData variable names

Problem 2. prepareData isn't designed for DataFrames.

Problem 3. You're not doing POS tagging! You don't need a CRF!

0 comments:

Post a Comment

Popular Posts

Labels

Problem 1. `prepareData` variable names

Problem 2. `prepareData` isn't designed for `DataFrame`s.