Issue
I am trying to extract POS tags information from the following dataset
Sentences Characters Label
803 A Complete Bibliography of Scientific American... 128 1
1373 Mandated MVNO access would 'likely lead to del... 244 0
1257 What is PANS/PANDAS? And Why Are Cases On The ... 212 0
2405 St Laurence School | Care • Inspire • Succ Hea... 124 1
2589 Study reveals: The 50 most Instagrammed island... 212 0
I am applying the following functions:
(after Arya's suggestions)
import nltk
tagged_sentences = nltk.corpus.treebank.tagged_sents()
cutoff = int(.75 * len(tagged_sentences))
import nltk
tagged_sentences = nltk.corpus.treebank.tagged_sents()
cutoff = int(.75 * len(tagged_sentences))
def features(sentence, index):
return {
'word': sentence[index],
'is_first_word': int(index == 0),
'is_last_word': int(index == len(sentence) - 1),
'is_capitalized': sentence[index][0].upper() == sentence[index][0],
'is_all_upper': int(sentence[index].upper() == sentence[index]),
'is_all_lower': int(sentence[index].lower() == sentence[index]),
'prev_word': '' if index == 0 else sentence[index - 1],
'next_word': '' if index == len(sentence) - 1 else sentence[index + 1],
'prefix-1': sentence[index][0],
'prefix-2': sentence[index][:2],
'prefix-3': sentence[index][:3],
'suffix-1': sentence[index][-1],
'suffix-2': sentence[index][-2:],
'suffix-3': sentence[index][-3:],
}
Following the steps described in this article: https://medium.com/analytics-vidhya/pos-tagging-using-conditional-random-fields-92077e5eaa31, I would like to apply the same to my data (still considering X
, y
, for example X=df[['Sentences','Characters']]
, and y=df['Label']
.
X=df[['Sentences','Characters']]
y=df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=40)
train_df= pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)
This step should split the dataset into train and test. However, I have already this information, so I would not need to split the dataset into train and test.
def untag(tagged_sentences):
return [w for w, t in tagged_sentences]
def prepareData(tagged_sentences):
X,y=[],[]
for sentence in tagged_sentences.Sentences:
X.append([features(untag(sentence), index) for index in range(len(sentence))])
y.append([tag for word,tag in sentence])
return X,y
X_train,y_train=prepareData(train_df)
X_test,y_test=prepareData(test_df)
Running my dataset, I get the error:
----> 8 X_train,y_train=prepareData(train_df)
ValueError: not enough values to unpack (expected 2, got 1)
I hope you can tell me how to fix the ValueError.
I would need to assign tags using Sentences. Difficulties are in using my dataset (train and test) for doing the same that it was done in the link I shared.
Solution
Okay, I see the problem. Well, the three problems.
Problem 1. prepareData
variable names
You're not copying from the tutorial you used carefully.
This is how they define prepareData
:
def prepareData(tagged_sentences):
X,y=[],[]
for sentences in tagged_sentences:
X.append([features(untag(sentences), index) for index in range(len(sentences))])
y.append([tag for word,tag in sentences])
return X,y
(Incidentally, they called a variable sentences
instead of sentence
, which is a horribly confusing naming convention because it holds a single sentence
.
But when you copied the tutorial, you changed the name tagged_sentences
in two places, each to different things!
As the name of the function parameter, you changed it to sentences
, which is already used. As the name of a variable inside the function, you changed it to training_sentences
. Both of these names are going to confuse you, because they have other meanings in your code. But they should match! Try renaming it to tagged_sentences
in both places.
Problem 2. prepareData
isn't designed for DataFrame
s.
Once you fix that problem, you'll have a new one. The tutorial isn't using a DataFrame
. It's using a list. You will run into problems on the line:
for sentences in tagged_sentences:
because looping over a Pandas DataFrame
loops over the column names. (See my answer here.)
Instead, what you want is to look at only the sentences.
Change that line to:
for sentences in tagged_sentences.Sentences:
This way, you get only the Series
you care about. (You weren't using the Characters
column anyway!)
Problem 3. You're not doing POS tagging! You don't need a CRF!
I'm trying to be clear here. POS tagging means that for every input word in a sentence, you predict a label for that word. (The label is the word's part of speech.)
You aren't doing that. You are creating one label for the entire sentence. (See your Label
column.) For that style of problem, you don't need a CRF. There's no chain of outputs. Logistic regression is the equivalent model that does what you need.
Answered By - Arya McCarthy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.