Issue
I am new to machine learning, and I am trying to work my way through a tutorial for text summarization using Keras.
I have reached the point of vectorizing the data, however I am getting an error, and I have tried everything I can myself. I really would like to get this program working, and was hoping someone could shed some light into why it is giving me this error and how I can fix it. I did look at previous posts, but none have helped so far, thanks. Here is my code:
#vectorise data
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
for story in stories:
input_text = story['story']
for highlight in story['highlights']:
target_text = highlight
target_text = '\t' + target_text + '\n'
input_texts.append(input_text)
target_texts.append(target_text)
for char in input_text:
if char not in input_characters:
input_characters.add(char)
for char in target_text:
if char not in target_characters:
target_characters.add(char)
input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])
print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)
This is the line of code that it is throwing the error on
for highlight in story['highlights']:
This is the code that i used to clean and pickle the data
#remove all unneeded features and null values
reviews = reviews.dropna()
reviews = reviews.drop(['Id','ProductId','UserId','ProfileName','HelpfulnessNumerator','HelpfulnessDenominator', 'Score','Time'], 1)
reviews = reviews.reset_index(drop=True)
print(reviews.head())
for i in range(5):
print("Review #",i+1)
print(reviews.Summary[i])
print(reviews.Text[i])
print()
#define contractions eg slang words and their correct spellings
contractions = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have"}
#clean the text of contractions and stop words
def clean_text(text, remove_stopwords=True):
text = text.lower()
if True:
text = text.split()
new_text = []
for word in text:
if word in contractions:new_text.append(contractions[word])
else:
new_text.append(word)
text = " ".join(new_text)
text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
text = re.sub(r'\<a href', ' ', text)
text = re.sub(r'&', '', text)
text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
text = re.sub(r'<br />', ' ', text)
text = re.sub(r'\'', ' ', text)
if remove_stopwords:
text = text.split()
stops = set(stopwords.words("english"))
text = [w for w in text if not w in stops]
text = " ".join(text)
return text
#clean summaries and texts
clean_summaries = []
for summary in reviews.Summary:
clean_summaries.append(clean_text(summary, remove_stopwords=False))
print("Summaries are complete.")
clean_texts = []
for text in reviews.Text:
clean_texts.append(clean_text(text))
print("Texts are complete.")
stories = list()
for i, text in enumerate(clean_texts):
stories.append({'story': text, 'highlights': clean_summaries[i]}) # save to file
dump(stories, open('data/review_dataset.pkl', 'wb'))
Solution
It seems like at least one of your story dictionaries does not have a key-value pair for the key 'highlights'. If this is only true for certain stories, you can simply check if there is a NoneType before iterating. If this is true for all stories, there might be a discrepancy between your code and the data you are working with.
Also, I believe there is an indentation error (might just be wrong SO formatting), but I believe the code after target_text = highlight
should be indented once more to the right.
for story in stories:
input_text = story['story']
# check for None to make sure you are not iterating over NoneType
if story['highlights'] is not None:
for highlight in story['highlights']:
target_text = highlight
# I believe the following code should be indented as well
target_text = '\t' + target_text + '\n'
input_texts.append(input_text)
target_texts.append(target_text)
...
Answered By - Chris Graf
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.