Issue
I tried to train a convolutional neural network to predict the labels (categorical data) given the criteria (text). This should have been a simple classification problem. There are 7 labels, hence my network has 7 output neurons with sigmoid
activation functions.
I encoded training data using the following simple format, in a txt file, using text descriptors ('criteria'
) and categorical label variables ('label'
):
'criteria'|'label'
Here's a peak at one entry from data file:
Headache location: Bilateral (intracranial). Facial pain: Nil. Pain quality: Pulsating. Thunderclap onset: Nil. Pain duration: 11. Pain episodes per month: 26. Chronic pain: No. Remission between episodes: Yes. Remission duration: 25. Pain intensity: Moderate (4-7). Aggravating/triggering factors: Innocuous facial stimuli, Bathing and/or showering, Chocolate, Exertion, Cold stimulus, Emotion, Valsalva maneuvers. Relieving factors: Nil. Headaches worse in the mornings and/or night: Nil. Associated symptoms: Nausea and/or vomiting. Reversible symptoms: Nil. Examination findings: Nil. Aura present: Yes. Reversible aura: Motor, Sensory, Brainstem, Visual. Duration of auras: 47. Aura in relation to headache: Aura proceeds headache. History of CNS disorders: Multiple Sclerosis, Angle-closure glaucoma. Past history: Nil. Temporal association: No. Disease worsening headache: Nil. Improved cause: Nil. Pain ipsilateral: Nil. Medication overuse: Nil. Establish drug overuse: Nil. Investigations: Nil.|Migraine with aura
Here's a snippet of the code from the training algorithm:
'''A. IMPORT DATA'''
dataset = pd.read_csv('Data/ICHD3_Database.txt', names=['criteria', 'label'], sep='|')
features = dataset['criteria'].values
labels = dataset['label'].values
labels = labels.reshape(len(labels), 1) # Reshape target to be a 2d array
'''B. DATA PRE-PROCESSING: WORD EMBEDDINGS'''
def word_embeddings(features):
maxlen = 200
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.33, random_state=42)
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(features_train)
features_train = pad_sequences(tokenizer.texts_to_sequences(features_train), padding='post', maxlen=maxlen)
features_test = pad_sequences(tokenizer.texts_to_sequences(features_test), padding='post', maxlen=maxlen)
labels_train = pad_sequences(tokenizer.texts_to_sequences(labels_train), padding='post', maxlen=maxlen)
labels_test = pad_sequences(tokenizer.texts_to_sequences(labels_test), padding='post', maxlen=maxlen)
vocab_size = len(tokenizer.word_index) + 1 # Adding 1 because of reserved 0 index
return features_train, features_test, labels_train, labels_test, vocab_size, maxlen
features_train, features_test, labels_train, labels_test, vocab_size, maxlen = word_embeddings(features) # Pre-process text using word embeddings
'''C. CREATE THE MODEL'''
def design_model(features, hidden_layers=2, number_neurons=128):
model = Sequential(name = "My_Sequential_Model")
model.add(layers.Embedding(input_dim=vocab_size, output_dim=50, input_length=maxlen))
model.add(layers.Conv1D(128, 5, activation='relu'))
model.add(layers.GlobalMaxPool1D())
for i in range(hidden_layers):
model.add(Dense(number_neurons, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(7, activation='sigmoid'))
opt = Adam(learning_rate=0.01)
model.compile(loss='binary_crossentropy', metrics=['mae'], optimizer=opt)
return model
I then pipe the model through a GridSearchCV
to find the optimal number of epochs, batch size, etc.
However, before it even gets to the GridSearchCV
, when I run it, I get the following error:
Traceback (most recent call last):
File "c:\Users\user\Desktop\Deep Learning\deep_learning_headache.py", line 51, in <module>
features_train, features_test, labels_train, labels_test, vocab_size, maxlen = word_embeddings(features) # Pre-process text using word embeddings
File "c:\Users\user\Desktop\Deep Learning\deep_learning_headache.py", line 45, in word_embeddings
labels_train = pad_sequences(tokenizer.texts_to_sequences(labels_train), padding='post', maxlen=maxlen)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\keras\src\preprocessing\text.py", line 357, in texts_to_sequences
return list(self.texts_to_sequences_generator(texts))
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\keras\src\preprocessing\text.py", line 386, in texts_to_sequences_generator
seq = text_to_word_sequence(
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\keras\src\preprocessing\text.py", line 74, in text_to_word_sequence
input_text = input_text.lower()
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
Where am I going wrong?
Solution
Based on the exception, it's expecting a string, and not a numpy ndarray (in the tokenizer)?
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
You can use the call stack to find the line of your code where you'll likely find the wrong type of thing being passed in:
File "c:\Users\user\Desktop\Deep Learning\deep_learning_headache.py", line 45, in word_embeddings labels_train = pad_sequences(tokenizer.texts_to_sequences(labels_train), padding='post', maxlen=maxlen)
I'd take a look at tokenizer.texts_to_sequences documentation, and examine what type of data is in labels_train, and see if you're passing in the shape/type of thing here.
Answered By - Some Body Else
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.