Issue
I have a dummy dataset of two text columns and labels as below.
import tensorflow as tf
from transformers import BertTokenizer, TFAutoModelForSequenceClassification
import numpy as np
from datasets import Dataset, DatasetDict
# Create a synthetic dataset with two text columns and a label column (0 or 1)
data_size = 1000
text_column1 = ["This is sentence {}.".format(i) for i in range(data_size)]
text_column2 = ["Another sentence {} for tokenization.".format(i) for i in range(data_size)]
labels = np.random.choice([0, 1], size=data_size)
I am using the HF bert model for classification.(TFAutoModelForSequenceClassification)
# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model2 = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")
When using the below code for preparing the dataset and model training, the execution is successful.
def tokenize_dataset(df):
# Keys of the returned dictionary will be added to the dataset as columns
return tokenizer(df['text_column1'], df['text_column2'])
# Convert to a DataFrame
import pandas as pd
df = pd.DataFrame({'text_column1': text_column1, 'text_column2': text_column2, 'label': labels})
df = Dataset.from_pandas(df).map(tokenize_dataset)
tf_train = model2.prepare_tf_dataset(df, batch_size=4, shuffle=True, tokenizer=tokenizer)
model2.compile(optimizer=Adam(3e-5)) # No loss argument!
model2.fit(tf_train)
The above code works successfully.
However when I use padding, truncation and max_length in the tokenizer, i.e as below
def tokenize_dataset(df):
# Keys of the returned dictionary will be added to the dataset as columns
return tokenizer(df['text_column1'], df['text_column2'], padding=True,truncation=True,max_length=30, return_tensors="tf")
# Convert to a DataFrame
import pandas as pd
df = pd.DataFrame({'text_column1': text_column1, 'text_column2': text_column2, 'label': labels})
df = Dataset.from_pandas(df).map(tokenize_dataset)
tf_train = model2.prepare_tf_dataset(df, batch_size=4, shuffle=True, tokenizer=tokenizer)
model2.compile(optimizer=Adam(3e-5)) # No loss argument!
model2.fit(tf_train)
This code gave the following error:
ValueError: Exception encountered when calling layer 'bert' (type TFBertMainLayer).
in user code:
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_tf_utils.py", line 1557, in run_call_with_unpacked_inputs *
return func(self, **unpacked_inputs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/bert/modeling_tf_bert.py", line 766, in call *
batch_size, seq_length = input_shape
ValueError: too many values to unpack (expected 2)
I am not able to understand why that will happen. If it happens then, why is it so and how to resolve the error?
Solution
The prepare_tf_dataset
function does not require the dataframe to have tensor
value types. Removing return_tensors='tf'
should solve the problem.
def tokenize_dataset(df):
# Keys of the returned dictionary will be added to the dataset as columns
return tokenizer(
df['text_column1'],
df['text_column2'],
padding=True,
truncation=True,
max_length=30)
Answered By - druskacik
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.