Issue
I'm working with a dataset containing approximately 920,614 rows and multiple columns, including "orig_item_title," "sub_item_title," "is_brand_same," and "is_flavor_same." The goal is to build a model that predicts the similarity or relevance of items, specifically whether a substitute item is similar to the original item. I'm implementing a learning-to-rank (LTR) framework, incorporating features such as brand and flavor matching along with embeddings from the BERT encoder for text columns.
Here's the code snippet for preprocessing the features and creating a TensorFlow dataset:
from transformer import BertTokenizer, TFBertModel
import tensorflow as tf
#load a pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = TFBertModel.from_pretrained(model_name) # Embedding size is 768
# Define a function for BERT encoding
def bert_encoder(text_column):
input_ids = tokenizer(text_column, return_tensors="tf", truncation=True, padding=True)["input_ids"]
outputs = bert_model(input_ids)
pooled_output = outputs.pooler_output
return pooled_output
def preprocess_features(df):
# Extract features and labels
text_columns = ["orig_item_title", "sub_item_title"]
numerical_columns = ["is_brand_same", 'is_flavor_same']
label_column = "acc_rate"
features = {
"orig_item_title": df["orig_item_title"],
"sub_item_title": df["sub_item_title"],
"is_brand_same": df["is_brand_same"],
'is_flavor_same': df['is_flavor_same'],
}
for col in text_columns:
features[col] = bert_encoder(df[col].tolist())
# Numerical columns
numerical_features = [tf.feature_column.numeric_column(col) for col in numerical_columns]
features.update({col: df[col] for col in numerical_columns})
return features, df[label_column]
dataset = tf.data.Dataset.from_tensor_slices(preprocess_features(df))
However, I'm encountering an Out-of-Memory (OOM) error when executing the preprocessing as it is processing a tensor of shape [920614,55,768]. I'm seeking advice on reducing the embedding dimension, possibly to 256 or 128, and exploring alternative approaches to make the preprocessing successful without exhausting memory. Any suggestions or code guidance would be really helpful.
Also, could someone provide assistance with coding the model, including the integration of BERT layer embeddings for text features, concatenation with numerical features, and the inclusion of a neural network layer with sigmoid predictions in the output layer?
Thank you.
Solution
what i will advise you to do is to implement batch processing as seen in the code below.instead of loading the entire dataset into memory, load and process data in chunks. You can use TensorFlow's Dataset API for this purpose, which is designed to handle large datasets efficiently.
from transformers import BertTokenizer, TFBertModel
import tensorflow as tf
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = TFBertModel.from_pretrained(model_name)
# Define a function for BERT encoding with dimensionality reduction
def bert_encoder(text_column, batch_size=32, embed_dim=256):
# Dataset for efficient batch processing
dataset = tf.data.Dataset.from_tensor_slices(text_column).batch(batch_size)
embeddings = []
for batch in dataset:
input_ids = tokenizer(batch.numpy().tolist(), return_tensors="tf", padding=True, truncation=True)["input_ids"]
outputs = bert_model(input_ids)
pooled_output = outputs.pooler_output
# Dimensionality reduction
dense_layer = tf.keras.layers.Dense(embed_dim, activation='relu')
reduced_output = dense_layer(pooled_output)
embeddings.append(reduced_output)
return tf.concat(embeddings, axis=0)
def preprocess_features(df, batch_size=32, embed_dim=256):
# Process text columns in batches
df["orig_item_title_emb"] = bert_encoder(df["orig_item_title"].tolist(), batch_size, embed_dim)
df["sub_item_title_emb"] = bert_encoder(df["sub_item_title"].tolist(), batch_size, embed_dim)
# Other features and labels
# ...
# Combine all features
# ...
return features, df["acc_rate"]
# Assuming df is your DataFrame
dataset = tf.data.Dataset.from_tensor_slices(preprocess_features(df)).batch(some_batch_size)
Answered By - Adesoji Alu
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.