Tuesday, January 30, 2024

[FIXED] Memory-efficient BERT Text Embedding for Large Dataset Preprocessing in TensorFlow

January 30, 2024 bert-language-model, embedding, huggingface-transformers, nlp, tensorflow No comments

Issue

I'm working with a dataset containing approximately 920,614 rows and multiple columns, including "orig_item_title," "sub_item_title," "is_brand_same," and "is_flavor_same." The goal is to build a model that predicts the similarity or relevance of items, specifically whether a substitute item is similar to the original item. I'm implementing a learning-to-rank (LTR) framework, incorporating features such as brand and flavor matching along with embeddings from the BERT encoder for text columns.

Here's the code snippet for preprocessing the features and creating a TensorFlow dataset:

from transformer import BertTokenizer, TFBertModel
import tensorflow as tf

#load a pre-trained BERT model and tokenizer
 model_name = "bert-base-uncased"
 tokenizer = BertTokenizer.from_pretrained(model_name)
 bert_model = TFBertModel.from_pretrained(model_name)  # Embedding size is 768
    
# Define a function for BERT encoding
def bert_encoder(text_column):
    input_ids = tokenizer(text_column, return_tensors="tf", truncation=True, padding=True)["input_ids"]
    outputs = bert_model(input_ids)
    pooled_output = outputs.pooler_output
    return pooled_output

def preprocess_features(df):
    # Extract features and labels
    text_columns = ["orig_item_title", "sub_item_title"]
    numerical_columns = ["is_brand_same",  'is_flavor_same']
    label_column = "acc_rate"

    features = {
        "orig_item_title": df["orig_item_title"],
        "sub_item_title": df["sub_item_title"],
       "is_brand_same": df["is_brand_same"],
       'is_flavor_same': df['is_flavor_same'],
    }

    for col in text_columns:
        features[col] = bert_encoder(df[col].tolist())  

    # Numerical columns
    numerical_features = [tf.feature_column.numeric_column(col) for col in numerical_columns]
    features.update({col: df[col] for col in numerical_columns})

    return features, df[label_column]

dataset = tf.data.Dataset.from_tensor_slices(preprocess_features(df))

However, I'm encountering an Out-of-Memory (OOM) error when executing the preprocessing as it is processing a tensor of shape [920614,55,768]. I'm seeking advice on reducing the embedding dimension, possibly to 256 or 128, and exploring alternative approaches to make the preprocessing successful without exhausting memory. Any suggestions or code guidance would be really helpful.

Also, could someone provide assistance with coding the model, including the integration of BERT layer embeddings for text features, concatenation with numerical features, and the inclusion of a neural network layer with sigmoid predictions in the output layer?

Thank you.

Solution

what i will advise you to do is to implement batch processing as seen in the code below.instead of loading the entire dataset into memory, load and process data in chunks. You can use TensorFlow's Dataset API for this purpose, which is designed to handle large datasets efficiently.

from transformers import BertTokenizer, TFBertModel
import tensorflow as tf

model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = TFBertModel.from_pretrained(model_name)

# Define a function for BERT encoding with dimensionality reduction
def bert_encoder(text_column, batch_size=32, embed_dim=256):
    # Dataset for efficient batch processing
    dataset = tf.data.Dataset.from_tensor_slices(text_column).batch(batch_size)
    embeddings = []

    for batch in dataset:
        input_ids = tokenizer(batch.numpy().tolist(), return_tensors="tf", padding=True, truncation=True)["input_ids"]
        outputs = bert_model(input_ids)
        pooled_output = outputs.pooler_output
        # Dimensionality reduction
        dense_layer = tf.keras.layers.Dense(embed_dim, activation='relu')
        reduced_output = dense_layer(pooled_output)
        embeddings.append(reduced_output)

    return tf.concat(embeddings, axis=0)

def preprocess_features(df, batch_size=32, embed_dim=256):
    # Process text columns in batches
    df["orig_item_title_emb"] = bert_encoder(df["orig_item_title"].tolist(), batch_size, embed_dim)
    df["sub_item_title_emb"] = bert_encoder(df["sub_item_title"].tolist(), batch_size, embed_dim)
    
    # Other features and labels
    # ...

    # Combine all features
    # ...

    return features, df["acc_rate"]

# Assuming df is your DataFrame
dataset = tf.data.Dataset.from_tensor_slices(preprocess_features(df)).batch(some_batch_size)

Answered By - Adesoji Alu

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 30, 2024

[FIXED] Memory-efficient BERT Text Embedding for Large Dataset Preprocessing in TensorFlow

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels