Issue
Trying to tokenize and encode data to feed to a neural network.
I only have 25GB RAM and everytime I try to run the code below my google colab crashes. Any idea how to prevent his from happening? “Your session crashed after using all available RAM”
I thought tokenize/encoding chunks of 50000 sentences would work but unfortunately not. The code works on a dataset with length 1.3 million. The current dataset has a length of 5 million.
max_q_len = 128
max_a_len = 64
trainq_list = train_q.tolist()
batch_size = 50000
def batch_encode(text, max_seq_len):
for i in range(0, len(trainq_list), batch_size):
encoded_sent = tokenizer.batch_encode_plus(
text,
max_length = max_seq_len,
pad_to_max_length=True,
truncation=True,
return_token_type_ids=False
)
return encoded_sent
# tokenize and encode sequences in the training set
tokensq_train = batch_encode(trainq_list, max_q_len)
The tokenizer comes from HuggingFace:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')
Solution
You should use generators and pass data to tokenizer.batch_encode_plus
, no matter the size.
Conceptually, something like this:
Training list
This one probably holds list of sentences, which is read from some file(s). If this is a single large file, you could follow this answer to lazily read parts of the input (preferably of batch_size
lines at once):
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
Otherwise open a single file (much smaller than memory, because it will be way larger after encoding using BERT), something like this:
import pathlib
def read_in_chunks(directory: pathlib.Path):
# Use "*.txt" or any other extension your file might have
for file in directory.glob("*"):
with open(file, "r") as f:
yield f.readlines()
Encoding
Encoder should take this generator and yield
back encoded parts, something like this:
# Generator should create lists useful for encoding
def batch_encode(generator, max_seq_len):
tokenizer = BertTokenizerFast.from_pretrained("bert-base-multilingual-uncased")
for text in generator:
yield tokenizer.batch_encode_plus(
text,
max_length=max_seq_len,
pad_to_max_length=True,
truncation=True,
return_token_type_ids=False,
)
Saving encoded files
As the files will be too large to fit in RAM memory, you should save them to disk (or use somehow as they are generated).
Something along those lines:
import numpy as np
# I assume np.arrays are created, adjust to PyTorch Tensors or anything if needed
def save(encoding_generator):
for i, encoded in enumerate(encoding_generator):
np.save(str(i), encoded)
Answered By - Szymon Maszke
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.