Issue
Currently I'm trying to implement lazy loading with allennlp, but can't. My code is as the followings.
def biencoder_training():
params = BiEncoderExperiemntParams()
config = params.opts
reader = SmallJaWikiReader(config=config)
# Loading Datasets
train, dev, test = reader.read('train'), reader.read('dev'), reader.read('test')
vocab = build_vocab(train)
vocab.extend_from_instances(dev)
# TODO: avoid memory consumption and lazy loading
train, dev, test = list(reader.read('train')), list(reader.read('dev')), list(reader.read('test'))
train_loader, dev_loader, test_loader = build_data_loaders(config, train, dev, test)
train_loader.index_with(vocab)
dev_loader.index_with(vocab)
embedder = emb_returner()
mention_encoder, entity_encoder = Pooler_for_mention(word_embedder=embedder), \
Pooler_for_cano_and_def(word_embedder=embedder)
model = Biencoder(mention_encoder, entity_encoder, vocab)
trainer = build_trainer(lr=config.lr,
num_epochs=config.num_epochs,
model=model,
train_loader=train_loader,
dev_loader=dev_loader)
trainer.train()
return model
When I commented-out train, dev, test = list(reader.read('train')), list(reader.read('dev')), list(reader.read('test'))
, iterator doesn't work and training is conducted with 0 sample.
Building the vocabulary
100it [00:00, 442.15it/s]01, 133.57it/s]
building vocab: 100it [00:01, 95.84it/s]
100it [00:00, 413.40it/s]
100it [00:00, 138.38it/s]
You provided a validation dataset but patience was set to None, meaning that early stopping is disabled
0it [00:00, ?it/s]
0it [00:00, ?it/s]
I'd like to know if there is any solution for avoid this. Thanks.
Supplement, added at fifth, May.
Currently I am trying to avoid putting all of each sample data on top of memory before training the model.
So I have implemented the _read method as a generator. My understanding is that by calling this method and wrapping it with SimpleDataLoader, I can actually pass the data to the model.
In the DatasetReader, the code for the _read method looks like this. It is my understanding that this is intended to be a generator that avoids memory consumption.
@overrides
def _read(self, train_dev_test_flag: str) -> Iterator[Instance]:
'''
:param train_dev_test_flag: 'train', 'dev', 'test'
:return: list of instances
'''
if train_dev_test_flag == 'train':
dataset = self._train_loader()
random.shuffle(dataset)
elif train_dev_test_flag == 'dev':
dataset = self._dev_loader()
elif train_dev_test_flag == 'test':
dataset = self._test_loader()
else:
raise NotImplementedError(
"{} is not a valid flag. Choose from train, dev and test".format(train_dev_test_flag))
if self.config.debug:
dataset = dataset[:self.config.debug_data_num]
for data in tqdm(enumerate(dataset)):
data = self._one_line_parser(data=data, train_dev_test_flag=train_dev_test_flag)
yield self.text_to_instance(data)
Also, build_data_loaders
actually looks like this.
def build_data_loaders(config,
train_data: List[Instance],
dev_data: List[Instance],
test_data: List[Instance]) -> Tuple[DataLoader, DataLoader, DataLoader]:
train_loader = SimpleDataLoader(train_data, config.batch_size_for_train, shuffle=False)
dev_loader = SimpleDataLoader(dev_data, config.batch_size_for_eval, shuffle=False)
test_loader = SimpleDataLoader(test_data, config.batch_size_for_eval, shuffle=False)
return train_loader, dev_loader, test_loader
But, by somewhat reason I don't know, this code doesn't work.
def biencoder_training():
params = BiEncoderExperiemntParams()
config = params.opts
reader = SmallJaWikiReader(config=config)
# Loading Datasets
train, dev, test = reader.read('train'), reader.read('dev'), reader.read('test')
vocab = build_vocab(train)
vocab.extend_from_instances(dev)
train_loader, dev_loader, test_loader = build_data_loaders(config, train, dev, test)
train_loader.index_with(vocab)
dev_loader.index_with(vocab)
embedder = emb_returner()
mention_encoder, entity_encoder = Pooler_for_mention(word_embedder=embedder), \
Pooler_for_cano_and_def(word_embedder=embedder)
model = Biencoder(mention_encoder, entity_encoder, vocab)
trainer = build_trainer(lr=config.lr,
num_epochs=config.num_epochs,
model=model,
train_loader=train_loader,
dev_loader=dev_loader)
trainer.train()
return model
In this code, the SimpleDataLoader will wrap the generator type as it is. I would like to do the lazy loading that allennlp did in the 0.9 version.
But this code iterates training over 0 instances, so currently I have added
train, dev, test = list(reader.read('train')), list(reader.read('dev')), list(reader.read('test'))
before
train_loader, dev_loader, test_loader = build_data_loaders(config, train, dev, test)
.
And it works. But this means that I can't train or evaluate the model until I have all the instances in memory. Rather, I want each batch to be called into memory only when it is time to train.
Solution
The SimpleDataLoader
is not capable of lazy loading. You should use the MultiProcessDataLoader
instead. Setting max_instances_in_memory
to a non-zero integer (usually some multiple of your batch size) will trigger lazy loading.
Answered By - petew
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.