Issue
I am running a Training Job using the Sagemaker API. The code for configuring the estimator looks as follows (I shrinked the full path names a bit):
s3_input = "s3://sagemaker-studio-****/training-inputs".format(bucket)
s3_images = "s3://sagemaker-studio-****/dataset"
s3_labels = "s3://sagemaker-studio-****/labels"
s3_output = 's3://sagemaker-studio-****/output'.format(bucket)
cfg='{}/input/models/'.format(s3_input)
weights='{}/input/data/weights/'.format(s3_input)
outpath='{}/'.format(s3_output)
images='{}/'.format(s3_images)
labels='{}/'.format(s3_labels)
hyperparameters = {
"epochs": 1,
"batch-size": 2
}
inputs = {
"cfg": TrainingInput(cfg),
"images": TrainingInput(images),
"weights": TrainingInput(weights),
"labels": TrainingInput(labels)
}
estimator = PyTorch(
entry_point='train.py',
source_dir='s3://sagemaker-studio-****/input/input.tar.gz',
image_uri=container,
role=get_execution_role(),
instance_count=1,
instance_type='ml.g4dn.xlarge',
input_mode='File',
output_path=outpath,
train_output=outpath,
base_job_name='visualsearch',
hyperparameters=hyperparameters,
framework_version='1.9',
py_version='py38'
)
estimator.fit(inputs)
Everything runs fine and I get the success message:
Results saved to #033[1mruns/train/exp#033[0m
2022-07-08 08:38:35,766 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2022-07-08 08:38:35,766 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.
2022-07-08 08:38:35,767 sagemaker-training-toolkit INFO Reporting training SUCCESS
2022-07-08 08:39:08 Uploading - Uploading generated training model
2022-07-08 08:39:08 Completed - Training job completed
ProfilerReport-1657268881: IssuesFound
Training seconds: 558
Billable seconds: 558
CPU times: user 1.34 s, sys: 146 ms, total: 1.48 s
Wall time: 11min 20s
When I call estimator.model_data
I get a path poiting to a model.tar.gz file s3://sagemaker-studio-****/output/.../model.tar.gz
Sagemaker generated subfoldes into the output folder (which in turn contain a lot of json files and other artifacts):
But the file model.tar.gz
is missing. This file is nowhere to be found. Is there anything I need to change or to add, in order to obtain my model?
Any help is much appreciated.
Solution
you need to make sure to store your model output to the right location inside the training container. Sagemaker will upload everything that is stored in the MODEL_DIR directory. You can find the location in the ENV of the training job:
model_dir = os.environ.get("SM_MODEL_DIR")
Normally it is set to opt/ml/model
Ref:
- https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md#sm_model_dir
- https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-output.html
Answered By - Steffenk
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.