Tuesday, November 2, 2021

[FIXED] Can I specify S3 bucket for sagemaker.sklearn.estimator's SKLearn?

November 02, 2021 amazon-s3, amazon-sagemaker, amazon-web-services, machine-learning, scikit-learn No comments

Issue

I'm following this example notebook to learn SageMaker's processing jobs API: https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.ipynb

I'm trying to modify their code to avoid using the default S3 bucket, namely: s3://sagemaker-<region>-<account_id>/

For their data processing step with the .run method:

from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor.run(
    code="preprocessing.py",
    inputs=[ProcessingInput(source=input_data, destination="/opt/ml/processing/input")],
    outputs=[
        ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"),
    ],
    arguments=["--train-test-split-ratio", "0.2"],
)

I was able to modify it to use my own S3 bucket by using the destination parameter like this:

sklearn_processor.run( 
    code=output_bucket_uri + "preprocessing.py", 
    inputs=[ProcessingInput( 
        source=input_bucket_uri + "census-income.csv", 
        destination=path+"input/", 
    )], 
    outputs=[ 
        ProcessingOutput( 
            output_name="train_data", 
            source=path+"train/", 
            destination=output_bucket_uri + "train/", 
        ), 
        ProcessingOutput( 
            output_name="test_data", 
            source=path+"test/", 
            destination=output_bucket_uri + "test/", 
        ), 
    ], 
    arguments=["--train-test-split-ratio", "0.2"], 
)

But for the .fit method:

sklearn.fit({"train": preprocessed_training_data})

I have not been able to find a parameter to pass it so that the output artifacts are saved to a S3 bucket that I specify instead of the default s3 bucket s3://sagemaker-<region>-<account_id>/.

Solution

You specify the output artifacts' bucket when you create the SKLearn estimator. SKLearn is a subclass of Framework which is a subclass of EstimatorBase, which has an output_path argument.

Below a snippet from Sagemaker Examples where they are using the Pytorch estimator but it's the same idea:

est = PyTorch(
    entry_point="train.py",
    source_dir="code",  # directory of your training script
    role=role,
    framework_version="1.5.0",
    py_version="py3",
    instance_type=instance_type,
    instance_count=1,
    output_path=output_path,
    hyperparameters={"batch-size": 128, "epochs": 1, "learning-rate": 1e-3, "log-interval": 100},
)

est.fit(...)

Docs:

Answered By - Murilo Cunha

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, November 2, 2021

[FIXED] Can I specify S3 bucket for sagemaker.sklearn.estimator's SKLearn?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels