3

I am trying to create a Sklearn processing job in Amazon Sagemekar to perform some data transformation of my input data before I do model training.

I wrote a custom python script preprocessing.py which does the needful. I use some python package in this script. Here is the Sagemaker example I followed.

When I try to submit the Processing Job I get an error -

............................Traceback (most recent call last):
  File "/opt/ml/processing/input/code/preprocessing.py", line 6, in <module>
    import snowflake.connector
ModuleNotFoundError: No module named 'snowflake.connector'

I understand that my processing job is unable to find this package and I need to install it. My question is how can I accomplish this using Sagemaker Processing Job API? Ideally there should be a way to define a requirements.txt in the API call, but I don't see such functionality in the docs.

I know I can create a custom Image with relevant packages and later use this image in the Processing Job, but this seems too much work for something that should be built-in?

Is there an easier/elegant way to install packages needed in Sagemaker Processing Job ?

iCHAIT
  • 415
  • 4
  • 12

1 Answers1

2

One way would be to call pip from Python:

subprocess.check_call([sys.executable, "-m", "pip", "install", package])

Another way would be to use an SKLearn Estimator (training job) instead, to do the same thing. You can provide the source_dir, which can include a requirements.txt file, and these requirements will be installed for you

estimator = SKLearn(
    entry_point="foo.py",
    source_dir="./foo", # no trailing slash! put requirements.txt here
    framework_version="0.23-1",
    role = ...,
    instance_count = 1,
    instance_type = "ml.m5.large"
)
Neil McGuigan
  • 43,981
  • 12
  • 119
  • 145