Write Spark MLlib model to S3 from AWS Glue

Question

My goal is to write a trained MLlib model from an AWS Glue Studio job to S3. On a separate job, I want to read the persisted model in from S3 to perform inference.

I understand that Spark MLlib models cannot be serialized by pickling using Python. That was my first area to investigate (see this discussion: Save Apache Spark mllib model in python).

I have also investigated this method: model.save([spark_context], [file_path]). I passed in the glueContext as the first parameter and provided a path - however, got an error TypeError: save() takes 2 positional arguments but 3 were given.

MLlib provides a JSON persistence format. However, I'm not sure how to access the raw JSON for an existing model - I believe this is the most promising approach. If I can get this JSON string then I can use boto3 to write and read from S3.

In summary, I have two alternative questions (the answer to either will suffice):

How do I write MLlib models to/from S3 from an AWS Glue job?
How do I get the MLlib JSON persistence format from an existing model?

Write Spark MLlib model to S3 from AWS Glue

0 Answers0