My goal is to write a trained MLlib model from an AWS Glue Studio job to S3. On a separate job, I want to read the persisted model in from S3 to perform inference.
I understand that Spark MLlib models cannot be serialized by pickling using Python. That was my first area to investigate (see this discussion: Save Apache Spark mllib model in python).
I have also investigated this method: model.save([spark_context], [file_path]). I passed in the glueContext as the first parameter and provided a path - however, got an error TypeError: save() takes 2 positional arguments but 3 were given.
MLlib provides a JSON persistence format. However, I'm not sure how to access the raw JSON for an existing model - I believe this is the most promising approach. If I can get this JSON string then I can use boto3 to write and read from S3.
In summary, I have two alternative questions (the answer to either will suffice):
- How do I write MLlib models to/from S3 from an AWS Glue job?
- How do I get the MLlib JSON persistence format from an existing model?