0

I am looking for a batch loader for a glue job to load into RDS using a PySpark script witht he DataFormatWriter. I have this working for RedShift as follows:

df.write \
    .format("com.databricks.spark.redshift") \
    .option("url", jdbcconf.get("url") + '/' + DATABASE + '?user=' + jdbcconf.get('user') + '&password=' + jdbcconf.get('password')) \
    .option("dbtable", TABLE_NAME) \
    .option("tempdir", args["TempDir"]) \
    .option("forward_spark_s3_credentials", "true") \
    .mode("overwrite") \
    .save()

Where df is defined above to read in a file. What is the best approach I could take to do this in RDS instead of in REDSHIFT?

John Rotenstein
  • 203,710
  • 21
  • 304
  • 382
AlexK
  • 148
  • 6
  • 14
  • 38

2 Answers2

1

In RDS would you be only APPEND / OVERWRITE, in such case you can create an RDS JDBC connection, and use something like below:

postgres_url="jdbc:postgresql://localhost:portnum/sakila?user=<user>&password=<pwd>"
df.write.jdbc(postgres_url,table="actor1",mode="append") #for append
df.write.jdbc(postgres_url,table="actor1",mode="overwrite") #for overwrite

If it involves UPSERTS, then probably you can use a MYSQL library as an external python library, and perform INSERT INTO ..... ON DUPLICATE KEY.

Please refer this url: How to use JDBC source to write and read data in (Py)Spark?

regards

Yuva

Yuva
  • 2,194
  • 6
  • 26
  • 49
0

I learned that this can be only done through JDBC. Eg.

df.write.format("jdbc") \
    .option("url", jdbcconf.get("url") + '/' + REDSHIFT_DATABASE + '?user=' + jdbcconf.get('user') + '&password=' + jdbcconf.get('password')) \
    .option("dbtable", REDSHIFT_TABLE_NAME) \
    .option("tempdir", args["TempDir"]) \
    .option("forward_spark_s3_credentials", "true") \
    .mode("overwrite") \
    .save()
AlexK
  • 148
  • 6
  • 14
  • 38