0

I have a requirement in pyspark, the requirement is to write pyspark output file as csv but the filename should contain the value of the column. There is no need of partitioning the output dataset.

For example:

+-----+----------+
| text|login_date|
+-----+----------+
|text1|2020-01-31|
|text2|2020-02-31|
|text3|2020-03-31|
|text4|2020-04-31|
+-----+----------+

My output should contain files equal to the distinct values of login date, example:

host_2020-01-31.csv
 -text,login_date
 -text1,2020-01-31
  
host_2020-02-31.csv
 -text,login_date
 -text2,2020-02-31
 
host_2020-03-31.csv
 -text,login_date
 -text3,2020-03-31
 
host_2020-04-31.csv
 -text,login_date
 -text1,2020-04-31

I am able to get the desired output using pandas, but the solution is not scalable, below is the code I have written. Can you please help me with a scalable solution for this problem.

demo_df = sqlContext.createDataFrame([("text1","2020-01-31"),("text2","2020-02-31"),("text3","2020-03-31"),("text4","2020-04-31")], ["text", "login_date"])

demo_df.show()
dates_coll=demo_df.select("login_date").distinct().collect()
for date in dates_coll:
  final_df=demo_df.filter(demo_df.login_date == date.login_date)
  final_df.toPandas().to_csv("host_"+date.login_date+'.csv', sep=',', header=True, index=False)
nilesh1212
  • 1,429
  • 2
  • 19
  • 54
  • You cannot control file naming out of the box by spark. check https://stackoverflow.com/questions/41990086/specifying-the-filename-when-saving-a-dataframe-as-a-csv – Sanket9394 Jul 12 '21 at 06:20

0 Answers0