2

I'm trying to save csv file as a result of SQL query, sent to Athena via Databricks. The file is supposed to be a big table of about 4-6 GB (~40m rows).

I'm doing the next steps:

  1. Creating PySpark dataframe by:

    df = sqlContext.sql("select * from my_table where year = 19")
    
  2. Converting PySpark dataframe to Pandas dataframe. I realize, this step may be unnecessary, but I only start using Databricks and may not know the required commands to do it more swiftly. So I do it like this:

    ab = df.toPandas()
    
  3. Save the file somewhere to download it locally later:

    ab.to_csv('my_my.csv')
    

But how do I download it?

I kindly ask you to be very specific as I do not know many tricks and details in working with Databricks.

James Z
  • 12,104
  • 10
  • 27
  • 43

1 Answers1

0

Using GUI, you can download full results (max 1 millions rows).

enter image description here

To download full results, first save the file to dbfs and then copy the file to local machine using Databricks cli as follows.

dbfs cp "dbfs:/FileStore/tables/my_my.csv" "A:\AzureAnalytics"

Reference: Databricks file system

The DBFS command-line interface (CLI) uses the DBFS API to expose an easy to use command-line interface to DBFS. Using this client, you can interact with DBFS using commands similar to those you use on a Unix command line. For example:

# List files in DBFS
dbfs ls
# Put local file ./apple.txt to dbfs:/apple.txt
dbfs cp ./apple.txt dbfs:/apple.txt
# Get dbfs:/apple.txt and save to local file ./apple.txt
dbfs cp dbfs:/apple.txt ./apple.txt
# Recursively put local dir ./banana to dbfs:/banana
dbfs cp -r ./banana dbfs:/banana

Reference: Installing and configuring Azure Databricks CLI

Hope this helps.

CHEEKATLAPRADEEP-MSFT
  • 11,445
  • 1
  • 14
  • 35
  • But is there anyway to incorporate this step to be within a script? Like for pandas, a line of the script could be df.to_csv("testing.csv") - how can we achieve the same for Databricks? – lagrangian_headache May 12 '21 at 05:08