How to save and download locally csv in DBFS?

Question

I'm trying to save csv file as a result of SQL query, sent to Athena via Databricks. The file is supposed to be a big table of about 4-6 GB (~40m rows).

I'm doing the next steps:

Creating PySpark dataframe by:

df = sqlContext.sql("select * from my_table where year = 19")

Converting PySpark dataframe to Pandas dataframe. I realize, this step may be unnecessary, but I only start using Databricks and may not know the required commands to do it more swiftly. So I do it like this:
```
ab = df.toPandas()
```
Save the file somewhere to download it locally later:
```
ab.to_csv('my_my.csv')
```

But how do I download it?

I kindly ask you to be very specific as I do not know many tricks and details in working with Databricks.

Do you want to download the file to DBFS or to the local machine? — CHEEKATLAPRADEEP-MSFT, Oct 21 '19 at 09:34
Eventually, I would like having the file locally on my machine. — Dmytro Zelenyi, Oct 22 '19 at 13:02

score 0 · Accepted Answer · answered Oct 23 '19 at 07:36

Using GUI, you can download full results (max 1 millions rows).

To download full results, first save the file to dbfs and then copy the file to local machine using Databricks cli as follows.

dbfs cp "dbfs:/FileStore/tables/my_my.csv" "A:\AzureAnalytics"

Reference: Databricks file system

The DBFS command-line interface (CLI) uses the DBFS API to expose an easy to use command-line interface to DBFS. Using this client, you can interact with DBFS using commands similar to those you use on a Unix command line. For example:

# List files in DBFS
dbfs ls
# Put local file ./apple.txt to dbfs:/apple.txt
dbfs cp ./apple.txt dbfs:/apple.txt
# Get dbfs:/apple.txt and save to local file ./apple.txt
dbfs cp dbfs:/apple.txt ./apple.txt
# Recursively put local dir ./banana to dbfs:/banana
dbfs cp -r ./banana dbfs:/banana

Reference: Installing and configuring Azure Databricks CLI

Hope this helps.

But is there anyway to incorporate this step to be within a script? Like for pandas, a line of the script could be df.to_csv("testing.csv") - how can we achieve the same for Databricks? — lagrangian_headache, May 12 '21 at 05:08

How to save and download locally csv in DBFS?

1 Answers1

Linked