Is there any link or sample code where we can write dataframe to azure blob storage using python (not using pyspark module).
Asked
Active
Viewed 1.1k times
2 Answers
6
Below is the code snippet for writing (dataframe) CSV data directly to an Azure blob storage container in an Azure Databricks Notebook.
# Configure blob storage account access key globally
spark.conf.set(
"fs.azure.account.key.%s.blob.core.windows.net" % storage_name,
sas_key)
output_container_path = "wasbs://%s@%s.blob.core.windows.net" % (output_container_name, storage_name)
output_blob_folder = "%s/wrangled_data_folder" % output_container_path
# write the dataframe as a single file to blob storage
(dataframe
.coalesce(1)
.write
.mode("overwrite")
.option("header", "true")
.format("com.databricks.spark.csv")
.save(output_blob_folder))
# Get the name of the wrangled-data CSV file that was just saved to Azure blob storage (it starts with 'part-')
files = dbutils.fs.ls(output_blob_folder)
output_file = [x for x in files if x.name.startswith("part-")]
# Move the wrangled-data CSV file from a sub-folder (wrangled_data_folder) to the root of the blob container
# While simultaneously changing the file name
dbutils.fs.mv(output_file[0].path, "%s/predict-transform-output.csv" % output_container_path)
Example: notebook
Output: Dataframe written to blob storage using Azure Databricks
Dharman
- 26,923
- 21
- 73
- 125
CHEEKATLAPRADEEP-MSFT
- 11,445
- 1
- 14
- 35
-
2Is there a way to write it simply as a CSV file without the other files and moving operations – Anirban Saha May 19 '21 at 07:28
0
This answer also helps to delete the wrangled data folder leaving you with only the file you need.
storage_name = "YOUR_STORAGE_NAME"
storage_access_key = "YOUR_STORAGE_ACCESS_KEY"
output_container_name = "YOUR_CONTAINER_NAME"
# Configure blob storage account access key globally
spark.conf.set("fs.azure.account.key.%s.blob.core.windows.net" % storage_name, storage_access_key)
output_container_path = "wasbs://%s@%s.blob.core.windows.net" % (output_container_name, storage_name)
output_blob_folder = "%s/wrangled_data_folder" % output_container_path
# write the dataframe as a single file to blob storage
(dataframe
.coalesce(1)
.write
.mode("overwrite")
.option("header", "true")
.format("com.databricks.spark.csv")
.save(output_blob_folder))
# Get the name of the wrangled-data CSV file that was just saved to Azure blob storage (it starts with 'part-')
files = dbutils.fs.ls(output_blob_folder)
output_file = [x for x in files if x.name.startswith("part-")]
# Move the wrangled-data CSV file from a sub-folder (wrangled_data_folder) to the root of the blob container
# While simultaneously changing the file name
dbutils.fs.mv(output_file[0].path, "%s/predict-transform-output.csv" % output_container_path)
# Delete all folders and files with 'wrangled_data' and leave only the folder needed
dbutils.fs.rm("%s/wrangled_data_folder" % output_container_path, True)
user7779697
- 31
- 4