0

I'm trying rename file in my code

from pyspark.sql import *
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:100% !important; }</style>"))

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option") \
    .getOrCreate()
    
df = spark.read.csv("../work/data2/*.csv", inferSchema=True, header=False)

df.createOrReplaceTempView("iris")
result = spark.sql("select * from iris where _c1 =2 order by _c0 ")
summary=result.describe(['_c10'])
summary.show()
summary.coalesce(1).write.csv("202003/data1_0331.csv")

.write.csv("202003/data1_0331.csv") in this code my spark creates everything folder

Result

"202003/data1_0331.csv/part-00000-3afd3298-a186-4289-8ba3-3bf55d27953f-c000.csv

The result i want is

202003/data1_0331.csv

How do I get the results I want? I saw a similar solution here like this write.csv(summary,file="data1_0331") but i got this error

cannot resolve '`0`' given input columns
Mohana B C
  • 3,982
  • 1
  • 7
  • 26
powpow
  • 5
  • 2
  • Does this answer your question: https://stackoverflow.com/questions/40792434/spark-dataframe-save-in-single-file-on-hdfs-location?rq=1 – Mohana B C Aug 17 '21 at 04:16

2 Answers2

0

Spark uses parallelism to speed up computation, so it's normal that Spark tries to write multiple files for one CSV, it will speed up the reading part.

So if you only use Spark: keep it that way, it will be faster.

However if you really want to save your data as a single CSV file, you can use pandas with something like this:

summary.toPandas().to_csv("202003/data1_0331.csv")

Be Chiller Too
  • 2,158
  • 2
  • 13
  • 38
0

You cannot control the name of the output of write Spark operation.

However, you can always rename it:

from py4j.java_gateway import java_import

java_import(spark._jvm, 'org.apache.hadoop.fs.Path')

fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())

list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(CSVPath))

file_name = [file.getPath().getName() for file in list_status if file.getPath().getName().startswith('part-')][0]

print(file_name)

fs.rename(sc._jvm.Path(CSVPath+''+file_name), sc._jvm.Path(CSVPath+"data1_0331.csv"))

This code will list all files in your output path and looks for files starting with part- and rename them to desired name.

Haha
  • 844
  • 9
  • 32