0

I have 5 excel files (.xlsx) in datalake with same sheet name in them. I need to read all files and create a dataframe and load it to datalake.

File1_Jan.xlsx
File2_Jan.xlsx
File3_Jan.xlsx
File4_Jan.xlsx
File5_Jan.xlsx

These files have same sheet name "Sheet1" in them. I tried with this code: but its not working:

parent_path='/path/to/parent/folder/Y=2022/M=01/'
filepath1 = parent_path+'File1_Jan.xlsx'
filepath2 = parent_path+'File2_Jan.xlsx'
filepath3 = parent_path+'File3_Jan.xlsx'
filepath4 = parent_path+'File4_Jan.xlsx'
filepath5 = parent_path+'File5_Jan.xlsx'

file_pd_df = pd.read_excel(filepath1, sheet_name='Sheet1')

file1Schema = StructType([
  StructField("Col1", StringType(), False),
  StructField("Col2", DateType(), False),
  StructField("Col3", StringType(), False),
  StructField("Col4", StringType(), False),
  StructField("Col5", DoubleType(), False)])

final_df = spark.createDataFrame(file_pd_df, schema=file1Schema)

How do I pass them in loop so it reads all the files? Can I avoid hardcoding of filenames?

user175025
  • 197
  • 8
  • Instead of hardcoding the files, you can read all the files of the `parent_path`. See this: https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory – Avión Feb 22 '22 at 07:43

0 Answers0