I have 5 excel files (.xlsx) in datalake with same sheet name in them. I need to read all files and create a dataframe and load it to datalake.
File1_Jan.xlsx
File2_Jan.xlsx
File3_Jan.xlsx
File4_Jan.xlsx
File5_Jan.xlsx
These files have same sheet name "Sheet1" in them. I tried with this code: but its not working:
parent_path='/path/to/parent/folder/Y=2022/M=01/'
filepath1 = parent_path+'File1_Jan.xlsx'
filepath2 = parent_path+'File2_Jan.xlsx'
filepath3 = parent_path+'File3_Jan.xlsx'
filepath4 = parent_path+'File4_Jan.xlsx'
filepath5 = parent_path+'File5_Jan.xlsx'
file_pd_df = pd.read_excel(filepath1, sheet_name='Sheet1')
file1Schema = StructType([
StructField("Col1", StringType(), False),
StructField("Col2", DateType(), False),
StructField("Col3", StringType(), False),
StructField("Col4", StringType(), False),
StructField("Col5", DoubleType(), False)])
final_df = spark.createDataFrame(file_pd_df, schema=file1Schema)
How do I pass them in loop so it reads all the files? Can I avoid hardcoding of filenames?