I want to read the last modified datetime of the files in data lake in a databricks script. If I could read it efficiently as a column when reading data from data lake, it would be perfect.
Thank you:)
Asked
Active
Viewed 593 times
0
ARCrow
- 616
- 5
- 18
-
will this help https://stackoverflow.com/questions/61317600/how-to-get-files-metadata-when-retrieving-data-from-hdfs/61423874#61423874 ? – Srinivas Jun 16 '21 at 15:50
-
@Srinivas Thank you for your comment. I'm limited to using pyspark and the dbutils.fs.ls which gives out some metadata about files doesn't contain last modified datetime only file size and path. Do you happen to know how I can replicate your logic in pyspark? – ARCrow Jun 16 '21 at 16:58
-
see the linked answer – Alex Ott Jun 17 '21 at 12:59
2 Answers
1
We can get those details using a Python code as we don't have direct method to get the modified time and date of the files in data lake
Here is the code
from pyspark.sql.functions import col
from azure.storage.blob import BlockBlobService
from datetime import datetime
import os.path
block_blob_service = BlockBlobService(account_name='account-name', account_key='account-key')
container_name ='container-firstname'
second_conatainer_name ='container-Second'
#block_blob_service.create_container(container_name)
generator = block_blob_service.list_blobs(container_name,prefix="Recovery/")
report_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
for blob in generator:
length = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
last_modified = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.last_modified
file_size = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
line = container_name+'|'+second_conatainer_name+'|'+blob.name+'|'+ str(file_size) +'|'+str(last_modified)+'|'+str(report_time)
print(line)
For more details, refer to the SO thread which addressing similar issue.
CHEEKATLAPRADEEP-MSFT
- 11,445
- 1
- 14
- 35
SaiSakethGuduru-MT
- 1,444
- 1
- 5
- 11
1
Regarding the issue, please refer to the following code
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
conf = sc._jsc.hadoopConfiguration()
conf.set(
"fs.azure.account.key.<account-name>.dfs.core.windows.net",
"<account-access-key>")
fs = Path('abfss://<container-name>@<account-name>.dfs.core.windows.net/<file-path>/').getFileSystem(sc._jsc.hadoopConfiguration())
status=fs.listStatus(Path('abfss://<container-name>@<account-name>.dfs.core.windows.net/<file-path>/'))
for i in status:
print(i)
print(i.getModificationTime())