How to efficiently read the data lake files' metadata

Question

I want to read the last modified datetime of the files in data lake in a databricks script. If I could read it efficiently as a column when reading data from data lake, it would be perfect.
Thank you:)

will this help https://stackoverflow.com/questions/61317600/how-to-get-files-metadata-when-retrieving-data-from-hdfs/61423874#61423874 ? — Srinivas, Jun 16 '21 at 15:50
@Srinivas Thank you for your comment. I'm limited to using pyspark and the dbutils.fs.ls which gives out some metadata about files doesn't contain last modified datetime only file size and path. Do you happen to know how I can replicate your logic in pyspark? — ARCrow, Jun 16 '21 at 16:58

score 1 · Answer 1 · edited Jun 17 '21 at 07:44

We can get those details using a Python code as we don't have direct method to get the modified time and date of the files in data lake

Here is the code

from pyspark.sql.functions import col
from azure.storage.blob import BlockBlobService
from datetime import datetime
import os.path

block_blob_service = BlockBlobService(account_name='account-name', account_key='account-key')
container_name ='container-firstname'
second_conatainer_name ='container-Second'
#block_blob_service.create_container(container_name)
generator = block_blob_service.list_blobs(container_name,prefix="Recovery/")
report_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')


for blob in generator:
    length = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
    last_modified = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.last_modified
    file_size = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
    line = container_name+'|'+second_conatainer_name+'|'+blob.name+'|'+ str(file_size) +'|'+str(last_modified)+'|'+str(report_time)
    print(line)

For more details, refer to the SO thread which addressing similar issue.

score 1 · Accepted Answer · edited Jun 17 '21 at 13:00

Regarding the issue, please refer to the following code

URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
conf = sc._jsc.hadoopConfiguration()

conf.set(
  "fs.azure.account.key.<account-name>.dfs.core.windows.net",
  "<account-access-key>")

fs = Path('abfss://<container-name>@<account-name>.dfs.core.windows.net/<file-path>/').getFileSystem(sc._jsc.hadoopConfiguration())

status=fs.listStatus(Path('abfss://<container-name>@<account-name>.dfs.core.windows.net/<file-path>/'))

for i in status:
  print(i)
  print(i.getModificationTime())

How to efficiently read the data lake files' metadata

2 Answers2