5

I have tried to find out if GCS Python client, and more specifically, blob.upload_from_file() and blob.download_to_file() check integrity of the uploaded or downloaded file automatically. If not, how can I check the CRC hash programmatically? Any pointers to documentation or source code would be appreciated.

Kevin
  • 53
  • 1
  • 6

1 Answers1

8

At the moment, integrity verification in the GCS Python package isn't fully supported automatically for both uploads and downloads.

Downloads

Support is available for downloads that aren't chunked or are the result of a compose operation[7] in the dependency google-resumable-media-python[4] which provides integrity verification for only an object's MD5 checksum. One main reasons for not supporting chunked verification is due to the Google Cloud Storage API not returning MD5 or CRC32C checksums for chunks of an object. MD5 and CRC32C checksums are only available for the full object data. Downloads aren't chunked when a blob's instance _chunk_size is None which is the default value for new instances of Blob[1]. The underlying package google-resumable-media-python[2] verifies integrity[3] for the google-cloud-storage package[4] which is used by blob.download_to_file[5]. At the moment CRC32C verification isn't supported.

Uploads

Uploads require a developer to perform MD5 or CRC32C checksums before performing an upload for example using blob.upload_from_file()[6].

Example with the expectation that you already know the base64 form of an objects CRC32C or MD5 (these fields are optional and are only used on an upload):

from google.cloud import storage

storage_client = storage.Client()

bucket = storage_client.bucket("bucket-name")
new_blob = bucket.blob("new-blob-name")
# base64 encoded CRC32C
new_blob.crc32c = "EhUJRQ=="
# base64 encoded MD5
new_blob.md5_hash = "DDzeBxm1uuDBNd9hEy8WBA=="
with open('my-file', 'rb') as my_file:
    new_blob.upload_from_file(my_file)

Google Cloud Storage will use these checksums to verify the upload server side and only completes the uploads when no error is found.

Calculating MD5 or CRC32C for an object in Python.

  1. For checksumming an object in Python I'll defer to the following StackOverflow questions MD5 Generating an MD5 checksum of a file

  2. CRC32C

I don't have a specific package that I'd strongly recommend at the moment, but crcmod and crc32c packages do exist and they can help you checksum data using CRC32C programmatically.

Example of using crc32c package to generate the expected value for GCS CRC32C checksum:

from crc32c import crc32
import base64

with open('file-name') as f:
    # Read data and checksum
    checksum = crc32(f.read().encode())
    # Convert into a bytes type that can be base64 encoded
    base64_crc32c = base64.b64encode(checksum.to_bytes(length=4, byteorder='big')).decode('utf-8')
    # Print the Base64 encoded CRC32C
    print(base64_crc32c)

HTH

Frank Natividad
  • 574
  • 4
  • 16