102

I read the filenames in my S3 bucket by doing

objs = boto3.client.list_objects(Bucket='my_bucket')
    while 'Contents' in objs.keys():
        objs_contents = objs['Contents']
        for i in range(len(objs_contents)):
            filename = objs_contents[i]['Key']

Now, I need to get the actual content of the file, similarly to a open(filename).readlines(). What is the best way?

mar tin
  • 8,186
  • 22
  • 65
  • 92

6 Answers6

131

boto3 offers a resource model that makes tasks like iterating through objects easier. Unfortunately, StreamingBody doesn't provide readline or readlines.

s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
# Iterates through all the objects, doing the pagination for you. Each obj
# is an ObjectSummary, so it doesn't contain the body. You'll need to call
# get to get the whole body.
for obj in bucket.objects.all():
    key = obj.key
    body = obj.get()['Body'].read()
Jordon Phillips
  • 13,094
  • 4
  • 33
  • 42
  • 2
    I passed through the client because I need to configure it manually within the script itself, as in client = boto3.client( 's3', aws_access_key_id="***", aws_secret_access_key="****" ). Is there a way to give the access keys to the resource without using the client? – mar tin Mar 24 '16 at 19:33
  • 3
    You can configure the resource in the same way. – Jordon Phillips Mar 24 '16 at 20:01
  • 2
    How do I read a file if it is in folders in S3. So for eg my bucket name is A. Now A has a folder B. B has a folder C. C contains a file Readme.csv. How to read this file. Your solution is good if we have files directly in bucket but in case we have multiple folders then how to go about it. Thanks. – Kshitij Marwah Dec 14 '16 at 16:56
  • S3 is an object store, not a file system. It doesn't actually have the concept of folders, though it is commonly stapled on. When iterating over objects you will get everything unless you specify otherwise. – Jordon Phillips Dec 14 '16 at 17:06
  • 2
    we can get the body, how can i read line by line within this body ? – Gabriel Wu Mar 02 '17 at 06:29
  • @GabrielWu did you find a way? – Adi Apr 01 '17 at 19:04
  • Do I really need to get all the objects until I find the one I need? – Iulian Onofrei May 02 '17 at 20:22
  • @IulianOnofrei you can filter in your listing, and there is no requirement that you call get on the object if you don't need to. – Jordon Phillips May 02 '17 at 22:36
  • Isn't `bucket.objects.all()` making requests to AWS while iterating it? – Iulian Onofrei May 03 '17 at 07:22
  • 8
    @IulianOnofrei it is making requests yes, but you aren't downloading the objects, just listing them. You can use `.filter()` to make fewer list requests. Or if you know the key you want just get it directly with `bucket.Object('mykey')` – Jordon Phillips May 03 '17 at 15:40
  • The `bucket.Object('mykey')` was exactly what I needed, thanks! – Iulian Onofrei May 03 '17 at 18:27
  • @martin you can configure aws cli or put you both keys in ~/.aws/credentials file, that way you don't have to specify in the script. It will take from the running machine or environment. – Vivek Feb 18 '18 at 14:18
  • @Jordon Phillips Would you know how to use your code to actually put all the read files into data frames? I mean if I have a bucket that has got two csv files I want to combine them into one. – Kalenji Nov 04 '20 at 09:15
35

You might consider the smart_open module, which supports iterators:

from smart_open import smart_open

# stream lines from an S3 object
for line in smart_open('s3://mybucket/mykey.txt', 'rb'):
    print(line.decode('utf8'))

and context managers:

with smart_open('s3://mybucket/mykey.txt', 'rb') as s3_source:
    for line in s3_source:
         print(line.decode('utf8'))

    s3_source.seek(0)  # seek to the beginning
    b1000 = s3_source.read(1000)  # read 1000 bytes

Find smart_open at https://pypi.org/project/smart_open/

caffreyd
  • 1,054
  • 1
  • 17
  • 25
32

Using the client instead of resource:

s3 = boto3.client('s3')
bucket='bucket_name'
result = s3.list_objects(Bucket = bucket, Prefix='/something/')
for o in result.get('Contents'):
    data = s3.get_object(Bucket=bucket, Key=o.get('Key'))
    contents = data['Body'].read()
    print(contents.decode("utf-8"))
Alex Waygood
  • 4,796
  • 3
  • 14
  • 41
Climbs_lika_Spyder
  • 5,377
  • 3
  • 33
  • 49
22

When you want to read a file with a different configuration than the default one, feel free to use either mpu.aws.s3_read(s3path) directly or the copy-pasted code:

def s3_read(source, profile_name=None):
    """
    Read a file from an S3 source.

    Parameters
    ----------
    source : str
        Path starting with s3://, e.g. 's3://bucket-name/key/foo.bar'
    profile_name : str, optional
        AWS profile

    Returns
    -------
    content : bytes

    botocore.exceptions.NoCredentialsError
        Botocore is not able to find your credentials. Either specify
        profile_name or add the environment variables AWS_ACCESS_KEY_ID,
        AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN.
        See https://boto3.readthedocs.io/en/latest/guide/configuration.html
    """
    session = boto3.Session(profile_name=profile_name)
    s3 = session.client('s3')
    bucket_name, key = mpu.aws._s3_path_split(source)
    s3_object = s3.get_object(Bucket=bucket_name, Key=key)
    body = s3_object['Body']
    return body.read()
Martin Thoma
  • 108,021
  • 142
  • 552
  • 849
13

If you already know the filename, you can use the boto3 builtin download_fileobj

import boto3

from io import BytesIO

session = boto3.Session()
s3_client = session.client("s3")

f = BytesIO()
s3_client.download_fileobj("bucket_name", "filename", f)
print(f.getvalue())
reubano
  • 4,616
  • 1
  • 38
  • 36
  • 1
    `f.seek(0)` is unnecessary with a BytesIO (or StringIO) object. `read` starts at the current position, but `getvalue` always reads from position 0. – Adam Hoelscher May 26 '20 at 20:01
  • Good point @adam. There's the chance that someone will actually need `read` for their use case. I only used `getvalue` for demonstrative purposes. – reubano May 26 '20 at 20:16
0

the best way for me is this:

result = s3.list_objects(Bucket = s3_bucket, Prefix=s3_key)
for file in result.get('Contents'):
    data = s3.get_object(Bucket=s3_bucket, Key=file.get('Key'))
    contents = data['Body'].read()
    #if Float types are not supported with dynamodb; use Decimal types instead
    j = json.loads(contents, parse_float=Decimal)
    for item in j:
       timestamp = item['timestamp']

       table.put_item(
           Item={
            'timestamp': timestamp
           }
      )

once you have the content you can run it through another loop to write it to a dynamodb table for instance ...

aerioeus
  • 1,134
  • 1
  • 9
  • 34