53

I've just started to experiment with AWS SageMaker and would like to load data from an S3 bucket into a pandas dataframe in my SageMaker python jupyter notebook for analysis.

I could use boto to grab the data from S3, but I'm wondering whether there is a more elegant method as part of the SageMaker framework to do this in my python code?

Thanks in advance for any advice.

A555h55
  • 633
  • 1
  • 7
  • 10

8 Answers8

55
import boto3
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()
bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

pd.read_csv(data_location)
Chhoser
  • 591
  • 1
  • 4
  • 2
36

In the simplest case you don't need boto3, because you just read resources.
Then it's even simpler:

import pandas as pd

bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

pd.read_csv(data_location)

But as Prateek stated make sure to configure your SageMaker notebook instance to have access to s3. This is done at configuration step in Permissions > IAM role

ivankeller
  • 1,624
  • 1
  • 14
  • 19
  • With that solution you avoid the credential headache, it's exactly what I was looking for, thank you. – Iakovos Belonias Jan 13 '20 at 18:40
  • I'm getting either a timeout or an Access Denied -- I have a folder between the file and bucket, so added that to end of bucket or begin of file -- I'm using root access, and don't think I have any protection on this bucket ? Does this (execution role) require an IAM? – Zach Oakes Jun 17 '20 at 14:21
  • Got it -- removing execution_role() fixed it -- great call. I was hoping something like this was available : ) – Zach Oakes Jun 17 '20 at 14:27
11

If you have a look here it seems you can specify this in the InputDataConfig. Search for "S3DataSource" (ref) in the document. The first hit is even in Python, on page 25/26.

Jonatan
  • 690
  • 6
  • 12
9

You could also access your bucket as your file system using s3fs

import s3fs
fs = s3fs.S3FileSystem()

# To List 5 files in your accessible bucket
fs.ls('s3://bucket-name/data/')[:5]

# open it directly
with fs.open(f's3://bucket-name/data/image.png') as f:
    display(Image.open(f))
CircleOnCircles
  • 2,202
  • 1
  • 18
  • 24
  • 1
    What are the advantages / disadvantages over the other way, I wonder – Hack-R Jun 06 '19 at 15:04
  • 1
    @Hack-R The pro is that you are able to use the python file pointer interface/object throughout the code. The con is that this object operates per file which might not be performance efficient. – CircleOnCircles Jun 14 '19 at 04:37
  • @Ben Thanks for this answer; however it's not working for me. I'm getting this error: `AttributeError: type object 'Image' has no attribute 'open'`. Can you share what library you're using for `Image` or any other details? Thanks! – Mabyn Jan 23 '20 at 19:38
  • 1
    Never mind, I just figured it out: `from IPython.display import display; from PIL import Image`. After that, the above worked great. Thanks! – Mabyn Jan 23 '20 at 19:48
5

Do make sure the Amazon SageMaker role has policy attached to it to have access to S3. It can be done in IAM.

Prateek Dubey
  • 209
  • 2
  • 6
4

You can also use AWS Data Wrangler https://github.com/awslabs/aws-data-wrangler:

import awswrangler as wr

df = wr.s3.read_csv(path="s3://...")
ivankeller
  • 1,624
  • 1
  • 14
  • 19
1

A similar answer with the f-string.

import pandas as pd
bucket = 'your-bucket-name'
file = 'file.csv'
df = pd.read_csv(f"s3://{bucket}/{file}")
len(df) # print row counts
Abu Shoeb
  • 4,028
  • 2
  • 32
  • 40
0

This code sample to import csv file from S3, tested at SageMaker notebook.

Use pip or conda to install s3fs. !pip install s3fs

import pandas as pd

my_bucket = '' #declare bucket name
my_file = 'aa/bb.csv' #declare file path

import boto3 # AWS Python SDK
from sagemaker import get_execution_role
role = get_execution_role()

data_location = 's3://{}/{}'.format(my_bucket,my_file)
data=pd.read_csv(data_location)
data.head(2)
Partha Sen
  • 2,619
  • 1
  • 17
  • 16