How to create a Spark data frame from an S3 presigned URL using Databricks?

Published in

DataDrivenInvestor

2 min readFeb 2, 2022

A pre-signed URL is used to grant temporary access to a specific S3 object. This is widely used when the data owner doesn’t want to grant bucket/folder level access to a large number of downstream users.

An example of a pre-signed URL is:

https://gcbucket.s3.amazonaws.com/chummadata/friends.csv.gz?AWSAccessKeyId=AKRDWVY7VZTAUN42OMQ&Expires=1643776416&Signature=kokaM@kk@8Z4pI%2B8BM3CcE29xqH%2FY%3D

A pre-signed URL uses three parameters to limit access to the user.

Bucket: gcbucket

Key: AKRDWVY7VZTAUN42OMQ

Expires: 1643776416

Anyone with a valid pre-signed URL can interact with the objects as specified during creation. For example, if a GET (Read) pre-signed URL is provided, a user could not use this as a PUT (Write).

The example script is written using PySpark in Databricks.

Step 1: Using the requests library download the contents of the s3 object. In the above example it’s friends.csv.gz

import requestsstrS3Url = 'https://gcbucket.s3.amazonaws.com/chummadata/friends.csv.gz?AWSAccessKeyId=AKRDWVY7VZTAUN42OMQ&Expires=1643776416&Signature=kokaM@kk@8Z4pI%2B8BM3CcE29xqH%2FY%3D'res = requests.get(strS3Url)

Step 2: Save the contents to a temporary folder on the Driver node. In this example, it's stored under /tmp of the driver node.

open('/tmp/friends.csv.gz', 'wb').write(res.content)

Step 3: Verify the download

%sh ls /tmp/friends.csv.gz

Step 4: Move the downloaded file to /dbfs system, so it can be loaded to a Spark data frame.

dbutils.fs.mv("file:/tmp/friends.csv.gz", "dbfs:/FileStore/tables/friends.csv.gz")

Step 5: Create a Dataframe

df = spark.read.format("csv") \
.load("dbfs:/FileStore/tables/friends.csv.gz",inferSchema=True)

Step 6: Display the data frame

display(df)

How to create a Spark data frame from an S3 presigned URL using Databricks?

Written by Ganesh Chandrasekaran