How to create a Spark data frame from an S3 presigned URL using Databricks?
A pre-signed URL is used to grant temporary access to a specific S3 object. This is widely used when the data owner doesn’t want to grant bucket/folder level access to a large number of downstream users.
An example of a pre-signed URL is:
https://gcbucket.s3.amazonaws.com/chummadata/friends.csv.gz?AWSAccessKeyId=AKRDWVY7VZTAUN42OMQ&Expires=1643776416&Signature=kokaM@kk@8Z4pI%2B8BM3CcE29xqH%2FY%3D
A pre-signed URL uses three parameters to limit access to the user.
Bucket: gcbucket
Key: AKRDWVY7VZTAUN42OMQ
Expires: 1643776416
Anyone with a valid pre-signed URL can interact with the objects as specified during creation. For example, if a GET (Read) pre-signed URL is provided, a user could not use this as a PUT (Write).
The example script is written using PySpark in Databricks.
Step 1: Using the requests library download the contents of the s3 object. In the above example it’s friends.csv.gz
import requestsstrS3Url = 'https://gcbucket.s3.amazonaws.com/chummadata/friends.csv.gz?AWSAccessKeyId=AKRDWVY7VZTAUN42OMQ&Expires=1643776416&Signature=kokaM@kk@8Z4pI%2B8BM3CcE29xqH%2FY%3D'res = requests.get(strS3Url)
Step 2: Save the contents to a temporary folder on the Driver node. In this example, it's stored under /tmp of the driver node.
open('/tmp/friends.csv.gz', 'wb').write(res.content)
Step 3: Verify the download
%sh ls /tmp/friends.csv.gz
Step 4: Move the downloaded file to /dbfs system, so it can be loaded to a Spark data frame.
dbutils.fs.mv("file:/tmp/friends.csv.gz", "dbfs:/FileStore/tables/friends.csv.gz")
Step 5: Create a Dataframe
df = spark.read.format("csv") \
.load("dbfs:/FileStore/tables/friends.csv.gz",inferSchema=True)
Step 6: Display the data frame
display(df)