How to Handle Bad Data in Spark SQL

Quick tutorial using Databricks to explain different ways to mitigate bad data in Spark SQL.

Published in

Python in Plain English

4 min readOct 31, 2021

Apache Spark SQL offers 4 different ways to mitigate bad data easily:

Sample data

baddata.csvid,name,age
1,rachel,30
2,monica,30
3,ross,31
4,chandler,31
5,joey,thirtytwo

Above dataset, 5th-row 3rd column is of the wrong datatype. Instead of an integer, it's a string.

Using data bricks, create a folder handle-bad-data

%fsmkdirs /FileStore/gc-dataset/handle-bad-data/

Upload the baddata.csv file from the local machine to this folder.

Verify the upload

%fsls /FileStore/gc-dataset/handle-bad-data/badddata.csv

Read the contents of CSV file using Python

%pythondf = spark.read.text("dbfs:/FileStore/gc-dataset/handle-bad-data/baddata.csv")