How to Handle Bad Data in Spark SQL

Quick tutorial using Databricks to explain different ways to mitigate bad data in Spark SQL.

Ganesh Chandrasekaran
Python in Plain English
4 min readOct 31, 2021

--

Photo by Markus Winkler on Unsplash

Apache Spark SQL offers 4 different ways to mitigate bad data easily:

  1. Move bad data to another folder.
  2. Allow bad data and flag it.
  3. Drop bad data without loading it to the SQL table.
  4. Trigger an error.

Sample data

baddata.csvid,name,age
1,rachel,30
2,monica,30
3,ross,31
4,chandler,31
5,joey,thirtytwo

Above dataset, 5th-row 3rd column is of the wrong datatype. Instead of an integer, it's a string.

Using data bricks, create a folder handle-bad-data

%fsmkdirs /FileStore/gc-dataset/handle-bad-data/

Upload the baddata.csv file from the local machine to this folder.

Verify the upload

%fsls /FileStore/gc-dataset/handle-bad-data/badddata.csv

Read the contents of CSV file using Python

%pythondf = spark.read.text("dbfs:/FileStore/gc-dataset/handle-bad-data/baddata.csv")

--

--

Big Data Solution Architect | Adjunct Professor. Thoughts and opinions are my own and don’t represent the companies I work for.