How to Handle Bad Data in Spark SQL
Quick tutorial using Databricks to explain different ways to mitigate bad data in Spark SQL.
Published in
4 min readOct 31, 2021
Apache Spark SQL offers 4 different ways to mitigate bad data easily:
- Move bad data to another folder.
- Allow bad data and flag it.
- Drop bad data without loading it to the SQL table.
- Trigger an error.
Sample data
baddata.csvid,name,age
1,rachel,30
2,monica,30
3,ross,31
4,chandler,31
5,joey,thirtytwo
Above dataset, 5th-row 3rd column is of the wrong datatype. Instead of an integer, it's a string.
Using data bricks, create a folder handle-bad-data
%fsmkdirs /FileStore/gc-dataset/handle-bad-data/
Upload the baddata.csv file from the local machine to this folder.
Verify the upload
%fsls /FileStore/gc-dataset/handle-bad-data/badddata.csv
Read the contents of CSV file using Python
%pythondf = spark.read.text("dbfs:/FileStore/gc-dataset/handle-bad-data/baddata.csv")