Data Cleaning and Transformation Using SQL and Apache Spark: A Complete Guide with Scala Examples

Introduction

In the data-driven world, raw datasets are rarely ready for analysis. They often contain missing values, duplicates, inconsistent formats, or irrelevant records. This is why data cleaning and data transformation are crucial before performing analytics or building machine learning models.

In this guide, we’ll explore how to use SQL for traditional structured data cleaning and Apache Spark (Scala) for distributed big data processing. You’ll learn key techniques, best practices, and see hands-on code examples.

Why Data Cleaning and Transformation Matter

Data cleaning and transformation ensure that your dataset is:

Accurate – Free of errors and outdated records
Consistent – Following standard formats across fields
Efficient – Reduced in size without losing value
Machine-Ready – Prepared for analytics or modeling

Without these steps, your insights could be misleading or your models underperforming.

Part 1 – Data Cleaning and Transformation with SQL

When working with structured data in relational databases, SQL is one of the most powerful and accessible tools.

Common Data Cleaning in SQL

1. Removing Duplicates

DELETE FROM customers
WHERE rowid NOT IN (
SELECT MIN(rowid)
FROM customers
GROUP BY email
);

2. Handling Missing Values

UPDATE sales
SET revenue = 0
WHERE revenue IS NULL;

3. Standardizing Formats

UPDATE customers
SET phone_number = REPLACE(phone_number, ‘-‘, ”)
WHERE phone_number LIKE ‘%-%’;

4. Filtering Invalid Data

Common Data Transformation in SQL

Aggregation Example:

SELECT product_id, SUM(quantity) AS total_sold
FROM sales
GROUP BY product_id;

Joining Tables:

SELECT c.customer_name, o.order_id
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;

Derived Columns:

💡 When to use SQL:

The dataset is small to medium-sized
Stored in relational databases
You need quick transformations without distributed computing

Part 2 – Data Cleaning and Transformation with Apache Spark (Scala)

When working with massive datasets stored across multiple machines, Apache Spark is the tool of choice. It supports large-scale distributed processing and integrates with various big data systems like HDFS, Hive, and Kafka.

Here’s how to clean and transform data using Scala in Spark.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object DataCleaningTransformation {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName(“DataCleaningTransformation”)
.master(“local[*]”)
.getOrCreate()

val df = spark.read
.option(“header”, “true”)
.option(“inferSchema”, “true”)
.csv(“sales.csv”)

// Remove duplicates
val dedupedDF = df.dropDuplicates()

// Handle missing values
val filledDF = dedupedDF.na.fill(Map(“revenue” -> 0))

// Standardize string formats
val trimmedDF = filledDF.withColumn(“product_name”, trim(col(“product_name”)))

// Filter invalid records
val validDF = trimmedDF.filter(col(“order_date”) >= “2000-01-01”)

// Aggregation
val salesSummary = validDF.groupBy(“product_id”)
.agg(sum(“quantity”).alias(“total_sold”))

// Derived column
val withTotalPrice = validDF.withColumn(“total_price”, expr(“price * quantity”))

// Join with customer data
val customersDF = spark.read
.option(“header”, “true”)
.option(“inferSchema”, “true”)
.csv(“customers.csv”)

val ordersWithCustomers = withTotalPrice.join(customersDF, “customer_id”)

// Show final result
ordersWithCustomers.show()
}
}

SQL vs. Apache Spark – Which Should You Use?

Feature

SQL (Databases)

Apache Spark

Data Size
Processing
Ease of Use
Integration
Best For

Small to Medium
Single Machine
Easy for analysts
Works with RDBMS
Quick queries, reports

Medium to Very Large
Distributed
Requires programming skills
Works with big data tools
Heavy processing at scale

Best Practices for Data Cleaning & Transformation

Profile your data first – Identify missing, inconsistent, and duplicate records.
Automate processes – Use scripts or ETL workflows to avoid manual repetition.
Document every change – Keep logs for auditing and reproducibility.
Validate results – Compare outputs before and after transformation.
Leverage the right tool for the right job – SQL for fast ad-hoc queries, Spark for big data.

Conclusion

Both SQL and Apache Spark are powerful for data cleaning and transformation, but they serve different purposes. SQL excels in quick, structured data tasks, while Spark dominates large-scale distributed processing.

By learning both, you’ll be able to handle everything from a small customer database to processing billions of streaming records in real time.