Data Cleaning and Transformation Using SQL and Apache Spark: A Complete Guide with Scala Examples

Introduction

In the data-driven world, raw datasets are rarely ready for analysis. They often contain missing values, duplicates, inconsistent formats, or irrelevant records. This is why data cleaning and data transformation are crucial before performing analytics or building machine learning models.

In this guide, we’ll explore how to use SQL for traditional structured data cleaning and Apache Spark (Scala) for distributed big data processing. You’ll learn key techniques, best practices, and see hands-on code examples.

Why Data Cleaning and Transformation Matter

Data cleaning and transformation ensure that your dataset is:

  • Accurate – Free of errors and outdated records

  • Consistent – Following standard formats across fields

  • Efficient – Reduced in size without losing value

  • Machine-Ready – Prepared for analytics or modeling

Without these steps, your insights could be misleading or your models underperforming.

Part 1 – Data Cleaning and Transformation with SQL

When working with structured data in relational databases, SQL is one of the most powerful and accessible tools.

Common Data Cleaning in SQL

1. Removing Duplicates

DELETE FROM customers
WHERE rowid NOT IN (
SELECT MIN(rowid)
FROM customers
GROUP BY email
);

2. Handling Missing Values

UPDATE sales
SET revenue = 0
WHERE revenue IS NULL;

3. Standardizing Formats

UPDATE customers
SET phone_number = REPLACE(phone_number, ‘-‘, ”)
WHERE phone_number LIKE ‘%-%’;

4. Filtering Invalid Data

DELETE FROM orders
WHERE order_date < ‘2000-01-01’;

Common Data Transformation in SQL

Aggregation Example:

SELECT product_id, SUM(quantity) AS total_sold
FROM sales
GROUP BY product_id;

Joining Tables:

SELECT c.customer_name, o.order_id
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;

Derived Columns:

SELECT order_id, price * quantity AS total_price
FROM orders;

💡 When to use SQL:

  • The dataset is small to medium-sized

  • Stored in relational databases

  • You need quick transformations without distributed computing

Part 2 – Data Cleaning and Transformation with Apache Spark (Scala)

When working with massive datasets stored across multiple machines, Apache Spark is the tool of choice. It supports large-scale distributed processing and integrates with various big data systems like HDFS, Hive, and Kafka.

Here’s how to clean and transform data using Scala in Spark.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object DataCleaningTransformation {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName(“DataCleaningTransformation”)
.master(“local[*]”)
.getOrCreate()

val df = spark.read
.option(“header”, “true”)
.option(“inferSchema”, “true”)
.csv(“sales.csv”)

// Remove duplicates
val dedupedDF = df.dropDuplicates()

// Handle missing values
val filledDF = dedupedDF.na.fill(Map(“revenue” -> 0))

// Standardize string formats
val trimmedDF = filledDF.withColumn(“product_name”, trim(col(“product_name”)))

// Filter invalid records
val validDF = trimmedDF.filter(col(“order_date”) >= “2000-01-01”)

// Aggregation
val salesSummary = validDF.groupBy(“product_id”)
.agg(sum(“quantity”).alias(“total_sold”))

// Derived column
val withTotalPrice = validDF.withColumn(“total_price”, expr(“price * quantity”))

// Join with customer data
val customersDF = spark.read
.option(“header”, “true”)
.option(“inferSchema”, “true”)
.csv(“customers.csv”)

val ordersWithCustomers = withTotalPrice.join(customersDF, “customer_id”)

// Show final result
ordersWithCustomers.show()
}
}

SQL vs. Apache Spark – Which Should You Use?

Feature 

SQL (Databases)

Apache Spark

Data Size
Processing
Ease of Use
Integration
Best For

Small to Medium
Single Machine
Easy for analysts
Works with RDBMS
Quick queries, reports

Medium to Very Large
Distributed
Requires programming skills
Works with big data tools
Heavy processing at scale

Best Practices for Data Cleaning & Transformation

  1. Profile your data first – Identify missing, inconsistent, and duplicate records.

  2. Automate processes – Use scripts or ETL workflows to avoid manual repetition.

  3. Document every change – Keep logs for auditing and reproducibility.

  4. Validate results – Compare outputs before and after transformation.

  5. Leverage the right tool for the right job – SQL for fast ad-hoc queries, Spark for big data.

Conclusion

Both SQL and Apache Spark are powerful for data cleaning and transformation, but they serve different purposes. SQL excels in quick, structured data tasks, while Spark dominates large-scale distributed processing.

By learning both, you’ll be able to handle everything from a small customer database to processing billions of streaming records in real time.

Scroll to Top