Ilesh Darji


How to Automate ETL Testing Using Python and Pandas

Introduction

ETL testing is essential for ensuring data accuracy, consistency, and integrity in data pipelines. Performing these tests manually can be time-consuming and error-prone. Automating ETL testing using Python and Pandas helps streamline validation, detect anomalies, and improve efficiency.

This post covers:

  • How to use Python and Pandas for ETL testing.
  • Automating common validation checks.
  • Writing reusable test scripts for large datasets.

1️⃣ Setting Up Your Environment

Install Required Libraries

Ensure you have Pandas installed:

pip install pandas

If you are working with SQL databases:

pip install sqlalchemy pymysql

Load Source and Target Data

import pandas as pd

source_data = pd.read_csv("source_data.csv")
target_data = pd.read_csv("target_data.csv")

If using a database:

from sqlalchemy import create_engine

engine = create_engine("mysql+pymysql://user:password@host/db")
source_data = pd.read_sql("SELECT * FROM source_table", engine)
target_data = pd.read_sql("SELECT * FROM target_table", engine)

2️⃣ Automating Data Completeness Validation

Check Row Counts

assert len(source_data) == len(target_data), "Row count mismatch between source and target"

Check for Missing Values

missing_values = target_data.isnull().sum()
print("Missing values:")
print(missing_values[missing_values > 0])

3️⃣ Automating Data Integrity Checks

Check for Duplicates

duplicates = target_data[target_data.duplicated()]
print("Duplicate records found:", len(duplicates))

Check Primary Key Integrity

assert source_data["id"].is_unique, "Source data contains duplicate primary keys"
assert target_data["id"].is_unique, "Target data contains duplicate primary keys"

4️⃣ Automating Data Transformation Validation

Example: Check a Derived Column Calculation

If the target column discounted_price should be price * 0.9, validate it:

target_data["expected_discounted_price"] = target_data["price"] * 0.9
assert all(target_data["discounted_price"] == target_data["expected_discounted_price"]), "Transformation error detected"

Example: Compare Aggregated Values

source_total = source_data["revenue"].sum()
target_total = target_data["revenue"].sum()
assert source_total == target_total, "Revenue mismatch detected"

5️⃣ Automating Incremental Load Validation

Check If New Records Are Loaded

new_records = source_data[~source_data["id"].isin(target_data["id"])]
print("New records that should be inserted:", len(new_records))

Check for Deleted Records

deleted_records = target_data[~target_data["id"].isin(source_data["id"])]
print("Records missing in source but present in target:", len(deleted_records))

Conclusion

Automating ETL testing using Python and Pandas improves efficiency and ensures data accuracy. By writing reusable test scripts, you can:

  • Validate data completeness.
  • Ensure data integrity.
  • Test transformations and incremental loads.

In the next post, we will discuss “Building a Scalable ETL Test Automation Framework.”