Himalayan Peaks

of Testing Data Pipelines

Ksenia Tomak, HelloFresh
Pasha Finkelshteyn, JetBrains

Pasha Finkelshteyn

Developer πŸ₯‘ for Big Data @ JetBrains

@asm0di0

🐘 @asm0dey@fosstodon.org

@asm0di0   @if_no_then_yes

Data processing

@asm0di0   @if_no_then_yes

Data lake?

@asm0di0   @if_no_then_yes

Who needs pipelines

  • Data Scientists
  • Data Analytics
  • Marketing
  • PO
@asm0di0   @if_no_then_yes

It have to be tested!

@asm0di0   @if_no_then_yes

Pyramid of testing?

@asm0di0   @if_no_then_yes

Pyramid of testing?

@asm0di0   @if_no_then_yes
@asm0di0   @if_no_then_yes

Pyramid of testing. Unit

@asm0di0   @if_no_then_yes

Bronze→Silver pipeline

@asm0di0   @if_no_then_yes

Typical pipeline

@asm0di0   @if_no_then_yes

Typical pipeline

@asm0di0   @if_no_then_yes

Typical pipeline

@asm0di0   @if_no_then_yes

Typical pipeline

@asm0di0   @if_no_then_yes

Typical pipeline

@asm0di0   @if_no_then_yes

Typical pipeline

@asm0di0   @if_no_then_yes

Unit testing of pipeline

What may we test here?

A pipeline should transform data correctly!

Correctness is a business term

@asm0di0   @if_no_then_yes

Let's paste fakes!

Fake input data
Reference data at the end of the pipeline

@asm0di0   @if_no_then_yes

Tools

holdenk/spark-testing-base ← Tools to run tests
MrPowers/spark-daria ← tools to easily create test data

@asm0di0   @if_no_then_yes

Component testing

@asm0di0   @if_no_then_yes
@asm0di0   @if_no_then_yes

TestContainers

@asm0di0   @if_no_then_yes

TestContainers

Supported languages:

  • Java (and compatibles: Scala, Kotlin, etc.)
  • Python
  • Go
  • Node.js
  • Rust
  • .NET
@asm0di0   @if_no_then_yes

Test Containers

@asm0di0   @if_no_then_yes

Test Containers

@asm0di0   @if_no_then_yes

Test Containers

@asm0di0   @if_no_then_yes
@asm0di0   @if_no_then_yes

@asm0di0   @if_no_then_yes

Real systems

Why are component tests not enough?

  • vendor lock tools (DB, processing, etc.)
  • external error handling
@asm0di0   @if_no_then_yes
@asm0di0   @if_no_then_yes

Real data


Get data samples from prod,
anonymize it

@asm0di0   @if_no_then_yes

Compare to reference

@asm0di0   @if_no_then_yes

Comparison example

gender reference id match
m m 1 TRUE
f c 2 FALSE
u u 3 TRUE
c c 4 TRUE
m f 5 FALSE
@asm0di0   @if_no_then_yes

Real data

Deploy full data backup on stage env, anonymize it πŸ€‘

@asm0di0   @if_no_then_yes

In usual testing you won't trust your code

@asm0di0   @if_no_then_yes

In pipeline testing you won't trust

both your code and your data

@asm0di0   @if_no_then_yes

Real data expectations

Test:
βœ… no data
βœ… valid data
❓ invalid data
❓ illegal data format

@asm0di0   @if_no_then_yes

Real data expectations. Tools:

Real data expectations

  • profilers
  • constraint suggestions
  • constraint verification
  • metrics
  • metrics strores
@asm0di0   @if_no_then_yes

@asm0di0   @if_no_then_yes

@asm0di0   @if_no_then_yes

@asm0di0   @if_no_then_yes

@asm0di0   @if_no_then_yes

Great expectations

@asm0di0   @if_no_then_yes

Great expectations

@asm0di0   @if_no_then_yes

Python Deequ

Python Deequ

Python Deequ

Python Deequ

Python Deequ.

constraint constraint_status constraint_message
CompletenessConstraint Success
ComplianceConstraint Failure Value: 0.5 does not meet the constraint requirement!
@asm0di0   @if_no_then_yes

Real data ecpectations. Use cases

  • pre-ingestion and post-ingestion data validaton
  • before pipeline development
  • monitoring and alerting
@asm0di0   @if_no_then_yes
@asm0di0   @if_no_then_yes

Monitoring

Why?

  • The only REAL testing is production
  • Data tends to change over time
@asm0di0   @if_no_then_yes

Monitoring

What?

  • data volumes
  • counters
  • time
  • dead letter queue monitoring
  • service health
  • business metrics
@asm0di0   @if_no_then_yes

Monitoring

How?

  • use Listeners
  • use data aggregations
  • AirFlow (or another orchestrator)
@asm0di0   @if_no_then_yes

Data pipelines is always DAG

Monitoring should visualize it

@asm0di0   @if_no_then_yes

Monitoring visualization

End-to-End tests

Compare with reports, old DWH

Multiple dimensions:

  • data
  • data latency
  • performance, scalability
@asm0di0   @if_no_then_yes
@asm0di0   @if_no_then_yes

Performance Tests





  • start with SLO
  • test your initial data load
@asm0di0   @if_no_then_yes

Real prod


Run a parallel job with a different sink
@asm0di0   @if_no_then_yes

Summary

  • Testing pipeline is like testing code
  • Testing pipelines is not like testing code
  • Pipeline quality is not only about testing
  • Sometimes testing outside of production is tricky
@asm0di0   @if_no_then_yes

Thanks!

Questions? πŸ™‡

@asm0di0
@if_no_then_yes

🐘 @asm0dey@fosstodon.org

Specifics of data pipeline monitoring