Himalayan Peaks

of Testing Data Pipelines

Ksenia Tomak, Dodo Engineering
Pasha Finkelshteyn, JetBrains

Ksenia Tomak

  • Tech Lead, Dodo Engineering
  • @if_no_then_yes

Industrial IoT, DE, Storages

@asm0di0   @if_no_then_yes

Pasha Finkelshteyn

Developer πŸ₯‘ for Big Data @ JetBrains

@asm0di0

@asm0di0   @if_no_then_yes

What is Big Data

  • Doesn't fit the single node (or Excel)
  • Maybe scaled when growing
  • Enough data to make reliable business solutions
@asm0di0   @if_no_then_yes

Who are DEs

Plumber of data

Data is produces by

  • sensors
  • clickstreams
  • etc.
@asm0di0   @if_no_then_yes

Data sinks

  • HDFS
  • S3, Azure Blob Storage
  • etc.
@asm0di0   @if_no_then_yes

What is a data pipeline?

@asm0di0   @if_no_then_yes

Data processing

width:700

@asm0di0   @if_no_then_yes

Data lake?

width:1140

@asm0di0   @if_no_then_yes

Who needs pipelines

  • Data Scientists
  • Data Analytics
  • Marketing
  • PO
@asm0di0   @if_no_then_yes

QA ?= QC

@asm0di0   @if_no_then_yes

QA β‰  QC

QA is about processes and not only about software quality.

@asm0di0   @if_no_then_yes

Pyramid of testing. Unit

@asm0di0   @if_no_then_yes

Bronze→Silver pipeline

width:1140

@asm0di0   @if_no_then_yes

Typical pipeline

@asm0di0   @if_no_then_yes

Typical pipeline

width:1000

@asm0di0   @if_no_then_yes

Typical pipeline

width:1000

@asm0di0   @if_no_then_yes

Typical pipeline

width:1000

@asm0di0   @if_no_then_yes

Typical pipeline

width:1000

@asm0di0   @if_no_then_yes

Typical pipeline

width:1000

@asm0di0   @if_no_then_yes

Unit testing of pipeline

What may we test here?

A pipeline should transform data correctly!

Correctness is a business term

@asm0di0   @if_no_then_yes

Let's paste fakes!

Fake input data
Reference data at the end of the pipeline

@asm0di0   @if_no_then_yes

Tools

holdenk/spark-testing-base ← Tools to run tests
MrPowers/spark-daria ← tools to easily create test data

@asm0di0   @if_no_then_yes

Component testing

@asm0di0   @if_no_then_yes
@asm0di0   @if_no_then_yes

TestContainers

@asm0di0   @if_no_then_yes

TestContainers

Supported languages:

  • Java (and compatibles: Scala, Kotlin, etc.)
  • Python
  • Go
  • Node.js
  • Rust
  • .NET
@asm0di0   @if_no_then_yes

Test Containers

width:1140

@asm0di0   @if_no_then_yes

Test Containers

width:1140

@asm0di0   @if_no_then_yes

Test Containers

width:1140

@asm0di0   @if_no_then_yes
@asm0di0   @if_no_then_yes


@asm0di0   @if_no_then_yes

Real systems

Why are component tests not enough?

  • vendor lock tools (DB, processing, etc.)
  • external error handling
@asm0di0   @if_no_then_yes
@asm0di0   @if_no_then_yes

Real data


Get data samples from prod,
anonymize it

@asm0di0   @if_no_then_yes

Compare to reference

@asm0di0   @if_no_then_yes

Real data

Deploy full data backup on stage env,
anonymize it πŸ€‘

@asm0di0   @if_no_then_yes

In usual testing you won't trust your code

@asm0di0   @if_no_then_yes

In pipeline testing you won't trust

both your code and your data

@asm0di0   @if_no_then_yes

Real data expectations

Test:
βœ… no data
βœ… valid data
❓ invalid data
❓ illegal data format

@asm0di0   @if_no_then_yes

Real data expectations. Tools:

Real data expectations

  • profilers
  • constraint suggestions
  • constraint verification
  • metrics
  • metrics strores
@asm0di0   @if_no_then_yes

width:1140

@asm0di0   @if_no_then_yes

width:1140

@asm0di0   @if_no_then_yes

width:1140

@asm0di0   @if_no_then_yes

width:1140

@asm0di0   @if_no_then_yes

Great expectations

width:1140

@asm0di0   @if_no_then_yes

Great expectations

height:480

@asm0di0   @if_no_then_yes

Python Deequ

width:1140

Python Deequ

width:1140

Python Deequ

width:1140

Python Deequ

width:1140

Python Deequ.

constraint constraint_status constraint_message
CompletenessConstraint Success
ComplianceConstraint Failure Value: 0.5 does not meet the constraint requirement!
@asm0di0   @if_no_then_yes

Real data ecpectations. Use cases

  • pre-ingestion and post-ingestion data validaton
  • before pipeline development
  • monitoring and alerting
@asm0di0   @if_no_then_yes
@asm0di0   @if_no_then_yes

Monitoring

Why?

  • The only REAL testing is production
  • Data tends to change over time
@asm0di0   @if_no_then_yes

Monitoring

What?

  • data volumes
  • counters
  • time
  • dead letter queue monitoring
  • service health
  • business metrics
@asm0di0   @if_no_then_yes

Monitoring

How?

  • use Listeners
  • use data aggregations
@asm0di0   @if_no_then_yes

Data pipelines is always DAG

Monitoring should visualize it

@asm0di0   @if_no_then_yes

Monitoring visualization

End-to-End tests

Compare with reports, old DWH

Multiple dimensions:

  • data
  • data latency
  • performance, scalability
@asm0di0   @if_no_then_yes
@asm0di0   @if_no_then_yes

Performance Tests





  • start with SLO
  • test your initial data load
@asm0di0   @if_no_then_yes

Real prod


Run a parallel job with a different sink
@asm0di0   @if_no_then_yes

Summary

  • Testing pipeline is like testing code
  • Testing pipelines is not like testing code
  • Pipeline quality is not only about testing
  • Sometimes testing outside of production is tricky
@asm0di0   @if_no_then_yes

Thanks!

Questions? πŸ™‡

@asm0di0
@if_no_then_yes

3V: volume, velocity, variety

Specifics of data pipeline monitoring