Ksenia Tomak, HelloFresh Pasha Finkelshteyn, JetBrains
Developer for Big Data @ JetBrains
@asm0di0
@asm0dey@fosstodon.org
What may we test here?
A pipeline should transform data correctly!
Correctness is a business term
Fake input data Reference data at the end of the pipeline
holdenk/spark-testing-base β Tools to run tests MrPowers/spark-daria β tools to easily create test data
Supported languages:
Why are component tests not enough?
Get data samples from prod, anonymize it
Deploy full data backup on stage env, anonymize it
Test: no data valid data invalid data illegal data format
Why?
What?
How?
Monitoring should visualize it
Compare with reports, old DWH
Multiple dimensions:
@asm0di0 @if_no_then_yes
Specifics of data pipeline monitoring