Ksenia Tomak, Dodo Engineering
Pasha Finkelshteyn, JetBrains
Industrial IoT, DE, Storages
Developer for Big Data @ JetBrains
Plumber of data
Data is produces by
QA is about processes and not only about software quality.
What may we test here?
A pipeline should transform data correctly!
Correctness is a business term
Fake input data
Reference data at the end of the pipeline
holdenk/spark-testing-base ← Tools to run tests
MrPowers/spark-daria ← tools to easily create test data
Why are component tests not enough?
Get data samples from prod,
Deploy full data backup on stage env,
illegal data format
Monitoring should visualize it
Compare with reports, old DWH
3V: volume, velocity, variety
Specifics of data pipeline monitoring