The document discusses using declarative data pipelines to code data workflows once and reuse them easily. It describes Flashfood, a company dealing with food waste data. The problem of maintaining many pipelines across different file types and clouds is presented. Three attempts at a solution showed that too little automation led to boilerplate code while too much automation caused unexpected behavior. The solution was to define YAML configuration files that jobs could be run against, allowing flexibility while enforcing DRY principles. This approach reduced maintenance overhead and allowed anyone to create similar jobs. Lessons included favoring parameters over inference and reusing extract and load code. Future work may involve programmatically adding new configurations and a Spark YAML grammar.
3. Food Waste
The larger problem
• 160 billion pounds of food in North America
end up in the landfill each year
• Food waste makes up at least 6% of all
greenhouse gas emissions globally.
• If International food waste were a country,
it would be the third leading cause to GHG
emissions behind the US & China [1]
[1] National Geographic, March
2016
4. Food Waste
The larger problem
• According to usda.gov, in the US, about 30-
40% of the food supply ends up in the landfill.
• In Canada, about 58% (35.5 million tonnes) of
all food produced goes to waste annually.
• 10.5 percent (13.7 million) of U.S. households
were food insecure at some time during 2019
[1] Second Harvest, 2019
5. Flashfood
• A marketplace for food nearing expiry
• Grocers recover costs on shrink
• Grocers reduce their carbon footprint
• More families are fed fresh food affordably
• In 2020 alone Flashfood
• Diverted 11.2 million pounds of food from
landfills
• Saved shoppers 29 million dollars on groceries
6. Data Science
Recommendation system,
fraud detection, dynamic
pricing
Product
Power our mobile
& web platforms
Analytics
Drive data driven
decisions, business
intelligence
Flashfood Data
8. Problem Definition
Many File Types Many Clouds Many Sources
▪ Partners are key to our
business; we are flexible on
how we integrate and
manage their data
▪ Some of our partners have
cloud provider restriction ▪ We have several other
operational & 3rd party
sources
Many Pipelines
15. Problem
▪ Inferred values cause unexpected
behavior
▪ Hard to make changes
▪ Difficult to reuse code
▪ Lazy solutions to problems
▪ Hard to debug
• Too much automation
• Not enough automation
▪ Difficult to maintain
▪ More room for errors
▪ Time spent on boilerplate logic
▪ Difficult to share code, pass on work
▪ Additions require Spark knowledge
16. The Declarative Data Pipeline
YAML based Airflow DAGs
Config based Spark Application
17. Attempt 3
The right amount of automation
Database
….n ….n
SyncTableJob(config1)
config1, config2, config3, config4
18. The right amount of automation
Scenario 3
Database
….n ….n
SyncTableJob(config2)
config1, config2, config3, config4
19. The right amount of automation
Scenario 3
Database
….n ….n
SyncTableJob(config3)
config1, config2, config3, config4
20. The right amount of automation
Scenario 3
Database
….n ….n
SyncTableJob(config4)
config1, config2, config3, config4
21. Why configs?
• Creates a contract between source and sink
• Forces DRY principle for similar jobs
• Can manually or programmatically add new jobs
30. Results
• Reduced maintenance overhead
• Democratized ability to create like jobs
• Improved readability and coding standard
31. Lessons Learned
• Favor parameters over inference
• Reuse code for extract & load
• Instance pools are important
32. Challenges ahead
• How much to generalize config
• Programmatically add new configurations
• Grammar parser for simple function definition in yaml
• Check yaml validity at source
• Could this be open sourced?
• We have SparkR, PySpark and Spark SQL; could we have Spark YAML?
33. Spark YAML
• Combine orchestration with
execution
• Simplify usage of parameter
heavy functions
34. - Garrison Keillor
A young writer is easily tempted by the
allusive and ethereal and ironic and
reflective, but the declarative is at the
bottom of most good writing.
35. Explicit
Indelicate
Logical
Simplistic
Settings and variables should be explicit
System should extend without breaking
Behavior should do exactly as stated
Jobs should make limited decisions & fail quickly
Pipelines should be clear in function & execution
Keillor’s Principles
Declarative