It's difficult to find off-the-shelf, open-source solutions for creating lean, simple, and language-agnostic data-processing pipelines for machine learning (ML). This session shows you how to use Amazon S3, Docker, Amazon EC2, Auto Scaling, and a number of open source libraries as cornerstones to build one. We also share our experience creating elastically scalable and robust ML infrastructure leveraging the Spot instance market.
2. Lessons we learned from
• Building a new data-heavy product
• On a tight timeline
• On budget (just 6 people)
Solution:
• Leverage AWS and Docker to build a no-frills data
pipeline
3. AdRoll Prospecting Product
Find new customers based on your
existing customers’ behavior
• hundreds of TB of data
• billions of cookies
• ~20 000 ML models
9. Queue service (Quentin)
• Finds an instance to run container on
• Maintains a queue when no instances available
• Feed queue metrics to CloudWatch
• Capture container stdout/stderr
• UI to debug failures
CloudWatch
Quentin (queue)
Auto Scaling
12. Lessons learned
• Scale based on job backlog size
• Multiple instance pools / Auto Scaling groups
• Use Elastic Load Balancing for health checks
• Lifecycle hooks
You don’t really need: data aware scheduling and HA
Nice to have: job profiling
18. Problem with time-centric approach
Job A
9am
midnight
9am
midnight
Job C
Job B
Job C
Job A
Job C
Job B
19. Solution
Job A
9am
midnight
9am
midnight
Job C
Job B
• Basically, make(1)
• Time/date is just another explicit parameter
• Jobs are triggered based on file existence/timestamp
D=2015-10-09
D=2015-10-09
D=2015-10-09
Job A
Job C
Job B
23. Lessons learned
Not a hard problem, but easily complicated:
• Jobs depend on data (not other jobs)
• Time-based scheduling can be added later
• Idempotent jobs (ideally)
• Transactional success flag (_SUCCESS in s3)
• Useful to have: dynamic dependency graphs
30. Putting it all together
Dependency management
Resource management
Deployment
31. Misc notes
• “Files in S3” is the only abstraction you really need
• No need in distributed FS, pulling from Amazon S3
scales well
• Keep jobs small (minutes to hours)
• Storing data efficiently helps a lot
• Using bigger instances
32. Daily numbers
• Hundreds of biggest Spot instances launched and killed
• 30 TB RAM in the cluster (peak)
• 100s of containers (1min to 6hr per container)
• Hundreds of billions of log lines analyzed
• Using R, C, Erlang, D, Python, Lua, JavaScript, and a
custom DSL