Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

[Research] deploying predictive models with the actor framework - Brian Gawalt

Build a better, faster, more efficient predictive API with the Actor model of programming. Latency, logging, full utilization are all easily handled with this framework. Upwork (formerly Elance-oDesk) freelancer availability model — anticipating who's looking for work right now — is now a real-time service, without costly or complicated build-out of our stack or our datacenter, thanks to the Actor model.

[Research] deploying predictive models with the actor framework - Brian Gawalt

  1. 1. PAPIs 2015 Akka & Data Science: Making real-time predictions Brian Gawalt 2nd International Conference on Predictive APIs and Apps August 7, 2015
  2. 2. PAPIs 2015 [A] Sometimes, data scientists need to worry about throughput. 2
  3. 3. PAPIs 2015 [B] One way to increase throughput is with concurrency. 3
  4. 4. PAPIs 2015 [C] The Actor Model is an easy way to build a concurrent system. 4
  5. 5. PAPIs 2015 [D] Scala+Akka provides an easy-to-use Actor Model context. 5
  6. 6. PAPIs 2015 [A + B + C + D ⇒ E] Data scientists should check out Scala+Akka. 6
  7. 7. PAPIs 2015 Consider: ● building a model, ● vs. using a model 7
  8. 8. PAPIs 2015 Lots of ways to practice building a model 8
  9. 9. PAPIs 2015 The Classic Process 1. Load your data set’s raw materials 2. Produce feature vectors: o Training, o Validation, o Testing 3. Build the model with training and validation vectors 9
  10. 10. PAPIs 2015 The Classic Process: One-time Testing 10 Load train/valid./test materials Make train/valid./test feature vectors Train Model Make test predictions Build Use
  11. 11. PAPIs 2015 The Classic Process: Repeated Testing 11 Load train/valid. materials Make train/valid. feature vectors Train Model Load test/new materials Make test/new feature vectors Make test/new predictions (saved model) (repeat every K minutes) Build Use
  12. 12. PAPIs 2015 Sometimes my tasks work like that, too! 12
  13. 13. PAPIs 2015 But this talk is about the other kind of tasks. 13
  14. 14. PAPIs 2015 [A] Sometimes, data scientists need to worry about throughput. 14
  15. 15. PAPIs 2015 Example: Freelancer availability on 15
  16. 16. PAPIs 2015 Hiring Freelancers on Upwork 1. Post a job 2. Search for freelancers 3. Find someone you like 4. Ask them to interview o Request Accepted! o or rejected/ignored... 16 THE TASK: Look at recent freelancer behavior, and predict, at time Step 2, who’s likely to accept an invite at time Step 4
  17. 17. PAPIs 2015 Building this model is business as usual: 17
  18. 18. PAPIs 2015 Building Availability Model 1. Load raw materials: o Examples of accepts/rejects o Histories of freelancer site activity  Job applications sent or received  Hours worked  Click logs  Profile updates 2. Produce feature vectors: 18 Greenplum Amazon S3 Internal Service
  19. 19. PAPIs 2015 Using Availability Model 19 Load train/valid. materials Make train/valid. feature vectors Train Model Load test/new materials Make test/new feature vectors Make test/new predictions (saved model) (repeat every 60 minutes)
  20. 20. PAPIs 2015 Using Availability Model 20 Load test/new materials Make test/new feature vectors Make test/new predictions (saved model) (repeat every 60 minutes) Load job app data (4 min.) Load click log data (30 min.) Load work hours data (5 min.) Load profile data (20 ms/profile)
  21. 21. PAPIs 2015 Using Availability Model 21 Load job app data (4 min.) Load click log data (30 min.) Load work hours data (5 min.) Load profile data (20 ms/profile) ● Left with under 21 minutes to collect profile data ○ Rate limit: 20 ms/profile ○ At most, 63K profiles per hour ● Six Million freelancers who need avail. predictions: expect ~90 hours between re-scoring any individual ● Still need to spend time actually building vectors and exporting scores!
  22. 22. PAPIs 2015 [B] One way to increase throughput is with concurrency. 22
  23. 23. PAPIs 2015 Expensive Option: Major infrastructure overhaul 23
  24. 24. PAPIs 2015 … but that takes a lot of time, attention, and cooperation… 24
  25. 25. PAPIs 2015 Simpler Option: The Actor Model 25
  26. 26. PAPIs 2015 [C] The Actor Model is an easy way to build a concurrent system. 26
  27. 27. PAPIs 2015 ● Imagine a mailbox with a brain ● Computation only begins when/if a message arrives ● Keeps its thoughts private: ○ No other actor can actively read this actor’s state ○ Other actors will have to wait to hear a message from this actor An Actor 27
  28. 28. PAPIs 2015 ● Lots of Actors, and each has: ○ Private message queue ○ Private state, shared only sending more messages ● Execution context: ○ Manages threading of each Actor’s computation ○ Handles asynch. message routing ○ Can send prescheduled messages ● Each received message’s computation is fully completed before Actor moves on to next message in queue The Actor Model of Concurrency 28
  29. 29. PAPIs 2015 The Actor Model of Concurrency 29 Execution Context
  30. 30. PAPIs 2015 Parallelizing predictions 30 Refresh work hours Vectorizer: ● Keep copies of raw data ● Emit vector for each new profile received Refresh job apps Refresh click log Fetch 10 profiles Apply model; export prediction raw data raw data Schedule: Fetch once per hour Schedule: Fetch once per hour Schedule: Fetch once per hour Schedule: Fetch every 300ms
  31. 31. PAPIs 2015 Serial processing 31 Refresh job apps Make feature vectors Export predictions (repeat every 60 minutes) Refresh work hours Refresh click log Fetch ~50K profiles ... 55 min 5 min 4 min 5 min 30 min 55 - 4 - 5 - 30 = 16 min...
  32. 32. PAPIs 2015 Serial processing 32 Refresh job apps Make feature vectors Export predictions (repeat every 60 minutes) Refresh work hours Refresh click log Fetch ~50K profiles ... 55 min 5 min 4 min 5 min 30 min 55 - 4 - 5 - 30 = 16 min... Throughput: 48K users/hr
  33. 33. PAPIs 2015 Parallel Processing with Actors 33 Refresh job apps ... Refresh click log Refresh work hrs. Rx data Fetch pro. Export Rx data Fetch pro. Fetch pro. Fetch pro. Fetch pro.= msg. sent = msg. rx’d 1/hr. 1/hr. 1/hr. 3/sec. (as rx’ed) Store Store Vectorize Vectorize Store 1/hr. Thr. 1 Thr. 2 Thr. 3 Thr. 4 Vectorize Fetch pro. Fetch pro. (msg. processing time not to scale) Rx data Vectorize ...
  34. 34. PAPIs 2015 Parallel Processing with Actors 34 Refresh job apps ... Refresh click log Refresh work hrs. Rx data Fetch pro. Export Rx data Fetch pro. Fetch pro. Fetch pro. Fetch pro.= msg. sent = msg. rx’d 1/hr. 1/hr. 1/hr. 3/sec. (as rx’ed) Store Store Vectorize Vectorize Store 1/hr. Thr. 1 Thr. 2 Thr. 3 Thr. 4 Vectorize Fetch pro. Fetch pro. Throughput: 180K users/hr Rx data Vectorize ...
  35. 35. PAPIs 2015 [D] Scala+Akka provides an easy-to-use Actor Model context. 35
  36. 36. PAPIs 2015 Message passing, scheduling, & computation behavior defined in 445 lines. 36
  37. 37. PAPIs 2015 Scala+Akka Actors ● Create Scala class, mix in Actor trait ● Implement the required partial function: receive: PartialFunction[Any, Unit] ● Define family of message objects this actor’s planning to handle ● Define behavior for each message case in receive 37
  38. 38. PAPIs 2015 Scala+Akka Actors 38 Mixin same code used for export in non-Actor version Private, mutable state: stored scores Private, mutable state: time of last export If receiving new scores: store them! If storing lots of scores, or if it’s been awhile: upload what’s stored, then erase them If told to shut down, stop accepting new scores
  39. 39. PAPIs 2015 Scala+Akka Pros ● Easy to get productive in the Scala language ● SBT dependency management makes it easy to move to any box with a JRE ● No global interpreter lock! 39
  40. 40. PAPIs 2015 Scala+Akka Cons ● Moderate Scala learning curve ● Object representation on the JVM has pretty lousy memory efficiency ● Not a lot of great options for building models in Scala (compared to R, Python, Julia) 40
  41. 41. PAPIs 2015 [A] Sometimes, data scientists need to worry about throughput. 41
  42. 42. PAPIs 2015 [B] One way to increase throughput is with concurrency. 42
  43. 43. PAPIs 2015 [C] The Actor Model is an easy way to build a concurrent system. 43
  44. 44. PAPIs 2015 [D] Scala+Akka provides an easy-to-use Actor Model context. 44
  45. 45. PAPIs 2015 [A + B + C + D ⇒ Z] Data scientists should check out Scala+Akka 45
  46. 46. PAPIs 2015 Thanks! Questions? bgawalt@{upwork, gmail}.com twitter.com/bgawalt

×