SlideShare a Scribd company logo
1 of 25
Building massive scale,
    fault tolerant,
job processing systems
    with Scala Akka
      framework
     Vignesh Sukumar
        SVCC 2012
About me

• Storage group, Backend Engineering at Box
• Love enterprise software!
• Interested in Big Data and building distributed
  systems in the cloud
About Box

• Leader in enterprise cloud collaboration and
  storage
• Cutting-edge work in backend, frontend,
  platform and engineering services
• A really fun place to work – we have a long
  slide!
Talk outline
• Job processing requirements
• Traditional & new models for job processing

• Akka actors framework
• Achieving and controlling high IO throughput
• Fine-grained fault tolerance
Typical architecture in a cloud storage
             environment
Practical realities

•Storage nodes are usually of varying
configurations (OS, processing power, storage
capacity, etc) mainly because of rapid evolution
in provisioning operations
•Some nodes are more over-worked than the
others (for ex, accepting live uploads)
•Billions of files; petabytes
Job processing requirements

• Iterate over all files (billions, petabyte scale):
  for ex, check consistency of all files

• High throughput

• Fault tolerant

• Secure
Traditional job processing model
Why traditional models fail in cloud
       storage environments
• Not scalable: petabyte scale, billions of files
• Insecure: cannot move files out of storage
  nodes
• No performance control: easy to overwhelm
  any storage node
• No fine grained fault tolerance
Compute on Storage

• Move job computation directly to storage
  nodes
• Utilize abundant CPU on storage nodes
• Metadata store still stays in a highly available
  system like a RDBMS
• Results from operations on a file are
  completely independent
Master – slave architecture
Benefits

• High IO throughput: Direct access; no transfer
  of files over a network
• Secure: files do not leave storage nodes
• Better performance control: compute can
  easily monitor system load and back off
• Better fault tolerance handling: finer grained
  handling of errors
Master node

• Responsible for accepting job submissions and
  splitting them to tasks for slave nodes
• Stateful: keeps durable copy of jobs and tasks
  in Zookeeper
• Horizontally scalable: service can be run on
  multiple nodes
Agent

• Runs directly on the storage nodes on a
  machine-independent JVM container
• Stateless: no task state is maintained
• Monitors system load with back-off
• Reports results directly to master without
  synchronizing with other agents
Implementation with the
  the Scala Akka Actor
       framework
Actors

• Concurrent threads abstraction with no
  shared state
• Exchange messages
• Asynchronous, non-blocking
• Multiple actors can map to a single OS thread
• Parent-children hierarchical relationship
Actors and messages
• Class MyActor extends Actor {
  def receive = {
    case MsgType1 => // do something
  }
}

// instantiation and sending messages
 val actorRef = system.actorOf(Props(new MyActor))
actorRef ! MsgType1
Agent Actor System
Achieving high IO throughput
• Parallel, asynchronous IO through “Futures”
val fileIOResult = Future {
  // issue high latency tasks like file IO
 }
val networkIOResult = Future { // read from network }

Futures.awaitAll(<wait time>, fileIOResult, networkIOResult)
fileIOResult onSuccess { // do something }
networkIOResult onFailure { // retry }
Controlling system throughput

• The problem: agents need to throttle
  themselves as storage nodes serve live traffic

• Adjust number of parallel workers dynamically
  through a monitoring service
Controlling throughput: Examples

•Parallelism parameters can be gotten from a
separate configuration service on a per node
basis
•Some machines can be speeded up and others
slowed down this way
•The configuration can be updated on a cron
schedule to speed up during weekends
Fine grained fault tolerance with
              Supervisors

• Parents of child actors can define specific
  fault-handling strategies for each failure
  scenario in their children
• Components can fail gracefully without
  affecting the entire system
Supervision strategy: Examples


Class TaskActor extends Actor {
  // create child workers
  override val supervisorStrategy = OneForOneStrategy(maxNrOrRetries = 3) {
   case SqlException => Resume // retry the same file
   case FileCorruptionException => Stop // don’t clobber it!
   case IOException => Restart // report and move on
}
Unit testing

• Scalatra test framework: very easy to read!
  TaskActorTest.receive(BadFileMsg) must throw
  FileNotFoundException
• Mocks for network and database calls
val mockHttp = mock[HttpExecutor]
TaskActorTest ! doHttpPost
there was atLeastOne(mockHttp).POST


• Extensive testing of failure injection scenarios
Takeaways
• Keep your architecture simple by modeling
  actor message flow along the same paths as
  parent-child actor hierarchy (i.e., no message
  exchange between peer child actors)
• Design and implement for component failures
• Write unit tests extensively: we did not have
  any fundamental level functionality breakage
• Box Engineering is awesome!

More Related Content

What's hot

Multithreading models.ppt
Multithreading models.pptMultithreading models.ppt
Multithreading models.pptLuis Goldster
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.jsVikash Singh
 
Parallel computing chapter 3
Parallel computing chapter 3Parallel computing chapter 3
Parallel computing chapter 3Md. Mahedi Mahfuj
 
Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]Ravindra Raju Kolahalam
 
Microservices Architecture Part 2 Event Sourcing and Saga
Microservices Architecture Part 2 Event Sourcing and SagaMicroservices Architecture Part 2 Event Sourcing and Saga
Microservices Architecture Part 2 Event Sourcing and SagaAraf Karsh Hamid
 
INTER PROCESS COMMUNICATION (IPC).pptx
INTER PROCESS COMMUNICATION (IPC).pptxINTER PROCESS COMMUNICATION (IPC).pptx
INTER PROCESS COMMUNICATION (IPC).pptxLECO9
 
virtualization and hypervisors
virtualization and hypervisorsvirtualization and hypervisors
virtualization and hypervisorsGaurav Suri
 
Azure Interview Questions And Answers | Azure Tutorial For Beginners | Azure ...
Azure Interview Questions And Answers | Azure Tutorial For Beginners | Azure ...Azure Interview Questions And Answers | Azure Tutorial For Beginners | Azure ...
Azure Interview Questions And Answers | Azure Tutorial For Beginners | Azure ...Edureka!
 
Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingSayed Chhattan Shah
 
Threads (operating System)
Threads (operating System)Threads (operating System)
Threads (operating System)Prakhar Maurya
 
TypeScript: coding JavaScript without the pain
TypeScript: coding JavaScript without the painTypeScript: coding JavaScript without the pain
TypeScript: coding JavaScript without the painSander Mak (@Sander_Mak)
 
Presentation on Segmentation
Presentation on SegmentationPresentation on Segmentation
Presentation on SegmentationPriyanka bisht
 
Introduction to APIs (Application Programming Interface)
Introduction to APIs (Application Programming Interface) Introduction to APIs (Application Programming Interface)
Introduction to APIs (Application Programming Interface) Vibhawa Nirmal
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQEvent Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQAraf Karsh Hamid
 

What's hot (20)

Multithreading models.ppt
Multithreading models.pptMultithreading models.ppt
Multithreading models.ppt
 
Web services SOAP
Web services SOAPWeb services SOAP
Web services SOAP
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Parallel computing chapter 3
Parallel computing chapter 3Parallel computing chapter 3
Parallel computing chapter 3
 
Jvm Architecture
Jvm ArchitectureJvm Architecture
Jvm Architecture
 
Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]
 
REST & RESTful Web Services
REST & RESTful Web ServicesREST & RESTful Web Services
REST & RESTful Web Services
 
Microservices Architecture Part 2 Event Sourcing and Saga
Microservices Architecture Part 2 Event Sourcing and SagaMicroservices Architecture Part 2 Event Sourcing and Saga
Microservices Architecture Part 2 Event Sourcing and Saga
 
INTER PROCESS COMMUNICATION (IPC).pptx
INTER PROCESS COMMUNICATION (IPC).pptxINTER PROCESS COMMUNICATION (IPC).pptx
INTER PROCESS COMMUNICATION (IPC).pptx
 
virtualization and hypervisors
virtualization and hypervisorsvirtualization and hypervisors
virtualization and hypervisors
 
Azure Interview Questions And Answers | Azure Tutorial For Beginners | Azure ...
Azure Interview Questions And Answers | Azure Tutorial For Beginners | Azure ...Azure Interview Questions And Answers | Azure Tutorial For Beginners | Azure ...
Azure Interview Questions And Answers | Azure Tutorial For Beginners | Azure ...
 
SQLite - Overview
SQLite - OverviewSQLite - Overview
SQLite - Overview
 
Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed Computing
 
Threads (operating System)
Threads (operating System)Threads (operating System)
Threads (operating System)
 
TypeScript: coding JavaScript without the pain
TypeScript: coding JavaScript without the painTypeScript: coding JavaScript without the pain
TypeScript: coding JavaScript without the pain
 
Presentation on Segmentation
Presentation on SegmentationPresentation on Segmentation
Presentation on Segmentation
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Introduction to APIs (Application Programming Interface)
Introduction to APIs (Application Programming Interface) Introduction to APIs (Application Programming Interface)
Introduction to APIs (Application Programming Interface)
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQEvent Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQ
 
Software Design Patterns
Software Design PatternsSoftware Design Patterns
Software Design Patterns
 

Similar to Massive scale job processing with Scala Akka framework

Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Ilya Ganelin
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAPaolo Platter
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with EpsilonSina Madani
 
Typesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and PlayTypesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and PlayLuka Zakrajšek
 
Indic threads pune12-typesafe stack software development on the jvm
Indic threads pune12-typesafe stack software development on the jvmIndic threads pune12-typesafe stack software development on the jvm
Indic threads pune12-typesafe stack software development on the jvmIndicThreads
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLArnab Biswas
 
Alluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio, Inc.
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez DataWorks Summit
 
Enhanced Reframework Session_16-07-2022.pptx
Enhanced Reframework Session_16-07-2022.pptxEnhanced Reframework Session_16-07-2022.pptx
Enhanced Reframework Session_16-07-2022.pptxRohit Radhakrishnan
 
MongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB
 
Reactive programming with examples
Reactive programming with examplesReactive programming with examples
Reactive programming with examplesPeter Lawrey
 
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache TomcatCase Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache TomcatVMware Hyperic
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreDataWorks Summit
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 

Similar to Massive scale job processing with Scala Akka framework (20)

Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKA
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with Epsilon
 
Typesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and PlayTypesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and Play
 
Indic threads pune12-typesafe stack software development on the jvm
Indic threads pune12-typesafe stack software development on the jvmIndic threads pune12-typesafe stack software development on the jvm
Indic threads pune12-typesafe stack software development on the jvm
 
Scaling tappsi
Scaling tappsiScaling tappsi
Scaling tappsi
 
Fastest Servlets in the West
Fastest Servlets in the WestFastest Servlets in the West
Fastest Servlets in the West
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Alluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata Services
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
 
Enhanced Reframework Session_16-07-2022.pptx
Enhanced Reframework Session_16-07-2022.pptxEnhanced Reframework Session_16-07-2022.pptx
Enhanced Reframework Session_16-07-2022.pptx
 
MongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOL
 
Reactive programming with examples
Reactive programming with examplesReactive programming with examples
Reactive programming with examples
 
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
 
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache TomcatCase Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
 
Road Trip To Component
Road Trip To ComponentRoad Trip To Component
Road Trip To Component
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Massive scale job processing with Scala Akka framework

  • 1. Building massive scale, fault tolerant, job processing systems with Scala Akka framework Vignesh Sukumar SVCC 2012
  • 2. About me • Storage group, Backend Engineering at Box • Love enterprise software! • Interested in Big Data and building distributed systems in the cloud
  • 3. About Box • Leader in enterprise cloud collaboration and storage • Cutting-edge work in backend, frontend, platform and engineering services • A really fun place to work – we have a long slide!
  • 4. Talk outline • Job processing requirements • Traditional & new models for job processing • Akka actors framework • Achieving and controlling high IO throughput • Fine-grained fault tolerance
  • 5. Typical architecture in a cloud storage environment
  • 6. Practical realities •Storage nodes are usually of varying configurations (OS, processing power, storage capacity, etc) mainly because of rapid evolution in provisioning operations •Some nodes are more over-worked than the others (for ex, accepting live uploads) •Billions of files; petabytes
  • 7. Job processing requirements • Iterate over all files (billions, petabyte scale): for ex, check consistency of all files • High throughput • Fault tolerant • Secure
  • 9. Why traditional models fail in cloud storage environments • Not scalable: petabyte scale, billions of files • Insecure: cannot move files out of storage nodes • No performance control: easy to overwhelm any storage node • No fine grained fault tolerance
  • 10. Compute on Storage • Move job computation directly to storage nodes • Utilize abundant CPU on storage nodes • Metadata store still stays in a highly available system like a RDBMS • Results from operations on a file are completely independent
  • 11. Master – slave architecture
  • 12. Benefits • High IO throughput: Direct access; no transfer of files over a network • Secure: files do not leave storage nodes • Better performance control: compute can easily monitor system load and back off • Better fault tolerance handling: finer grained handling of errors
  • 13. Master node • Responsible for accepting job submissions and splitting them to tasks for slave nodes • Stateful: keeps durable copy of jobs and tasks in Zookeeper • Horizontally scalable: service can be run on multiple nodes
  • 14. Agent • Runs directly on the storage nodes on a machine-independent JVM container • Stateless: no task state is maintained • Monitors system load with back-off • Reports results directly to master without synchronizing with other agents
  • 15. Implementation with the the Scala Akka Actor framework
  • 16. Actors • Concurrent threads abstraction with no shared state • Exchange messages • Asynchronous, non-blocking • Multiple actors can map to a single OS thread • Parent-children hierarchical relationship
  • 17. Actors and messages • Class MyActor extends Actor { def receive = { case MsgType1 => // do something } } // instantiation and sending messages val actorRef = system.actorOf(Props(new MyActor)) actorRef ! MsgType1
  • 19. Achieving high IO throughput • Parallel, asynchronous IO through “Futures” val fileIOResult = Future { // issue high latency tasks like file IO } val networkIOResult = Future { // read from network } Futures.awaitAll(<wait time>, fileIOResult, networkIOResult) fileIOResult onSuccess { // do something } networkIOResult onFailure { // retry }
  • 20. Controlling system throughput • The problem: agents need to throttle themselves as storage nodes serve live traffic • Adjust number of parallel workers dynamically through a monitoring service
  • 21. Controlling throughput: Examples •Parallelism parameters can be gotten from a separate configuration service on a per node basis •Some machines can be speeded up and others slowed down this way •The configuration can be updated on a cron schedule to speed up during weekends
  • 22. Fine grained fault tolerance with Supervisors • Parents of child actors can define specific fault-handling strategies for each failure scenario in their children • Components can fail gracefully without affecting the entire system
  • 23. Supervision strategy: Examples Class TaskActor extends Actor { // create child workers override val supervisorStrategy = OneForOneStrategy(maxNrOrRetries = 3) { case SqlException => Resume // retry the same file case FileCorruptionException => Stop // don’t clobber it! case IOException => Restart // report and move on }
  • 24. Unit testing • Scalatra test framework: very easy to read! TaskActorTest.receive(BadFileMsg) must throw FileNotFoundException • Mocks for network and database calls val mockHttp = mock[HttpExecutor] TaskActorTest ! doHttpPost there was atLeastOne(mockHttp).POST • Extensive testing of failure injection scenarios
  • 25. Takeaways • Keep your architecture simple by modeling actor message flow along the same paths as parent-child actor hierarchy (i.e., no message exchange between peer child actors) • Design and implement for component failures • Write unit tests extensively: we did not have any fundamental level functionality breakage • Box Engineering is awesome!

Editor's Notes

  1. 1. Example of a job is to check consistency of all the files: this will involve iterating over every file on all storage nodes, reading file and verifying content integrity.
  2. Scalability: non-performant because of the IO bottleneck in getting files to the application cluster Insecure: application clusters can store the files locally. It’s easy to melt a single a storage node by reading or writing a lot to it Cannot perform fine grained fault tolerance