SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Things might go wrong in a
data-intensive application
Petertc Chu | PyConline AU 2021
Scope
Applications deal with huge volumes of data
- Web applications, mobile apps, IoT...
Challenges
- “the quantity of data, the complexity of data, the speed
at which it is changing”
Key factors
- Scalability, Reliability
(dataintensive.net)
About me
Research engineer and Pythonista from Taiwan
Working on data infrastructures for ten years
kiwislife.com
The case
Host and manage UGC (User-generated content) with various usage patterns
- Streaming, IoT data aggregation, file distribution, archiving...
- ~10PiB raw capacity
- Processing several TiBs per day
We can cover a football field if we put all our disks on the ground
Structured data store
Sharding / partitioning,
RDMBS clusters,
NoSQL...
Concepts
Cache layer
Unstructured data store
Various kind of DFSs,
heterogeneous storage
media
Application
servers
Job processing
systems,
Other
subsystems
Various usage patterns
Incident #1
What happened?
Thousands of IoT devices push data to
our cluster 24-7-365, got
- error rate: ~30%
- Avg RTT: 39.005s
The build up
DB race condition
- Optimistic locking doesn’t help in this pattern (W >> R)
databases
IoT
devices
application
servers
contention
occurred! 😱
😡
The build up
Pessimistic locking is too expensive for other usage patterns
databases
IoT
devices
application
servers
Implement global
locking
🚘🚘
🚘
🚘🚘
🚘
🚘
🚘
🚘
other users
😡
😡
😡
👍
The build up
Final: a hybrid / adaptive approach
- Only do pessimistic locking for specific operations
- Do locking in local by default
- Switch to global locking for specific resource automatically while collision detected
- (switch back after a certain duration)
- Keep using optimistic locking otherwise
The build up
Final: a hybrid / adaptive approach
databases
IoT
devices
application
servers
local lock
local lock
local lock
(Global lock)
other users
👍
👍
👍
👍
Root cause #scalability
We don’t design for a usage pattern and workload like that
Action taken
- Test concurrency scenarios before each release
- Introduce observability and proactive monitoring systems for quick incident
detection and diagnosis
Incident #2
What
happened?
We have an advanced data management feature
- Not production ready, just a prototype
- No one use it for several years
One day, a user discovered it and made a million
times more requests to this subsystem!!
The build up
We needed some kind of distributed solution to handle this.
- resque: a Redis-backed framework for creating background jobs
https://github.blog/2009-11-03-introducing-resque/ https://gist.github.com/defunkt/225369
Root cause #scalability
Load exceeds expectations
Action taken
- All batch processing subsystems are now implemented in a distributed way
Incident #3
What
happened?
A supplier built a data protection subsystem for us
...after we deployed it...
Users complain data corruption!!
The build up
Defective padding in the encryption process
Example 1:
Input data: “DD” * 12
Expected result:
| DD DD DD DD DD DD DD DD | DD DD DD DD 04 04 04 04 |
Example 2:
Input data: “DD” * 16
Expected result:
| DD DD DD DD DD DD DD DD | DD DD DD DD DD DD DD DD |
| 16 16 16 16 16 16 16 16 | 16 16 16 16 16 16 16 16 |
Incorrect result:
| DD DD DD DD DD DD DD DD | DD DD DD DD DD DD DD DD |
(If the length of the original data is an integer multiple of the block size B,
then an extra block of bytes with value B is added. B is 16 in this case.)
The build up
Design a process to fix all affected data
- List all affected records from DBs
- Read corresponding data with an “incorrect” decryption algorithm
- Write data back with a correct encryption algorithm
Id Size Encryption method Version number Data reference key
1 32 (Not encrypted) 0 aaa
2 6 Indefective algorithm 0 bbb
3 5 (not affected) Defective algorithm 0 ccc
4 32 (affected) Defective algorithm 1 (fixed) ddd
5 64 (affected) Defective algorithm 0 (not yet fixed) eee
Only the last one needs a fix (block size = 16)
The build up
Just a silly bug, if it didn’t affect…
- Millions of user records
We set up a job processing system to correct all affected data in our system
gearman [Gearman Job Server] https://github.com/Yelp/python-gearman
Root cause #reliability #softwareFaults
1. Unreliable solution provider
2. Less than 1% possibility to find the bug by testing
Action taken
- Not outsourcing anymore
- More comprehensive tests with various kinds of scenarios
- ~10 TiB test dataset
Incident #4
What
happened?
To keep reliability, we
- Replicate user data multiple times
- Distribute replicas to different failure domains
(different host/data center)
Data still lost!!
http://dx.doi.org/10.6861/tanet.201810.0398
The build up
Our system balances loading by writing data into nodes that have more resource
- A newly added node has more resource in general
- Result in data tend to be placed on new nodes
Data are written to unreliable newly added nodes and lost even though they are
distributed in different failure domains.
Topic: Electronic/Electrical Reliability (cmu.edu)
Root cause #reliability #hardwareFaults
It’s hard to prevent data loss completely
- Modeling or simulation cannot truly reflect situations in
real world
Action taken
- Do more stability tests on new coming nodes
- Add a batch of new nodes each time, so it has less
opportunity to write data into an unreliable node
http://dx.doi.org/10.6861/tanet.201810.0398
What do we learn
from these
incidents?🤔
#1 “There is unfortunately no easy fix for
making applications reliable, scalable”
- No way to enumerate all possible reliability causes (hardware faults,
software faults, human errors)
- Usage pattern and load keep changing while your business
expanded, cannot have an ultimate scalability design beforehand
#2 Before trying to build a faultless
architecture, think twice
- Consider maintainability
- We need a team to sustain a large-scale system, not just a talented engineer
(dataintensive.net)
#3 Service = human beings + machines
Thank you! 🙏🙏🙏
@petertc_chu

Contenu connexe

Tendances

Tendances (20)

Data Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETLData Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETL
 
Software cracking and patching
Software cracking and patchingSoftware cracking and patching
Software cracking and patching
 
Distributed Tracing with OpenTracing, ZipKin and Kubernetes
Distributed Tracing with OpenTracing, ZipKin and KubernetesDistributed Tracing with OpenTracing, ZipKin and Kubernetes
Distributed Tracing with OpenTracing, ZipKin and Kubernetes
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
 
EKAW - Publishing with Triple Pattern Fragments
EKAW - Publishing with Triple Pattern FragmentsEKAW - Publishing with Triple Pattern Fragments
EKAW - Publishing with Triple Pattern Fragments
 
Self driving computers active learning workflows with human interpretable ve...
Self driving computers  active learning workflows with human interpretable ve...Self driving computers  active learning workflows with human interpretable ve...
Self driving computers active learning workflows with human interpretable ve...
 
Advanced deeplearning4j features
Advanced deeplearning4j featuresAdvanced deeplearning4j features
Advanced deeplearning4j features
 
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBHow a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
 
Boolan machine learning summit
Boolan machine learning summitBoolan machine learning summit
Boolan machine learning summit
 
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with Order
 
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
 
IoT Data Connector Fluent Bit
IoT Data Connector Fluent BitIoT Data Connector Fluent Bit
IoT Data Connector Fluent Bit
 
Open Tracing, to order and understand your mess. - ApiConf 2017
Open Tracing, to order and understand your mess. - ApiConf 2017Open Tracing, to order and understand your mess. - ApiConf 2017
Open Tracing, to order and understand your mess. - ApiConf 2017
 
Performance monitoring and call tracing in microservice environments
Performance monitoring and call tracing in microservice environmentsPerformance monitoring and call tracing in microservice environments
Performance monitoring and call tracing in microservice environments
 
Brief introduction to Distributed Deep Learning
Brief introduction to Distributed Deep LearningBrief introduction to Distributed Deep Learning
Brief introduction to Distributed Deep Learning
 
Fluent-bit
Fluent-bitFluent-bit
Fluent-bit
 
netflix-real-time-data-strata-talk
netflix-real-time-data-strata-talknetflix-real-time-data-strata-talk
netflix-real-time-data-strata-talk
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 

Similaire à PyConline AU 2021 - Things might go wrong in a data-intensive application

The Growth Of Data Centers
The Growth Of Data CentersThe Growth Of Data Centers
The Growth Of Data Centers
Gina Buck
 
Data Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloningData Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloning
Kyle Hailey
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
programmermag
 
Nyoug delphix slideshare
Nyoug delphix slideshareNyoug delphix slideshare
Nyoug delphix slideshare
Kyle Hailey
 

Similaire à PyConline AU 2021 - Things might go wrong in a data-intensive application (20)

Big Data and Fast Data combined – is it possible?
Big Data and Fast Data combined – is it possible?Big Data and Fast Data combined – is it possible?
Big Data and Fast Data combined – is it possible?
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
 
The Growth Of Data Centers
The Growth Of Data CentersThe Growth Of Data Centers
The Growth Of Data Centers
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)
 
Production debugging web applications
Production debugging web applicationsProduction debugging web applications
Production debugging web applications
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Data Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloningData Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloning
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Nyoug delphix slideshare
Nyoug delphix slideshareNyoug delphix slideshare
Nyoug delphix slideshare
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Monitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp DockerMonitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp Docker
 
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and OptimizeISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
 
Production Debugging War Stories
Production Debugging War StoriesProduction Debugging War Stories
Production Debugging War Stories
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011
 
Roberto minerva 20181130
Roberto minerva 20181130  Roberto minerva 20181130
Roberto minerva 20181130
 

Dernier

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Dernier (20)

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 

PyConline AU 2021 - Things might go wrong in a data-intensive application

  • 1. Things might go wrong in a data-intensive application Petertc Chu | PyConline AU 2021
  • 2. Scope Applications deal with huge volumes of data - Web applications, mobile apps, IoT... Challenges - “the quantity of data, the complexity of data, the speed at which it is changing” Key factors - Scalability, Reliability (dataintensive.net)
  • 3. About me Research engineer and Pythonista from Taiwan Working on data infrastructures for ten years kiwislife.com
  • 4. The case Host and manage UGC (User-generated content) with various usage patterns - Streaming, IoT data aggregation, file distribution, archiving... - ~10PiB raw capacity - Processing several TiBs per day We can cover a football field if we put all our disks on the ground
  • 5. Structured data store Sharding / partitioning, RDMBS clusters, NoSQL... Concepts Cache layer Unstructured data store Various kind of DFSs, heterogeneous storage media Application servers Job processing systems, Other subsystems Various usage patterns
  • 7. What happened? Thousands of IoT devices push data to our cluster 24-7-365, got - error rate: ~30% - Avg RTT: 39.005s
  • 8. The build up DB race condition - Optimistic locking doesn’t help in this pattern (W >> R) databases IoT devices application servers contention occurred! 😱 😡
  • 9. The build up Pessimistic locking is too expensive for other usage patterns databases IoT devices application servers Implement global locking 🚘🚘 🚘 🚘🚘 🚘 🚘 🚘 🚘 other users 😡 😡 😡 👍
  • 10. The build up Final: a hybrid / adaptive approach - Only do pessimistic locking for specific operations - Do locking in local by default - Switch to global locking for specific resource automatically while collision detected - (switch back after a certain duration) - Keep using optimistic locking otherwise
  • 11. The build up Final: a hybrid / adaptive approach databases IoT devices application servers local lock local lock local lock (Global lock) other users 👍 👍 👍 👍
  • 12. Root cause #scalability We don’t design for a usage pattern and workload like that Action taken - Test concurrency scenarios before each release - Introduce observability and proactive monitoring systems for quick incident detection and diagnosis
  • 14. What happened? We have an advanced data management feature - Not production ready, just a prototype - No one use it for several years One day, a user discovered it and made a million times more requests to this subsystem!!
  • 15. The build up We needed some kind of distributed solution to handle this. - resque: a Redis-backed framework for creating background jobs https://github.blog/2009-11-03-introducing-resque/ https://gist.github.com/defunkt/225369
  • 16. Root cause #scalability Load exceeds expectations Action taken - All batch processing subsystems are now implemented in a distributed way
  • 18. What happened? A supplier built a data protection subsystem for us ...after we deployed it... Users complain data corruption!!
  • 19. The build up Defective padding in the encryption process Example 1: Input data: “DD” * 12 Expected result: | DD DD DD DD DD DD DD DD | DD DD DD DD 04 04 04 04 | Example 2: Input data: “DD” * 16 Expected result: | DD DD DD DD DD DD DD DD | DD DD DD DD DD DD DD DD | | 16 16 16 16 16 16 16 16 | 16 16 16 16 16 16 16 16 | Incorrect result: | DD DD DD DD DD DD DD DD | DD DD DD DD DD DD DD DD | (If the length of the original data is an integer multiple of the block size B, then an extra block of bytes with value B is added. B is 16 in this case.)
  • 20. The build up Design a process to fix all affected data - List all affected records from DBs - Read corresponding data with an “incorrect” decryption algorithm - Write data back with a correct encryption algorithm Id Size Encryption method Version number Data reference key 1 32 (Not encrypted) 0 aaa 2 6 Indefective algorithm 0 bbb 3 5 (not affected) Defective algorithm 0 ccc 4 32 (affected) Defective algorithm 1 (fixed) ddd 5 64 (affected) Defective algorithm 0 (not yet fixed) eee Only the last one needs a fix (block size = 16)
  • 21. The build up Just a silly bug, if it didn’t affect… - Millions of user records We set up a job processing system to correct all affected data in our system gearman [Gearman Job Server] https://github.com/Yelp/python-gearman
  • 22. Root cause #reliability #softwareFaults 1. Unreliable solution provider 2. Less than 1% possibility to find the bug by testing Action taken - Not outsourcing anymore - More comprehensive tests with various kinds of scenarios - ~10 TiB test dataset
  • 24. What happened? To keep reliability, we - Replicate user data multiple times - Distribute replicas to different failure domains (different host/data center) Data still lost!! http://dx.doi.org/10.6861/tanet.201810.0398
  • 25. The build up Our system balances loading by writing data into nodes that have more resource - A newly added node has more resource in general - Result in data tend to be placed on new nodes Data are written to unreliable newly added nodes and lost even though they are distributed in different failure domains. Topic: Electronic/Electrical Reliability (cmu.edu)
  • 26. Root cause #reliability #hardwareFaults It’s hard to prevent data loss completely - Modeling or simulation cannot truly reflect situations in real world Action taken - Do more stability tests on new coming nodes - Add a batch of new nodes each time, so it has less opportunity to write data into an unreliable node http://dx.doi.org/10.6861/tanet.201810.0398
  • 27. What do we learn from these incidents?🤔
  • 28. #1 “There is unfortunately no easy fix for making applications reliable, scalable” - No way to enumerate all possible reliability causes (hardware faults, software faults, human errors) - Usage pattern and load keep changing while your business expanded, cannot have an ultimate scalability design beforehand
  • 29. #2 Before trying to build a faultless architecture, think twice - Consider maintainability - We need a team to sustain a large-scale system, not just a talented engineer (dataintensive.net)
  • 30. #3 Service = human beings + machines