PyConline AU 2021 - Things might go wrong in a data-intensive application

Things might go wrong in a
data-intensive application
Petertc Chu | PyConline AU 2021

Scope
Applications deal with huge volumes of data
- Web applications, mobile apps, IoT...
Challenges
- “the quantity of data, the complexity of data, the speed
at which it is changing”
Key factors
- Scalability, Reliability
(dataintensive.net)

About me
Research engineer and Pythonista from Taiwan
Working on data infrastructures for ten years
kiwislife.com

The case
Host and manage UGC (User-generated content) with various usage patterns
- Streaming, IoT data aggregation, file distribution, archiving...
- ~10PiB raw capacity
- Processing several TiBs per day
We can cover a football field if we put all our disks on the ground

Structured data store
Sharding / partitioning,
RDMBS clusters,
NoSQL...
Concepts
Cache layer
Unstructured data store
Various kind of DFSs,
heterogeneous storage
media
Application
servers
Job processing
systems,
Other
subsystems
Various usage patterns

What happened?
Thousands of IoT devices push data to
our cluster 24-7-365, got
- error rate: ~30%
- Avg RTT: 39.005s

The build up
DB race condition
- Optimistic locking doesn’t help in this pattern (W >> R)
databases
IoT
devices
application
servers
contention
occurred! 😱
😡

The build up
Pessimistic locking is too expensive for other usage patterns
databases
IoT
devices
application
servers
Implement global
locking
🚘🚘
🚘
🚘🚘
🚘
🚘
🚘
🚘
other users
😡
😡
😡
👍

The build up
Final: a hybrid / adaptive approach
- Only do pessimistic locking for specific operations
- Do locking in local by default
- Switch to global locking for specific resource automatically while collision detected
- (switch back after a certain duration)
- Keep using optimistic locking otherwise

The build up
Final: a hybrid / adaptive approach
databases
IoT
devices
application
servers
local lock
local lock
local lock
(Global lock)
other users
👍
👍
👍
👍

Root cause #scalability
We don’t design for a usage pattern and workload like that
Action taken
- Test concurrency scenarios before each release
- Introduce observability and proactive monitoring systems for quick incident
detection and diagnosis

What
happened?
We have an advanced data management feature
- Not production ready, just a prototype
- No one use it for several years
One day, a user discovered it and made a million
times more requests to this subsystem!!

The build up
We needed some kind of distributed solution to handle this.
- resque: a Redis-backed framework for creating background jobs
https://github.blog/2009-11-03-introducing-resque/ https://gist.github.com/defunkt/225369

Root cause #scalability
Load exceeds expectations
Action taken
- All batch processing subsystems are now implemented in a distributed way

What
happened?
A supplier built a data protection subsystem for us
...after we deployed it...
Users complain data corruption!!

The build up
Defective padding in the encryption process
Example 1:
Input data: “DD” * 12
Expected result:
| DD DD DD DD DD DD DD DD | DD DD DD DD 04 04 04 04 |
Example 2:
Input data: “DD” * 16
Expected result:
| DD DD DD DD DD DD DD DD | DD DD DD DD DD DD DD DD |
| 16 16 16 16 16 16 16 16 | 16 16 16 16 16 16 16 16 |
Incorrect result:
| DD DD DD DD DD DD DD DD | DD DD DD DD DD DD DD DD |
(If the length of the original data is an integer multiple of the block size B,
then an extra block of bytes with value B is added. B is 16 in this case.)

The build up
Design a process to fix all affected data
- List all affected records from DBs
- Read corresponding data with an “incorrect” decryption algorithm
- Write data back with a correct encryption algorithm
Id Size Encryption method Version number Data reference key
1 32 (Not encrypted) 0 aaa
2 6 Indefective algorithm 0 bbb
3 5 (not affected) Defective algorithm 0 ccc
4 32 (affected) Defective algorithm 1 (fixed) ddd
5 64 (affected) Defective algorithm 0 (not yet fixed) eee
Only the last one needs a fix (block size = 16)

The build up
Just a silly bug, if it didn’t affect…
- Millions of user records
We set up a job processing system to correct all affected data in our system
gearman [Gearman Job Server] https://github.com/Yelp/python-gearman

Root cause #reliability #softwareFaults
1. Unreliable solution provider
2. Less than 1% possibility to find the bug by testing
Action taken
- Not outsourcing anymore
- More comprehensive tests with various kinds of scenarios
- ~10 TiB test dataset

What
happened?
To keep reliability, we
- Replicate user data multiple times
- Distribute replicas to different failure domains
(different host/data center)
Data still lost!!
http://dx.doi.org/10.6861/tanet.201810.0398

The build up
Our system balances loading by writing data into nodes that have more resource
- A newly added node has more resource in general
- Result in data tend to be placed on new nodes
Data are written to unreliable newly added nodes and lost even though they are
distributed in different failure domains.
Topic: Electronic/Electrical Reliability (cmu.edu)

Root cause #reliability #hardwareFaults
It’s hard to prevent data loss completely
- Modeling or simulation cannot truly reflect situations in
real world
Action taken
- Do more stability tests on new coming nodes
- Add a batch of new nodes each time, so it has less
opportunity to write data into an unreliable node
http://dx.doi.org/10.6861/tanet.201810.0398

What do we learn
from these
incidents?🤔

#1 “There is unfortunately no easy fix for
making applications reliable, scalable”
- No way to enumerate all possible reliability causes (hardware faults,
software faults, human errors)
- Usage pattern and load keep changing while your business
expanded, cannot have an ultimate scalability design beforehand

#2 Before trying to build a faultless
architecture, think twice
- Consider maintainability
- We need a team to sustain a large-scale system, not just a talented engineer
(dataintensive.net)

#3 Service = human beings + machines

Thank you! 🙏🙏🙏
@petertc_chu

PyConline AU 2021 - Things might go wrong in a data-intensive application

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à PyConline AU 2021 - Things might go wrong in a data-intensive application

Similaire à PyConline AU 2021 - Things might go wrong in a data-intensive application (20)

Dernier

Dernier (20)

PyConline AU 2021 - Things might go wrong in a data-intensive application