Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1Th5Qe3.
Yongsheng Wu talks about how to build highly-resilient systems at scale. He covers 5 highly fault-tolerant, battle-tested systems: dynamic service discovery, real-time configuration management, caching, persistent storage, and event processing pipeline. Wu presents also failure cases that prompted engineers at Pinterest to build such systems, and how they actually test these systems. Filmed at qconsf.com.
Yongsheng Wu is an early engineer on infrastructure team at Pinterest, where he helped in making the system available, scalable, reliable, and performant. He is currently leading the storage & caching team; previously he led the effort in making it easy to use, build and run services at Pinterest.
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Building Highly-resilient Systems at Pinterest
1.
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations/
pinterest-resilient-systems
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
4. Highly Resilient Systems at Pinterest
Yongsheng Wu
Engineering Manager of Storage & Caching, Pinterest
Email: yongsheng@pinterest.com
Pinterest: www.pinterest.com/yswu
Nov 17, 2015
5. Our mission is to help people
discover and do what they love
53. McRouter
Pros
• No inconsistency caused by node joining/leaving the pool
• No cascading failures in case of excessive load caused by hot keys
Cons
• Cache misses
54. Replicated Pools - Reads
McRouter
cache
1
cache
2
cache
n…
cache
1’
cache
2’
cache
n’…
ac Pool de Poolgetfoo
55. Replicated Pools - Reads
McRouter
cache
1
cache
2
cache
n…
cache
1’
cache
2’
cache
n’…
ac Pool de Poolgetfoo
56. Replicated Pools - Invalidation
McRouter
cache
1
cache
2
cache
n…
cache
1’
cache
2’
cache
n’…
ac Pool de Pooldeletefoo
57. Replicated Pools - Invalidation
McRouter
cache
1
cache
2
cache
n…
cache
1’
cache
2’
cache
n’…
ac Pool de Pooldeletefoo
Log
58. Replicated Pools - Invalidation
McRouter
cache
1
cache
2
cache
n…
cache
1’
cache
2’
cache
n’…
ac Pool de Pooldeletefoo
Log
59. Replicated Pools - Invalidation
McRouter
cache
1
cache
2
cache
n…
cache
1’
cache
2’
cache
n’…
ac Pool de Pooldeletefoo
LogSinger
kafka
tailer
PinLater
60. Challenges
• Build the feedback loop from persistent layer to
caching layer
• Move to multiple geographic regions
65. Clients
DataServices
1
… …
Master
1
Slave
1 … …
Master
m
Slave
m
Clients
Clients
Read from Slave
DataServices
2
DataServices
n
Read from slave after
master failing health
check over a certain
period of time.
67. Other Persistence Stores
UMetaStore
• Key value store based on HBase
Zen
• Graph store: nodes and edges
• Flexible schema
• Custom index
• Both HBase and MySQL
Future
• Rocksdb
71. Async Processing
Use Cases
• Acknowledge success with non-time-sensitive actions taken at
later time
• Schedule and execute large number of jobs
Benefits
• Faster response time
• More resilient to dependent system failures
72. Pyres Limitations
• No mechanism for success acknowledgement
• No visibility into status of individual job types
• No support for scheduled job execution at a
specific time in the future
• Rate limiting and retries are hard to manage
• Redis as only supported storage backend
86. Learnings
• Avoid Complete reliance on any single system, even
if it is a highly reliable distributed system
• Replication and failover are the key ingredients for
building highly resilient storage and caching
systems
• Use async processing as much as possible to
deliver faster response time and make request
handling more robust
87. Failure Testing
• Be explicit with scope
• Failure Modes
• Sandbox testing
• Manual testing
• Automated simulation
• Testing on production
• AWS is doing it for us all the time
• Simian Army
88.
89. Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/pinterest-
resilient-systems