APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

Michael Kehoe
Michael KehoeArchitect of reliable, scalable infrastructure à LinkedIn
Trafficshifting: Avoiding Disasters &
Improving Performance at Scale
Michael Kehoe
Staff Site Reliability Engineer
LinkedIn
2
Overview
• Problem Statement
• Solution – How LinkedIn trafficshift’s
• Datacenter shifting
• PoP steering
• Challenges of APAC region
• IPv4 vs IPv6
• Questions
$ whoami
3
Michael Kehoe
• Staff Site Reliability Engineer (SRE) @ LinkedIn
• Production-SRE team
• Funny accent = Australian + 3 years American
$ whatis SRE
4
Michael Kehoe
• Site Reliability Engineering
• Operations for the production application
environment
• Responsibilities include
• Architecture design
• Capacity planning
• Operations
• Tooling
• Responsibilities include DNS/ CDN management &
Traffic infrastructure
5
Terminology
• PoP - Where LinkedIn terminates incoming requests.
• Fabric – Datacenter with full LinkedIn production stack deployed
• Loadtest – Stress test of a Fabric – to simulate a disaster scenario
Disaster Recovery
6
Problem Statement
• Fail between Fabrics
• Performance of applications is degraded
• Validate disaster recovery (DR) scenario
• Expose bugs and suboptimal configurations via loadtest
• Planned maintenance
• Fail between PoP’s
• Mitigate impact of a 3rd party provider maintenance/ failure (e.g. transport links)
• Software/ Configuration Bugs
Performance
7
Problem Statement
• Fabric Assignment
• Assign preferred and secondary fabric to all members based on:
• Member location
• Capacity
• PoP/ CDN steering
• Use GeoDNS to steer user to ‘best’ PoP
• Use RUM DNS to steer users to ’best’ CDN
United States Performance (Global)
8
Problem Statement
APAC Performance (APAC cities)
9
Problem Statement
Delta US & APAC
10
Problem Statement
Site Speed
11
Problem Statement
• Site Speed affects User Engagement
• User Engagement affects page-views & transactions
• Bottom Line: Site Speed has an impact on revenue
LinkedIn’s Traffic Architecture
12
Solution
LinkedIn’s Traffic Architecture
13
Solution
Fabric shifting
14
Solution
• Stickyrouting
• Using a Hadoop job, we calculate a primary and
secondary datacenter for the user based on
location
• This data is stored in a Key-Value store
(Espresso)
• Stickyrouting serves this information over a
RESTful interface to our Edge PoP’s
Fabric shifting
15
Solution
• Different traffic types are partitioned and controlled separately
• Logged-In vs Logged-out
• CDN’s
• Monitoring
• Microsites
• Logged-in users are placed into ‘buckets’
• Buckets are marked online/ offline to move site traffic
Fabric shifting
16
Solution
• Stickyrouting – Benefits
• Ensure we serve the request as close to the user as possible
• Capacity management for datacenters
• We can assign a percentage of users to a datacenter
• Enables personal data routing (PDR)
• Only store data where we need it
Fabric shifting Automation
17
Solution
Fabric shifting Automation
18
Solution
Fabric Shifting
19
Solution
Fabric Shifting Load tests
20
Solution
Fabric Shifting Loadtests
21
Solution
LinkedIn’s Traffic Architecture
22
Solution
LinkedIn’s PoP Distribution
23
Solution
LinkedIn’s PoP Architecture
24
Solution
• Using IPVS - Each PoP announces a unicast address and a regional anycast
address
• APAC, EU and NAMER anycast regions
• Use GeoDNS to steer users to the ‘best’ PoP
• DNS will either provide users with an anycast or unicast address for
www.linkedin.com
• US and EU members is nearly all anycast
• APAC is all unicast
LinkedIn’s PoP DR
25
Solution
• Sometimes need to fail out of PoP’s
• 3rd party provider issues (e.g. transit links
going down)
• Infrastructure maintenance
• Withdraw anycast route announcements
• Fail healthchecks on proxy to drain unicast
traffic
LinkedIn’s PoP Performance
26
Solution
• PoP DNS Steering
• LinkedIn currently uses GeoDNS for routing
• Piloting RumDNS
• Pick the best PoP based on network, not country
• CDN Steering
• Mix CDN’s to get best performance
• Constantly evaluate performance/ availability
• Automatically adjust CDN weighting
LinkedIn’s PoP Performance
27
Solution
US CDN request time 50th percentile 24 hours
Working around fiber cuts
28
APAC Challenges
• Case Study: Fail out of India PoP due to fiber cuts
Connection Time for Indian members (90th percentile)
ASN 15802
ASN 5384
GeoDNS Suboptimal PoP’s
29
APAC Challenges
Source: http://www.submarinecablemap.com/#/submarine-cable/bay-of-bengal-gateway-bbg
SingaporeMumbai
45 ms
220 ms
70 ms
ASN 15802 RTT to Singapore is (220+70) 290ms (all at 50th percentile)
GeoDNS Suboptimal PoP’s
30
APAC Challenges
London
Dublin
SingaporeMumbai
160 ms
45 ms
ASN 15802
ASN 5384
70 ms
35 ms
350 ms
Hong
Kong160 ms
GeoDNS Suboptimal PoP’s
31
APAC Challenges
600
700
800
900
1000
1100
1200
Performance & Adoption
32
IPv4 vs IPv6
• IPv6 performs better for our members
• Less request time-outs on IPv6 for mobile users
• Mobile carriers are adopting IPv6 faster
• Win for LinkedIn and our members!
• In July 2014 (IPv6 launch): 3% of traffic was IPv6
• Today: ~12% of traffic is IPv6
Key Takeaways
33
Conclusion
• Application level traffic engineering is extremely important for content providers
• RUM data is extremely useful for finding anomalies
• Route traffic based on performance, not just location
• IPv6 performs better for LinkedIn users
34
Questions?
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale
1 sur 35

Recommandé

Couchbase Connect 2016 par
Couchbase Connect 2016Couchbase Connect 2016
Couchbase Connect 2016Michael Kehoe
665 vues28 diapositives
Reducing MTTR and False Escalations: Event Correlation at LinkedIn par
Reducing MTTR and False Escalations: Event Correlation at LinkedInReducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedInMichael Kehoe
956 vues34 diapositives
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn par
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInCouchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInMichael Kehoe
720 vues42 diapositives
Using SaltStack to Auto Triage and Remediate Production Systems par
Using SaltStack to Auto Triage and Remediate Production SystemsUsing SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production SystemsMichael Kehoe
1.8K vues29 diapositives
SouthBay SRE Meetup Jan 2016 par
SouthBay SRE Meetup Jan 2016SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016Michael Kehoe
586 vues17 diapositives
Becoming Protocol-Agnostic with Kafka, REST, GraphQL & gRPC | Tyler Mills, Sm... par
Becoming Protocol-Agnostic with Kafka, REST, GraphQL & gRPC | Tyler Mills, Sm...Becoming Protocol-Agnostic with Kafka, REST, GraphQL & gRPC | Tyler Mills, Sm...
Becoming Protocol-Agnostic with Kafka, REST, GraphQL & gRPC | Tyler Mills, Sm...HostedbyConfluent
635 vues10 diapositives

Contenu connexe

Tendances

Integrating Apache Kafka Into Your Environment par
Integrating Apache Kafka Into Your EnvironmentIntegrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your Environmentconfluent
3.9K vues29 diapositives
Introducing Tupilak, Snowplow's unified log fabric par
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricAlexander Dean
1.3K vues16 diapositives
Tale of two streaming frameworks (Karthik D - Walmart) par
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)KafkaZone
231 vues36 diapositives
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies... par
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...HostedbyConfluent
4.7K vues43 diapositives
Kafka Streams: What it is, and how to use it? par
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?confluent
1.9K vues34 diapositives
What Crimean War gunboats teach us about the need for schema registries par
What Crimean War gunboats teach us about the need for schema registriesWhat Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesAlexander Dean
869 vues31 diapositives

Tendances(20)

Integrating Apache Kafka Into Your Environment par confluent
Integrating Apache Kafka Into Your EnvironmentIntegrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your Environment
confluent3.9K vues
Introducing Tupilak, Snowplow's unified log fabric par Alexander Dean
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabric
Alexander Dean1.3K vues
Tale of two streaming frameworks (Karthik D - Walmart) par KafkaZone
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)
KafkaZone231 vues
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies... par HostedbyConfluent
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...
HostedbyConfluent4.7K vues
Kafka Streams: What it is, and how to use it? par confluent
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
confluent1.9K vues
What Crimean War gunboats teach us about the need for schema registries par Alexander Dean
What Crimean War gunboats teach us about the need for schema registriesWhat Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registries
Alexander Dean869 vues
[Webinar] AWS Monitoring with Site24x7 par Site24x7
[Webinar] AWS Monitoring with Site24x7[Webinar] AWS Monitoring with Site24x7
[Webinar] AWS Monitoring with Site24x7
Site24x71.8K vues
Consolidating services with middleware - NDC London 2017 par Christian Horsdal
Consolidating services with middleware - NDC London 2017Consolidating services with middleware - NDC London 2017
Consolidating services with middleware - NDC London 2017
Building a Self-Service Hadoop Platform at Linkedin with Azkaban par DataWorks Summit
Building a Self-Service Hadoop Platform at Linkedin with AzkabanBuilding a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
DataWorks Summit5.5K vues
Stream Processing Live Traffic Data with Kafka Streams par Tom Van den Bulck
Stream Processing Live Traffic Data with Kafka StreamsStream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka Streams
Event-driven Applications with Kafka, Micronaut, and AWS Lambda | Dave Klein,... par HostedbyConfluent
Event-driven Applications with Kafka, Micronaut, and AWS Lambda | Dave Klein,...Event-driven Applications with Kafka, Micronaut, and AWS Lambda | Dave Klein,...
Event-driven Applications with Kafka, Micronaut, and AWS Lambda | Dave Klein,...
David Max SATURN 2018 - Migrating from Oracle to Espresso par David Max
David Max SATURN 2018 - Migrating from Oracle to EspressoDavid Max SATURN 2018 - Migrating from Oracle to Espresso
David Max SATURN 2018 - Migrating from Oracle to Espresso
David Max55 vues
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent) K... par confluent
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent)  K...Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent)  K...
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent) K...
confluent3.7K vues
Scala eXchange: Building robust data pipelines in Scala par Alexander Dean
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
Alexander Dean4.6K vues
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy par Kairo Tavares
Confluent Kafka and KSQL: Streaming Data Pipelines Made EasyConfluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Kairo Tavares400 vues
How to use Standard SQL over Kafka: From the basics to advanced use cases | F... par HostedbyConfluent
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
A Practical Guide to Selecting a Stream Processing Technology par confluent
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
confluent2K vues
Span Conference: Why your company needs a unified log par Alexander Dean
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified log
Alexander Dean1.8K vues

En vedette

Couchbase Meetup Jan 2016 par
Couchbase Meetup Jan 2016Couchbase Meetup Jan 2016
Couchbase Meetup Jan 2016Michael Kehoe
991 vues11 diapositives
SRECon USA 2016: Growing your Entry Level Talent par
SRECon USA 2016: Growing your Entry Level TalentSRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level TalentMichael Kehoe
520 vues19 diapositives
CouchbasetoHadoop_Matt_Michael_Justin v4 par
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
590 vues17 diapositives
Feedback loops: How SREs benefit and what is needed to realize their potential par
Feedback loops: How SREs benefit and what is needed to realize their potentialFeedback loops: How SREs benefit and what is needed to realize their potential
Feedback loops: How SREs benefit and what is needed to realize their potentialPooja Tangi
531 vues14 diapositives
CouchDB y el desarrollo de aplicaciones Android par
CouchDB y el desarrollo de aplicaciones AndroidCouchDB y el desarrollo de aplicaciones Android
CouchDB y el desarrollo de aplicaciones AndroidRicardo Monagas Medina
1.9K vues26 diapositives
Couchbase Orchestration and Scaling a Caching Infrastructure At LinkedIn. par
Couchbase Orchestration and Scaling a Caching Infrastructure At LinkedIn.Couchbase Orchestration and Scaling a Caching Infrastructure At LinkedIn.
Couchbase Orchestration and Scaling a Caching Infrastructure At LinkedIn.Issa Fattah
548 vues44 diapositives

En vedette(20)

SRECon USA 2016: Growing your Entry Level Talent par Michael Kehoe
SRECon USA 2016: Growing your Entry Level TalentSRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level Talent
Michael Kehoe520 vues
CouchbasetoHadoop_Matt_Michael_Justin v4 par Michael Kehoe
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
Michael Kehoe590 vues
Feedback loops: How SREs benefit and what is needed to realize their potential par Pooja Tangi
Feedback loops: How SREs benefit and what is needed to realize their potentialFeedback loops: How SREs benefit and what is needed to realize their potential
Feedback loops: How SREs benefit and what is needed to realize their potential
Pooja Tangi531 vues
Couchbase Orchestration and Scaling a Caching Infrastructure At LinkedIn. par Issa Fattah
Couchbase Orchestration and Scaling a Caching Infrastructure At LinkedIn.Couchbase Orchestration and Scaling a Caching Infrastructure At LinkedIn.
Couchbase Orchestration and Scaling a Caching Infrastructure At LinkedIn.
Issa Fattah548 vues
Recruitment Analytics workshop - Endouble Antwerp 6-3-2017 par Endouble
Recruitment Analytics workshop  - Endouble Antwerp 6-3-2017Recruitment Analytics workshop  - Endouble Antwerp 6-3-2017
Recruitment Analytics workshop - Endouble Antwerp 6-3-2017
Endouble 277 vues
High Performance Magnolia with Anycast Routing par bkraft
High Performance Magnolia with Anycast RoutingHigh Performance Magnolia with Anycast Routing
High Performance Magnolia with Anycast Routing
bkraft570 vues
Service Redundancy and Traffic Balancing Using Anycast par Sean Jain Ellis
Service Redundancy and Traffic Balancing Using AnycastService Redundancy and Traffic Balancing Using Anycast
Service Redundancy and Traffic Balancing Using Anycast
Sean Jain Ellis5K vues
Software reliability tools and common software errors par Himanshu
Software reliability tools and common software errorsSoftware reliability tools and common software errors
Software reliability tools and common software errors
Himanshu 1.8K vues
Routing for an Anycast CDN par Tom Paseka
Routing for an Anycast CDNRouting for an Anycast CDN
Routing for an Anycast CDN
Tom Paseka4.5K vues
Endouble Advertising Workshop par Endouble
Endouble Advertising WorkshopEndouble Advertising Workshop
Endouble Advertising Workshop
Endouble 207 vues
How TPM saves the day par Pooja Tangi
How TPM saves the dayHow TPM saves the day
How TPM saves the day
Pooja Tangi318 vues
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G... par Diego Pacheco
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...
Diego Pacheco1.2K vues
Software Reliability Engineering par guest90cec6
Software Reliability EngineeringSoftware Reliability Engineering
Software Reliability Engineering
guest90cec61.9K vues
Event Driven Automation Meetup May 14/2015 par Dmitri Zimine
Event Driven Automation Meetup May 14/2015Event Driven Automation Meetup May 14/2015
Event Driven Automation Meetup May 14/2015
Dmitri Zimine2.2K vues
Software reliability growth model par Himanshu
Software reliability growth modelSoftware reliability growth model
Software reliability growth model
Himanshu 9.5K vues
Load balancing in the SRE way par Shawn Zhu
Load balancing in the SRE wayLoad balancing in the SRE way
Load balancing in the SRE way
Shawn Zhu730 vues
Best Practices And Next Gen Formats: Supercharging Web Content Performance par G3 Communications
Best Practices And Next Gen Formats: Supercharging Web Content PerformanceBest Practices And Next Gen Formats: Supercharging Web Content Performance
Best Practices And Next Gen Formats: Supercharging Web Content Performance

Similaire à APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

Trafficshifting: Avoiding Disasters & Improving Performance at Scale par
Trafficshifting: Avoiding Disasters & Improving Performance at ScaleTrafficshifting: Avoiding Disasters & Improving Performance at Scale
Trafficshifting: Avoiding Disasters & Improving Performance at ScaleAPNIC
245 vues37 diapositives
Play With Streams par
Play With StreamsPlay With Streams
Play With StreamsTianjian Chen
461 vues78 diapositives
PLNOG 3: John Evans - Best Practices in Network Planning par
PLNOG 3: John Evans - Best Practices in Network PlanningPLNOG 3: John Evans - Best Practices in Network Planning
PLNOG 3: John Evans - Best Practices in Network PlanningPROIDEA
19 vues36 diapositives
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan... par
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...wangbo626
1.2K vues42 diapositives
Migration from Oracle to PostgreSQL: NEED vs REALITY par
Migration from Oracle to PostgreSQL: NEED vs REALITYMigration from Oracle to PostgreSQL: NEED vs REALITY
Migration from Oracle to PostgreSQL: NEED vs REALITYAshnikbiz
140 vues24 diapositives
Rzepnicki_thesis_presentation_2003(2) (1) par
Rzepnicki_thesis_presentation_2003(2) (1)Rzepnicki_thesis_presentation_2003(2) (1)
Rzepnicki_thesis_presentation_2003(2) (1)Witold Rzepnicki
132 vues49 diapositives

Similaire à APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale(20)

Trafficshifting: Avoiding Disasters & Improving Performance at Scale par APNIC
Trafficshifting: Avoiding Disasters & Improving Performance at ScaleTrafficshifting: Avoiding Disasters & Improving Performance at Scale
Trafficshifting: Avoiding Disasters & Improving Performance at Scale
APNIC245 vues
PLNOG 3: John Evans - Best Practices in Network Planning par PROIDEA
PLNOG 3: John Evans - Best Practices in Network PlanningPLNOG 3: John Evans - Best Practices in Network Planning
PLNOG 3: John Evans - Best Practices in Network Planning
PROIDEA19 vues
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan... par wangbo626
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
wangbo6261.2K vues
Migration from Oracle to PostgreSQL: NEED vs REALITY par Ashnikbiz
Migration from Oracle to PostgreSQL: NEED vs REALITYMigration from Oracle to PostgreSQL: NEED vs REALITY
Migration from Oracle to PostgreSQL: NEED vs REALITY
Ashnikbiz140 vues
Rzepnicki_thesis_presentation_2003(2) (1) par Witold Rzepnicki
Rzepnicki_thesis_presentation_2003(2) (1)Rzepnicki_thesis_presentation_2003(2) (1)
Rzepnicki_thesis_presentation_2003(2) (1)
Witold Rzepnicki132 vues
How can Big data accelerate CDN services ? par ANOOP KUMAR P
How can Big data accelerate CDN services ?How can Big data accelerate CDN services ?
How can Big data accelerate CDN services ?
ANOOP KUMAR P1.7K vues
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud par APNIC
Row #9: An architecture overview of APNIC's RDAP deployment to the cloudRow #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
APNIC385 vues
MongoDB .local London 2019: Migrating a Monolith to MongoDB Atlas – Auto Trad... par MongoDB
MongoDB .local London 2019: Migrating a Monolith to MongoDB Atlas – Auto Trad...MongoDB .local London 2019: Migrating a Monolith to MongoDB Atlas – Auto Trad...
MongoDB .local London 2019: Migrating a Monolith to MongoDB Atlas – Auto Trad...
MongoDB463 vues
Hybrid Cloud Journey - Maximizing Private and Public Cloud par Ryan Lynn
Hybrid Cloud Journey - Maximizing Private and Public CloudHybrid Cloud Journey - Maximizing Private and Public Cloud
Hybrid Cloud Journey - Maximizing Private and Public Cloud
Ryan Lynn396 vues
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale par Michael Kehoe
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleVelocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Michael Kehoe247 vues
Motor vehicle emission checker danu-lap par aidsdatahub
Motor vehicle emission checker danu-lapMotor vehicle emission checker danu-lap
Motor vehicle emission checker danu-lap
aidsdatahub567 vues
Freedom of Movement for redisconf19 par Richard Leddy
Freedom of Movement for redisconf19Freedom of Movement for redisconf19
Freedom of Movement for redisconf19
Richard Leddy63 vues
Transform Your Data Integration Platform From Informatica To ODI par Jade Global
Transform Your Data Integration Platform From Informatica To ODI Transform Your Data Integration Platform From Informatica To ODI
Transform Your Data Integration Platform From Informatica To ODI
Jade Global910 vues
Improving Resource Utilization in Cloud using Application Placement Heuristics par Atakan Aral
Improving Resource Utilization in Cloud using Application Placement HeuristicsImproving Resource Utilization in Cloud using Application Placement Heuristics
Improving Resource Utilization in Cloud using Application Placement Heuristics
Atakan Aral536 vues
Migrating Big Data Workloads to the Cloud par Robert Sanders
Migrating Big Data Workloads to the CloudMigrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the Cloud
Robert Sanders451 vues
Druid Optimizations for Scaling Customer Facing Analytics par Amir Youssefi
Druid Optimizations for Scaling Customer Facing AnalyticsDruid Optimizations for Scaling Customer Facing Analytics
Druid Optimizations for Scaling Customer Facing Analytics
Amir Youssefi20 vues
About VisualDNA Architecture @ Rubyslava 2014 par Michal Harish
About VisualDNA Architecture @ Rubyslava 2014About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014
Michal Harish1.6K vues

Plus de Michael Kehoe

eBPF Workshop par
eBPF WorkshopeBPF Workshop
eBPF WorkshopMichael Kehoe
1.4K vues26 diapositives
eBPF Basics par
eBPF BasicseBPF Basics
eBPF BasicsMichael Kehoe
2.7K vues63 diapositives
Code Yellow: Helping operations top-heavy teams the smart way par
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayMichael Kehoe
140 vues29 diapositives
QConSF 2018: Building Production-Ready Applications par
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsMichael Kehoe
193 vues43 diapositives
Helping operations top-heavy teams the smart way par
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayMichael Kehoe
420 vues29 diapositives
AllDayDevops: What the NTSB teaches us about incident management & postmortems par
AllDayDevops: What the NTSB teaches us about incident management & postmortemsAllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsMichael Kehoe
321 vues58 diapositives

Plus de Michael Kehoe(16)

Code Yellow: Helping operations top-heavy teams the smart way par Michael Kehoe
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart way
Michael Kehoe140 vues
QConSF 2018: Building Production-Ready Applications par Michael Kehoe
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready Applications
Michael Kehoe193 vues
Helping operations top-heavy teams the smart way par Michael Kehoe
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart way
Michael Kehoe420 vues
AllDayDevops: What the NTSB teaches us about incident management & postmortems par Michael Kehoe
AllDayDevops: What the NTSB teaches us about incident management & postmortemsAllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortems
Michael Kehoe321 vues
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops par Michael Kehoe
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsPapers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Michael Kehoe285 vues
What the NTSB teaches us about incident management & postmortems par Michael Kehoe
What the NTSB teaches us about incident management & postmortemsWhat the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortems
Michael Kehoe489 vues
PyBay 2018: Production-Ready Python Applications par Michael Kehoe
PyBay 2018: Production-Ready Python ApplicationsPyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python Applications
Michael Kehoe283 vues
Helping operations top-heavy teams the smart way par Michael Kehoe
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart way
Michael Kehoe233 vues
The Next Wave of Reliability Engineering par Michael Kehoe
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
Michael Kehoe687 vues
Building Production-Ready Microservices: DevopsExchangeSF par Michael Kehoe
Building Production-Ready Microservices: DevopsExchangeSFBuilding Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSF
Michael Kehoe452 vues
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine... par Michael Kehoe
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
Michael Kehoe321 vues
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at... par Michael Kehoe
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
Michael Kehoe270 vues
SRECon-Europe-2017: Networks for SREs par Michael Kehoe
SRECon-Europe-2017: Networks for SREsSRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREs
Michael Kehoe383 vues

Dernier

What is Unit Testing par
What is Unit TestingWhat is Unit Testing
What is Unit TestingSadaaki Emura
24 vues25 diapositives
Generative AI Models & Their Applications par
Generative AI Models & Their ApplicationsGenerative AI Models & Their Applications
Generative AI Models & Their ApplicationsSN
8 vues1 diapositive
CHEMICAL KINETICS.pdf par
CHEMICAL KINETICS.pdfCHEMICAL KINETICS.pdf
CHEMICAL KINETICS.pdfAguedaGutirrez
13 vues337 diapositives
Investigation of Physicochemical Changes of Soft Clay around Deep Geopolymer ... par
Investigation of Physicochemical Changes of Soft Clay around Deep Geopolymer ...Investigation of Physicochemical Changes of Soft Clay around Deep Geopolymer ...
Investigation of Physicochemical Changes of Soft Clay around Deep Geopolymer ...AltinKaradagli
12 vues16 diapositives
DESIGN OF SPRINGS-UNIT4.pptx par
DESIGN OF SPRINGS-UNIT4.pptxDESIGN OF SPRINGS-UNIT4.pptx
DESIGN OF SPRINGS-UNIT4.pptxgopinathcreddy
19 vues47 diapositives
Design_Discover_Develop_Campaign.pptx par
Design_Discover_Develop_Campaign.pptxDesign_Discover_Develop_Campaign.pptx
Design_Discover_Develop_Campaign.pptxShivanshSeth6
32 vues20 diapositives

Dernier(20)

Generative AI Models & Their Applications par SN
Generative AI Models & Their ApplicationsGenerative AI Models & Their Applications
Generative AI Models & Their Applications
SN8 vues
Investigation of Physicochemical Changes of Soft Clay around Deep Geopolymer ... par AltinKaradagli
Investigation of Physicochemical Changes of Soft Clay around Deep Geopolymer ...Investigation of Physicochemical Changes of Soft Clay around Deep Geopolymer ...
Investigation of Physicochemical Changes of Soft Clay around Deep Geopolymer ...
AltinKaradagli12 vues
Design_Discover_Develop_Campaign.pptx par ShivanshSeth6
Design_Discover_Develop_Campaign.pptxDesign_Discover_Develop_Campaign.pptx
Design_Discover_Develop_Campaign.pptx
ShivanshSeth632 vues
Design of machine elements-UNIT 3.pptx par gopinathcreddy
Design of machine elements-UNIT 3.pptxDesign of machine elements-UNIT 3.pptx
Design of machine elements-UNIT 3.pptx
gopinathcreddy32 vues
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx par lwang78
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
lwang7883 vues
Update 42 models(Diode/General ) in SPICE PARK(DEC2023) par Tsuyoshi Horigome
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
Control Systems Feedback.pdf par LGGaming5
Control Systems Feedback.pdfControl Systems Feedback.pdf
Control Systems Feedback.pdf
LGGaming56 vues

APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

Notes de l'éditeur

  1. Good morning, my name is Michael Kehoe and in this presentation I’m going to talk about how LinkedIn shifts traffic between it’s PoP’s and datacenters to avoid disaster and improve site performance at scale
  2. So this morning I want to talk about the problem that we’re trying to solve, particularly in the context of APAC which is extremely challenging for internet companies Then we’ll deep-dive into how LinkedIn solves these problems to improve our availability and site performance. Specifically we’ll look at: Datacenter shifting PoP steering We’ll look at some of the challenges of operating in the APAC region, briefly talk about IPv6 adoption and then I’ll take questions
  3. So who am I? I’m a Staff Site Reliability Engineer (commonly referred to as SRE) at LinkedIn. I am on a team called Production-SRE, our team charter includes: Developing applications to improve MTTD and MTTR Build tools for efficient site issue troubleshooting, issue detection & correlation Assist in restoring stability to services during site critical issues Yes I have a slightly strange accent, it’s Australian with three 3 years of American.
  4. Site Reliability Engineering A term coined by Ben Treynor from Google You may also find it being called Devops/ Appops or Production Engineering Skillset based of: Sysadmin Network Engineer Architect Troubleshooter Software Engineer Role consists of: Architecture design Capacity planning Application Operations – Keeping the site healthy Writing automation and tooling SRE role/ philosophy differs between companies. At LinkedIn, SRE’s are responsible for DNS/ CDN management and traffic infrastructure
  5. So before we deep-dive, let’s go over some terminology PoP – Where LinkedIn terminates incoming requests to it’s datacenters. Spread geographically across the world Fabric – Datacenter where the full LinkedIn application stack is deployed. LinkedIn has 3 datacenters in the US and one in Singapore Loadtest – Where we stress test a Fabric to simulate a disaster.
  6. What are the use-cases for shifting traffic for Disaster Recovery purposes? Fabric: Performance of applications is degraded Site may be slow or users get errors Validate disaster recovery Plan for disasters (natural/ infrastructure/ code) Expose code bugs and suboptimal configurations via loadtest When the application infrastructure is under stress, easier to expose sub optimal configuration/ code Planned maintenance Intrusive infrastructure maintenance that may cause impact PoP Transport provider maintenance More common in Asia given the large number of submarine cables we utilize Software bugs
  7. So let’s look at the performance side of the equation. How can shifting traffic improve performance: Fabric: Members use the closest datacenter to them Manage capacity of a datacenter PoP: Steering Users to the best possible PoP gives us significant performance advantage By measuring CDN availability/ performance using RUM (talk about RUM and how it works), we can speed-up page-load-time by 50%
  8. **** NOTE: Move to excel and remove values *** Average linkedin.com page load time for countries using US Data-centers (measured by Catchpoint – All Major Metro Nodes around the world)
  9. Average linkedin.com page load time for countries using APAC Data-centres (measured by Catchpoint – Top 10 APAC metro nodes).
  10. Delta between US and APAC performance. Average is 2.5s
  11. LinkedIn has done extensive research on the impact site-speed has on user-engagement. From this research we know that slow page load times affects engagement and transaction This in-turn affects our revenue. This is imporant!
  12. So what does LinkedIn’s traffic architecture look like DNS routes users to the ‘best’ PoP (more on that later) IPVS (IP Virtual Server, a Linux kernel module) announces Unicast and Anycast addresses for www.linkedin.com and terminates TCP connections ATS (Apache Traffic Server) terminates SSL sessions and proxies requests to datacenters Stickyrouting service (talk about in a minute) tells the PoP (specifically ATS) which datacenter/ fabric to send the request to ATS in the datacenter proxies requests to frontend services
  13. Let’s talk about stickyrouting and Fabric-Shifting
  14. We run an offline Hadoop job to calculate primary and secondary datacenters for users. Hadoop is a distributed computing mechanism that proceses large datasets We store this data in an in-house key-value store named Espresso Stickyrouting serves information over a RESTFul interface to our Edge-PoP’s
  15. At LinkedIn, we partition our traffic into various classes so we can control them independently Logged-in vs Logged-out CDN traffic Monitoring traffic Microsites Logged-in users get assigned to a bucket (an arbitrary partition) We then online/ offline buckets in a fabric to manipulate the distribution of traffic between fabrics
  16. Benefits: Serve the request as close to the user Capacity management - Ensure that data-centers aren’t overloaded Personal data routing – lowers cost to serve
  17. My team built ’TrafficShift’ app to help automate datacenter routing’ We’ve automated fail-outs of datacenters Also allows us to do automated load-testing of our datacenters
  18. You can see, LTX1 (Texas datacenter) is failed out
  19. Example of failing out of East Coast Datacenter Top graph – Online buckets Bottom graph – Distribution of traffic
  20. Automation to validate DR Tell the engine which datacenter to stress, how much traffic, and what time periods and it will execute for us Traffic engine watches our alerting system to ensure we do not negatively impact the member experience
  21. Let’s talk about how users connect to LinkedIn’s PoP’s
  22. LinkedIn’s PoP locations Note that PoP in India is red – means it’s offline – talk about that further later
  23. Sometimes need to fail out for 3rd party issues – remember the red dot on the PoP map. Steer users to the next-best PoP. In this case. India to Singapore Note the slow traffic tail-off in TMU1 – DNS TTL’s not being honored For Anycast traffic, we withdraw the prefix announcement For Unicast, Fail healthchecks that DNS providers use to check if we are serving from that site
  24. Remember that red dot before. Sometimes by pure necessity, we need to fail out of PoP’s to mitigate impact or potential impact. In this case, move India traffic from India PoP to Singapore This does have an impact on client connect times and also page-load times.
  25. UAE has 2 ASNs and GeoDNS routes both to India 5384 – That’’s ok 15802 – Not ok
  26. RUM DNS recognizes optimal PoPs for ASN 15802 Two better paths, Hong Kong and London/ Dublin
  27. Drop in connect time after the change
  28. IPv6 – performs up to 40% better We’ve grown from 3% IPv6 traffic in July 2014 to over 12% today