SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
Operating Mesos-powered Infrastructures
Pierre Cheynier
@pierrecdn
Operations Engineer, SRE Division
October 27, 2017
Operating 600+ servers on 7 DCs @ Criteo : sharing some insights
Company
2009 – GOING ABROAD
2005 - CREATION DATE 2013 – NASDAQ IPO
2016 - +1B REVENUE !
• 2,700 employees (600 R&D engineers), 30 offices
• 1.2B distinct users/month
• Billions of ads served & transactions analyzed / day
• 7 datacenters + 15 network PoPs
• 20K servers (Linux/Windows mix)
• 3M RPS at peak time
• Real Time Bidding: ~ 10 ms
• Hadoop: 171 PB storage (+600TB per day)
Transitioning…
• Hardware : reducing the Total Cost of Ownership
• Filling racks on premises  fully populated cabinets, repeatable process
• Fully secured (RAID, 2 x power, ...) COTS  commodity hardware
• O/S : maintainability
• Windows  Linux
• Runtime : diversity
• .NET Framework  CoreCLR (.NET Core Runtime) & JVM
• Platform deployment : flexibility, self-service
• IT automation  Tasks/Job Orchestration
Transitioning…
• Stable & Maintainable system => Simple & Modular
Why Mesos ?
• Small and Extensible project
• A highly-available distributed system kernel, abstracting and isolating
resources in less than 250k LoC
• Concrete primitives and interfaces, extensibility through Modules
• Implementing industry standards (such as CNI, CSI & OCI soon)
• Self-sufficient
• Mesos Containerizer
• UCR
• Where are we ?
• Started a small PoC during 2015 S2
• 1.5 year later: 600 agents, 150+ production apps, 250K QPS
• 2 generalist frameworks, ML-oriented & GPU-based workloads coming.
The long journey of setting up production-grade infrastructures
• 1 - Automate everything
• 2 - Configure defensively
• 3 - Discovering services and more
• 4 - Provide visibility to the end-users
• 5 - Networking is hard
1 - Automate everything
• Chef: our all-purposes config management tool
• Automate everything:
• address hardware scale up/down operations in minutes.
• Choregraphie: perform complex ops using lock-based
resource protection
• Reliability > CI pipelines:
• perform tests in VMs
• deploy in preproduction environment
2 - Configure defensively
• Identify fault-domains
• Placement constraints
• Take care of user secrets
• Authenticate everything
• Encryption channel provided through asymmetric crypto & key distribution
• Mesos Secrets available now (1.4.0) - SecretResolver
• Enforce limits
• CPU: for predictability use --cgroups_enable_cfs
• Mem: turn off swap (hi OOM-killer !)
• Disk: turn on disk quotas / unbounded by default on Marathon / understand GC.
• User: mandatory (forbid root usage and grant frameworks through Mesos ACL).
• Perform backups
• And try to restore ! (beware of API consistency / versioning)
•
3 - Discovering services and more
• Flat Service Discovery model
• Don’t forget legacy !
• Help managing the DC bootstrap case
• Fallback to the nearest DC using “prepared queries”
• Intra-DC communications : 1 network hop
• Consul API (DNS / HTTP)
• CSLB library embedded in Criteo SDK
• Consul as a DC, Services and State reference
• Tags and K/V used to store services metadatas
• Consul health-check as a general state reference
• Practical applications: automatically provision LBs, smooth
transitions between legacy and Mesos.
•
4 - Provide visibility to the end-users
• Cultural changes
• App instances move continuously !
• Metrology & Alerting
• Collectd, prometheus_exporter, etc.
• Not well-known metrics, from mesos.proto :
• Networking: net_[rx|tx|tcp]*,
[TrafficControl|Ip|Tcp|Udp]Statistics,
• Disk I/O: CgroupInfo.Blkio.CFQ.Statistics
• Tracing: PerfStatistics (costly!)
• SLAs
• Transparency about platform footprint
• Report your ability to schedule – chaos monkey involved !
• Debugging / Tracing
• The Mesos I/O Switchboard: remotely attach/exec
• Introducing system tracing components such as LTTng
5 - Networking is hard
• “The network is reliable”
• The 8 fallacies of distributed computing (L. Peter Deutsch - 1994)
• Load-balancing
• Providing services such as: visibility, timeout profiles, sticky cookie,
TLS...
• Use the new “seamless reloads” feature (1.8-dev2).
• net_cls cgroup : the simplest way to introduce basic QoS
• Noisy neighbours > which trade-off will you choose ?
•
Incidents…
• DC Outages
• Jul, 2017: “The site has been evacuated and the Fire Department
has been notified. Every server basically got shutdown and
restarted”.
• Disaster recovery scenarios
• Apr, 2017: “Marathon applications were deleted WW”
• Jun, 2017: “Zookeeper does not accept connections anymore, has
been satured by Aurora, new task deployments are in pending state”
• Noisy neighbours
• “Network latencies on 1 instance increased a lot (average, 95pctl)”
• “In 1 cabinet row, switches backplanes are currently saturated”
What’s left to answer ?
• Isolation, isolation, isolation
• Network and I/O bandwidth as a first-class resource ?
• Latency critical apps: combine with cpu_set ?
• Efficiency
• Revocable resources for non-latency critical tasks (jobs) ?
• Quotas + Oversubscription ?
• Bin packing (= reclaim hardware … & electrical power !)
• Maintenance Primitives
• Anticipate more complex operations by reclaiming resources
and not allocating new tasks.
Happy users !
• Providing support and sharing knowledge leads to great contributions
Do you want to know more ?
We’re hiring !
Thank you.

Contenu connexe

Tendances

Ceph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan XuCeph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan XuCeph Community
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFShapeBlue
 
Mi-Cloud Deployment Scenarios - Nazarudin Wijee
Mi-Cloud Deployment Scenarios - Nazarudin WijeeMi-Cloud Deployment Scenarios - Nazarudin Wijee
Mi-Cloud Deployment Scenarios - Nazarudin WijeeOpenNebula Project
 
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus NetworksOpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus NetworksOpenStack
 
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and CephProtecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and CephSean Cohen
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency CephShapeBlue
 
Using CloudStack With Clustered LVM
Using CloudStack With Clustered LVMUsing CloudStack With Clustered LVM
Using CloudStack With Clustered LVMMarcus L Sorensen
 
Making Ceph awesome on Kubernetes with Rook - Bassam Tabbara
Making Ceph awesome on Kubernetes with Rook - Bassam TabbaraMaking Ceph awesome on Kubernetes with Rook - Bassam Tabbara
Making Ceph awesome on Kubernetes with Rook - Bassam TabbaraCeph Community
 
Managing ceph through_oVirt_using_Cinder
Managing ceph through_oVirt_using_CinderManaging ceph through_oVirt_using_Cinder
Managing ceph through_oVirt_using_CinderMaor Lipchuk
 
Ceph with CloudStack
Ceph with CloudStackCeph with CloudStack
Ceph with CloudStackShapeBlue
 
Adventures in Research
Adventures in ResearchAdventures in Research
Adventures in ResearchNETWAYS
 
OpenNebulaconf2017US: Paying down technical debt with "one" dollar bills by ...
OpenNebulaconf2017US:  Paying down technical debt with "one" dollar bills by ...OpenNebulaconf2017US:  Paying down technical debt with "one" dollar bills by ...
OpenNebulaconf2017US: Paying down technical debt with "one" dollar bills by ...OpenNebula Project
 
OpenNebula TechDay Waterloo 2015 - Hyperconvergence and OpenNebula
OpenNebula TechDay Waterloo 2015 - Hyperconvergence  and  OpenNebulaOpenNebula TechDay Waterloo 2015 - Hyperconvergence  and  OpenNebula
OpenNebula TechDay Waterloo 2015 - Hyperconvergence and OpenNebulaOpenNebula Project
 
OpenNebula Conf 2014 | Lightning talk: OpenNebula at Etnetera by Jan Horacek
OpenNebula Conf 2014 | Lightning talk: OpenNebula at Etnetera by Jan HoracekOpenNebula Conf 2014 | Lightning talk: OpenNebula at Etnetera by Jan Horacek
OpenNebula Conf 2014 | Lightning talk: OpenNebula at Etnetera by Jan HoracekNETWAYS
 
Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...
Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...
Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...OPNFV
 
OpenNebulaconf2017US: Rapid scaling of research computing to over 70,000 cor...
OpenNebulaconf2017US:  Rapid scaling of research computing to over 70,000 cor...OpenNebulaconf2017US:  Rapid scaling of research computing to over 70,000 cor...
OpenNebulaconf2017US: Rapid scaling of research computing to over 70,000 cor...OpenNebula Project
 
OpenStack High Availability
OpenStack High AvailabilityOpenStack High Availability
OpenStack High AvailabilityJakub Pavlik
 
OpenNebula Conf 2014 | Bootstrapping a virtual infrastructure using OpenNebul...
OpenNebula Conf 2014 | Bootstrapping a virtual infrastructure using OpenNebul...OpenNebula Conf 2014 | Bootstrapping a virtual infrastructure using OpenNebul...
OpenNebula Conf 2014 | Bootstrapping a virtual infrastructure using OpenNebul...NETWAYS
 
Mi-ROSS Reliable Object Storage System For Software Defined Storage and Cloud
Mi-ROSS Reliable Object Storage System For Software Defined Storage and CloudMi-ROSS Reliable Object Storage System For Software Defined Storage and Cloud
Mi-ROSS Reliable Object Storage System For Software Defined Storage and CloudOpenNebula Project
 

Tendances (20)

Ceph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan XuCeph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Mi-Cloud Deployment Scenarios - Nazarudin Wijee
Mi-Cloud Deployment Scenarios - Nazarudin WijeeMi-Cloud Deployment Scenarios - Nazarudin Wijee
Mi-Cloud Deployment Scenarios - Nazarudin Wijee
 
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus NetworksOpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
 
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and CephProtecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
Using CloudStack With Clustered LVM
Using CloudStack With Clustered LVMUsing CloudStack With Clustered LVM
Using CloudStack With Clustered LVM
 
Making Ceph awesome on Kubernetes with Rook - Bassam Tabbara
Making Ceph awesome on Kubernetes with Rook - Bassam TabbaraMaking Ceph awesome on Kubernetes with Rook - Bassam Tabbara
Making Ceph awesome on Kubernetes with Rook - Bassam Tabbara
 
Managing ceph through_oVirt_using_Cinder
Managing ceph through_oVirt_using_CinderManaging ceph through_oVirt_using_Cinder
Managing ceph through_oVirt_using_Cinder
 
Ceph with CloudStack
Ceph with CloudStackCeph with CloudStack
Ceph with CloudStack
 
Adventures in Research
Adventures in ResearchAdventures in Research
Adventures in Research
 
OpenNebulaconf2017US: Paying down technical debt with "one" dollar bills by ...
OpenNebulaconf2017US:  Paying down technical debt with "one" dollar bills by ...OpenNebulaconf2017US:  Paying down technical debt with "one" dollar bills by ...
OpenNebulaconf2017US: Paying down technical debt with "one" dollar bills by ...
 
OpenNebula TechDay Waterloo 2015 - Hyperconvergence and OpenNebula
OpenNebula TechDay Waterloo 2015 - Hyperconvergence  and  OpenNebulaOpenNebula TechDay Waterloo 2015 - Hyperconvergence  and  OpenNebula
OpenNebula TechDay Waterloo 2015 - Hyperconvergence and OpenNebula
 
OpenNebula Conf 2014 | Lightning talk: OpenNebula at Etnetera by Jan Horacek
OpenNebula Conf 2014 | Lightning talk: OpenNebula at Etnetera by Jan HoracekOpenNebula Conf 2014 | Lightning talk: OpenNebula at Etnetera by Jan Horacek
OpenNebula Conf 2014 | Lightning talk: OpenNebula at Etnetera by Jan Horacek
 
Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...
Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...
Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...
 
OpenNebulaconf2017US: Rapid scaling of research computing to over 70,000 cor...
OpenNebulaconf2017US:  Rapid scaling of research computing to over 70,000 cor...OpenNebulaconf2017US:  Rapid scaling of research computing to over 70,000 cor...
OpenNebulaconf2017US: Rapid scaling of research computing to over 70,000 cor...
 
OpenStack High Availability
OpenStack High AvailabilityOpenStack High Availability
OpenStack High Availability
 
OpenNebula Conf 2014 | Bootstrapping a virtual infrastructure using OpenNebul...
OpenNebula Conf 2014 | Bootstrapping a virtual infrastructure using OpenNebul...OpenNebula Conf 2014 | Bootstrapping a virtual infrastructure using OpenNebul...
OpenNebula Conf 2014 | Bootstrapping a virtual infrastructure using OpenNebul...
 
Openstack nova
Openstack novaOpenstack nova
Openstack nova
 
Mi-ROSS Reliable Object Storage System For Software Defined Storage and Cloud
Mi-ROSS Reliable Object Storage System For Software Defined Storage and CloudMi-ROSS Reliable Object Storage System For Software Defined Storage and Cloud
Mi-ROSS Reliable Object Storage System For Software Defined Storage and Cloud
 

Similaire à MesosCon EU 2017 - Criteo - Operating Mesos-based Infrastructures

The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenParticular Software
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)Tibo Beijen
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
Containerization - The DevOps Revolution
Containerization - The DevOps RevolutionContainerization - The DevOps Revolution
Containerization - The DevOps RevolutionYulian Slobodyan
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservicesBigstep
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
PromCon EU 2022 - Centralized vs Decentralized Prometheus Scraping Architectu...
PromCon EU 2022 - Centralized vs Decentralized Prometheus Scraping Architectu...PromCon EU 2022 - Centralized vs Decentralized Prometheus Scraping Architectu...
PromCon EU 2022 - Centralized vs Decentralized Prometheus Scraping Architectu...Eric D. Schabell
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learnJohn D Almon
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…Sergey Dzyuban
 
Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018David Stockton
 
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"Fwdays
 
Best practices in Deploying SUSE CaaS Platform v3
Best practices in Deploying SUSE CaaS Platform v3Best practices in Deploying SUSE CaaS Platform v3
Best practices in Deploying SUSE CaaS Platform v3Juan Herrera Utande
 
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ... The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...Josef Adersberger
 
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...QAware GmbH
 
Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Govind Kanshi
 
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise  interactionMtc learnings from isv & enterprise  interaction
Mtc learnings from isv & enterprise interactionGovind Kanshi
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Dave Holland
 

Similaire à MesosCon EU 2017 - Criteo - Operating Mesos-based Infrastructures (20)

The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves Goeleven
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Containerization - The DevOps Revolution
Containerization - The DevOps RevolutionContainerization - The DevOps Revolution
Containerization - The DevOps Revolution
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
PromCon EU 2022 - Centralized vs Decentralized Prometheus Scraping Architectu...
PromCon EU 2022 - Centralized vs Decentralized Prometheus Scraping Architectu...PromCon EU 2022 - Centralized vs Decentralized Prometheus Scraping Architectu...
PromCon EU 2022 - Centralized vs Decentralized Prometheus Scraping Architectu...
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
 
Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018
 
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
 
Best practices in Deploying SUSE CaaS Platform v3
Best practices in Deploying SUSE CaaS Platform v3Best practices in Deploying SUSE CaaS Platform v3
Best practices in Deploying SUSE CaaS Platform v3
 
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ... The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
 
Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)
 
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise  interactionMtc learnings from isv & enterprise  interaction
Mtc learnings from isv & enterprise interaction
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 

Dernier

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

MesosCon EU 2017 - Criteo - Operating Mesos-based Infrastructures

  • 1. Operating Mesos-powered Infrastructures Pierre Cheynier @pierrecdn Operations Engineer, SRE Division October 27, 2017 Operating 600+ servers on 7 DCs @ Criteo : sharing some insights
  • 2. Company 2009 – GOING ABROAD 2005 - CREATION DATE 2013 – NASDAQ IPO 2016 - +1B REVENUE ! • 2,700 employees (600 R&D engineers), 30 offices • 1.2B distinct users/month • Billions of ads served & transactions analyzed / day • 7 datacenters + 15 network PoPs • 20K servers (Linux/Windows mix) • 3M RPS at peak time • Real Time Bidding: ~ 10 ms • Hadoop: 171 PB storage (+600TB per day)
  • 3. Transitioning… • Hardware : reducing the Total Cost of Ownership • Filling racks on premises  fully populated cabinets, repeatable process • Fully secured (RAID, 2 x power, ...) COTS  commodity hardware • O/S : maintainability • Windows  Linux • Runtime : diversity • .NET Framework  CoreCLR (.NET Core Runtime) & JVM • Platform deployment : flexibility, self-service • IT automation  Tasks/Job Orchestration
  • 4. Transitioning… • Stable & Maintainable system => Simple & Modular
  • 5. Why Mesos ? • Small and Extensible project • A highly-available distributed system kernel, abstracting and isolating resources in less than 250k LoC • Concrete primitives and interfaces, extensibility through Modules • Implementing industry standards (such as CNI, CSI & OCI soon) • Self-sufficient • Mesos Containerizer • UCR • Where are we ? • Started a small PoC during 2015 S2 • 1.5 year later: 600 agents, 150+ production apps, 250K QPS • 2 generalist frameworks, ML-oriented & GPU-based workloads coming.
  • 6. The long journey of setting up production-grade infrastructures • 1 - Automate everything • 2 - Configure defensively • 3 - Discovering services and more • 4 - Provide visibility to the end-users • 5 - Networking is hard
  • 7. 1 - Automate everything • Chef: our all-purposes config management tool • Automate everything: • address hardware scale up/down operations in minutes. • Choregraphie: perform complex ops using lock-based resource protection • Reliability > CI pipelines: • perform tests in VMs • deploy in preproduction environment
  • 8. 2 - Configure defensively • Identify fault-domains • Placement constraints • Take care of user secrets • Authenticate everything • Encryption channel provided through asymmetric crypto & key distribution • Mesos Secrets available now (1.4.0) - SecretResolver • Enforce limits • CPU: for predictability use --cgroups_enable_cfs • Mem: turn off swap (hi OOM-killer !) • Disk: turn on disk quotas / unbounded by default on Marathon / understand GC. • User: mandatory (forbid root usage and grant frameworks through Mesos ACL). • Perform backups • And try to restore ! (beware of API consistency / versioning) •
  • 9. 3 - Discovering services and more • Flat Service Discovery model • Don’t forget legacy ! • Help managing the DC bootstrap case • Fallback to the nearest DC using “prepared queries” • Intra-DC communications : 1 network hop • Consul API (DNS / HTTP) • CSLB library embedded in Criteo SDK • Consul as a DC, Services and State reference • Tags and K/V used to store services metadatas • Consul health-check as a general state reference • Practical applications: automatically provision LBs, smooth transitions between legacy and Mesos. •
  • 10. 4 - Provide visibility to the end-users • Cultural changes • App instances move continuously ! • Metrology & Alerting • Collectd, prometheus_exporter, etc. • Not well-known metrics, from mesos.proto : • Networking: net_[rx|tx|tcp]*, [TrafficControl|Ip|Tcp|Udp]Statistics, • Disk I/O: CgroupInfo.Blkio.CFQ.Statistics • Tracing: PerfStatistics (costly!) • SLAs • Transparency about platform footprint • Report your ability to schedule – chaos monkey involved ! • Debugging / Tracing • The Mesos I/O Switchboard: remotely attach/exec • Introducing system tracing components such as LTTng
  • 11. 5 - Networking is hard • “The network is reliable” • The 8 fallacies of distributed computing (L. Peter Deutsch - 1994) • Load-balancing • Providing services such as: visibility, timeout profiles, sticky cookie, TLS... • Use the new “seamless reloads” feature (1.8-dev2). • net_cls cgroup : the simplest way to introduce basic QoS • Noisy neighbours > which trade-off will you choose ? •
  • 12. Incidents… • DC Outages • Jul, 2017: “The site has been evacuated and the Fire Department has been notified. Every server basically got shutdown and restarted”. • Disaster recovery scenarios • Apr, 2017: “Marathon applications were deleted WW” • Jun, 2017: “Zookeeper does not accept connections anymore, has been satured by Aurora, new task deployments are in pending state” • Noisy neighbours • “Network latencies on 1 instance increased a lot (average, 95pctl)” • “In 1 cabinet row, switches backplanes are currently saturated”
  • 13. What’s left to answer ? • Isolation, isolation, isolation • Network and I/O bandwidth as a first-class resource ? • Latency critical apps: combine with cpu_set ? • Efficiency • Revocable resources for non-latency critical tasks (jobs) ? • Quotas + Oversubscription ? • Bin packing (= reclaim hardware … & electrical power !) • Maintenance Primitives • Anticipate more complex operations by reclaiming resources and not allocating new tasks.
  • 14. Happy users ! • Providing support and sharing knowledge leads to great contributions
  • 15. Do you want to know more ? We’re hiring ! Thank you.