SlideShare une entreprise Scribd logo
1  sur  14
Télécharger pour lire hors ligne
LISA 2011

Practice and Experience Report
Scaling on EC2 in a fast-paced environment
Nicolas Brousse
nicolas@tubemogul.com
December 8th 2011

2011 TubeMogul Incorporated All rights reserved.

1
Introduction - About the speaker
• My name is Nicolas Brousse
• I previously worked for many industry leading company in France
– From Web Hosting to Online Video services

(Lycos, MultiMania, Kewego, MediaPlazza...)

– Heavy traffic environment and large user databases
• I work as a Lead Operations Engineer at TubeMogul.com since 2008
• I help TubeMogul to scale its infrastructure
– From 20 servers to +500 servers
– Using 4 Amazon EC2 Regions + 1 Colo
– Monitoring with Nagios over 6,000 actives services and 1,000 passives services
– Collecting over 80,000 metrics with Ganglia
– Managing over 300 TB of data in Hadoop HDFS
– Billions HTTP queries a day
• Occasionally contribute to OpenSource projects
– Ganglia (PHP and PERL module)
– PHP Judy
2011 TubeMogul Incorporated All rights reserved.

2
Introduction - About TubeMogul
• Created in November 2006 by John Hughes and Brett Wilson
• Formerly a video distribution and analytics platform
• Acquire Illuminex - a flash analytics firm - in October 2008
• New platform call PlayTime™ :
– TubeMogul is a Video Marketing Company
– Built for Branding
– Integrate real-time media buying, ad serving, targeting, optimization and brand
measurement

TubeMogul simplifies the delivery of video ads and maximizes the impact of
every dollar spent by brand marketers
http://www.tubemogul.com/company/about_us

2011 TubeMogul Incorporated All rights reserved.

3
Amazon Clound Environment

2011 TubeMogul Incorporated All rights reserved.

4
Amazon Clound Environment
• We like it because....
– We can quickly start new servers/clusters
– We can quickly start new servers/clusters in many regions
• US East (Virginia)
• US West (North California)
• Europe (Dublin)
• Asia Pacific (Tokyo & Singapore)
– We can use different type of instances (RAM, CPU, Disks, etc.)
– It’s easy to automate with EC2 API
– It’s easy to plug to a configuration management tool

• But...
– It can be hard to troubleshoot some failures or network problems
– Occasionally being notified of hardware failures after the facts
– No Multicast (Though, possible with Amazon VPC)
– Bandwidth cost between regions can get expensive

2011 TubeMogul Incorporated All rights reserved.

5
Auto scale a cloud application
Crawling partners’ API to
aggregate Analytics data:
1. Call AWS API to start
instances
2. AWS Start instances
3. We push our code to the
instances
4. Open SSH tunnel to DB
5. Crawl Partners API and
aggregate the data
6. Push the data to our DB thru
SSH Tunnel
7. Shutdown EC2 instance

2011 TubeMogul Incorporated All rights reserved.

6
First approach to manage “always on” instances
•

Instances provisioning with a home made script called “Cerveza” in Tcl/Tk
–
–

•

Used to configure instance profile at boot
Let us run commands on multiple instances at once

Using LDAP and SQLite to track hostname and instance profiles
–
–

•

DNS Bind plugged to LDAP
SQLite store EBS volume, Instances id, keypair, AMI, etc.

Instance configuration with shell scripts from NFS mount
–

EBS Raid setup

–

Software deployment

–

Configuration changes

2011 TubeMogul Incorporated All rights reserved.

7
First approach to manage “always on” instances
•

Access Control
–
–

SSH Access limited by VPN

–

•

Security Group for each Cluster type (Hadoop, MySQL, etc.)
SSH and VPN plugged to LDAP

Easy way to identify instances
–
–

•

DNS plugged to LDAP
Each instance configured with human readable hostname
example: dev-mysql01, hadoop-namenode01, etc.

Easy working process for devs
–

•

Users’ home directory using NFS auto-mount

Automate Instances Monitoring
–

Trending with Ganglia, Alerting with Nagios

–

Using trending data for most Nagios checks

–

NSCA

2011 TubeMogul Incorporated All rights reserved.

8
Learning the hard way - the story
• SPOF in our infrastructure design lead us to a long outage
– prevent us to login into servers for couple of days
– couldn’t start new instances

1. Main server with NFS/LDAP went down
• corrupted EBS lead to fsck at boot time but no KVM on EC2...
• starting new instance still get stuck on fsck at boot
• end up using old AMI which we rebuild changing fstab
2. New instance took time to get ready to use because of old AMI
• need to recover from ldap backup first
• lost lot of configuration setup, slowing down our ability to manage instances
• need to re-install and configure our instance management tool “Cerveza”
3. /etc/resolv.conf need to be changed on every instances
• ssh backdoor didn’t work all the time because of LDAP/SSH/NFS timeout
4. Security Group rules with private IP made the recovery process more
complex
2011 TubeMogul Incorporated All rights reserved.

9
Learning the hard way - quick fixes

2011 TubeMogul Incorporated All rights reserved.

10
Going worldwide
• Need to handle multiple EC2 region for improved response time
– use of gateway server in each Availability Zone
– DNS Caching, LDAP Syncrepl, NFS with FS-Cache

• “Cerveza” rewrite in Python
– Replace SQLite by SimpleDB
– can be run from our laptop, need to run on a specific server
– handle full provisioning (start/stop/reboot) instances
– easily define instances profiles with YAML
– use EC2 tags to identify instances and EBS volumes

• Stop building our own AMI
– use of public ubuntu AMI server to reduce maintenance and support burden
– use of cloud-init to easy start and preconfigure new instances
– use of Puppet as our configuration management tool

2011 TubeMogul Incorporated All rights reserved.

11
Going worldwide

2011 TubeMogul Incorporated All rights reserved.

12
Lesson Learned
• Cloud environment doesn’t mean you should by-pass basic infrastructure
management rules
• Make sure the evolution of your infrastructure doesn’t introduce SPOF
• Keep it simple, stupid
• Don’t forget about backup and recovery process
• Use a configuration management tool early to prevent headache later

2011 TubeMogul Incorporated All rights reserved.

13
Thank You...
TubeMogul is Hiring !
http://www.tubemogul.com/company/careers
jobs@tubemogul.com
Follow us on Twitter
@tubemogul
2011 TubeMogul Incorporated All rights reserved.

@orieg

14

Contenu connexe

Plus de Nicolas Brousse

SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteSuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteNicolas Brousse
 
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...Nicolas Brousse
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
 
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Nicolas Brousse
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetNicolas Brousse
 
Scaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentScaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentNicolas Brousse
 
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Nicolas Brousse
 
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...Nicolas Brousse
 
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Nicolas Brousse
 

Plus de Nicolas Brousse (9)

SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteSuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
 
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with Puppet
 
Scaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentScaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced Environment
 
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
 
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
 
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
 

Dernier

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Dernier (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Scaling on EC2 in a Fast-Paced Environment (LISA'11)

  • 1. LISA 2011 Practice and Experience Report Scaling on EC2 in a fast-paced environment Nicolas Brousse nicolas@tubemogul.com December 8th 2011 2011 TubeMogul Incorporated All rights reserved. 1
  • 2. Introduction - About the speaker • My name is Nicolas Brousse • I previously worked for many industry leading company in France – From Web Hosting to Online Video services (Lycos, MultiMania, Kewego, MediaPlazza...) – Heavy traffic environment and large user databases • I work as a Lead Operations Engineer at TubeMogul.com since 2008 • I help TubeMogul to scale its infrastructure – From 20 servers to +500 servers – Using 4 Amazon EC2 Regions + 1 Colo – Monitoring with Nagios over 6,000 actives services and 1,000 passives services – Collecting over 80,000 metrics with Ganglia – Managing over 300 TB of data in Hadoop HDFS – Billions HTTP queries a day • Occasionally contribute to OpenSource projects – Ganglia (PHP and PERL module) – PHP Judy 2011 TubeMogul Incorporated All rights reserved. 2
  • 3. Introduction - About TubeMogul • Created in November 2006 by John Hughes and Brett Wilson • Formerly a video distribution and analytics platform • Acquire Illuminex - a flash analytics firm - in October 2008 • New platform call PlayTime™ : – TubeMogul is a Video Marketing Company – Built for Branding – Integrate real-time media buying, ad serving, targeting, optimization and brand measurement TubeMogul simplifies the delivery of video ads and maximizes the impact of every dollar spent by brand marketers http://www.tubemogul.com/company/about_us 2011 TubeMogul Incorporated All rights reserved. 3
  • 4. Amazon Clound Environment 2011 TubeMogul Incorporated All rights reserved. 4
  • 5. Amazon Clound Environment • We like it because.... – We can quickly start new servers/clusters – We can quickly start new servers/clusters in many regions • US East (Virginia) • US West (North California) • Europe (Dublin) • Asia Pacific (Tokyo & Singapore) – We can use different type of instances (RAM, CPU, Disks, etc.) – It’s easy to automate with EC2 API – It’s easy to plug to a configuration management tool • But... – It can be hard to troubleshoot some failures or network problems – Occasionally being notified of hardware failures after the facts – No Multicast (Though, possible with Amazon VPC) – Bandwidth cost between regions can get expensive 2011 TubeMogul Incorporated All rights reserved. 5
  • 6. Auto scale a cloud application Crawling partners’ API to aggregate Analytics data: 1. Call AWS API to start instances 2. AWS Start instances 3. We push our code to the instances 4. Open SSH tunnel to DB 5. Crawl Partners API and aggregate the data 6. Push the data to our DB thru SSH Tunnel 7. Shutdown EC2 instance 2011 TubeMogul Incorporated All rights reserved. 6
  • 7. First approach to manage “always on” instances • Instances provisioning with a home made script called “Cerveza” in Tcl/Tk – – • Used to configure instance profile at boot Let us run commands on multiple instances at once Using LDAP and SQLite to track hostname and instance profiles – – • DNS Bind plugged to LDAP SQLite store EBS volume, Instances id, keypair, AMI, etc. Instance configuration with shell scripts from NFS mount – EBS Raid setup – Software deployment – Configuration changes 2011 TubeMogul Incorporated All rights reserved. 7
  • 8. First approach to manage “always on” instances • Access Control – – SSH Access limited by VPN – • Security Group for each Cluster type (Hadoop, MySQL, etc.) SSH and VPN plugged to LDAP Easy way to identify instances – – • DNS plugged to LDAP Each instance configured with human readable hostname example: dev-mysql01, hadoop-namenode01, etc. Easy working process for devs – • Users’ home directory using NFS auto-mount Automate Instances Monitoring – Trending with Ganglia, Alerting with Nagios – Using trending data for most Nagios checks – NSCA 2011 TubeMogul Incorporated All rights reserved. 8
  • 9. Learning the hard way - the story • SPOF in our infrastructure design lead us to a long outage – prevent us to login into servers for couple of days – couldn’t start new instances 1. Main server with NFS/LDAP went down • corrupted EBS lead to fsck at boot time but no KVM on EC2... • starting new instance still get stuck on fsck at boot • end up using old AMI which we rebuild changing fstab 2. New instance took time to get ready to use because of old AMI • need to recover from ldap backup first • lost lot of configuration setup, slowing down our ability to manage instances • need to re-install and configure our instance management tool “Cerveza” 3. /etc/resolv.conf need to be changed on every instances • ssh backdoor didn’t work all the time because of LDAP/SSH/NFS timeout 4. Security Group rules with private IP made the recovery process more complex 2011 TubeMogul Incorporated All rights reserved. 9
  • 10. Learning the hard way - quick fixes 2011 TubeMogul Incorporated All rights reserved. 10
  • 11. Going worldwide • Need to handle multiple EC2 region for improved response time – use of gateway server in each Availability Zone – DNS Caching, LDAP Syncrepl, NFS with FS-Cache • “Cerveza” rewrite in Python – Replace SQLite by SimpleDB – can be run from our laptop, need to run on a specific server – handle full provisioning (start/stop/reboot) instances – easily define instances profiles with YAML – use EC2 tags to identify instances and EBS volumes • Stop building our own AMI – use of public ubuntu AMI server to reduce maintenance and support burden – use of cloud-init to easy start and preconfigure new instances – use of Puppet as our configuration management tool 2011 TubeMogul Incorporated All rights reserved. 11
  • 12. Going worldwide 2011 TubeMogul Incorporated All rights reserved. 12
  • 13. Lesson Learned • Cloud environment doesn’t mean you should by-pass basic infrastructure management rules • Make sure the evolution of your infrastructure doesn’t introduce SPOF • Keep it simple, stupid • Don’t forget about backup and recovery process • Use a configuration management tool early to prevent headache later 2011 TubeMogul Incorporated All rights reserved. 13
  • 14. Thank You... TubeMogul is Hiring ! http://www.tubemogul.com/company/careers jobs@tubemogul.com Follow us on Twitter @tubemogul 2011 TubeMogul Incorporated All rights reserved. @orieg 14