SlideShare une entreprise Scribd logo
1  sur  70
Télécharger pour lire hors ligne
Artur Bergman
          sky@crucially.net
• Wikia Inc
  – We are hiring
  – Community/Bizdev in Germany
  – Engineers in Poland
  – http://www.wikia.com/wiki/hiring
• O’Reilly Radar
  – http://radar.oreilly.com/artur/
The value of operations
•   Google
•   Orkut
•   Friendster
•   Myspace
Benefits
•   Users trust your brand
•   They rely on you
•   They spend more time on your site
•   Bad operations wastes R&D money

• Fixed amount of time + faster site =
  more page views
Stepchild of Engineering
• Product development
• Engineering
• Operations
  – Sysadmins?
• Why?
Operations Engineering
• It is engineering
• Google terminology -
  – Site Reliability Engineer
• Sure there are sysadmins too, people
  mananing NOCs and datacenters
• Provide career growth
Good Engineers
•   Detail Oriented
•   Aspire to be operational engineers
•   Stubborn
•   Can steer their inner ADD
    – Interrupt driven
• Not the same as good developers
Danger signs
• Thinks operation is a path to
  development engineering
  – Fire them
• Want people dedicated to the task
• A good operations engineer should
  spend some time in development
• A good development engineer MUST
  spend some time in operations
Debugging
• 9 Rules of debugging
• http://www.debuggingrules.com/Poster_
  download.html
  – Yes the font is horrible
Rule 1:
       Understand the system
•   Complexity Kills
•   No excuse
•   If you write it, you must know it
•   If you run it, you must know it
•   If you buy it, you must know it
Rule 3:
      Quit thinking and look
• quot;It is a capital mistake to theorize before
  one has data. Insensibly one begins to
  twist facts to suit theories, instead of
  theories to suit facts.”
Rule 3:
        Quit thinking and look
•   What do you look at?
•   The importance of monitoring
•   Monitoring
•   Monitoring
•   Monitoring
My my, confusing term
• Monitoring
• Alerting
• Trending
Monitoring
•   Collects data
•   Puts into databases
•   Makes it available for you
•   Active collection
•   Passive interaction
Alerting
• Acts on monitoring data
• Severe alerts
  – Active
  – Needs action
• Passive alerts
  – Things that need to be done but not right now
• DO NOT OVER ALERT
• DO NOT CRY WOLF
Wikia alerting strategy
•   When the site is slow
•   Or down
•   We send emails and do phone calls
•   Europe and US West coast
•   Looking to hire in East Asia
•   No night time
Trending
• Long term
• Capacity planning
Monitor Tools
•   Nagios
•   Cacti
•   MRTG
•   Hyperic
•   Cricket
•   Ganglia
External Monitoring
• Use one, tells you what your clients see
  every x minutes
• Keynote
• Gomez
• Websitepulse (cheap - easy - I like
  them; no annoying salesforce)
Nagios
•   Alerting
•   Hassle
•   C CGI??
•   Doesn’t
    scale
Hyperic
• Most exciting open source tool
• Agent base - self configured
• Baseline alerting
Cricket MRTG Cacti
• Impossible to configure
• You need to write tools to do it
• Especially Cacti
  – Somewhat more pleasant than clawing out
    your eyes
Ganglia
• We love ganglia
• Automatically graphs everything you
  want - just works
• Large scale clusters
• Multicast
• Zero config
• RRD
http://ganglia.wikimedia.org/
•   270 hosts
•   880 CPU
•   2 clusters
•   1.2 TB of Memory
http://ganglia.wikimedia.org
Custom Ganglia Gmetrics
• Write your own

gmetric --name='Oldest query' --type=int32
--units='sec' --dmax=65 --value=`echo '
show processlist' | mysql -uroot -ppass |
grep -v Sleep | grep -v 'system user' | head -2 |
tail -1 | cut -f 6`
Custom Ganglia Gmetrics
• Or Learn Unix

gmetric --name='Oldest query' --type=int32
--units='sec' --dmax=65 --value=`echo '
show processlist' | mysql -uroot -ppass |
grep -v Sleep | grep -v 'system user' | head -2 |
tail -1 | cut -f 6`
Custom Ganglia Gmetrics
• Write your own

gmetric --name='Oldest query' --type=int32
--units='sec' --dmax=65 --value=`echo '
show processlist' | mysql -uroot -ppass |
grep -v Sleep | grep -v 'system user' | head -2 |
tail -1 | cut -f 6`
Custom Ganglia Gmetrics
• Write your own

gmetric --name='Oldest query' --type=int32
--units='sec' --dmax=65 --value=`echo '
show processlist' | mysql -uroot -ppass |
grep -v Sleep | grep -v 'system user' | head -2 |
tail -1 | cut -f 6`
Custom Ganglia Gmetrics
• Write your own

gmetric --name='Oldest query' --type=int32
--units='sec' --dmax=65 --value=`echo '
show processlist' | mysql -uroot -ppass |
grep -v Sleep | grep -v 'system user' | head -2 |
tail -1 | cut -f 6`
Something is wrong

• Don’t worry, data warehouse




                      QuickTime™ and a
            TIFF (Uncompressed) decompressor
               are needed to see this picture.
tcpdump / waveshark
•   If you suspect the network
•   Don’t just suspect
•   LOOK AT IT
•   Tcpdump / waveshark will tell you
    – If your packets are lost, delayed or
      corrupted
    – Your windowing is wrong
Rule 4: Divde and Conquer
• Look at the problems in turn
• Split between people
• Go in the order you suspect is the most
  likely
Rule 5:
 Change one thing at a time
• I cannot stress this enough
• IF YOU DO NOT THEN YOU HAVE
  FAILED TO IDENTIFY THE PROBLEM
Rule 6:
        Keep an audit trail
• You might be making things worse
• Good for the root cause analysis
• Have your shell log all commands
  – Good practice anyway
• Version control
Rule 9:
    If you didn’t fix it, it ain’t fixed
•   You must do something to fix a problem
•   Or it will bite you again
•   And again
•   And again
•   They don’t just appear and disappear
•   Except BGP route convergence :)
Process
• You need a little
• Don’t worry
Don’t forget
Complexity kills
•   Design against it
•   Reuse components
•   Define standards
•   Have a few images that all machines
    look like - reimage machines every now
    and then for the heck of it.
    – EC2 forces you to do this
MTBF
Meduim Time Between Failure
• Actually mostly irrelevant
• Dealing with failure is more important
• Target the right uptime
  – Complexity scales exponatially with
    required uptime
• Don’t kid yourself, you don’t need 5
  nines
MTTR
  Medium Time To Recovery
• Important
• Noone cares if you fail once a minute
  – If you recover in 50 ms
• If you are down 1 minute a week, you
  are still going to hit 4 nines (99.99%)
• Failures happen, plan how to deal with
  them
Problem found
• If it is critical, start a phone conversation
• Use IRC to communicate technical data
• One person liasons with non technical
  staff
• One person specifically in command
• Sleep scheduling ( audit log important )
Post crisis
• Root cause analysis
  – Just find out what went wrong
  – And how to avoid it
  – Or fix it faster next time if you can’t
• Keep track of your uptime
Automation
•   All machines are created equal
•   Seriously
•   If you manually make changes
•   You are wrong
    – Unless you know what you are doing
Best practices
•   Version control
•   Gold images
•   Centralised authentication
•   Time Sync ( NTP )
•   Central logging
•   ( All of this applies for virtual machines
    too!)
cfengine
•   Standard automation tool
•   Written in C
•   Not much support
•   Very good
•   Very annoying
contro :
      l
  s te
   i      = ( mys te )
                 i        domain = (
  mysite .count y )
               r
  sysadm = (mark )          netmask = (
  255.255.255.0 )          ac i
                             t onsequence =
  (         mounta ll       mount nfo
                                  i
      addmounts          mounta l
                                l        lnks
                                          i
  )        mountpat rn = / ie) (
                      te     $(s t /$ host))
 homepat r = ( u? )
          te n
Puppet
•   New hip kid on the block
•   Written in ruby
•   Better support?
•   Much nicer syntax
•   Easier to extend
def ne yumrepo (enab
   i                 led = true)
{c i i
    onf gfle
{ /e c
 quot; t /yum.repos /
               .d $name.repo”: mode
  => 644,
source => quot; yum/repos
             /        /$name. repoquot;,
ensure => $enab led ? {
true => fl ,
         ie
defau t=> absent
      l                  }
}}
cobb er
                        l
• Automatic PXE Installer
    – Uses kickstart files
•   Redhat Enterprise
•   Centos
•   Fedora
•   Some support for debian
cobbler
cobbler system add
  --name=xen8
  --mac=00:19:B9:EE:6D:0A
  --ip=10.10.30.208
  --profile=Centos-5-x86_64
  --kopts='ksdevice=00:19:B9:EE:6D:0A
      console=ttyS1,57600 console=tty0'
cobbler
cobbler system add
  --name=xen8
  --mac=00:19:B9:EE:6D:0A
  --ip=10.10.30.208
  --profile=Centos-5-x86_64
  --kopts='ksdevice=00:19:B9:EE:6D:0A
      console=ttyS1,57600 console=tty0’
koan
• Client install tool
  – Xen
  – Or OS re-image


koan --server=10.10.30.205 --virt --
  profile=virt_fc6 --virt-name=otrs
Your datacenter
• Keep it tidy
   – Label things, keep cables as short as possible
   – Have a switch in each rack
• If you are small without dedicated DC staff
  you need
   – Remote control power switches
   – Remote console!
Virtualization
•   Please use it
•   Managing becomes much easier
•   Power consumption
•   Need a new test box
    – The requestor can have it in minutes
Power consumption
• Maybe not as important in Europe
• 8 core machines are more efficient than
  1 core
• But memcache uses 1 core and all RAM
• Get more RAM and virtualise
Our network admin boxes
•   1 Xen CPU for Vyatta
•   1 Xen CPU for LVS
•   1 Xen CPU for Squid - Carp
•   1 Xen CPU for Squid
•   1 Xen CPU for Monitoring
•   1 Xen CPU for network tasks

• We can have more of these and a loss of one
  affects us less
Vyatta
• Opensource router
  – Really like it
  – No need to use Cisco
LVS
•   Linux Virtual Server
•   Low level load balancer
•   HA
•   Fast
•   Doesn’t inspire people to put things in
    the only place that is hard to scale
Squid Carp
• Squids configured to hash the urls and
  send them to specific backend
• Very little configuration done
• Logging of UDP - no disk IO
Squid
• As a reverse web accelerator
• 90 % of our hits served from RAM in less than
  1 ms
• Same as wikipedia
• We only use RAM cache ( unlike wikipedia)
• Cached per user
• If not cacheable - cache for a second to
  redue backend effect
App servers
• 1 xen cpu for memcache ( 5 GB Ram)
• 1 xen cpu for squid ( 5GB Ram )
• 6 xen cpus for apache (6 GB Ram )

• More power efficient, less affected by
  loss
• Applications can’t affect each other
Databases
• Keep developers on short leash
• Report bad queries
• Fear object relational mappers
Outsourcing
• As much as possible
• The younger you are as a company the
  less risk
  – When you have no users, you have no
    value
• VCs don’t like having their money go
  into Capex
What I want from Vendors
• They do what they tell me
• They do what I tell them

• No annoying up sells, no premium
  services
  – I know more about what you are selling
    than you
Services we use
• Amazon EC2 and S3
• Panther-Express
Panther Express
• Fantastic Content Distribution Network
• Cheap, simple price list
  – Take note akamai
• Cut delivery time to Europe by 70%
• We let our images be cached 1 second
  to redue load
EC2 and S3
•   We save all our binlogs to S3
•   We save database dumps to S3
•   We have monitors running from EC2
•   We plan to build a datawarehouse
    cluster on EC2
EC2 Requires Automation
• Machine is blank when you bring it up
• Download database dump from S3 and
  replicate up - automatically
• Use puppet
• Amazon saves you hardware
  headaches
  – But complexity is still a problem
Thank you

Contenu connexe

En vedette

Web App Testing - A Practical Approach
Web App Testing - A Practical ApproachWeb App Testing - A Practical Approach
Web App Testing - A Practical Approach
Walter Mamed
 
Web Application Testing
Web Application TestingWeb Application Testing
Web Application Testing
Richa Goel
 
Software testing basic concepts
Software testing basic conceptsSoftware testing basic concepts
Software testing basic concepts
Hưng Hoàng
 
Testing concepts ppt
Testing concepts pptTesting concepts ppt
Testing concepts ppt
Rathna Priya
 
Software Testing Fundamentals
Software Testing FundamentalsSoftware Testing Fundamentals
Software Testing Fundamentals
Chankey Pathak
 

En vedette (12)

Web engineering - An overview about HTML
Web engineering -  An overview about HTMLWeb engineering -  An overview about HTML
Web engineering - An overview about HTML
 
Web Engineering - Web Application Testing
Web Engineering - Web Application TestingWeb Engineering - Web Application Testing
Web Engineering - Web Application Testing
 
Web application testing with Selenium
Web application testing with SeleniumWeb application testing with Selenium
Web application testing with Selenium
 
Web App Testing - A Practical Approach
Web App Testing - A Practical ApproachWeb App Testing - A Practical Approach
Web App Testing - A Practical Approach
 
Testing Web Applications
Testing Web ApplicationsTesting Web Applications
Testing Web Applications
 
Web Application Testing
Web Application TestingWeb Application Testing
Web Application Testing
 
Testing web application
Testing web applicationTesting web application
Testing web application
 
Selenium Testing Project report
Selenium Testing Project reportSelenium Testing Project report
Selenium Testing Project report
 
Software testing basic concepts
Software testing basic conceptsSoftware testing basic concepts
Software testing basic concepts
 
Testing concepts ppt
Testing concepts pptTesting concepts ppt
Testing concepts ppt
 
Software Testing Fundamentals
Software Testing FundamentalsSoftware Testing Fundamentals
Software Testing Fundamentals
 
Software testing ppt
Software testing pptSoftware testing ppt
Software testing ppt
 

Similaire à Web 2.0 Performance and Reliability: How to Run Large Web Apps

Make Your Life Easier With Maatkit
Make Your Life Easier With MaatkitMake Your Life Easier With Maatkit
Make Your Life Easier With Maatkit
MySQLConference
 
Understanding and hiding your operations
Understanding and hiding your operationsUnderstanding and hiding your operations
Understanding and hiding your operations
Daniel López Jiménez
 
Secure Programming With Static Analysis
Secure Programming With Static AnalysisSecure Programming With Static Analysis
Secure Programming With Static Analysis
ConSanFrancisco123
 
Scaling Rails with memcached
Scaling Rails with memcachedScaling Rails with memcached
Scaling Rails with memcached
elliando dias
 

Similaire à Web 2.0 Performance and Reliability: How to Run Large Web Apps (20)

Make Your Life Easier With Maatkit
Make Your Life Easier With MaatkitMake Your Life Easier With Maatkit
Make Your Life Easier With Maatkit
 
Drizzle Talk
Drizzle TalkDrizzle Talk
Drizzle Talk
 
How the JDeveloper team test JDeveloper at UKOUG'08
How the JDeveloper team test JDeveloper at UKOUG'08How the JDeveloper team test JDeveloper at UKOUG'08
How the JDeveloper team test JDeveloper at UKOUG'08
 
All The Little Pieces
All The Little PiecesAll The Little Pieces
All The Little Pieces
 
Tips on High Performance Server Programming
Tips on High Performance Server ProgrammingTips on High Performance Server Programming
Tips on High Performance Server Programming
 
Becoming a Power User
Becoming a Power UserBecoming a Power User
Becoming a Power User
 
The Automation Factory
The Automation FactoryThe Automation Factory
The Automation Factory
 
Understanding and hiding your operations
Understanding and hiding your operationsUnderstanding and hiding your operations
Understanding and hiding your operations
 
The Current State of Asynchronous Processing With Ruby
The Current State of Asynchronous Processing With RubyThe Current State of Asynchronous Processing With Ruby
The Current State of Asynchronous Processing With Ruby
 
Nevmug Lighthouse Automation7.1
Nevmug   Lighthouse   Automation7.1Nevmug   Lighthouse   Automation7.1
Nevmug Lighthouse Automation7.1
 
Just In Time Scalability Agile Methods To Support Massive Growth Presentation
Just In Time Scalability  Agile Methods To Support Massive Growth PresentationJust In Time Scalability  Agile Methods To Support Massive Growth Presentation
Just In Time Scalability Agile Methods To Support Massive Growth Presentation
 
Practical project automation
Practical project automationPractical project automation
Practical project automation
 
Securing Rails
Securing RailsSecuring Rails
Securing Rails
 
Os Wilhelm
Os WilhelmOs Wilhelm
Os Wilhelm
 
Secure Programming With Static Analysis
Secure Programming With Static AnalysisSecure Programming With Static Analysis
Secure Programming With Static Analysis
 
When Crypto Attacks! (Yahoo 2009)
When Crypto Attacks! (Yahoo 2009)When Crypto Attacks! (Yahoo 2009)
When Crypto Attacks! (Yahoo 2009)
 
Nsd, il tuo compagno di viaggio quando Domino va in crash
Nsd, il tuo compagno di viaggio quando Domino va in crashNsd, il tuo compagno di viaggio quando Domino va in crash
Nsd, il tuo compagno di viaggio quando Domino va in crash
 
Scaling Rails with memcached
Scaling Rails with memcachedScaling Rails with memcached
Scaling Rails with memcached
 
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelKernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
 
Tools and Tips to Diagnose Performance Issues
Tools and Tips to Diagnose Performance IssuesTools and Tips to Diagnose Performance Issues
Tools and Tips to Diagnose Performance Issues
 

Plus de adunne

Add Powerful Full Text Search to Your Web App with Solr
Add Powerful Full Text Search to Your Web App with SolrAdd Powerful Full Text Search to Your Web App with Solr
Add Powerful Full Text Search to Your Web App with Solr
adunne
 
The Impact of Mobile Web 2.0 on the Telecoms Industry
The Impact of Mobile Web 2.0 on the Telecoms IndustryThe Impact of Mobile Web 2.0 on the Telecoms Industry
The Impact of Mobile Web 2.0 on the Telecoms Industry
adunne
 
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
adunne
 
Web 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web AppsWeb 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web Apps
adunne
 
Disrupting the Platform: Harnessing social analytics and other musings on the...
Disrupting the Platform: Harnessing social analytics and other musings on the...Disrupting the Platform: Harnessing social analytics and other musings on the...
Disrupting the Platform: Harnessing social analytics and other musings on the...
adunne
 
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data SetUnder the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
adunne
 
Scalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and ApproachesScalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and Approaches
adunne
 
Trends in Search Engine Optimization and Search Engine Marketing
Trends in Search Engine Optimization and Search Engine MarketingTrends in Search Engine Optimization and Search Engine Marketing
Trends in Search Engine Optimization and Search Engine Marketing
adunne
 

Plus de adunne (20)

Seedcamp Overview
Seedcamp OverviewSeedcamp Overview
Seedcamp Overview
 
Netvibes Preview
Netvibes PreviewNetvibes Preview
Netvibes Preview
 
Community Practices: From Forums to Social Networks
Community Practices: From Forums to Social NetworksCommunity Practices: From Forums to Social Networks
Community Practices: From Forums to Social Networks
 
Designing Tag Navigation
Designing Tag NavigationDesigning Tag Navigation
Designing Tag Navigation
 
Social Commerce and Community
Social Commerce and CommunitySocial Commerce and Community
Social Commerce and Community
 
The Starfish and the Spider
The Starfish and the SpiderThe Starfish and the Spider
The Starfish and the Spider
 
Ginger Preview
Ginger PreviewGinger Preview
Ginger Preview
 
Add Powerful Full Text Search to Your Web App with Solr
Add Powerful Full Text Search to Your Web App with SolrAdd Powerful Full Text Search to Your Web App with Solr
Add Powerful Full Text Search to Your Web App with Solr
 
The Impact of Mobile Web 2.0 on the Telecoms Industry
The Impact of Mobile Web 2.0 on the Telecoms IndustryThe Impact of Mobile Web 2.0 on the Telecoms Industry
The Impact of Mobile Web 2.0 on the Telecoms Industry
 
Building Web 2.0: Next-Generation Data Centers
Building Web 2.0: Next-Generation Data CentersBuilding Web 2.0: Next-Generation Data Centers
Building Web 2.0: Next-Generation Data Centers
 
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
 
Designing for a Web of Data
Designing for a Web of DataDesigning for a Web of Data
Designing for a Web of Data
 
Web 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web AppsWeb 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web Apps
 
Disrupting the Platform: Harnessing social analytics and other musings on the...
Disrupting the Platform: Harnessing social analytics and other musings on the...Disrupting the Platform: Harnessing social analytics and other musings on the...
Disrupting the Platform: Harnessing social analytics and other musings on the...
 
Your User's Privacy
Your User's PrivacyYour User's Privacy
Your User's Privacy
 
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data SetUnder the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
 
Scalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and ApproachesScalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and Approaches
 
Trends in Search Engine Optimization and Search Engine Marketing
Trends in Search Engine Optimization and Search Engine MarketingTrends in Search Engine Optimization and Search Engine Marketing
Trends in Search Engine Optimization and Search Engine Marketing
 
Wuala, P2P Online Storage
Wuala, P2P Online StorageWuala, P2P Online Storage
Wuala, P2P Online Storage
 
Breaking Down The Barriers: Design for Accessibility
Breaking Down The Barriers: Design for AccessibilityBreaking Down The Barriers: Design for Accessibility
Breaking Down The Barriers: Design for Accessibility
 

Dernier

Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Anamikakaur10
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
lizamodels9
 
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
daisycvs
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
amitlee9823
 
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
dollysharma2066
 
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
amitlee9823
 

Dernier (20)

Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
 
BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRLBAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
 
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
 
PHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation Final
 
Eluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort Service
Eluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort ServiceEluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort Service
Eluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort Service
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
 
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Cheap Rate Call Girls In Noida Sector 62 Metro 959961乂3876
Cheap Rate Call Girls In Noida Sector 62 Metro 959961乂3876Cheap Rate Call Girls In Noida Sector 62 Metro 959961乂3876
Cheap Rate Call Girls In Noida Sector 62 Metro 959961乂3876
 
Falcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investorsFalcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investors
 
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 MonthsSEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
 
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
 
Falcon Invoice Discounting platform in india
Falcon Invoice Discounting platform in indiaFalcon Invoice Discounting platform in india
Falcon Invoice Discounting platform in india
 
Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
 
JAYNAGAR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
JAYNAGAR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRLJAYNAGAR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
JAYNAGAR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 May
 
RSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors DataRSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors Data
 
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
 

Web 2.0 Performance and Reliability: How to Run Large Web Apps

  • 1. Artur Bergman sky@crucially.net • Wikia Inc – We are hiring – Community/Bizdev in Germany – Engineers in Poland – http://www.wikia.com/wiki/hiring • O’Reilly Radar – http://radar.oreilly.com/artur/
  • 2. The value of operations • Google • Orkut • Friendster • Myspace
  • 3. Benefits • Users trust your brand • They rely on you • They spend more time on your site • Bad operations wastes R&D money • Fixed amount of time + faster site = more page views
  • 4. Stepchild of Engineering • Product development • Engineering • Operations – Sysadmins? • Why?
  • 5. Operations Engineering • It is engineering • Google terminology - – Site Reliability Engineer • Sure there are sysadmins too, people mananing NOCs and datacenters • Provide career growth
  • 6. Good Engineers • Detail Oriented • Aspire to be operational engineers • Stubborn • Can steer their inner ADD – Interrupt driven • Not the same as good developers
  • 7. Danger signs • Thinks operation is a path to development engineering – Fire them • Want people dedicated to the task • A good operations engineer should spend some time in development • A good development engineer MUST spend some time in operations
  • 8.
  • 9. Debugging • 9 Rules of debugging • http://www.debuggingrules.com/Poster_ download.html – Yes the font is horrible
  • 10. Rule 1: Understand the system • Complexity Kills • No excuse • If you write it, you must know it • If you run it, you must know it • If you buy it, you must know it
  • 11. Rule 3: Quit thinking and look • quot;It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”
  • 12. Rule 3: Quit thinking and look • What do you look at? • The importance of monitoring • Monitoring • Monitoring • Monitoring
  • 13. My my, confusing term • Monitoring • Alerting • Trending
  • 14. Monitoring • Collects data • Puts into databases • Makes it available for you • Active collection • Passive interaction
  • 15. Alerting • Acts on monitoring data • Severe alerts – Active – Needs action • Passive alerts – Things that need to be done but not right now • DO NOT OVER ALERT • DO NOT CRY WOLF
  • 16. Wikia alerting strategy • When the site is slow • Or down • We send emails and do phone calls • Europe and US West coast • Looking to hire in East Asia • No night time
  • 17. Trending • Long term • Capacity planning
  • 18. Monitor Tools • Nagios • Cacti • MRTG • Hyperic • Cricket • Ganglia
  • 19. External Monitoring • Use one, tells you what your clients see every x minutes • Keynote • Gomez • Websitepulse (cheap - easy - I like them; no annoying salesforce)
  • 20. Nagios • Alerting • Hassle • C CGI?? • Doesn’t scale
  • 21. Hyperic • Most exciting open source tool • Agent base - self configured • Baseline alerting
  • 22. Cricket MRTG Cacti • Impossible to configure • You need to write tools to do it • Especially Cacti – Somewhat more pleasant than clawing out your eyes
  • 23. Ganglia • We love ganglia • Automatically graphs everything you want - just works • Large scale clusters • Multicast • Zero config • RRD
  • 24. http://ganglia.wikimedia.org/ • 270 hosts • 880 CPU • 2 clusters • 1.2 TB of Memory
  • 26. Custom Ganglia Gmetrics • Write your own gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo ' show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
  • 27. Custom Ganglia Gmetrics • Or Learn Unix gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo ' show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
  • 28. Custom Ganglia Gmetrics • Write your own gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo ' show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
  • 29. Custom Ganglia Gmetrics • Write your own gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo ' show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
  • 30. Custom Ganglia Gmetrics • Write your own gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo ' show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
  • 31. Something is wrong • Don’t worry, data warehouse QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
  • 32. tcpdump / waveshark • If you suspect the network • Don’t just suspect • LOOK AT IT • Tcpdump / waveshark will tell you – If your packets are lost, delayed or corrupted – Your windowing is wrong
  • 33. Rule 4: Divde and Conquer • Look at the problems in turn • Split between people • Go in the order you suspect is the most likely
  • 34. Rule 5: Change one thing at a time • I cannot stress this enough • IF YOU DO NOT THEN YOU HAVE FAILED TO IDENTIFY THE PROBLEM
  • 35. Rule 6: Keep an audit trail • You might be making things worse • Good for the root cause analysis • Have your shell log all commands – Good practice anyway • Version control
  • 36. Rule 9: If you didn’t fix it, it ain’t fixed • You must do something to fix a problem • Or it will bite you again • And again • And again • They don’t just appear and disappear • Except BGP route convergence :)
  • 37. Process • You need a little • Don’t worry
  • 39. Complexity kills • Design against it • Reuse components • Define standards • Have a few images that all machines look like - reimage machines every now and then for the heck of it. – EC2 forces you to do this
  • 40. MTBF Meduim Time Between Failure • Actually mostly irrelevant • Dealing with failure is more important • Target the right uptime – Complexity scales exponatially with required uptime • Don’t kid yourself, you don’t need 5 nines
  • 41. MTTR Medium Time To Recovery • Important • Noone cares if you fail once a minute – If you recover in 50 ms • If you are down 1 minute a week, you are still going to hit 4 nines (99.99%) • Failures happen, plan how to deal with them
  • 42. Problem found • If it is critical, start a phone conversation • Use IRC to communicate technical data • One person liasons with non technical staff • One person specifically in command • Sleep scheduling ( audit log important )
  • 43. Post crisis • Root cause analysis – Just find out what went wrong – And how to avoid it – Or fix it faster next time if you can’t • Keep track of your uptime
  • 44. Automation • All machines are created equal • Seriously • If you manually make changes • You are wrong – Unless you know what you are doing
  • 45. Best practices • Version control • Gold images • Centralised authentication • Time Sync ( NTP ) • Central logging • ( All of this applies for virtual machines too!)
  • 46. cfengine • Standard automation tool • Written in C • Not much support • Very good • Very annoying
  • 47. contro : l s te i = ( mys te ) i domain = ( mysite .count y ) r sysadm = (mark ) netmask = ( 255.255.255.0 ) ac i t onsequence = ( mounta ll mount nfo i addmounts mounta l l lnks i ) mountpat rn = / ie) ( te $(s t /$ host)) homepat r = ( u? ) te n
  • 48. Puppet • New hip kid on the block • Written in ruby • Better support? • Much nicer syntax • Easier to extend
  • 49. def ne yumrepo (enab i led = true) {c i i onf gfle { /e c quot; t /yum.repos / .d $name.repo”: mode => 644, source => quot; yum/repos / /$name. repoquot;, ensure => $enab led ? { true => fl , ie defau t=> absent l } }}
  • 50. cobb er l • Automatic PXE Installer – Uses kickstart files • Redhat Enterprise • Centos • Fedora • Some support for debian
  • 51. cobbler cobbler system add --name=xen8 --mac=00:19:B9:EE:6D:0A --ip=10.10.30.208 --profile=Centos-5-x86_64 --kopts='ksdevice=00:19:B9:EE:6D:0A console=ttyS1,57600 console=tty0'
  • 52. cobbler cobbler system add --name=xen8 --mac=00:19:B9:EE:6D:0A --ip=10.10.30.208 --profile=Centos-5-x86_64 --kopts='ksdevice=00:19:B9:EE:6D:0A console=ttyS1,57600 console=tty0’
  • 53. koan • Client install tool – Xen – Or OS re-image koan --server=10.10.30.205 --virt -- profile=virt_fc6 --virt-name=otrs
  • 54. Your datacenter • Keep it tidy – Label things, keep cables as short as possible – Have a switch in each rack • If you are small without dedicated DC staff you need – Remote control power switches – Remote console!
  • 55. Virtualization • Please use it • Managing becomes much easier • Power consumption • Need a new test box – The requestor can have it in minutes
  • 56. Power consumption • Maybe not as important in Europe • 8 core machines are more efficient than 1 core • But memcache uses 1 core and all RAM • Get more RAM and virtualise
  • 57. Our network admin boxes • 1 Xen CPU for Vyatta • 1 Xen CPU for LVS • 1 Xen CPU for Squid - Carp • 1 Xen CPU for Squid • 1 Xen CPU for Monitoring • 1 Xen CPU for network tasks • We can have more of these and a loss of one affects us less
  • 58. Vyatta • Opensource router – Really like it – No need to use Cisco
  • 59. LVS • Linux Virtual Server • Low level load balancer • HA • Fast • Doesn’t inspire people to put things in the only place that is hard to scale
  • 60. Squid Carp • Squids configured to hash the urls and send them to specific backend • Very little configuration done • Logging of UDP - no disk IO
  • 61. Squid • As a reverse web accelerator • 90 % of our hits served from RAM in less than 1 ms • Same as wikipedia • We only use RAM cache ( unlike wikipedia) • Cached per user • If not cacheable - cache for a second to redue backend effect
  • 62. App servers • 1 xen cpu for memcache ( 5 GB Ram) • 1 xen cpu for squid ( 5GB Ram ) • 6 xen cpus for apache (6 GB Ram ) • More power efficient, less affected by loss • Applications can’t affect each other
  • 63. Databases • Keep developers on short leash • Report bad queries • Fear object relational mappers
  • 64. Outsourcing • As much as possible • The younger you are as a company the less risk – When you have no users, you have no value • VCs don’t like having their money go into Capex
  • 65. What I want from Vendors • They do what they tell me • They do what I tell them • No annoying up sells, no premium services – I know more about what you are selling than you
  • 66. Services we use • Amazon EC2 and S3 • Panther-Express
  • 67. Panther Express • Fantastic Content Distribution Network • Cheap, simple price list – Take note akamai • Cut delivery time to Europe by 70% • We let our images be cached 1 second to redue load
  • 68. EC2 and S3 • We save all our binlogs to S3 • We save database dumps to S3 • We have monitors running from EC2 • We plan to build a datawarehouse cluster on EC2
  • 69. EC2 Requires Automation • Machine is blank when you bring it up • Download database dump from S3 and replicate up - automatically • Use puppet • Amazon saves you hardware headaches – But complexity is still a problem