SlideShare a Scribd company logo
1 of 26
Download to read offline
Practice and Challenges
               from
Building Infrastracture-as-a-Service
                 朱可
         zhukecdl@cn.ibm.com
Disclaimer
●
    Representing personal opinion only
IaaS in Our Development Lab
●
    Virtual machine
●
    Block storage
●
    Virtual machine template
●
    VLAN
●
    Static ip address
●
    Virtual Desktop


    $ ./iaas-deploy-vms -i centos63 -n 100
    $ ./iaas-deploy-vms -i centos63 -n 100
The Machinery


                Node: 16 Cores 192GB RAM 1,6TB




Rack: 20+ nodes, 2 rack switches
Quick Stats
●
    5,800 VMs provisioned in 2 months
●
    700+ individual visitors per month
●
    50,000+ requests to web services per single
    day
    –   Less than 40% requests are sent by human
Design for Failure
●
    “Failure is not an option, it's a requirement.”
●
    Things will crash
    –   Linux kernel panic
    –   Defunct process
    –   File system becomes read only suddenly
●
    HW just doesn't work in every week
    –   Broken disk
    –   Flaws in CPU
    –   Network adapter varies among 10/100/1000 Mbps
Event In Red: Failure
Flakiness
Nov 14 00:39:27 r007x072 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it
Nov 14 00:39:35 r007x072 kernel: e1000e: eth3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
Nov 14 00:39:35 r007x072 kernel: e1000e 0000:1a:00.1: eth3: 10/100 speed: disabling TSO
Nov 14 00:39:35 r007x072 kernel: bonding: bond1: link status definitely up for interface eth3.
Nov 14 00:39:36 r007x072 kernel: e1000e: eth3 NIC Link is Down
Nov 14 00:39:36 r007x072 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it




                                            Analysis
                                            Analysis




                        Unqualified Network Cables
                        Unqualified Network Cables
[root@r007x072 ~]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

Bonding Mode: adaptive load balancing
Primary Slave: None
Currently Active Slave: eth2
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth2
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1b:21:98:2a:4c
Slave queue ID: 0

Slave Interface: eth3
MII Status: up
Link Failure Count: 1627
Permanent HW addr: 00:1b:21:98:2a:4d
Slave queue ID: 0
Keep Simple and Robust
●
    “I have 4 letters for you: KISS (Keep it simple
    and stupid)”
●
    Complex system === hazardous system
●
    Just enough fault-tolerance
    –   Reboot machine if it goes wrong
    –   Logout iSCSI session and login again
    –   Mini toolkit to fix broken DM (device mapper)
        table
Example: Stateless OS
  ●
      Mount root partition in RAM
      –   Think about how you install Ubuntu or Fedora
  ●
      Fix problem by reboot only

[root@r009x090 ~]# df   -h
Filesystem              Size   Used Avail Use% Mounted on
/dev/mapper/live-rw     7,9G   1,5G 6,4G 19% /
tmpfs                    71G   4,0K   71G   1% /dev/shm
/dev/sda2               7,9G   1,4G 6,2G 18% /var/log
/dev/sda4               1,6T   183G 1,4T 12% /iaas/local-storage
P2P based Socialized Communication
●
    Bots “talk” to each other
●
    Anyone can be re-run in seconds when things
    go wrong
Robust Application
   ●
        A number of roles in distributed system do
        there own jobs
          –   Bot, manager, watch dog, zookeeper, agent,
              hbase, hadoop, etc
       (http://zookeeper.apache.org/images/zookeeper_small.gif)



                                                                           HBase

                                                                  Region   Region   Region
                                                                  Server   Server   Server
                        zookeeper
                                                                  Data     Data     Data
                                                                  node     node     node

                                                                           HDFS
Regular bot Watch dog                     Manager bot
Dedicated
        Network-accessible Services
●
    NTP (controversial in VM but good enough)
●
    ZooKeeper
    –   Node presence
    –   Configuration data
    –   Leader election
●
    HBase: store schema-less data
●
    Rsyslog: centralize logs
●
    Web Service: accept HTTP requests only
Scale-out Architecture For Growth
●
    Single namespace for global infrastructure
    –   v525400ffffff.region-a.cloud.xx.ibm.com/service-foo
●
    Multi-region for Geo-distribution
●
    Use cache when possible
●
    Share nothing by autonomy
●
    Leader election (elect new manager if former
    dies)
●
    Collect metrics
Requirement grows/decreases
      faster than purchasing HW
●
    “I need 200 large VMs this afternoon and will
    terminate all of them tomorrow.”
Storage is Always Not Enough
●
    Walk-around: recycle unused files
    –   Move low hit virtual images out of hot zone
    –   Setup SLA to limit availability (provide
        redundancy only when necessary)
Metrics Collection is Critical
●
    “Gathering, storing, and displaying metrics
    should be considered a mission-critical part of
    your infrastructure.”*
●
    Measurement for performance boost (or
    downgrade)




                      (* comes from chapter 3 of the book “web operations”)
Example #1:
   Fix Side Effect of the Leap Second
  ●
      The latest leap second occurred on the end of
      June 2012
/var/log/messages grows too much
                       It take 10 times long in job distribution
                       between bots



tgtd: work_timer_evt_handler(89) failed to read from timerfd,
Resource temporarily unavailable




  # service ntpd stop; date -s “`date`”; service ntpd start
Example #2:
           Recycle Unused Resources
@zhukecdl Our analysis of your VM instance(s) shows that
CPU utilization and network traffic in the past 48 hours
have dropped below 2% and 10 MB.

Instance ID            CPU Time (s)     CPU Rate (%)   TX (MB)
r007.x072.17897.u51393       337.3      0.20     0

We would strongly urge you to consider recycling your
instance(s) so that others can make use of these resources.

If you didn't contact the administrator before 2011-08-16
17:00+8000, the instance r007.x072.17897.u51393 will be recycled

Regards,
Automated Operation (and More)
●
    Goals
    –   Daily upgrade all components
    –   One administrator for 1k systems
    –   No working overtime
●
    Tool
    –   Ruby chef
    –   SmartCloud portfolio
●
    Process
    –   Run benchmark to the system every week
    –   Stay in office until build break is fixed
Run Benchmark to the System Often
●
    Measurement to your performance tweaks
●
    Tools
    –   Netperf
    –   Apache JMeter



               Benchmarking network infrastructure
     # netserver
     # netperf -H 10.10.1.97 -l 43200 TCP_CRR &
     # netperf -H 9.123.127.227 -l 43200 TCP_CRR &
Infrastructure as Code
●
    Building network accessible services
●
    Integration these services
          [root@beijing-mn03 ~]# virsh list
           Id Name                 State
          ----------------------------------
            1 hbm1                 running
            2 bj-jenkins           running
            3 hjt                  running
            4 webservice-1         running
            5 hnn2                 running
            6 bugzilla             running
            8 hslave07             running
            9 hslave08             running
           10 hslave09             running
           11 hslave10             running
           12 hslave11             running
           13 ScannerSlackware     running
Real time Feedback by Tracing Logs
●
    Manager X: “I need daily success rate report
    on deploy VM from department Y today.”
Visualize Traces Via Timeline
Summary
●
    Keep it simple and robust
●
    Scale-out architecture
●
    Automated operation

More Related Content

What's hot

How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
Cassandra Summit 2014: Performance Tuning Cassandra in AWSCassandra Summit 2014: Performance Tuning Cassandra in AWS
Cassandra Summit 2014: Performance Tuning Cassandra in AWSDataStax Academy
 
Cassandra Troubleshooting (for 2.0 and earlier)
Cassandra Troubleshooting (for 2.0 and earlier)Cassandra Troubleshooting (for 2.0 and earlier)
Cassandra Troubleshooting (for 2.0 and earlier)J.B. Langston
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Amazon Web Services
 
Gc and-pagescan-attacks-by-linux
Gc and-pagescan-attacks-by-linuxGc and-pagescan-attacks-by-linux
Gc and-pagescan-attacks-by-linuxCuong Tran
 
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0sprdd
 
HDFS NameNode High Availability
HDFS NameNode High AvailabilityHDFS NameNode High Availability
HDFS NameNode High AvailabilityDataWorks Summit
 
Evergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival SkillsEvergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival SkillsEvergreen ILS
 
VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series VMworld
 
Kernel Recipes 2019 - XDP closer integration with network stack
Kernel Recipes 2019 -  XDP closer integration with network stackKernel Recipes 2019 -  XDP closer integration with network stack
Kernel Recipes 2019 - XDP closer integration with network stackAnne Nicolas
 
Lisa12 methodologies
Lisa12 methodologiesLisa12 methodologies
Lisa12 methodologiesBrendan Gregg
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.Jack Levin
 
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...Command Prompt., Inc
 
Hdfs ha using journal nodes
Hdfs ha using journal nodesHdfs ha using journal nodes
Hdfs ha using journal nodesEvans Ye
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFBrendan Gregg
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems PerformanceBrendan Gregg
 
Awrrpt 1 3004_3005
Awrrpt 1 3004_3005Awrrpt 1 3004_3005
Awrrpt 1 3004_3005Kam Chan
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedAnne Nicolas
 
Performance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksPerformance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksMarian Marinov
 

What's hot (20)

How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
Cassandra Summit 2014: Performance Tuning Cassandra in AWSCassandra Summit 2014: Performance Tuning Cassandra in AWS
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
 
Cassandra Troubleshooting (for 2.0 and earlier)
Cassandra Troubleshooting (for 2.0 and earlier)Cassandra Troubleshooting (for 2.0 and earlier)
Cassandra Troubleshooting (for 2.0 and earlier)
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 
Performance Whackamole (short version)
Performance Whackamole (short version)Performance Whackamole (short version)
Performance Whackamole (short version)
 
Gc and-pagescan-attacks-by-linux
Gc and-pagescan-attacks-by-linuxGc and-pagescan-attacks-by-linux
Gc and-pagescan-attacks-by-linux
 
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
 
HDFS NameNode High Availability
HDFS NameNode High AvailabilityHDFS NameNode High Availability
HDFS NameNode High Availability
 
Evergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival SkillsEvergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival Skills
 
VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series
 
Kernel Recipes 2019 - XDP closer integration with network stack
Kernel Recipes 2019 -  XDP closer integration with network stackKernel Recipes 2019 -  XDP closer integration with network stack
Kernel Recipes 2019 - XDP closer integration with network stack
 
Lisa12 methodologies
Lisa12 methodologiesLisa12 methodologies
Lisa12 methodologies
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
 
Hdfs ha using journal nodes
Hdfs ha using journal nodesHdfs ha using journal nodes
Hdfs ha using journal nodes
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems Performance
 
Awrrpt 1 3004_3005
Awrrpt 1 3004_3005Awrrpt 1 3004_3005
Awrrpt 1 3004_3005
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
 
Performance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksPerformance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networks
 

Viewers also liked

Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)Manoj Kumar
 
Security on cloud storage and IaaS (NSC: Taiwan - JST: Japan workshop)
Security on cloud storage and IaaS (NSC: Taiwan - JST: Japan workshop)Security on cloud storage and IaaS (NSC: Taiwan - JST: Japan workshop)
Security on cloud storage and IaaS (NSC: Taiwan - JST: Japan workshop)Kuniyasu Suzaki
 
IoT DDoS Attacks: the stakes have changed
IoT DDoS Attacks: the stakes have changed IoT DDoS Attacks: the stakes have changed
IoT DDoS Attacks: the stakes have changed Great Bay Software
 
The security of SAAS and private cloud
The security of SAAS and private cloudThe security of SAAS and private cloud
The security of SAAS and private cloudAzure Group
 
Cloud Computing Security Challenges
Cloud Computing Security ChallengesCloud Computing Security Challenges
Cloud Computing Security ChallengesYateesh Yadav
 
Cloud computing security & forensics (manu)
Cloud computing security & forensics (manu)Cloud computing security & forensics (manu)
Cloud computing security & forensics (manu)ClubHack
 
Trying to bottle the cloud forensic challenges with cloud computing
Trying to bottle the cloud   forensic challenges with cloud computingTrying to bottle the cloud   forensic challenges with cloud computing
Trying to bottle the cloud forensic challenges with cloud computingBrent Muir
 
Cloud Forensics
Cloud ForensicsCloud Forensics
Cloud Forensicssdavis532
 
2017 03-01-forensics 1488330715
2017 03-01-forensics 14883307152017 03-01-forensics 1488330715
2017 03-01-forensics 1488330715APNIC
 
(130928) #fitalk cloud storage forensics - dropbox
(130928) #fitalk   cloud storage forensics - dropbox(130928) #fitalk   cloud storage forensics - dropbox
(130928) #fitalk cloud storage forensics - dropboxINSIGHT FORENSIC
 
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution AnalyticsRevolution Analytics
 
How IoT Is Breaking The Internet
How IoT Is Breaking The InternetHow IoT Is Breaking The Internet
How IoT Is Breaking The InternetCarl J. Levine
 
Assessing the Security of Cloud SaaS Solutions
Assessing the Security of Cloud SaaS SolutionsAssessing the Security of Cloud SaaS Solutions
Assessing the Security of Cloud SaaS SolutionsDigital Bond
 
Privacy and Security in the Internet of Things / Конфиденциальность и безопас...
Privacy and Security in the Internet of Things / Конфиденциальность и безопас...Privacy and Security in the Internet of Things / Конфиденциальность и безопас...
Privacy and Security in the Internet of Things / Конфиденциальность и безопас...Positive Hack Days
 
IoT - the Next Wave of DDoS Threat Landscape
IoT - the Next Wave of DDoS Threat LandscapeIoT - the Next Wave of DDoS Threat Landscape
IoT - the Next Wave of DDoS Threat LandscapeAPNIC
 

Viewers also liked (20)

5 Ways To Fight A DDoS Attack
5 Ways To Fight A DDoS Attack5 Ways To Fight A DDoS Attack
5 Ways To Fight A DDoS Attack
 
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
 
The Cloud: Privacy and Forensics
The Cloud: Privacy and ForensicsThe Cloud: Privacy and Forensics
The Cloud: Privacy and Forensics
 
Security on cloud storage and IaaS (NSC: Taiwan - JST: Japan workshop)
Security on cloud storage and IaaS (NSC: Taiwan - JST: Japan workshop)Security on cloud storage and IaaS (NSC: Taiwan - JST: Japan workshop)
Security on cloud storage and IaaS (NSC: Taiwan - JST: Japan workshop)
 
IoT DDoS Attacks: the stakes have changed
IoT DDoS Attacks: the stakes have changed IoT DDoS Attacks: the stakes have changed
IoT DDoS Attacks: the stakes have changed
 
The security of SAAS and private cloud
The security of SAAS and private cloudThe security of SAAS and private cloud
The security of SAAS and private cloud
 
Cloud Computing Security Challenges
Cloud Computing Security ChallengesCloud Computing Security Challenges
Cloud Computing Security Challenges
 
Cloud computing security & forensics (manu)
Cloud computing security & forensics (manu)Cloud computing security & forensics (manu)
Cloud computing security & forensics (manu)
 
Trying to bottle the cloud forensic challenges with cloud computing
Trying to bottle the cloud   forensic challenges with cloud computingTrying to bottle the cloud   forensic challenges with cloud computing
Trying to bottle the cloud forensic challenges with cloud computing
 
Cloud Forensics
Cloud ForensicsCloud Forensics
Cloud Forensics
 
2017 03-01-forensics 1488330715
2017 03-01-forensics 14883307152017 03-01-forensics 1488330715
2017 03-01-forensics 1488330715
 
(130928) #fitalk cloud storage forensics - dropbox
(130928) #fitalk   cloud storage forensics - dropbox(130928) #fitalk   cloud storage forensics - dropbox
(130928) #fitalk cloud storage forensics - dropbox
 
IoT Security: Cases and Methods
IoT Security: Cases and MethodsIoT Security: Cases and Methods
IoT Security: Cases and Methods
 
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
 
How IoT Is Breaking The Internet
How IoT Is Breaking The InternetHow IoT Is Breaking The Internet
How IoT Is Breaking The Internet
 
Assessing the Security of Cloud SaaS Solutions
Assessing the Security of Cloud SaaS SolutionsAssessing the Security of Cloud SaaS Solutions
Assessing the Security of Cloud SaaS Solutions
 
Privacy and Security in the Internet of Things / Конфиденциальность и безопас...
Privacy and Security in the Internet of Things / Конфиденциальность и безопас...Privacy and Security in the Internet of Things / Конфиденциальность и безопас...
Privacy and Security in the Internet of Things / Конфиденциальность и безопас...
 
IBM Security SaaS IaaS and PaaS
IBM Security SaaS IaaS and PaaSIBM Security SaaS IaaS and PaaS
IBM Security SaaS IaaS and PaaS
 
IoT - the Next Wave of DDoS Threat Landscape
IoT - the Next Wave of DDoS Threat LandscapeIoT - the Next Wave of DDoS Threat Landscape
IoT - the Next Wave of DDoS Threat Landscape
 
Ecommerce001
Ecommerce001Ecommerce001
Ecommerce001
 

Similar to Practice and challenges from building IaaS

Considerations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfConsiderations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfhik_lhz
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...netvis
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesAlexander Penev
 
Virtualization & Network Connectivity
Virtualization & Network Connectivity Virtualization & Network Connectivity
Virtualization & Network Connectivity itplant
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
OpenNebula Conf 2014 | ONE BIT to rule them all - Stefan Kooman
OpenNebula Conf 2014 | ONE BIT to rule them all - Stefan KoomanOpenNebula Conf 2014 | ONE BIT to rule them all - Stefan Kooman
OpenNebula Conf 2014 | ONE BIT to rule them all - Stefan KoomanNETWAYS
 
OpenNebulaConf 2014 - ONE BIT to rule them all - Stefan Kooman
OpenNebulaConf 2014 - ONE BIT to rule them all - Stefan KoomanOpenNebulaConf 2014 - ONE BIT to rule them all - Stefan Kooman
OpenNebulaConf 2014 - ONE BIT to rule them all - Stefan KoomanOpenNebula Project
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios
 
Known basic of NFV Features
Known basic of NFV FeaturesKnown basic of NFV Features
Known basic of NFV FeaturesRaul Leite
 
Microsofts Configurable Cloud
Microsofts Configurable CloudMicrosofts Configurable Cloud
Microsofts Configurable CloudChris Genazzio
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...
Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...
Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...Ontico
 
Rohit Yadav - The future of the CloudStack Virtual Router
Rohit Yadav - The future of the CloudStack Virtual RouterRohit Yadav - The future of the CloudStack Virtual Router
Rohit Yadav - The future of the CloudStack Virtual RouterShapeBlue
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Puppet
 
Automated Out-of-Band management with Ansible and Redfish
Automated Out-of-Band management with Ansible and RedfishAutomated Out-of-Band management with Ansible and Redfish
Automated Out-of-Band management with Ansible and RedfishJose De La Rosa
 
Service Assurance for Virtual Network Functions in Cloud-Native Environments
Service Assurance for Virtual Network Functions in Cloud-Native EnvironmentsService Assurance for Virtual Network Functions in Cloud-Native Environments
Service Assurance for Virtual Network Functions in Cloud-Native EnvironmentsNikos Anastopoulos
 
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalTommy Lee
 

Similar to Practice and challenges from building IaaS (20)

Considerations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfConsiderations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmf
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Virtualization & Network Connectivity
Virtualization & Network Connectivity Virtualization & Network Connectivity
Virtualization & Network Connectivity
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
OpenNebula Conf 2014 | ONE BIT to rule them all - Stefan Kooman
OpenNebula Conf 2014 | ONE BIT to rule them all - Stefan KoomanOpenNebula Conf 2014 | ONE BIT to rule them all - Stefan Kooman
OpenNebula Conf 2014 | ONE BIT to rule them all - Stefan Kooman
 
OpenNebulaConf 2014 - ONE BIT to rule them all - Stefan Kooman
OpenNebulaConf 2014 - ONE BIT to rule them all - Stefan KoomanOpenNebulaConf 2014 - ONE BIT to rule them all - Stefan Kooman
OpenNebulaConf 2014 - ONE BIT to rule them all - Stefan Kooman
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
 
Known basic of NFV Features
Known basic of NFV FeaturesKnown basic of NFV Features
Known basic of NFV Features
 
Microsofts Configurable Cloud
Microsofts Configurable CloudMicrosofts Configurable Cloud
Microsofts Configurable Cloud
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...
Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...
Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...
 
Rohit Yadav - The future of the CloudStack Virtual Router
Rohit Yadav - The future of the CloudStack Virtual RouterRohit Yadav - The future of the CloudStack Virtual Router
Rohit Yadav - The future of the CloudStack Virtual Router
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
 
Automated Out-of-Band management with Ansible and Redfish
Automated Out-of-Band management with Ansible and RedfishAutomated Out-of-Band management with Ansible and Redfish
Automated Out-of-Band management with Ansible and Redfish
 
Service Assurance for Virtual Network Functions in Cloud-Native Environments
Service Assurance for Virtual Network Functions in Cloud-Native EnvironmentsService Assurance for Virtual Network Functions in Cloud-Native Environments
Service Assurance for Virtual Network Functions in Cloud-Native Environments
 
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-final
 

Recently uploaded

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Practice and challenges from building IaaS

  • 1. Practice and Challenges from Building Infrastracture-as-a-Service 朱可 zhukecdl@cn.ibm.com
  • 2. Disclaimer ● Representing personal opinion only
  • 3. IaaS in Our Development Lab ● Virtual machine ● Block storage ● Virtual machine template ● VLAN ● Static ip address ● Virtual Desktop $ ./iaas-deploy-vms -i centos63 -n 100 $ ./iaas-deploy-vms -i centos63 -n 100
  • 4. The Machinery Node: 16 Cores 192GB RAM 1,6TB Rack: 20+ nodes, 2 rack switches
  • 5. Quick Stats ● 5,800 VMs provisioned in 2 months ● 700+ individual visitors per month ● 50,000+ requests to web services per single day – Less than 40% requests are sent by human
  • 6. Design for Failure ● “Failure is not an option, it's a requirement.” ● Things will crash – Linux kernel panic – Defunct process – File system becomes read only suddenly ● HW just doesn't work in every week – Broken disk – Flaws in CPU – Network adapter varies among 10/100/1000 Mbps
  • 7. Event In Red: Failure
  • 8. Flakiness Nov 14 00:39:27 r007x072 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Nov 14 00:39:35 r007x072 kernel: e1000e: eth3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX Nov 14 00:39:35 r007x072 kernel: e1000e 0000:1a:00.1: eth3: 10/100 speed: disabling TSO Nov 14 00:39:35 r007x072 kernel: bonding: bond1: link status definitely up for interface eth3. Nov 14 00:39:36 r007x072 kernel: e1000e: eth3 NIC Link is Down Nov 14 00:39:36 r007x072 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Analysis Analysis Unqualified Network Cables Unqualified Network Cables
  • 9. [root@r007x072 ~]# cat /proc/net/bonding/bond1 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) Bonding Mode: adaptive load balancing Primary Slave: None Currently Active Slave: eth2 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth2 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:1b:21:98:2a:4c Slave queue ID: 0 Slave Interface: eth3 MII Status: up Link Failure Count: 1627 Permanent HW addr: 00:1b:21:98:2a:4d Slave queue ID: 0
  • 10. Keep Simple and Robust ● “I have 4 letters for you: KISS (Keep it simple and stupid)” ● Complex system === hazardous system ● Just enough fault-tolerance – Reboot machine if it goes wrong – Logout iSCSI session and login again – Mini toolkit to fix broken DM (device mapper) table
  • 11. Example: Stateless OS ● Mount root partition in RAM – Think about how you install Ubuntu or Fedora ● Fix problem by reboot only [root@r009x090 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/live-rw 7,9G 1,5G 6,4G 19% / tmpfs 71G 4,0K 71G 1% /dev/shm /dev/sda2 7,9G 1,4G 6,2G 18% /var/log /dev/sda4 1,6T 183G 1,4T 12% /iaas/local-storage
  • 12. P2P based Socialized Communication ● Bots “talk” to each other ● Anyone can be re-run in seconds when things go wrong
  • 13. Robust Application ● A number of roles in distributed system do there own jobs – Bot, manager, watch dog, zookeeper, agent, hbase, hadoop, etc (http://zookeeper.apache.org/images/zookeeper_small.gif) HBase Region Region Region Server Server Server zookeeper Data Data Data node node node HDFS Regular bot Watch dog Manager bot
  • 14. Dedicated Network-accessible Services ● NTP (controversial in VM but good enough) ● ZooKeeper – Node presence – Configuration data – Leader election ● HBase: store schema-less data ● Rsyslog: centralize logs ● Web Service: accept HTTP requests only
  • 15. Scale-out Architecture For Growth ● Single namespace for global infrastructure – v525400ffffff.region-a.cloud.xx.ibm.com/service-foo ● Multi-region for Geo-distribution ● Use cache when possible ● Share nothing by autonomy ● Leader election (elect new manager if former dies) ● Collect metrics
  • 16. Requirement grows/decreases faster than purchasing HW ● “I need 200 large VMs this afternoon and will terminate all of them tomorrow.”
  • 17. Storage is Always Not Enough ● Walk-around: recycle unused files – Move low hit virtual images out of hot zone – Setup SLA to limit availability (provide redundancy only when necessary)
  • 18. Metrics Collection is Critical ● “Gathering, storing, and displaying metrics should be considered a mission-critical part of your infrastructure.”* ● Measurement for performance boost (or downgrade) (* comes from chapter 3 of the book “web operations”)
  • 19. Example #1: Fix Side Effect of the Leap Second ● The latest leap second occurred on the end of June 2012 /var/log/messages grows too much It take 10 times long in job distribution between bots tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable # service ntpd stop; date -s “`date`”; service ntpd start
  • 20. Example #2: Recycle Unused Resources @zhukecdl Our analysis of your VM instance(s) shows that CPU utilization and network traffic in the past 48 hours have dropped below 2% and 10 MB. Instance ID CPU Time (s) CPU Rate (%) TX (MB) r007.x072.17897.u51393 337.3 0.20 0 We would strongly urge you to consider recycling your instance(s) so that others can make use of these resources. If you didn't contact the administrator before 2011-08-16 17:00+8000, the instance r007.x072.17897.u51393 will be recycled Regards,
  • 21. Automated Operation (and More) ● Goals – Daily upgrade all components – One administrator for 1k systems – No working overtime ● Tool – Ruby chef – SmartCloud portfolio ● Process – Run benchmark to the system every week – Stay in office until build break is fixed
  • 22. Run Benchmark to the System Often ● Measurement to your performance tweaks ● Tools – Netperf – Apache JMeter Benchmarking network infrastructure # netserver # netperf -H 10.10.1.97 -l 43200 TCP_CRR & # netperf -H 9.123.127.227 -l 43200 TCP_CRR &
  • 23. Infrastructure as Code ● Building network accessible services ● Integration these services [root@beijing-mn03 ~]# virsh list Id Name State ---------------------------------- 1 hbm1 running 2 bj-jenkins running 3 hjt running 4 webservice-1 running 5 hnn2 running 6 bugzilla running 8 hslave07 running 9 hslave08 running 10 hslave09 running 11 hslave10 running 12 hslave11 running 13 ScannerSlackware running
  • 24. Real time Feedback by Tracing Logs ● Manager X: “I need daily success rate report on deploy VM from department Y today.”
  • 26. Summary ● Keep it simple and robust ● Scale-out architecture ● Automated operation