SlideShare une entreprise Scribd logo
1  sur  66
Télécharger pour lire hors ligne
Scalable
      System Operations


Joshua Hoffman
                          1
About This Talk

• Set of principles        • Software
• Operations Engineering   • Techniques
• Tumblr project           • Example code
• Server management        • Best practices
• Massively automated      • Open source



                                              2
About Me




           3
About Me

    • 1995: CompUSA
      Intro to The Internet
    • 2000: Guru Labs
      Sun, Cisco, Red Hat
    • 2002: Red Hat
      Sys admin courseware




                              4
About Me
    • 2004: Fortress Systems
      Anti-spam/malware
    • 2005: Red Hat
      Virtualization cert
      Remote learning
      Defined “cloud”
    • 2011: Tumblr
      Lead Systems Eng




                               5
The Problem




  Deploying new servers is very repetitive and slow.
                (and we hate that)
http://www.flickr.com/people/mc4army/
                                                       6
The Job We Don’t Want




http://www.flickr.com/people/redjar/
                                      7
The Solution




               8
Install OS
                    Firmware
Configure OS
                    Configure BIOS
Install software
                    Set up BMC
Configure software
                    Inventory
Add to DNS
                    Stress testing
Add to monitoring
                    Network config
Add to trending




                                     9
The Goal Is Clear




                    10
Time To Strategize




http://www.flickr.com/people/irishwildcat/
                                            11
Time To Strategize
                                            Use open source?
                                            Which?
                                            Buy software?
                                            Which?
                                            Write software?
                                            Mix and match?



http://www.flickr.com/people/irishwildcat/
                                                               12
The Choice Principle




The time to make a decision is a function of the possible choices.

  http://www.flickr.com/people/3059349393/
                                                                     13
Rapid Software Research




                          14
Rapid Software Research


                1. Define
                2. Gather
                3. Disqualify
                4. Rank




                                15
Rank




http://www.flickr.com/people/nirak/
                                            16
Rank

                                            • Modularity
                                            • Compliance
                                            • Novelty
                                            • Disruption



http://www.flickr.com/people/nirak/
                                                           17
My Requirements

 • Asset inventory
 • State management
 • Robust API
 • Event triggers



                      18
My Requirements

    • Modular
    • Flexible
    • Extensible
    • Fast



                   19
My Requirements


Manage physical hardware
as easily as virtual machines.




                                 20
The Usual Suspects
     • Cobbler
     • Foreman
     • Satellite
     • Orchestra
     • Racktables
     • Clusto


                     21
But Wait!




            22
Data Entry




   “Just import the data supplied by the hardware vendor...”

http://www.flickr.com/people/mwichary/
                                                               23
Missing Requirements
      Firmware
      Configure BIOS
      Set up BMC
      Inventory
      Stress testing
      Network config
      Add to monitoring
      Add to trending




                          24
We have to write software!




http://www.flickr.com/people/argen/
                                     25
We have to write software!


                                     • Delivery Schedule
                                     • Scope Creep
                                     • Maintenance
                                     • Documentation


http://www.flickr.com/people/argen/
                                                           26
Tumblr Management Stack

       • iPXE
       • Invisible Touch
       • Collins
       • Phil
       • Kickstart
       • Puppet



                           27
The Glue Principle




                          Unix Rule of Parsimony:
                Write a big program only when it is clear by
                demonstration that nothing else will do.
http://www.flickr.com/people/kodomut/
                                                               28
The Standards Principle




The nice thing about standards is that you have so many to choose from.
                          -Andrew Tanenbaum
     http://www.flickr.com/people/usfwssoutheast/
                                                                          29
The Simplicity Principle




                   Unix Rule of Simplicity:
 Design for simplicity; add complexity only where you must.

http://www.flickr.com/people/haldanemartin/
                                                              30
The 3:00 AM Principle




It must be obvious to someone woken up from a sound sleep at 3:00 am.

     http://www.flickr.com/people/joi/
                                                                        31
The Don’t Break The OS Principle




The software should NOT prevent the OS from working as expected.
   http://www.flickr.com/photos/philmanker/
                                                                   32
The Amnesia Principle




      Given enough time, you WILL forget why you did that.

http://www.flickr.com/people/zach_a/
                                                             33
Tumblr Management Stack
       • iPXE
       • Invisible Touch
       • Collins
       • Phil
       • Kickstart
       • Puppet


                           34
Why not pxelinux?




http://www.flickr.com/people/digiart2001/
                                           35
Why not pxelinux?


                                           • TFTP
                                           • Flat files




http://www.flickr.com/people/digiart2001/
                                                         36
iPXE




http://www.flickr.com/people/chiotsrun/
                                                37
iPXE

                                                • HTTP, FTP, iSCSI
                                                • Scriptable
                                                • Variables
                                                • Dynamic



http://www.flickr.com/people/chiotsrun/
                                                                     38
ISC DHCP For iPXE
  # subnet for the provisioning vlan
  subnet <%= subnet %> netmask <%= netmask %> {
    option domain-name          "<%= option_domain_name %>";
    option routers              <%= option_routers %>;
    option domain-name-servers <%= option_dns_servers.map{|i| "#{i}"}.join(", ") -%>;
    option subnet-mask          <%= option_subnet_mask %>;
    default-lease-time          21600;
    max-lease-time              43200;
    range                       <%= range_start %> <%= range_end %>;
    # If a pxe request comes in from ipxe send the config url
    if exists user-class and option user-class = "iPXE" {
        filename "<%= ipxe_config_url %>"; # http://foo.example.com/ipxe/${net0/mac}
    # For all other pxe requests send ipxe
    } else {
        next-server <%= next_server %>; # tftp server
        filename "<%= filename %>";      # path to ipxe binary on tftp server
    }
  }
}

                                                                                        39
Fedora LiveCD Tools
lang en_US.UTF-8
keyboard us
timezone US/Eastern
auth --useshadow --enablemd5
selinux --enforcing
firewall --disabled
repo --name=centos   --baseurl=http://127.0.0.1/pub/repo/centos/os/6.2
repo --name=infra    --baseurl=http://127.0.0.1/pub/repo/infra/6.2
repo --name=epel     --baseurl=http://127.0.0.1/repo/epel/6/x86_64/

%packages --excludedocs
@core
dracut
dracut-kernel
device-mapper
device-mapper-event
%end


                                                                         40
Invisible Touch Kickstart
# Invisible Touch Live OS image
%include centos-6.2-livecd-minimal.ks
%packages --excludedocs
it
%end
%post
cat > /etc/issue <<EoF
Invisible Touch Live OS v0.0.4
Kernel r
EoF
# set ipmi to start at boot up
/sbin/chkconfig ipmi on
# configure rsyslog
cat >> /etc/rsyslog.conf <<EoF
# invisible touch
local0.*                                /var/log/it.log
local0.*                                /dev/tty7
EoF
%end

                                                          41
Invisible Touch Utilities

        • lshw
        • lldpd
        • Breakin
        • ipmitool
        • Bash scripts



                            42
lshw
<node id="disk:1" claimed="true" class="disk" handle="SCSI:04:00:01:00">
  <description>ATA Disk</description>
  <product>ST91000640NS</product>
  <vendor>Seagate</vendor>
  <physid>0.1.0</physid>
  <businfo>scsi@4:0.1.0</businfo>
  <logicalname>/dev/sdf</logicalname>
  <dev>8:80</dev>
  <version>n/a</version>
  <serial>9XG0ETB8</serial>
  <size units="bytes">1000204886016</size>
  <configuration>
    <setting id="ansiversion" value="5" />
    <setting id="signature" value="000e1763" />
  </configuration>
  <capabilities>
    <capability id="partitioned">Partitioned disk</capability>
    <capability id="partitioned:dos">MS-DOS partition table</capability>
  </capabilities>
</node>



lshw generates hardware info XML

                                                                           43
lldpd
<interface label="Interface" name="eth0" via="LLDP" rid="1" age="0 day, 00:01:03">
 <chassis label="Chassis">
  <id label="ChassisID" type="mac">78:19:f7:88:60:c0</id>
  <name label="SysName">core01.dfw01</name>
  <descr label="SysDescr">Juniper Networks, Inc. ex4500-40f</descr>
  <capability label="Capability" type="Bridge" enabled="on" />
  <capability label="Capability" type="Router" enabled="on" />
 </chassis>
 <port label="Port">
  <id label="PortID" type="local">608</id>
  <descr label="PortDescr">ge-0/0/3.0</descr>
  <mfs label="MFS">1514</mfs>
  <auto-negotiation label="PMD autoneg" supported="no" enabled="yes">
   <advertised label="Adv" type="10Base-T" hd="no" fd="yes" />
   <current label="MAU oper type">unknown</current>
  </auto-negotiation>
 </port>
 <vlan label="VLAN" vlan-id="666" pvid="yes">DFW01-PROVISIONING</vlan>
 <lldp-med label="LLDP-MED">
  <device-type label="Device Type">Network Connectivity Device</device-type>
  <capability label="Capability" type="Capabilities" />
 </lldp-med>
</interface>


    lldpctl outputs network info in XML

                                                                                     44
Breakin




Stress testing framework



                           45
Breakin

     • Standard tools
     • LINPACK
     • Extensible
     • Bash scripts



                        46
Invisible Touch

    Firmware
    Configure BIOS
    Set up BMC
    Inventory
    Stress testing
    Network config




                     47
Collins
• Asset management system in Scala
• REST API
• Client libraries in Ruby, Python and Bash
• Shell tool for scripting and automation
• Callback system for hooking into events
• Granular permissions model
• Flexible web and API based provisioning
• Remote power management
• IP Address allocation and management
• Distributed mode for spanning data centers
                                               48
Collins Docs




               49
Collins Search




                 50
Collins Asset Details




                        51
Collins Provisioning




                       52
Phil

• iPXE dispatcher
• Kickstart generator
• Light Ruby app
• Collins API client



                        53
Server Intake Workflow

   1. Rack and stack
   2. Power on
   3. Enter physical data




                            54
Server Intake Process
1. Server boots iPXE via DHCP/PXE
2. iPXE gets config from Phil
3. Phil sends Invisible Touch
4. IT updates firmware (if needed)
5. IT configures BIOS
6. IT configures BMC
7. IT uploads inventory data to Collins
8. IT starts stress tests
9. IT powers down server



                                          55
Provisioning Workflow

1. Search Collins
2. Choose Profile, Role, Pool
3. Click button




                               56
Provisioning Process
1. Server boots iPXE via DHCP/PXE
2. iPXE gets config from Phil
3. Phil sends install image
4. Install image gets Kickstart from Phil
5. Install runs Puppet in %post
6. End of %post calls back to Collins
7. Collins triggers vlan update
8. Collins triggers monitoring/trending
9. Added to production if “all green”


                                            57
Result




          Fast, scalable, no hassle provisioning!

http://www.flickr.com/people/mc4army/
                                                    58
Hurdles




http://www.flickr.com/people/ligynnek/
                                                  59
Hurdles


                                          PXE kickstart w/ multiple NICs
                                          Network set up in %post
                                          Virident SSD set up in %post




http://www.flickr.com/people/ligynnek/
                                                                           60
PXE Kickstart / Multiple NICs
                     Phil iPXE config
     initrd <%= os_install_url %>/images/initrd.img
     kernel <%= os_install_url %>/images/vmlinuz ip=dhcp ksdevice=${mac}




                Phil kickstart snippet
     # network
     network --bootproto=dhcp




                                                                           61
%post Network Set Up
                Phil kickstart snippet
# Bond Interface: <%= bond.name %>

cat > /etc/sysconfig/network-scripts/ifcfg-<%= bond.name %> <<EoF
DEVICE=<%= bond.name %>
BONDING_OPTS="<%= bond.options %>"
BOOTPROTO=static
IPADDR=<%= bond.address %>
NETMASK=<%= bond.netmask %>
GATEWAY=<%= bond.gateway %>
EoF




                                                                    62
%post Virident SSD Set Up
   # Start the virident daemon
   /etc/init.d/vgcd start
   # create a device node
   mknod /dev/vgca0 b 252 0
   # create a mount point
   mkdir -p /var/lib/mysql
   # create partitions
   parted -s /dev/vgca0 mklabel msdos
   parted -s /dev/vgca0 unit s mkpart primary ext2 2048 100%
   # make another device node
   mknod /dev/vgca0p1 b 252 1
   # make the filesystem
   /sbin/mkfs.xfs -f -d su=64k,sw=3 -l size=32m,su=16k /dev/vgca0p1
   # create fstab entry
   echo "/dev/vgca0p1       /var/lib/mysql   xfs    noauto     0 0" >>/etc/fstab
   # create virident config
   cat > /etc/sysconfig/vgcd.conf << EoF
   RESCAN_MD=1
   RESCAN_LVM=1
   MOUNT_POINTS="/var/lib/mysql"
   RESCAN_MOUNT=1
   EoF
   # mount the virident
   mount /var/lib/mysql




                                                                                   63
Lessons Learned

• Modularity is very important
• Hardware always has issues at scale
• Use modern Bash syntax
• 4 hour burn-in is not enough



                                        64
Yes, we’re hiring!

   Joshua Hoffman
  joshua@tumblr.com
    tumblr.com/jobs




                      65
66

Contenu connexe

Similaire à Scalable system operations presentation

Smart Bombs: Mobile Vulnerability and Exploitation
Smart Bombs: Mobile Vulnerability and ExploitationSmart Bombs: Mobile Vulnerability and Exploitation
Smart Bombs: Mobile Vulnerability and ExploitationSecureState
 
Design for Scale / Surge 2010
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010Christopher Brown
 
Owning windows 8 with human interface devices
Owning windows 8 with human interface devicesOwning windows 8 with human interface devices
Owning windows 8 with human interface devicesNikhil Mittal
 
The world is not black and white – Impact of decisions over the lifetime of a...
The world is not black and white – Impact of decisions over the lifetime of a...The world is not black and white – Impact of decisions over the lifetime of a...
The world is not black and white – Impact of decisions over the lifetime of a...Eric Reiche
 
DEF CON 27 - workshop - RICHARD GOLD - mind the gap
DEF CON 27 - workshop - RICHARD GOLD - mind the gapDEF CON 27 - workshop - RICHARD GOLD - mind the gap
DEF CON 27 - workshop - RICHARD GOLD - mind the gapFelipe Prado
 
Continuous delivery applied
Continuous delivery appliedContinuous delivery applied
Continuous delivery appliedMike McGarr
 
Demystifying Binary Reverse Engineering - Pixels Camp
Demystifying Binary Reverse Engineering - Pixels CampDemystifying Binary Reverse Engineering - Pixels Camp
Demystifying Binary Reverse Engineering - Pixels CampAndré Baptista
 
Mobile for PHP developers
Mobile for PHP developersMobile for PHP developers
Mobile for PHP developersIvo Jansch
 
5 cro tools that i can't live without
5 cro tools that i can't live without5 cro tools that i can't live without
5 cro tools that i can't live withoutCraig Sullivan
 
Continuous Delivery Applied (AgileDC)
Continuous Delivery Applied (AgileDC)Continuous Delivery Applied (AgileDC)
Continuous Delivery Applied (AgileDC)Mike McGarr
 
Zero Trust And Best Practices for Securing Endpoint Apps on May 24th 2021
Zero Trust And Best Practices for Securing Endpoint Apps on May 24th 2021Zero Trust And Best Practices for Securing Endpoint Apps on May 24th 2021
Zero Trust And Best Practices for Securing Endpoint Apps on May 24th 2021Teemu Tiainen
 
Php Development In The Cloud
Php Development In The CloudPhp Development In The Cloud
Php Development In The CloudIvo Jansch
 
Made for Each Other: Microservices + PaaS
Made for Each Other: Microservices + PaaSMade for Each Other: Microservices + PaaS
Made for Each Other: Microservices + PaaSVMware Tanzu
 
10 Things You Probably Didn't Know About Plone
10 Things You Probably Didn't Know About Plone10 Things You Probably Didn't Know About Plone
10 Things You Probably Didn't Know About PloneJazkarta, Inc.
 
Platform Clouds, Containers, Immutable Infrastructure Oh My!
Platform Clouds, Containers, Immutable Infrastructure Oh My!Platform Clouds, Containers, Immutable Infrastructure Oh My!
Platform Clouds, Containers, Immutable Infrastructure Oh My!Stuart Charlton
 
Metasploitation part-1 (murtuja)
Metasploitation part-1 (murtuja)Metasploitation part-1 (murtuja)
Metasploitation part-1 (murtuja)ClubHack
 
2013: Trends from the Trenches
2013: Trends from the Trenches2013: Trends from the Trenches
2013: Trends from the TrenchesChris Dagdigian
 
Continuous delivery applied (DC CI User Group)
Continuous delivery applied (DC CI User Group)Continuous delivery applied (DC CI User Group)
Continuous delivery applied (DC CI User Group)Mike McGarr
 

Similaire à Scalable system operations presentation (20)

Smart Bombs: Mobile Vulnerability and Exploitation
Smart Bombs: Mobile Vulnerability and ExploitationSmart Bombs: Mobile Vulnerability and Exploitation
Smart Bombs: Mobile Vulnerability and Exploitation
 
Design for Scale / Surge 2010
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010
 
Owning windows 8 with human interface devices
Owning windows 8 with human interface devicesOwning windows 8 with human interface devices
Owning windows 8 with human interface devices
 
The world is not black and white – Impact of decisions over the lifetime of a...
The world is not black and white – Impact of decisions over the lifetime of a...The world is not black and white – Impact of decisions over the lifetime of a...
The world is not black and white – Impact of decisions over the lifetime of a...
 
Lrug
LrugLrug
Lrug
 
DEF CON 27 - workshop - RICHARD GOLD - mind the gap
DEF CON 27 - workshop - RICHARD GOLD - mind the gapDEF CON 27 - workshop - RICHARD GOLD - mind the gap
DEF CON 27 - workshop - RICHARD GOLD - mind the gap
 
Continuous delivery applied
Continuous delivery appliedContinuous delivery applied
Continuous delivery applied
 
Demystifying Binary Reverse Engineering - Pixels Camp
Demystifying Binary Reverse Engineering - Pixels CampDemystifying Binary Reverse Engineering - Pixels Camp
Demystifying Binary Reverse Engineering - Pixels Camp
 
Mobile for PHP developers
Mobile for PHP developersMobile for PHP developers
Mobile for PHP developers
 
5 cro tools that i can't live without
5 cro tools that i can't live without5 cro tools that i can't live without
5 cro tools that i can't live without
 
Continuous Delivery Applied (AgileDC)
Continuous Delivery Applied (AgileDC)Continuous Delivery Applied (AgileDC)
Continuous Delivery Applied (AgileDC)
 
Zero Trust And Best Practices for Securing Endpoint Apps on May 24th 2021
Zero Trust And Best Practices for Securing Endpoint Apps on May 24th 2021Zero Trust And Best Practices for Securing Endpoint Apps on May 24th 2021
Zero Trust And Best Practices for Securing Endpoint Apps on May 24th 2021
 
Php Development In The Cloud
Php Development In The CloudPhp Development In The Cloud
Php Development In The Cloud
 
Made for Each Other: Microservices + PaaS
Made for Each Other: Microservices + PaaSMade for Each Other: Microservices + PaaS
Made for Each Other: Microservices + PaaS
 
10 Things You Probably Didn't Know About Plone
10 Things You Probably Didn't Know About Plone10 Things You Probably Didn't Know About Plone
10 Things You Probably Didn't Know About Plone
 
Platform Clouds, Containers, Immutable Infrastructure Oh My!
Platform Clouds, Containers, Immutable Infrastructure Oh My!Platform Clouds, Containers, Immutable Infrastructure Oh My!
Platform Clouds, Containers, Immutable Infrastructure Oh My!
 
What is this cloud thing?
What is this cloud thing?What is this cloud thing?
What is this cloud thing?
 
Metasploitation part-1 (murtuja)
Metasploitation part-1 (murtuja)Metasploitation part-1 (murtuja)
Metasploitation part-1 (murtuja)
 
2013: Trends from the Trenches
2013: Trends from the Trenches2013: Trends from the Trenches
2013: Trends from the Trenches
 
Continuous delivery applied (DC CI User Group)
Continuous delivery applied (DC CI User Group)Continuous delivery applied (DC CI User Group)
Continuous delivery applied (DC CI User Group)
 

Plus de james tong

数据库系统设计漫谈
数据库系统设计漫谈数据库系统设计漫谈
数据库系统设计漫谈james tong
 
Migrating from MySQL to PostgreSQL
Migrating from MySQL to PostgreSQLMigrating from MySQL to PostgreSQL
Migrating from MySQL to PostgreSQLjames tong
 
Oracle 性能优化
Oracle 性能优化Oracle 性能优化
Oracle 性能优化james tong
 
Cap 理论与实践
Cap 理论与实践Cap 理论与实践
Cap 理论与实践james tong
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...james tong
 
Stability patterns presentation
Stability patterns presentationStability patterns presentation
Stability patterns presentationjames tong
 
The right read optimization is actually write optimization
The right read optimization is actually write optimizationThe right read optimization is actually write optimization
The right read optimization is actually write optimizationjames tong
 
My sql ssd-mysqluc-2012
My sql ssd-mysqluc-2012My sql ssd-mysqluc-2012
My sql ssd-mysqluc-2012james tong
 
Evaluating ha alternatives my sql tutorial2
Evaluating ha alternatives my sql tutorial2Evaluating ha alternatives my sql tutorial2
Evaluating ha alternatives my sql tutorial2james tong
 
Troubleshooting mysql-tutorial
Troubleshooting mysql-tutorialTroubleshooting mysql-tutorial
Troubleshooting mysql-tutorialjames tong
 
Understanding performance through_measurement
Understanding performance through_measurementUnderstanding performance through_measurement
Understanding performance through_measurementjames tong
 
我对后端优化的一点想法 (2012)
我对后端优化的一点想法 (2012)我对后端优化的一点想法 (2012)
我对后端优化的一点想法 (2012)james tong
 
设计可扩展的Oracle应用
设计可扩展的Oracle应用设计可扩展的Oracle应用
设计可扩展的Oracle应用james tong
 
我对后端优化的一点想法.pptx
我对后端优化的一点想法.pptx我对后端优化的一点想法.pptx
我对后端优化的一点想法.pptxjames tong
 
Enqueue Lock介绍.ppt
Enqueue Lock介绍.pptEnqueue Lock介绍.ppt
Enqueue Lock介绍.pptjames tong
 
Oracle数据库体系结构简介.ppt
Oracle数据库体系结构简介.pptOracle数据库体系结构简介.ppt
Oracle数据库体系结构简介.pptjames tong
 
Cassandra简介.ppt
Cassandra简介.pptCassandra简介.ppt
Cassandra简介.pptjames tong
 

Plus de james tong (17)

数据库系统设计漫谈
数据库系统设计漫谈数据库系统设计漫谈
数据库系统设计漫谈
 
Migrating from MySQL to PostgreSQL
Migrating from MySQL to PostgreSQLMigrating from MySQL to PostgreSQL
Migrating from MySQL to PostgreSQL
 
Oracle 性能优化
Oracle 性能优化Oracle 性能优化
Oracle 性能优化
 
Cap 理论与实践
Cap 理论与实践Cap 理论与实践
Cap 理论与实践
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
 
Stability patterns presentation
Stability patterns presentationStability patterns presentation
Stability patterns presentation
 
The right read optimization is actually write optimization
The right read optimization is actually write optimizationThe right read optimization is actually write optimization
The right read optimization is actually write optimization
 
My sql ssd-mysqluc-2012
My sql ssd-mysqluc-2012My sql ssd-mysqluc-2012
My sql ssd-mysqluc-2012
 
Evaluating ha alternatives my sql tutorial2
Evaluating ha alternatives my sql tutorial2Evaluating ha alternatives my sql tutorial2
Evaluating ha alternatives my sql tutorial2
 
Troubleshooting mysql-tutorial
Troubleshooting mysql-tutorialTroubleshooting mysql-tutorial
Troubleshooting mysql-tutorial
 
Understanding performance through_measurement
Understanding performance through_measurementUnderstanding performance through_measurement
Understanding performance through_measurement
 
我对后端优化的一点想法 (2012)
我对后端优化的一点想法 (2012)我对后端优化的一点想法 (2012)
我对后端优化的一点想法 (2012)
 
设计可扩展的Oracle应用
设计可扩展的Oracle应用设计可扩展的Oracle应用
设计可扩展的Oracle应用
 
我对后端优化的一点想法.pptx
我对后端优化的一点想法.pptx我对后端优化的一点想法.pptx
我对后端优化的一点想法.pptx
 
Enqueue Lock介绍.ppt
Enqueue Lock介绍.pptEnqueue Lock介绍.ppt
Enqueue Lock介绍.ppt
 
Oracle数据库体系结构简介.ppt
Oracle数据库体系结构简介.pptOracle数据库体系结构简介.ppt
Oracle数据库体系结构简介.ppt
 
Cassandra简介.ppt
Cassandra简介.pptCassandra简介.ppt
Cassandra简介.ppt
 

Dernier

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Dernier (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Scalable system operations presentation

  • 1. Scalable System Operations Joshua Hoffman 1
  • 2. About This Talk • Set of principles • Software • Operations Engineering • Techniques • Tumblr project • Example code • Server management • Best practices • Massively automated • Open source 2
  • 4. About Me • 1995: CompUSA Intro to The Internet • 2000: Guru Labs Sun, Cisco, Red Hat • 2002: Red Hat Sys admin courseware 4
  • 5. About Me • 2004: Fortress Systems Anti-spam/malware • 2005: Red Hat Virtualization cert Remote learning Defined “cloud” • 2011: Tumblr Lead Systems Eng 5
  • 6. The Problem Deploying new servers is very repetitive and slow. (and we hate that) http://www.flickr.com/people/mc4army/ 6
  • 7. The Job We Don’t Want http://www.flickr.com/people/redjar/ 7
  • 9. Install OS Firmware Configure OS Configure BIOS Install software Set up BMC Configure software Inventory Add to DNS Stress testing Add to monitoring Network config Add to trending 9
  • 10. The Goal Is Clear 10
  • 12. Time To Strategize Use open source? Which? Buy software? Which? Write software? Mix and match? http://www.flickr.com/people/irishwildcat/ 12
  • 13. The Choice Principle The time to make a decision is a function of the possible choices. http://www.flickr.com/people/3059349393/ 13
  • 15. Rapid Software Research 1. Define 2. Gather 3. Disqualify 4. Rank 15
  • 17. Rank • Modularity • Compliance • Novelty • Disruption http://www.flickr.com/people/nirak/ 17
  • 18. My Requirements • Asset inventory • State management • Robust API • Event triggers 18
  • 19. My Requirements • Modular • Flexible • Extensible • Fast 19
  • 20. My Requirements Manage physical hardware as easily as virtual machines. 20
  • 21. The Usual Suspects • Cobbler • Foreman • Satellite • Orchestra • Racktables • Clusto 21
  • 22. But Wait! 22
  • 23. Data Entry “Just import the data supplied by the hardware vendor...” http://www.flickr.com/people/mwichary/ 23
  • 24. Missing Requirements Firmware Configure BIOS Set up BMC Inventory Stress testing Network config Add to monitoring Add to trending 24
  • 25. We have to write software! http://www.flickr.com/people/argen/ 25
  • 26. We have to write software! • Delivery Schedule • Scope Creep • Maintenance • Documentation http://www.flickr.com/people/argen/ 26
  • 27. Tumblr Management Stack • iPXE • Invisible Touch • Collins • Phil • Kickstart • Puppet 27
  • 28. The Glue Principle Unix Rule of Parsimony: Write a big program only when it is clear by demonstration that nothing else will do. http://www.flickr.com/people/kodomut/ 28
  • 29. The Standards Principle The nice thing about standards is that you have so many to choose from. -Andrew Tanenbaum http://www.flickr.com/people/usfwssoutheast/ 29
  • 30. The Simplicity Principle Unix Rule of Simplicity: Design for simplicity; add complexity only where you must. http://www.flickr.com/people/haldanemartin/ 30
  • 31. The 3:00 AM Principle It must be obvious to someone woken up from a sound sleep at 3:00 am. http://www.flickr.com/people/joi/ 31
  • 32. The Don’t Break The OS Principle The software should NOT prevent the OS from working as expected. http://www.flickr.com/photos/philmanker/ 32
  • 33. The Amnesia Principle Given enough time, you WILL forget why you did that. http://www.flickr.com/people/zach_a/ 33
  • 34. Tumblr Management Stack • iPXE • Invisible Touch • Collins • Phil • Kickstart • Puppet 34
  • 36. Why not pxelinux? • TFTP • Flat files http://www.flickr.com/people/digiart2001/ 36
  • 38. iPXE • HTTP, FTP, iSCSI • Scriptable • Variables • Dynamic http://www.flickr.com/people/chiotsrun/ 38
  • 39. ISC DHCP For iPXE # subnet for the provisioning vlan subnet <%= subnet %> netmask <%= netmask %> {     option domain-name "<%= option_domain_name %>";     option routers <%= option_routers %>;     option domain-name-servers <%= option_dns_servers.map{|i| "#{i}"}.join(", ") -%>;     option subnet-mask <%= option_subnet_mask %>;     default-lease-time 21600;     max-lease-time 43200;     range <%= range_start %> <%= range_end %>;     # If a pxe request comes in from ipxe send the config url     if exists user-class and option user-class = "iPXE" {         filename "<%= ipxe_config_url %>"; # http://foo.example.com/ipxe/${net0/mac}     # For all other pxe requests send ipxe     } else {         next-server <%= next_server %>; # tftp server         filename "<%= filename %>"; # path to ipxe binary on tftp server     } } } 39
  • 40. Fedora LiveCD Tools lang en_US.UTF-8 keyboard us timezone US/Eastern auth --useshadow --enablemd5 selinux --enforcing firewall --disabled repo --name=centos --baseurl=http://127.0.0.1/pub/repo/centos/os/6.2 repo --name=infra --baseurl=http://127.0.0.1/pub/repo/infra/6.2 repo --name=epel --baseurl=http://127.0.0.1/repo/epel/6/x86_64/ %packages --excludedocs @core dracut dracut-kernel device-mapper device-mapper-event %end 40
  • 41. Invisible Touch Kickstart # Invisible Touch Live OS image %include centos-6.2-livecd-minimal.ks %packages --excludedocs it %end %post cat > /etc/issue <<EoF Invisible Touch Live OS v0.0.4 Kernel r EoF # set ipmi to start at boot up /sbin/chkconfig ipmi on # configure rsyslog cat >> /etc/rsyslog.conf <<EoF # invisible touch local0.* /var/log/it.log local0.* /dev/tty7 EoF %end 41
  • 42. Invisible Touch Utilities • lshw • lldpd • Breakin • ipmitool • Bash scripts 42
  • 43. lshw <node id="disk:1" claimed="true" class="disk" handle="SCSI:04:00:01:00"> <description>ATA Disk</description> <product>ST91000640NS</product> <vendor>Seagate</vendor> <physid>0.1.0</physid> <businfo>scsi@4:0.1.0</businfo> <logicalname>/dev/sdf</logicalname> <dev>8:80</dev> <version>n/a</version> <serial>9XG0ETB8</serial> <size units="bytes">1000204886016</size> <configuration> <setting id="ansiversion" value="5" /> <setting id="signature" value="000e1763" /> </configuration> <capabilities> <capability id="partitioned">Partitioned disk</capability> <capability id="partitioned:dos">MS-DOS partition table</capability> </capabilities> </node> lshw generates hardware info XML 43
  • 44. lldpd <interface label="Interface" name="eth0" via="LLDP" rid="1" age="0 day, 00:01:03"> <chassis label="Chassis"> <id label="ChassisID" type="mac">78:19:f7:88:60:c0</id> <name label="SysName">core01.dfw01</name> <descr label="SysDescr">Juniper Networks, Inc. ex4500-40f</descr> <capability label="Capability" type="Bridge" enabled="on" /> <capability label="Capability" type="Router" enabled="on" /> </chassis> <port label="Port"> <id label="PortID" type="local">608</id> <descr label="PortDescr">ge-0/0/3.0</descr> <mfs label="MFS">1514</mfs> <auto-negotiation label="PMD autoneg" supported="no" enabled="yes"> <advertised label="Adv" type="10Base-T" hd="no" fd="yes" /> <current label="MAU oper type">unknown</current> </auto-negotiation> </port> <vlan label="VLAN" vlan-id="666" pvid="yes">DFW01-PROVISIONING</vlan> <lldp-med label="LLDP-MED"> <device-type label="Device Type">Network Connectivity Device</device-type> <capability label="Capability" type="Capabilities" /> </lldp-med> </interface> lldpctl outputs network info in XML 44
  • 46. Breakin • Standard tools • LINPACK • Extensible • Bash scripts 46
  • 47. Invisible Touch Firmware Configure BIOS Set up BMC Inventory Stress testing Network config 47
  • 48. Collins • Asset management system in Scala • REST API • Client libraries in Ruby, Python and Bash • Shell tool for scripting and automation • Callback system for hooking into events • Granular permissions model • Flexible web and API based provisioning • Remote power management • IP Address allocation and management • Distributed mode for spanning data centers 48
  • 53. Phil • iPXE dispatcher • Kickstart generator • Light Ruby app • Collins API client 53
  • 54. Server Intake Workflow 1. Rack and stack 2. Power on 3. Enter physical data 54
  • 55. Server Intake Process 1. Server boots iPXE via DHCP/PXE 2. iPXE gets config from Phil 3. Phil sends Invisible Touch 4. IT updates firmware (if needed) 5. IT configures BIOS 6. IT configures BMC 7. IT uploads inventory data to Collins 8. IT starts stress tests 9. IT powers down server 55
  • 56. Provisioning Workflow 1. Search Collins 2. Choose Profile, Role, Pool 3. Click button 56
  • 57. Provisioning Process 1. Server boots iPXE via DHCP/PXE 2. iPXE gets config from Phil 3. Phil sends install image 4. Install image gets Kickstart from Phil 5. Install runs Puppet in %post 6. End of %post calls back to Collins 7. Collins triggers vlan update 8. Collins triggers monitoring/trending 9. Added to production if “all green” 57
  • 58. Result Fast, scalable, no hassle provisioning! http://www.flickr.com/people/mc4army/ 58
  • 60. Hurdles PXE kickstart w/ multiple NICs Network set up in %post Virident SSD set up in %post http://www.flickr.com/people/ligynnek/ 60
  • 61. PXE Kickstart / Multiple NICs Phil iPXE config initrd <%= os_install_url %>/images/initrd.img kernel <%= os_install_url %>/images/vmlinuz ip=dhcp ksdevice=${mac} Phil kickstart snippet # network network --bootproto=dhcp 61
  • 62. %post Network Set Up Phil kickstart snippet # Bond Interface: <%= bond.name %> cat > /etc/sysconfig/network-scripts/ifcfg-<%= bond.name %> <<EoF DEVICE=<%= bond.name %> BONDING_OPTS="<%= bond.options %>" BOOTPROTO=static IPADDR=<%= bond.address %> NETMASK=<%= bond.netmask %> GATEWAY=<%= bond.gateway %> EoF 62
  • 63. %post Virident SSD Set Up # Start the virident daemon /etc/init.d/vgcd start # create a device node mknod /dev/vgca0 b 252 0 # create a mount point mkdir -p /var/lib/mysql # create partitions parted -s /dev/vgca0 mklabel msdos parted -s /dev/vgca0 unit s mkpart primary ext2 2048 100% # make another device node mknod /dev/vgca0p1 b 252 1 # make the filesystem /sbin/mkfs.xfs -f -d su=64k,sw=3 -l size=32m,su=16k /dev/vgca0p1 # create fstab entry echo "/dev/vgca0p1 /var/lib/mysql xfs noauto 0 0" >>/etc/fstab # create virident config  cat > /etc/sysconfig/vgcd.conf << EoF RESCAN_MD=1 RESCAN_LVM=1 MOUNT_POINTS="/var/lib/mysql" RESCAN_MOUNT=1 EoF # mount the virident mount /var/lib/mysql 63
  • 64. Lessons Learned • Modularity is very important • Hardware always has issues at scale • Use modern Bash syntax • 4 hour burn-in is not enough 64
  • 65. Yes, we’re hiring! Joshua Hoffman joshua@tumblr.com tumblr.com/jobs 65
  • 66. 66