2. 5+ years experience designing, setting
up, testing & running production web systems in
varied deployment environments
Experience setting up colocation IDCs with
Active-Active DR sites for India’s No. 1 OTA
Experience working on public cloud platforms
like AWS and setting up private cloud
infrastructure
…Generation G : Gamification /engineer/
Tags: techie, open source
enthusiast, engineer, geek, DevOps, web
ops, security , Tripper(MMYT),Ex-Nextag-ian :)
3. Scalable
Robust and Always Available
Manageable
Resilience
Operationally Visible (Monitor Everything)
Cost effective
4.
5. Avoid unnecessary change by selecting a
long-term supported distribution on which to
base your platform.
◦ RHEL / CentOS
◦ Ubuntu LTS (Long Term Support)
◦ Debian Stable
My preference:-
RHEL / CentOS (Red Hat Stability & yum wins)
6. Use your capacity model to drive a decision
on how you build infrastructure : Check SLAs
& Cost constraints
◦ 100% dedicated hardware (Self Managed /
Outsourced)
◦ 100% cloud (May consider AWS /or Rackspace)
◦ Hybrid
Cloud success relies on “automating” key
service management processes to optimize
the run-time operation of /dynamic
workloads/ in a shared-resource
environment.
7. Split each service(/layer) out across its own
set of servers for easier scale-out and
management.
◦ Traffic Management / (both Global Traffic & Local
traffic management)
◦ Application Servers
◦ Data Store Servers
◦ Email Services
◦ + Minimize Distribution of State:-
Keep services that require storage to a minimum, for
ease of backups and management - like Data Services
(backups)
8. Use redundant pairs(on devices/appliances)
, /HA/ & clustering or failover to ensure
availability of service(s).
◦ Minimum down-time.
◦ Application & services redundancy + Load Balanced
cluster on one site & DR too
◦ DB HA+ Data Store(MySQL) Backup and Recovery
◦ Choose and implement best suited Failover strategy
◦ Redundant Network on each node (+ on Server:
Linux NIC bond)
9. ◦ Dev , QA and staging platforms (both application &
N/W platform) to prove application and
configuration changes before they go live into
production.
◦ Most of the Live site issues are due to lack of
similar configuration environment / platform for
Dev / QA / Staging Testing.
◦ LAB Env:-
Performance/Stress LAB
Experimentation LAB (A/B or Multivariate experiment)
support with Live traffic
10. Virtualization is key here :) ...actually this is
changing world ...not the cloud !!
+ Selecting the Right Virtualization
Technology
Use network boot and installer tools; or
templated provisioning to build servers
identically
◦ PXE Boot + Kickstart
◦ VMWare ESXi Template /Citrix Xenserver
◦ Amazon AMI (EC2)
◦ OpenNebula
11. Package Management - YUM repositories
(Distribution + Own)
Create you own Repository servers for
packages + Code both
Use configuration management tools to
deploy configuration automatically from a
central location.
◦ Puppet / Facter
◦ Chef
◦ CFEngine (Nova)
◦ RANCID (N/w Devices)
12. Use a central service for identity and
password management
◦ OpenLDAP
◦ Active Directory
◦ TACACS+ (N/w devices)
Have proper accounting/audit Logging
Inventory Management :
◦ Use facter facts + CMDB based Inventory
Management
13. ◦ Version Control:-
SVN / GIT
◦ Use continuous integration and deployment tools to
test and release software
Jenkins (Hudson) / Go
Capistrano / Fabric
◦ ....Deploy more frequently ...so as to build
confidence in the whole system for change
management
14. Starting from Site Availability Checks &
External Dependencies Checks to much more
detailed data to Capture as much data as
possible.
Store time-series data for trend analysis, and
alert when thresholds are breached.
◦ CPU / RAM / IO / Network usage per server
◦ Application metrics
◦ Disc space usage
◦ Network bandwidth
◦ MySQL numbers
◦ ...etc
15. So, source could be anything starting from
DB, logs, SNMP, http etc
+ have Real time reporting over it
(Dashboards)
+ Real time data extraction
Tools to consider:
◦ Ganglia / Centreon / Nagios
◦ OpManager for URL monitoring
◦ Selenium RC based checks (Functional tests) etc
Alerting on both Minimum/Maximum
Thresholds (OK, WARN, CRITICAL)!
16. Continue to plan your resource requirements
based on growth expectations, new features
and performance targets
Use data from:
◦ Your monitoring system!
◦ Business requirements
Continuously Improve:
◦ Profile applications and reduce resource usage
(Dtrace)
◦ Review performance against capacity model
◦ Feed a “Top 10” hitlist back to developers may be
slow queries etc
17. Varnish cache
◦ Reverse proxy, flexible configuration with inline C
support
Nginx
◦ Event based / Lightweight
◦ Runs more than 8% of the web
PHP-FPM
◦ Best FastCGI implementation available for PHP
MySQL Server tuning / optimization
Caching:- In memory data store -
Memcached / Redis
18. As a first exercise - do have a IT Infrastructure &
Application Threat Modeling done along with
Risk Assessment then…..consider having
◦ HIDS (OSSEC) /IPTABLES
◦ WAF (Web Application Firewall)
◦ IPS (Intrusion prevention system)
◦ Linux Hardening
◦ DLP (Data Leakage Prevention)
◦ Data Encryption considerations wrt Data Classification
Security Monitoring & Attack Detection
Key thing is to "Enable continuous compliance"
...maybe PCI-DSS for an e-comm.
19. Diagnosing / Troubleshooting and Fixing
production issues
Change Management and Delivery
Automate as much as possible with centralized
management of Scripting etc
Backup/restore : Always do test drills for them
Don’t re-invent the wheel & try to Go with proven
and solid technologies when you can
Last :) Keep-on Re-architecting the infrastructure
(may be small things) to optimize efficiency
(every 6 months) ...learn from mistakes (yours/
others too :))
20. Questions if Any !!
Ping Me on:-
IRC /freenode/ : PiyushK ##infra-talk
Gtalk: piykumar
Twitter @piykumar