Umeng Operations Infrastructure & Practice

Umeng Operations Infrastructure & Practice
Wang Yuxi, Umeng
w@umeng.com

About Me
● Before 2014, the only ops at Umeng
● Now, core member of ops team
● Technical generalist, responsible for the overall reliability and performance
of Umeng
● ArchLinux user
@Jasey_Wang | http://JaseyWang.Me

Agenda
● About Umeng
● IDC
● Network
● Server
● Product
● On Giant’s Shoulders
● OS
● User Management
● Critical Infrastructure
● Package Management
● Code Deployment
● Configuration Management
● Monitoring
● Tuning
● Documation
● Outage & Diagnose
● Security
● With Dev
● What We Are Doing Now

About Umeng
● Founded on April 2010
● Incubated by Innovation
Works
● $10 Million raised from
Matrix China
● Acquired by Alibaba
● Largest Mobile app
analytical platform in
China
● 400K+ Apps
● ~1B mobile device

IDC
● IDC
o 3 + 1
● Rack
o ~90
● Server
o 800+
● Network device
o 100+

Network
● Bandwidth
o 4Gbps+
o BGP cost
● Internal Network
o 10G interconnection
o Third Network arch
 Upgrade on Q2, 2014
 Nexus 752
 Bonding
o OOB issue

Server
● Before 2014
o Dell(11G, 12G)
● Now
o Dell, HP, Huawei, Inspur
● 10G NIC, enterprise SSD
● Power supply, hot-plug, redundant
● Hard drive, hot-plug

Product
● Real time analytics(thunder)
o 150k req/s
o ~ 5B log/d
o 100+ shards
● Batch processing system(iceberg)
o ~ 300 2U node, 2T/3T, 7200 SAS
o ~ 3T/d daily incremental data
o 4P/5P usage
● Push, Social

On Giant’s Shoulders
● OSS
o Nginx(Tengine)
o Finagle, Thrift
o Redis
o Kafka
o Storm
o MongoDB
o Hadoop & ecosystem
● Enterprise
o Google apps
o Github enterprise
o Redhat
o NewRelic
o CDN

OS
● Before 2013
o Ubuntu 10.04/12.04
● Now
o RedHat 6.2, 2.6.32-279(80%)
o professional technical support
● BIOS, RAID
o automatic tools
o done before delivery
http://goo.gl/TyDEVR

OS(cont.)
● OS template
o ks & preseed(great pain)
o partition(ext3/ext4, mount options)
o unnecessary service(irqbalance, cpuspeed, netfilter, etc.)
o sshd, monitoring agent
o handy tools(nmap, tcpdump, htop, iftop, screen, etc.)
o lang(Java/Scala, Python, Ruby)
● Custom init setup via Cobbler
● Added automatically by Zabbix

User Management
● OpenVPN(multi path)
o Incredibly stable for 3 years, ZERO outage
o TCP vs UDP
● Public key
o OK for startup, quick & dirty
● IPA(identity, policy, audit(snoopy))
o preferred
● Headache for us, history reason
o engineers enjoy the “free style”
o so, the sooner the better

Critical Infrastructure
● DNS
o use IP, not hostname in your code
o retry, timeout
● NTP
● Netfilter
o disabled by default
o conntrack
o NAT server

Package Management
● Internal repo
o sync periodically
o GFWed issue :-(
● Really need compile?
● package manager
o yum/apt
o rpm/dpkg
o how we use them
● One package principle
o rpm
o tgz

Code deployment
● Capistrano
o Written in Ruby
o Deploy any language
o Easy to use
● Configuration management
o dev use
o ops

Configuration Management
● 2011
o tens of servers
o free to use, mainly shell
● 2012 ~ 2013
o just ME
o Puppet is ok, learn some
Ruby
o tens of modules written by
me
● Now
o prerequisite
 team skill tree
 learning curve
o Puppet
 obsolete in new IDC
 complex syntax, slow
o Saltstack
 easy to pick up
 flexible & plain
 ansible as backup
o Python/ruby scripts, product
level

Monitoring
● Metrics, Metrics, Metrics!!!
● “All monitoring software evolves towards becoming an
implementation of Nagios”
http://goo.gl/PvBYky

Monitoring(cont.)
● From top to bottom
o customer perspective
o business level(dau, etc.), critical sensitive
o application level(qps, latency, return code, exception)
o system level(load, nic, cpu, memory)
 fork
 swap in/out
 nic speed/drops/errors
 tcp queue, retransmit
o hardware level

Monitoring(cont.)
● Ideal
o near-real time
o flexible, 5s, 60s, 300s,
1800s
o comparable by date/time
o active/passive or just feed
● Dashboard(core metric)
● Before
o Nagios/Munin(out of box)
● Now
o Zabbix/Graphite
o networkbench,
alibench(user end)
o New relic
● log
o rsyslog
o ELK
o scripts

Tuning
● From app level to system level
● App level, not covered here
● System level, take away for common use
● Don’t forget hardware(BIOS, RAID)
● Baseline comes first
● One modification one time
● Never over-optimized
o “it works”, then “it runs happily”
o business driven

Tuning(cont.)
● Don’t modify kernel
parameter unless 100%
sure
o timestamp issue
o ecn issue
● Tcp related
● Ring buffer, interrupts,
open files, etc.
● DB, watch out

Documentation
● Routine
o regular deploy & setup, weekly report
o online standard, 100+ slides for engineer
o ops share every Thu
● Post-Mortem
o blameless
o timeline & deadline
● Github Wiki & Google Docs

Outage & Diagnose
● This year(2014)
o SLA 99% ~ 99.9%
o issues every week, mostly invisible to customers
● When site is down
o from bottom to top, vice-versa
o good bug can reproduce
o tools are key power
 system http://goo.gl/wrNLi7
 app
o inform support & bd
o technical background share(http://blog.umeng.com/?cat=4)
● Network is a unreliable, and it can breakdown

Security
● IP issue, long long
history
o public & private ip
o port restricted, listen()
o oob
● test IDC
● UDP amplification
● Bash, SSL vulnerability
● DDoS
● whitehat(WooYun, etc.)
http://goo.gl/Q1SkXV

With Dev
● Tradeoff
o less dev’s work usually means more reliable system
o there will always be conflicts between ops & dev
 unless one of them gives in
 aggressive or mild, choose one
● Understand business logic
o code talks
o data talks
http://goo.gl/Qwh6Ze

What We Are Doing Now
● New IDCs, New beginning, Great challenge
o active - backup
o active - active
● Transfer data from BJ to SH
● Env setup, stress test, benchmark
● Finally, switchover
http://goo.gl/TMDnnS

What We Are Doing Now(cont.)
● Private Cloud
o capex & opex
o resource(hardware,
software)
o workforce

Umeng Operations Infrastructure & Practice

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Umeng Operations Infrastructure & Practice

Editor's Notes