Brief introduction to Umeng.com Operations Infrastructure & Practice.
---
updates: 03/05/2015
Thanks to @TerryWang(http://www.slideshare.net/terrywang) who help to correct some grammar errors.
Below is the original copy, feel free to comment:
https://docs.google.com/presentation/d/1d1MAR8SClZDf8gjCNPuOeu63Fd83T-mzzqnqcTboAoY/edit?usp=sharing
2. About Me
● Before 2014, the only ops at Umeng
● Now, core member of ops team
● Technical generalist, responsible for the overall reliability and performance
of Umeng
● ArchLinux user
@Jasey_Wang | http://JaseyWang.Me
3. Agenda
● About Umeng
● IDC
● Network
● Server
● Product
● On Giant’s Shoulders
● OS
● User Management
● Critical Infrastructure
● Package Management
● Code Deployment
● Configuration Management
● Monitoring
● Tuning
● Documation
● Outage & Diagnose
● Security
● With Dev
● What We Are Doing Now
4. About Umeng
● Founded on April 2010
● Incubated by Innovation
Works
● $10 Million raised from
Matrix China
● Acquired by Alibaba
● Largest Mobile app
analytical platform in
China
● 400K+ Apps
● ~1B mobile device
5. IDC
● IDC
o 3 + 1
● Rack
o ~90
● Server
o 800+
● Network device
o 100+
6. Network
● Bandwidth
o 4Gbps+
o BGP cost
● Internal Network
o 10G interconnection
o Third Network arch
Upgrade on Q2, 2014
Nexus 752
Bonding
o OOB issue
8. Server
● Before 2014
o Dell(11G, 12G)
● Now
o Dell, HP, Huawei, Inspur
● 10G NIC, enterprise SSD
● Power supply, hot-plug, redundant
● Hard drive, hot-plug
9. Product
● Real time analytics(thunder)
o 150k req/s
o ~ 5B log/d
o 100+ shards
● Batch processing system(iceberg)
o ~ 300 2U node, 2T/3T, 7200 SAS
o ~ 3T/d daily incremental data
o 4P/5P usage
● Push, Social
11. On Giant’s Shoulders
● OSS
o Nginx(Tengine)
o Finagle, Thrift
o Redis
o Kafka
o Storm
o MongoDB
o Hadoop & ecosystem
● Enterprise
o Google apps
o Github enterprise
o Redhat
o NewRelic
o CDN
12. OS
● Before 2013
o Ubuntu 10.04/12.04
● Now
o RedHat 6.2, 2.6.32-279(80%)
o professional technical support
● BIOS, RAID
o automatic tools
o done before delivery
http://goo.gl/TyDEVR
13. OS(cont.)
● OS template
o ks & preseed(great pain)
o partition(ext3/ext4, mount options)
o unnecessary service(irqbalance, cpuspeed, netfilter, etc.)
o sshd, monitoring agent
o handy tools(nmap, tcpdump, htop, iftop, screen, etc.)
o lang(Java/Scala, Python, Ruby)
● Custom init setup via Cobbler
● Added automatically by Zabbix
14. User Management
● OpenVPN(multi path)
o Incredibly stable for 3 years, ZERO outage
o TCP vs UDP
● Public key
o OK for startup, quick & dirty
● IPA(identity, policy, audit(snoopy))
o preferred
● Headache for us, history reason
o engineers enjoy the “free style”
o so, the sooner the better
15. Critical Infrastructure
● DNS
o use IP, not hostname in your code
o retry, timeout
● NTP
● Netfilter
o disabled by default
o conntrack
o NAT server
16. Package Management
● Internal repo
o sync periodically
o GFWed issue :-(
● Really need compile?
● package manager
o yum/apt
o rpm/dpkg
o how we use them
● One package principle
o rpm
o tgz
17. Code deployment
● Capistrano
o Written in Ruby
o Deploy any language
o Easy to use
● Configuration management
o dev use
o ops
18. Configuration Management
● 2011
o tens of servers
o free to use, mainly shell
● 2012 ~ 2013
o just ME
o Puppet is ok, learn some
Ruby
o tens of modules written by
me
● Now
o prerequisite
team skill tree
learning curve
o Puppet
obsolete in new IDC
complex syntax, slow
o Saltstack
easy to pick up
flexible & plain
ansible as backup
o Python/ruby scripts, product
level
19. Monitoring
● Metrics, Metrics, Metrics!!!
● “All monitoring software evolves towards becoming an
implementation of Nagios”
http://goo.gl/PvBYky
20. Monitoring(cont.)
● From top to bottom
o customer perspective
o business level(dau, etc.), critical sensitive
o application level(qps, latency, return code, exception)
o system level(load, nic, cpu, memory)
fork
swap in/out
nic speed/drops/errors
tcp queue, retransmit
o hardware level
21. Monitoring(cont.)
● Ideal
o near-real time
o flexible, 5s, 60s, 300s,
1800s
o comparable by date/time
o active/passive or just feed
● Dashboard(core metric)
● Before
o Nagios/Munin(out of box)
● Now
o Zabbix/Graphite
o networkbench,
alibench(user end)
o New relic
● log
o rsyslog
o ELK
o scripts
22. Tuning
● From app level to system level
● App level, not covered here
● System level, take away for common use
● Don’t forget hardware(BIOS, RAID)
● Baseline comes first
● One modification one time
● Never over-optimized
o “it works”, then “it runs happily”
o business driven
23. Tuning(cont.)
● Don’t modify kernel
parameter unless 100%
sure
o timestamp issue
o ecn issue
● Tcp related
● Ring buffer, interrupts,
open files, etc.
● DB, watch out
24. Documentation
● Routine
o regular deploy & setup, weekly report
o online standard, 100+ slides for engineer
o ops share every Thu
● Post-Mortem
o blameless
o timeline & deadline
● Github Wiki & Google Docs
25. Outage & Diagnose
● This year(2014)
o SLA 99% ~ 99.9%
o issues every week, mostly invisible to customers
● When site is down
o from bottom to top, vice-versa
o good bug can reproduce
o tools are key power
system http://goo.gl/wrNLi7
app
o inform support & bd
o technical background share(http://blog.umeng.com/?cat=4)
● Network is a unreliable, and it can breakdown
26. Security
● IP issue, long long
history
o public & private ip
o port restricted, listen()
o oob
● test IDC
● UDP amplification
● Bash, SSL vulnerability
● DDoS
● whitehat(WooYun, etc.)
http://goo.gl/Q1SkXV
27. With Dev
● Tradeoff
o less dev’s work usually means more reliable system
o there will always be conflicts between ops & dev
unless one of them gives in
aggressive or mild, choose one
● Understand business logic
o code talks
o data talks
http://goo.gl/Qwh6Ze
28. What We Are Doing Now
● New IDCs, New beginning, Great challenge
o active - backup
o active - active
● Transfer data from BJ to SH
● Env setup, stress test, benchmark
● Finally, switchover
http://goo.gl/TMDnnS
29. What We Are Doing Now(cont.)
● Private Cloud
o capex & opex
o resource(hardware,
software)
o workforce
push product, fast growth
mentor other team member, help to debug
out of control often 2011
10 am event trigger
10:30 pm, peak period
just like nice app, I attend their cto’s speech at archsummit
~6B/d dau
m & a by alibaba: idc migrate, and the related proj
7 eng, 4 pe, 1 network eng, 1 it, 1 director
part time IT before 2012, cross wall
startup road almost the same
2011, 2012 not hot
from 2013, hot
growth fast, now, a is smaller, but v is still large
Q4, 2013 dau 2B
Q4, 2014 dau 6B, 300%
zhihu daily report
3m(andriod), 2m installation
1m, 1.5m launch
IDC ~ outage/m, active - active is a must
bandwidth limited
3 gen
1: 2011, simple
2: 2012, fast
3: Q2, 2014
752, Q1 ~ Q2, 2014, many challenge, via my blog, finally, works
mulit uplink
incoming > outcoming, different from the trad web
server: so many type, operational problems
10 nic
lvs
nat server
all db shipped with ssd
power must dual, often power outage
few vms running
first xen, network problems
run lvs on xen, wtf
thunder
from ruby 2 scala
sdk resend problem
sdk envelope, etc.
kafka mirror, 600Mbps
scp data from finagle, not stable
then kafka mirror, quite well
hbase buggy
mongo reshard, acceptable, stable
front end, new relic, code level call
no fancy technology
github, redhat wonderful, have a try
jiankongbao, jidiao
redhat, nice doc, vmcore analytics
all raid10, no reason, just for easy management
analysis vmcore
dns is not reliable, even in alibaba, for hadoop, hosts
but hosts is not easy to mantain
puppet, dsl
salt, so young, community active, but less ordered, google sometimes not work
almost outage due to lack of metrics
1.5K nvps
kids, redis sharding
oss better than umeng
the items refer above must tuned for high load node
networking is critical, tcp retransmit, overflow
oc, tons of small packet, 100B pkg size begins to drop, 200kpps/s
disk, hadoop, noatime, ext3 not ext4
4xx: ~ 0.4%
mobile network is different
5xx: ~ 0.005%
no 3xx
kernel panic for switch
208.5d panic
results are stupid
idrac hacked
db connected without auth from public
weak passwd
ddos affect us, actually the same uplink
instagram already migrate from aws to facebook’s idc, spend 1 year to prepare
for us, tens of problems to solve
push, typical case
thunder, some app(wannianli), 10am, traffic spike
resend control