8. Problems solved last year
Low Latency
- < 50ms for EVERY request
High Concurrency
- 8k+ QPS, increasing
Linear Scalable
- High Growth Rate: YoY: 350%
Targeting
- Big Data / Data Mining …
Low (or NO) Budget
- What The F …
廣告費欄位需即時正確,
否則就會蒙受損失
PS: 依照去年情況,系統費用欄位
delay 一秒鐘,大約損失NT$300
8
9. Architecture shared last year
Tomcat
Application
haproxy
…
Tomcat
Application
Tomcat
Application
haproxy
Infinispan Dist. Cluster
Node A
Infinispan Repl. Cluster
Node N Node A
Node N
Application Server: 15 nodes
Infinispan: 5 nodes, stores 100M K/V
Capacity: 8k+ QPS
10. New Challenges
Global business
- Buy and Sell in different countries
- Multiple DC / Hybrid cloud infrastructure
More business coverage
- Self-operating Ad network
- + Turnkey solution
Much more capacity requirement
- Data from all over the world
- 100k+ QPS, latency <50ms
11. Data Analysis
DSP / AdN Platform
Bidding
Engine
Pricing
Engine
MapReduce/Spark
HBase
Advertisers
Message Routing / Streaming Processing
Ad
Request
Hadoop Distributed
File System
(HDFS)
User Profiles
Ad Requests
HTTP POST
Avro Avro Avro
Ad videos,
images
HTTP Get
Data Processing and Archiving
Creative
and
videos
AD management
Report UI
(Django, RoR,
SSH)
Vpon AD services
backend functions
CDN
Recommender System
Other
undergoing
topics
Reporting system
Sales
Support
System
AD-hoc
reporting
Operation
Ganglia
Solr
AD Operation
AD
Monitoring
System
Scenario
modeling
Avro
Web
Proxy
+
Cache
User
Profiles
(Data
Store)
Rsync, Avro Avro
Python + pig, hive,
Hadoop Streaming, Spark
Python + pig, hive,
Hadoop Streaming, Spark
HA Proxy
12. New Architecture
Asynchronous in design
Move computing to data
Cache in actors in every node
- Reduce data accessing
- No more cache consistency problem
- Good for trouble shooting
- Less maintenance cost
Remove hotspot by distribute tasks to actors
Flexible resource management
- Shutdown server instance in off-peak
13. Move Computing to Data
Move data takes tremendous time
- Remember we have less than 30ms
Access AD decision data takes time
Heavy loading on DataStore / Cache
Tomcat
App
haproxy
Tomcat
… App
Tomcat
App
haproxy
Infinispan Dist. Cluster
Node A
Infinispan Repl. Cluster
Node N Node A
Node N
User Profile = 2kB
QPS = 100k
At least 1.56Gb/s
14. Why Scala and Akka
Scala
- Functional
- Great for concurrency
- Reuse legacy Java code
Akka
- Actor Pattern
- Asynchronous and Distributed by design.
- Form cluster
17. Preliminary Benchmark
Agent Cluster (Not yet optimized)
AWS
- Akka Node
or3.xlarge (8 vCPU, 61G RAM) * 3
- Data Store
oi2.xlarge (8 vCPU, 61G RAM, 1600GB SSD) * 3
Results
- 15,321 QPS (~10x comparing to old
architecture)
- 23ms average process time
18. Lesson Learned
Maximizing server capacity by asynchronous
- Asynchronous is never easy
- It takes time to learn the correct practice
Move computing to data
- Accessing data is expensive
Reduced hotspot / bottleneck by separating tasks
into different actors in different nodes
- Dispatch tasks by consistent hash
19. We need
Scala, Java, Python, RoR, Hadoop, Spark, Docker, Operation experts