Breaking the Kubernetes Kill Chain: Host Path Mount
Towards Scalable Service Composition on Multicores
1. Towards Scalable Service
Composition on Multicores
Daniele Bonetta,
Achille Peternier, Cesare Pautasso,Walter Binder
Faculty of Informatics
University of Lugano - USI
Switzerland
http://sosoa.inf.usi.ch
daniele.bonetta@usi.ch
3. Composition Engines
Focus: Service Composition Runtime
Execution Environment
Client
Web
Service
Web
Service
Web
Service
Composite
Web
Service
Service
Composition
Engine
12. Scalability on Multicores
On top of today’s heterogeneous hardware
• Different number of
cores
• Different type of
cores (SMT = n)
• Different chip
memory layouts
(cache levels, cache
size, NUMA)
13. Engine Architecture
Run a large number of concurrent
compositions with a limited
number of execution threads
Request
Handler
Kernel Invoker
25. Our Proposal
• Replicate the architecture instead of just
increasing the number of threads
Topology-Aware deployment
26. Our Proposal
• Replicate the architecture instead of just
increasing the number of threads
• Bind threads to specific affinity groups
Topology-Aware deployment
27. Our Proposal
• Replicate the architecture instead of just
increasing the number of threads
• Bind threads to specific affinity groups
• Distribute resources(memory/threads)
among replicas proportionally to hw-
resources and number of replicas
Topology-Aware deployment
29. Single Instance
This baseline deployment lets the OS thread
scheduler map the engine threads on all cores
L2 cache L2 cache
L1 L1 L1 L1
C1 C2 C3 C4
Engine Instance
(8 threads)
30. Two instances
The threads of each
instance are bound to specific cores
L2 cache L2 cache
L1 L1 L1 L1
C1 C2 C3 C4
Instance
#1
(4 threads)
Instance
#2
(4 threads)
31. Hardware Awareness
1. Gather hardware topology information:
• #cores, #caches, #cache-levels, ...
2. Replicate the engine architecture:
• One instance per last-level shared cache
• Configure the thread pool sizes
Self-configuration at startup:
34. Experimental Results
Fixing the number of threads to the
optimal value number
# of
Replicas
Request
Handler
Kernel Invoker Total
1
2
6
12
12
6
2
1
12
6
2
1
12
6
2
1
36
36
36
36
35. Experimental Results
2 x AMD Barcelona 6 cores processors with 2 LLC
300
600
900
1200
1500
1800
2100
300 600 900 1200 1500 1800 2100 2400 2700 3000 3300
Throughput(Instances/sec)
Number of Clients
1 Replica
2 Replicas
6 Replicas
12 Replicas
Scalability (Throughput up to 3300 clients)
# of clients
Throughput(res/s)
36. Experimental Results
4 x Intel Xeon 4 cores processors with 4 LLC
Relative Speedup at saturation
1 Rep
2 Rep
3 Rep
4 Rep
37. Conclusion
• Targeting different multicores with the same
engine architecture is a challenging issue
• Simply increasing the number of threads is
not always the optimal approach
• The scalability depends on how a limited
amount of threads are mapped to the cores
• Hardware Aware Deployment can improve
performance up to 30%