Self-tuning data centers aim to minimize human intervention through machine learning techniques. Current challenges include meeting service level agreements for performance and uptime while maximizing efficiency of resources and minimizing costs. A self-tuning architecture uses monitoring data to detect issues and make recommendations for scaling, migration, or tuning of resources without human input. This approach aims to optimize data centers so they can scale efficiently to support growing workloads and applications.
3. Business Challenge:
Software Defined Business
Software Defined Transportation Software Defined Video Streaming
Control, Management
and Analytics Tier
R&D
Software Defined Leasing/Hostelling Software Defined Data Centers
and Analytics Tier
Resource Pool Tier
4. Engineering Challenge:
Big Data Problem
Resource/Supply Providing
Monitoring, Analytics, Management
Prediction/Optimization
Customer/Service User
Control and
Management Tier
Resource
Pool Tier
Large amount of data and meta data generated
R&D
Large amount of data and meta data generated
6. Applications Spectrum
Computing (CPU , GPU, DSP, FPGA,...)
Self-Driving CarsRobotic/AI Applications
These applications will be
fully or partially supported
by Data Centers Services
(Cloud-Based)
R&D
Storage (DRAM, SSD, HDD,..)
Network (Wired, Wi-Fi, 4G,…)
Data Management Systems
Video Streaming/IoT,…
7. Typical Data Center Architecture
As a simple rule of thumb:
Enterprise Data Center Size :
100 Hosts
1000 VMs
~Logs : 40 GB/Day
Data Center
Management
Host 1 Host 2
Host n
Apps are running
on VMs
R&D
VM-1-k
VM-1-1 VM-2-1
VM-2-m
VM-n-1
VM-n-l
logslogs logs
Storage Pool
Big Data Engineering
and Science
9. Challenges in Data Center
Management
Service Level Agreement (SLA) :
Throughput/Latency (e-commerce applications):
► 2014 US $304 billion increasing 15.4% yearly in e-commerce [1],
► 100ms latency costs 1% decrease in sale [3],
► Page loading should be less than 2 seconds per page not to lose
customer, will decrease overall sales by 7% [2],
R&D
Availability and Fault Tolerance :
► Example Huawei public cloud 99.9999 Availability [4] =
Daily: 0.9s
Weekly: 6.0s
Monthly: 26.3s
Yearly: 5m 15.6s
Scalable and Elastic (on Demand) :
► Should know when and how to scale to satisfy SLA dynamically,
10. Data Center Energy Efficiency and Resource Utilization :
► By 2020 reduction of energy cost 30% based on
European law-Green DC [5],
Challenges in Data Center
Management
Security and Privacy :
► Should guarantee data privacy (like medical data, Financial Data,…) and
security against attacks, data ownership,…
R&D
By 2020 reduction of energy cost 30% based on
European law-Green DC [5],
► US data centers consume ~ 90 billion Kilowatt hours annually =
House hold in NY for two years
► Pollute over 150 million tons of carbon yearly in USA [5],
► ~ 90 percent of the VMs utilizes < 15% of assigned cores [9],
► ~ 90 percent of the VMs only have < 10 IOPS [9],
► Average server runs on [12%-18%] of their capacity most of the time
still consuming 30% to 60% of their maximum power consumption [6,7].
► High utilization -> save in power consumption->Low carbon footprint
11. Software Compliance and License :
► ~ $500,000 spent on software licensing for average size data center,
► It could be per User/Device/VM/Core/…
► Different models and policies for license like [8]:
1) Running licensed workload on bare metal (no virtualization),
2) Running licensed workload on dedicated cluster,
3) Migrate licensed workload,
Challenges in Data Center
Management
R&D
3) Migrate licensed workload,
4) …
► Workloads and cluster growth bring challenges for software license,
► This bring the challenge how to minimize the cost of software on data
centers and not violate license policy,
Dynamic Service Pricing :
► Computing, network and storage are utilities for workloads.
► Should model to find a dynamic way and good policy of pricing in
competitive market of cloud providers while increasing revenue.
12. Self-Tuning Data Center :
Simplified Service Architecture
VM
Scheduling and
Orchestrating Services
and Resources
Real-time Log and
Monitoring Service
Alert and Policy
Service
Recommendation
Service
2) Ask correct size, type
And location for resource
Based on request
1) Request resource
3) Correct conf and resource
size and place
4) Allocate required
resources
1) Telemetry and log sending
Initial
State
R&D
1) Telemetry and log sending
2) Query logs for policy and
alert checks
4) Check for violation
and warnings
5) Alert of Violation
6) Ask for Recommendation
7 ) Send Recommendations
and Recipes
8 ) Apply Recommendation
Operational and
Recovery State
1 ) Ask Recommendation
For Self-Tune (for example
in low traffic state)
2) Send Tuning Plan
and Recommendation
(like VM migration or resizing)
3 ) Apply Self-Tuning
recommendation
Self-Tuning
State
3) Collected Data
13. Huawei Position in
Self-Tuning Data Center
► Huawei Cloud is growing very fast > 50% revenue increase y-y.
► Huawei launched its first public cloud outside China in Europe
(announced in CeBIT 2016) with 50,000 Hosts.
► Working on intelligent service in Huawei R&D Storage Lab in USA to
address self-tuning data centers and provide solution for Huawei
customers and their needs.
► Using and contributing idea from/to open source big data
R&D
► Using and contributing idea from/to open source big data
community.
14. Conclusions and
Future Directions
► Cloud-based ecosystem is the future of IT.
► Cloud data center composed of different resources
to satisfy applications requirements.
► Managing these resources is a complicated task that human
can not do it manually.
► Machines in data centers are generating big amount of logs which
describe what happen in data center.
R&D
describe what happen in data center.
► Data scientists and engineers are needed to study system
behavior and data center optimization.
► This will result to the next generation data centers which are self-tuning
and need minimum human efforts.
15. References
[1] U.S Census Bureau News : http://www2.census.gov/retail/releases/historical/ecomm/14q4.pdf
[2] Akamai Newsroom : http://www.akamai.com/html/about/press/releases/2009/press_091409.html
[3] High Scalability Blog : http://highscalability.com/latency-everywhere-and-it-costs-you-sales-how-crush-it
[4] High Availability : https://en.wikipedia.org/wiki/High_availability
[5] European Commission on Renewable Energy : https://ec.europa.eu/energy/en/topics/renewable-
energy
R&D
energy
[6] ISSUE Brief : https://www.nrdc.org/sites/default/files/data-center-efficiency-assessment-IB.pdf
[7] ISSUE Paper : https://www.nrdc.org/sites/default/files/data-center-efficiency-assessment-IP.pdf
[8] Turbotonic white paper “Licensing, Compliance & Audits in the Cloud Era”, 2015.
[9] CloudPhysics, Global IT Data Lake Report, Q4 2016