3. BSC – Barcelona Supercomputing Center
3
23 years resarch on computer architecture
• European Center for Parallelism of Barcelona (CEPBA)
• Based at the Polytechnical University of Catalonia (UPC)
Led by Mateo Valero
• Seymour Cray 2015, first european to win it
• ACM fellow, Eckert-Mauchly award in 2007, Google award 2009
Large resarch staff
• 1000+ publications
4. BSC – Barcelona Supercomputing Center
4
Many life sciencies computational projects
• Computational Genomics
• Molecular modeling and bioinformatics
• Protein interactions and docking
• In place computational capabilities
• Mare Nostrum supercomputer
Research activity around Hadoop since 2008
• Data-centric research group:
http://www.bsc.es/computer-sciences/data-centric-
computing
• SLA-driven scheduling (adaptive scheduler)
• Project ALOJA
6. Automated characterization of cost-effectiveness of Big Data
deployments
Seeks to provide knowledge and tools aiming to help users reduce the
TCO of infrastructures
About the project
6
7. What is the most effective configuration for my needs?
About the project
7
8. On ALOJA we acquired large knowledge on the behavior of On-
Premise and IaaS hadoop deployments
60k+ runs
Public repository
8
9. What it is best for one workload it is not for all
Lessons learnt from IaaS
9
Disks and network impact Local vs remote disks
HDD-IB
SSD-ETH
HDD-ETH
SSD-IB
Local only
1 Remote
2 Remotes
3 Remotes
1 Remote /tmp local
2 Remote /tmp local
3 Remote /tmp local
11. Provides an automated setup of BigData services (Hadoop, Spark,
Hive..)
• Optimized for the underlying hardware
• Removes cost of installation
The service provider is in charge of maintenance
• Reduces TCO
• As any cloud service you pay as you go
Platform as a Service
12
12. O'Reily made a survey on data science salaries and estimated an
average salary of 140.000 US$ for a data engineer
Within a cluster of 16 datanodes on HDInsight of A3 machines, for a
year it costs:
• (16 datanodes + 2 headnodes) * 0.2384/hr = 4.2912 $US/hr =>
4.2912*24*365 = 37,590.912 $US/year
Hence, on ideal conditions we can save up to 102,409.088 $US per
year
How much spent on maintenance?
13
13. Some current solutions
• Azure HDInsight
• Rackspace CBD
• Amazon EMR
• Google Cloud Platform
Platform as a Service
14
14. Linux-based clusters of 4,8 and 16 datanodes
• Azure HDInsight and Rackspace CBD
• Azure IaaS and Rackspace IaaS clusters as well
Clusters of up to 8 cores / per node and 64 GB RAM
HDInsight: azure storage HDFS (remote disks)
Rackspace CBD: nodes’ local disks as HDFS
Evaluation environment
15
15. Wordcount
• CPU intensive: useful to analyze scalability of the nodes between VM
sizes
Tested workloads
16
%user %system %steal %iowait %nice
16. Terasort
• Combined I/O and CPU loads, a de facto benchmark in the community
Tested workloads
17
Datasizes of 1, 10,100 and 1000 GB
This is enough to stress the system and get an overall behavior of it
%user %system %steal %iowait %nice
17. Runs repeated several times
Cloud variability (100GB runs)
18
Benchmark Provider Standard Deviation
(%)
Terasort HDInsight 60%
Rackspace CBD 28%
Wordcount HDInsight 55%
Rackspace CBD 47%
18. Relevant factors tree
19
ALOJA-ML is a set of machine learning techniques and tools to estimate
executions’ behavior on the unexplored search space
Relevant factors tree: a tool that explores the parameters that changes most an
execution’s behavior
19. Relevant factors tree
20
Resulting tree for PaaS executions
IOFileBuffer=131072
Datasize
Benchmark=Terasort
Replication
Benchmark=wordcount
Datanodes
IOFileBuffer=262144
Datasize
32. Cost difference IaaS and PaaS
35
Provider VM Size IaaS US$/h PaaS US$/h
Azure/HDI 4 CPU, 7GB RAM $0,176/h $0,32/h
8 CPU, 15GB RAM $0,352/h $0,64/h
Rackspace/CBD 4vCPU,15GB RAM $0,555/h $0,7925/h
8vCPU,30GB RAM $1,11/h $2,776/h
Amazon/EMR 4vCPU,16G RAM $0,239/h $0,299/h
8vCPU,32GB RAM $0,479/h $0,599/h
IaaS is cheaper, but might increase TCO (maintenance on your own!)
33. Conclusions
36
Providers are not really significant
In public cloud, large datasizes or large clusters introduce problems
• A larger cluster may improve performance but be more expensive in the
end
PaaS allows you to save on maintenance
• But you still have to take care of tunning a bit
• Not as much as on IaaS
• Cheaper or not than IaaS it all depends on your business