The increasing complexity of available infrastructures (hierarchical, parallel, distributed, etc.) with specific features (caches, hyper-threading, dual core, etc.) makes it extremely difficult to build analytical models that allow for a satisfying prediction. Hence, it raises the question on how to validate algorithms and software systems if a realistic analytic study is not possible. As for many other sciences, the one answer is experimental validation. However, such experimentations rely on the availability of an instrument able to validate every level of the software stack and offering different hardware and software facilities about compute, storage, and network resources.
Almost ten years after its premises, the Grid'5000 testbed has become one of the most complete testbed for designing or evaluating large-scale distributed systems. Initially dedicated to the study of large HPC facilities, Grid’5000 has evolved in order to address wider concerns related to Desktop Computing, the Internet of Services and more recently the Cloud Computing paradigm. We now target new processors features such as hyperthreading, turbo boost, and power management or large applications managing big data. In this keynote we will both address the issue of experiments in HPC and computer science and the design and usage of the Grid'5000 platform for various kind of applications.
1. Grid'5000: Running a Large Instrument
for Parallel and Distributed Computing
Experiments
F. Desprez
INRIA Grenoble Rhône-Alpes, LIP ENS Lyon, Avalon Team
Joint work with G. Antoniu, Y. Georgiou, D. Glesser, A. Lebre, L. Lefèvre, M. Liroz, D. Margery,
L. Nussbaum, C. Perez, L. Pouillioux
F. Desprez - Cluster 2014 24/09/2014 - 1
2. “One could determine the different ages of a science by the technic of its
measurement instruments”
Gaston Bachelard
The Formation of the scientific mind
F. Desprez - Cluster 2014 24/09/2014 - 2
3. Agenda
• Experimental Computer Science
• Overview of GRID’5000
• GRID’5000 Experiments
• Related Platforms
• Conclusions and Open Challenges
F. Desprez - Cluster 2014 24/09/2014 - 3
5. The discipline of computing: an experimental science
The reality of computer science
- Information
- Computers, network, algorithms, programs, etc.
Studied objects (hardware, programs, data, protocols, algorithms, network)
are more and more complex
Modern infrastructures
• Processors have very nice features: caches, hyperthreading, multi-core
• Operating system impacts the performance
(process scheduling, socket implementation, etc.)
• The runtime environment plays a role
(MPICH ≠ OPENMPI)
• Middleware have an impact
• Various parallel architectures that can be
heterogeneous, hierarchical, distributed, dynamic
F. Desprez - Cluster 2014 24/09/2014 - 5
6. Experimental culture not comparable with
other science
Different studies
• 1994: 400 papers
- Between 40% and 50% of CS ACM papers requiring experimental validation had none (15% in
optical engineering) [Lukovicz et al.]
• 1998: 612 papers
- “Too many articles have no experimental validation” [Zelkowitz and Wallace 98]
• 2009 update
- Situation is improving
• 2007: Survey of simulators used in P2P research
- Most papers use an unspecified or custom simulator
Computer science not at the same level than some other sciences
• Nobody redo experiments
• Lack of tool and methodologies
Paul Lukowicz et al. Experimental Evaluation in Computer Science: A Quantitative Study. In: J.l of Systems and Software 28:9-18, 1994
M.V. Zelkowitz and D.R. Wallace. Experimental models for validating technology. Computer, 31(5):23-31, May 1998
Marvin V. Zelkowitz. An update to experimental models for validating computer technology. In: J. Syst. Softw. 82.3:373–376, Mar. 2009
S. Naicken et al. The state of peer-to-peer simulators and simulations. In: SIGCOMM Comput. Commun. Rev. 37.2:95–98, Mar. 2007
F. Desprez - Cluster 2014 24/09/2014 - 6
7. “Good experiments”
A good experiment should fulfill the following properties
• Reproducibility: must give the same result with the same input
• Extensibility: must target possible comparisons with other works
and extensions (more/other processors, larger data sets, different
architectures)
• Applicability: must define realistic parameters and must allow for
an easy calibration
• “Revisability”: when an implementation does not perform as
expected, must help to identify the reasons
F. Desprez - Cluster 2014 24/09/2014 - 7
8. Analytic modeling
Purely analytical (mathematical) models
• Demonstration of properties (theorem)
• Models need to be tractable: over-simplification?
• Good to understand the basic of the
problem
• Most of the time ones still perform a
experiments (at least for comparison)
For a practical impact (especially in distributed computing):
analytic study not always possible or not sufficient
F. Desprez - Cluster 2014 24/09/2014 - 8
9. Experimental Validation
A good alternative to analytical validation
• Provides a comparison between algorithms and programs
• Provides a validation of the model or helps to define the validity
domain of the model
Several methodologies
• Simulation (SimGrid, NS, …)
• Emulation (MicroGrid, Distem, …)
• Benchmarking (NAS, SPEC, Linpack, ….)
• Real-scale (Grid’5000, FutureGrid, OpenCirrus, PlanetLab, …)
F. Desprez - Cluster 2014 24/09/2014 - 9
11. Grid’5000 Mission
Support high quality, reproducible experiments on a distributed system
testbed
Two areas of work
• Improve trustworthiness
• Testbed description
• Experiment description
• Control of experimental conditions
• Automate experiments
• Monitoring & measurement
• Improve scope and scale
• Handle large number of nodes
• Automate experiments
• Handle failures
• Monitoring and measurements
Both goals raise similar challenges
F. Desprez - Cluster 2014 24/09/2014 - 11
12. GRID’5000
• Testbed for research on distributed systems
• Born from the observation that we need a better and larger testbed
• High Performance Computing, Grids, Peer-to-peer systems, Cloud computing
• A complete access to the nodes’ hardware in an exclusive mode (from one
node to the whole infrastructure): Hardware as a service
• RIaaS : Real Infrastructure as a Service ! ?
• History, a community effort
• 2003: Project started (ACI GRID)
• 2005: Opened to users
• Funding
• INRIA, CNRS, and many local entities (regions, universities)
• One rule: only for research on distributed systems
• → no production usage
• Free nodes during daytime to prepare experiments
• Large-scale experiments during nights and week-ends
F. Desprez - Cluster 2014 24/09/2014 - 12
13. Current Status (Sept. 2014 data)
• 10 sites (1 outside France)
• 24 clusters
• 1006 nodes
• 8014 cores
• Diverse technologies
• Intel (65%), AMD (35%)
• CPUs from one to 12 cores
• Ethernet 1G, 10G,
• Infiniband {S, D, Q}DR
• Two GPU clusters
• 2 Xeon Phi
• 2 data clusters (3-5 disks/node)
• More than 500 users per year
• Hardware renewed regularly
F. Desprez - Cluster 2014 24/09/2014 - 13
14. A Large Research Applicability
F. Desprez - Cluster 2014 24/09/2014 - 14
15. Backbone Network
Dedicated 10 Gbps backbone provided by Renater (french NREN)
Work in progress
• Packet-level and flow level
monitoring
F. Desprez - Cluster 2014 24/09/2014 - 15
16. Using GRID’5000: User’s Point of View
• Key tool: SSH
• Private network: connect through access machines
• Data storage:
- NFS (one server per GRID’5000 site)
- Datablock service (one per site)
- 100TB server in Rennes
F. Desprez - Cluster 2014 24/09/2014 - 16
17. GRID’5000 Software Stack
• Resource management: OAR
• System reconfiguration: Kadeploy
• Network isolation: KaVLAN
• Monitoring: Ganglia, Kaspied, Kwapi
• Putting all together GRID’5000 API
F. Desprez - Cluster 2014 24/09/2014 - 17
18. Resource Management: OAR
Batch scheduler with specific features
• interactive jobs
• advance reservations
• powerful resource matching
• Resources hierarchy
• cluster / switch / node / cpu / core
• Properties
• memory size, disk type & size, hardware capabilities, network interfaces, …
• Other kind of resources: VLANs, IP ranges for virtualization
I want 1 core on 2 nodes of the same cluster with 4096 GB of memory and
Infiniband 10G + 1 cpu on 2 nodes of the same switch with dualcore processors
for a walltime of 4 hours …
oarsub -I -l "memnode=4096 and ib10g=’YES’}/cluster=1/nodes=2/core=1
+ {cpucore=2}/switch=1/nodes=2/cpu=1,walltime=4:0:0"
F. Desprez - Cluster 2014 24/09/2014 - 18
19. Resource Management: OAR, Visualization
Resource status Gantt chart
F. Desprez - Cluster 2014 24/09/2014 - 19
20. Kadeploy – Scalable Cluster Deployment
• Provides a Hardware-as-a-Service Cloud infrastructure
• Built on top of PXE, DHCP, TFTP
• Scalable, efficient, reliable and flexible
• Chain-based and BitTorrent environment broadcast
• 255 nodes deployed in 7 minutes (latest scalability test 4000 nodes)
• Support of a broad range of systems (Linux, Xen, *BSD, etc.)
• Command-line interface & asynchronous interface (REST API)
• Similar to a cloud/virtualization provisionning tool (but on real machines)
• Choose a system stack and deploy it over GRID’5000 !
Preparation
Update PXE
Deploy environment
fdisk and mkfs
Chained broadcast
Image writing
Prepare boot of
deployed environment
Install bootloader
Update PXE and VLAN
Reboot
Reboot
kadeploy3.gforge.inria.fr
F. Desprez - Cluster 2014 24/09/2014 - 20
21. Kadeploy example – create a data cluster
Out of 8 nodes
• 2 *10 core machine (Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz)
• 1 10Gb interface, 1 1Gb interface
• 5* 600GB SAS disks
With Kadeploy3, and puppet recipes (shared)
• 4mn 25s to deploy a new base image on all nodes
• 5 mn 9s to configure them using puppet
Any user can get a 17,6 TB CEPH data cluster with
• 32 OSDs
• 1280 MB/s read performance on the Ceph Storage Cluster (Rados bench)
• 517 MB/s write performance with default replication size = 2 (Rados bench)
• 1180 MB/s write performance without replication (Rados bench)
Playing with tweaked Hadoop deployments is popular
Work carried out by P. Morillon (https://github.com/pmorillon/grid5000-xp-ceph)
F. Desprez - Cluster 2014 24/09/2014 - 21
22. Network Isolation: KaVLAN
• Reconfigures switches for the duration of a user experiment to complete
level 2 isolation
• Avoid network pollution (broadcast, unsolicited connections)
• Enable users to start their own DHCP servers
• Experiment on ethernet-based protocols
• Interconnect nodes with another testbed without compromising the security of
Grid'5000
• Relies on 802.1q (VLANs)
• Compatible with many network equipments
• Can use SNMP, SSH or telnet to connect to switches
• Supports Cisco, HP, 3Com, Extreme Networks, and Brocade
• Controlled with a command-line client or a REST API
• Recent work
- support several interfaces
F. Desprez - Cluster 2014 24/09/2014 - 22
25. Monitoring, Energy
Unified way to access energy logs for energy
profiling
• Energy consumption per resource
• Energy consumption per user jobs
• Energy measurement injection in user
F. Desprez - Cluster 2014 24/09/2014 - 25
26. Putting it all together: GRID’5000 API
• Individual services & command-line interfaces are painful
• REST API for each Grid'5000 service
• Reference API versioned description of Grid'5000 resources
• Monitoring API state of Grid'5000 resources
• Metrology API access to data probes’ output (ganglia, hdf5, …)
• Jobs API OAR interface
• Deployments API Kadeploy interface
• User API managing the user base
F. Desprez - Cluster 2014 24/09/2014 - 26
27. Description of an Experiment over Grid’5000
• Description and verification of the environment
• Reconfiguring the testbed to meet experimental needs
• Monitoring experiments, extracting and analyzing data
• Improving control and description of experiments
F. Desprez - Cluster 2014 24/09/2014 - 27
28. Description and verification of the environment
• Typical needs
• How can I find suitable resources for my experiment?
• How sure can I be that the actual resources will match their description?
• What was the hard drive on the nodes I used six months ago?
F. Desprez - Cluster 2014 24/09/2014 - 28
29. Description and selection of resources
• Describing resources understand results
• Detailed description on the Grid’5000 wiki
• Machine-parsable format (JSON)
• Archived (State of testbed 6 months ago?)
• Selecting resources
• OAR database filled from JSON
oarsub -p "wattmeter=’YES’ and gpu=’YES’ »
oarsub -l "cluster=’a’/nodes=1+cluster=’b’ and eth10g=’Y’/nodes=2,walltime=2"
F. Desprez - Cluster 2014 24/09/2014 - 29
30. Verification of resources
Inaccuracies in resources descriptions dramatic consequences
• Mislead researchers into making false assumptions
• Generate wrong results retracted publications!
• Happen frequently: maintenance, broken hardware (e.g. RAM)
• Our solution: g5k-checks
• Runs at node boot (can also be run manually by users)
• Retrieves current description of node in Reference API
• Acquire information on node using OHAI, ethtool, etc.
• Compare with Reference API
• Future work (maybe?)
• Verification of performance, not just availability and configuration of hardware
(hard drives, network, etc.)
• Provide tools to capture the state of the testbed archival with the rest of the
experiment's data
F. Desprez - Cluster 2014 24/09/2014 - 30
31. Reconfiguring the testbed
• Typical needs
• How can I install $SOFTWARE on my nodes?
• How can I add $PATCH to the kernel running on my nodes?
• Can I run a custom MPI to test my fault tolerance work?
• How can I experiment with that Cloud/Grid middleware?
• Likely answer on any production facility: you can’t
• Or: use virtual machines experimental bias
• On Grid’5000
• Operating System reconfiguration with Kadeploy
• Customize networking environment with KaVLAN
F. Desprez - Cluster 2014 24/09/2014 - 31
32. Changing experimental conditions
• Reconfigure experimental conditions with Distem
• Introduce heterogeneity in an homogeneous cluster
• Emulate complex network topologies
http://distem.gforge.inria.fr/
• What else can we enable users to change?
• BIOS settings
• Power management settings
• CPU features (Hyperthreading, Turbo mode, etc.)
• Cooling system: temperature in the machine room?
F. Desprez - Cluster 2014 24/09/2014 - 32
33. Monitoring experiments
Goal: enable users to understand what happens during their experiment
Power consumption
CPU – memory – disk
Network backbone Internal networks
F. Desprez - Cluster 2014 24/09/2014 - 33
34. Exporting and analyzing data
• Unified access to monitoring tools through the Grid’5000 API
• Automatically export data during/after an experiment
• Current work: high resolution monitoring for energy & network
F. Desprez - Cluster 2014 24/09/2014 - 34
35. Controlling advanced CPU features
Modern processors have advanced options
• Hyperthreading (SMT): share execution resources of a core between 2 logical
processors
• Turboboost: stop some cores to increase frequency of others
• C-states: put cores in different sleep modes
• P-states (aka SpeedStep): Give each core a target performance
These advanced options are set at a very low level (BIOS or kernel options)
• Do they have an impact on the performance of middleware ?
• Do publications document this level of experimental setting?
• Does the community understand the possible impact ?
• How can this be controlled/measured on a shared infrastructure ?
F. Desprez - Cluster 2014 24/09/2014 - 35
36. Modern CPU features and Grid’5000
On Grid’5000
• Recent management cards can control low-level features (BIOS options to be
short) have their options set through XML descriptions (tested on DELL’s
IDRAC7)
• Kadeploy inner workings support
• Booting a deployment environment that can be used to apply changes to BIOS options
• User control over cmdline options to kernels
Ongoing work
• Taking BIOS (or UEFI) descriptions as a new parameter at kadeploy3 level
• Restoring and documenting the standard state
F. Desprez - Cluster 2014 24/09/2014 - 36
39. Energy efficiency around Grid’5000
• IT – 2-5% of CO2 emissions / 10% electricity
• Green IT reducing electrical consumption of IT equipments
• Future exascale/datacenters platforms systems from 20 to 100MW
• How to build such systems and make them (more) energy
sustainable/responsible ? Multi dimension approaches : hardware, software,
usage
• Several activities around energy management in Grid’5000 since 2007
• 2007: first energy efficiency considerations
• 2008: Launch of Green activities…
• 1st challenge: finding a First (real) powermeter
• SME French company Omegawatt
• Deployment of multiple powermeters: Lyon, Toulouse, Grenoble
• 2010: Increasing scalability: Lyon Grid5000 site - a fully energy monitored site > 150
powermeters in big boxes.
F. Desprez - Cluster 2014 24/09/2014 - 39
40. Aggressive ON/OFF is not always the best solution
• Exploiting the gaps between activities
• Reducing unused plugged ressources number
• Only switiching off if potential energy saving
Anne-Cecile Orgerie, Laurent Lefevre, and Jean-Patrick Gelas. "Save Watts in your Grid: Green Strategies for Energy-Aware
Framework in Large Scale Distributed Systems", ICPADS 2008 : The 14th IEEE International Conference on Parallel and Distributed
Systems, Melbourne, Australia, December 2008
F. Desprez - Cluster 2014 24/09/2014 - 40
41. To understand energy measurements : take care of
your wattmeters !
Frequency / precision
M. Diouri, M. Dolz, O. Glück, L. Lefevre, P. Alonso,
S. Catalan, R. Mayo, E. Quintan-Orti. Solving some
Mysteries in Power Monitoring of Servers: Take
Care of your Wattmeters!, EE-LSDS 2013 : Energy
Efficiency in Large Scale Distributed Systems
conference , Vienna, Austria, April 22-24, 2013
F. Desprez - Cluster 2014 24/09/2014 - 41
42. Homogeneity (in energy consumption) does not exist !
• Depends on technology
• Same flops but not same flops per watt
• Idle / static cost
• CPU : main responsible
• Green scheduler designers must incorporate
this issue !
Mohammed el Mehdi Diouri, Olivier Gluck, Laurent Lefevre and Jean-Christophe Mignot. "Your Cluster is not Power Homogeneous:
Take Care when Designing Green Schedulers!", IGCC2013 : International Green Computing Conference, Arlington, USA, June 27-29,
F. Desprez - Cluster 2014 24/09/2014 - 42
43. Improving energy mangement with application
expertise
• Considered services : resilience & data broadcasting
• 4 steps: Service analysis, Measurements, Calibration,
Estimation
• Helping users make the right choices depending on context
and parameters
M. Diouri, Olivier Glück, Laurent Lefevre, and Franck Cappello. "ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance
Protocols during HPC executions", CCGrid2013.
F. Desprez - Cluster 2014 24/09/2014 - 43
44. Improving energy mangement without application
expertise
• Irregular usage of resources
• Phase detection, characterisation
• Power saving modes deployment
Landry Tsafack, Laurent Lefevre, Jean-Marc Pierson, Patricia Stolf, and Georges Da Costa. "A runtime framework for energy efficient HPC systems
without a priori knowledge of applications", ICPADS 2012.
F. Desprez - Cluster 2014 24/09/2014 - 44
46. GRID’5000, Virtualization and Clouds
• Virtualization technologies as the building blocks of Cloud
Computing
• Highlighted as a major topic for Grid’5000 as soon as 2008
• Two visions
• Iaas providers (investigate low-level concerns)
• Cloud end-users (investigate new algorithms/mechanisms on top of major cloud kits)
• Require additional tools and support from the Grid’5000 technical board
(IP allocations, VM Images management…)
Adding Virtualization Capabilities to the Grid'5000 Testbed, Balouek, D., Carpen Marie, A., Charrier, G., Desprez, F., Jeannot, E., Jeanvoine, E.,
Lèbre, A., Margery, D., Niclausse, N., Nussbaum, L., Richard, O., Perez, C., Quesnel, F., Rohr, C., and Sarzyniec, L., Cloud Computing and Services
Science, #367 in Communications in Computer and Information Science series, pp. 3-20, Ivanov, I., Sinderen, M., Leymann, F., and Shan, T. Eds,
Springer, 2013
F. Desprez - Cluster 2014 24/09/2014 - 46
47. GRID’5000, Virtualization and Clouds:
Dedicated tools
• Pre-built VMM images maintained by the technical staff
• Xen, KVM
• Possibility for end-users to run VMs directly on the reference environment (like
traditional IaaS infrastructures)
• Cloud kits: scripts to easily deploy and use
Nimbus/OpenNebula/Cloudstack and OpenStack
• Network
• Need reservation scheme for VM addresses (both MAC and IP)
• Mac addresses randomly assigned
• Sub-net range can be booked for IPs (/18, /19, …)
F. Desprez - Cluster 2014 24/09/2014 - 47
48. GRID’5000, Virtualization and Clouds:
Sky computing use-case
• 2010 - A Nimbus federation over three sites
• Deployment process
• Reserve available nodes (using the Grid‘5000 REST API)
• Deploy the Nimbus image on all selected nodes
• Finalize the configuration (using the Chef utility)
• Allocate roles to nodes (service, repository),
• Generate necessary files (Root Certificate Authority, ...)
• 31 min later, simply use the clouds : 280 VMMs, 1600 virtual CPUs
Experiment renewed and extended to the FutureGrid testbed
F. Desprez - Cluster 2014 24/09/2014 - 48
49. GRID’5000, Virtualization and Clouds:
Sky computing use-case
Experiments between USA and France
• Nimbus (resource management, contextualization)/ViNe (connectivity)/Hadoop (task
distribution, fault-tolerance, dynamicity)
• FutureGrid (3 sites) and Grid’5000 (3 sites) platforms
• Optimization of creation and propagation of VMs
MapReduce Application
Distributed
Application Hadoop
Grid’5000
firewall
SD IaaS software
IaaS software
ViNe
Large-Scale Cloud Computing Research: Sky Computing on FutureGrid and Grid’5000, by Pierre Riteau, Maurício Tsugawa, Andréa
Matsunaga, José Fortes and Kate Keahey, ERCIM News 83, Oct. 2010.
Crédits: Pierre Riteau
UF
UC
Rennes
Lille
Sophia
White-listed
Queue
VR
All-to-all
connectivity!
F. Desprez - Cluster 2014 24/09/2014 - 49
50. GRID’5000, Virtualization and Clouds:
Dynamic VM placement use-case
2012 - Investigate issues related to preemptive scheduling of VMs
• Can a system handle VMs across a distributed infrastructure like OSes
manipulate processes on local nodes ?
• Several proposals in the literature, but
• Few real experiments
(simulation based results)
• Scalability is usually a concern
• Can we perform several migrations between several nodes at the same
time ? What is the amount of time, the impact on the VMs/on the network
?
F. Desprez - Cluster 2014 24/09/2014 - 50
51. GRID’5000, Virtualization and Clouds:
Dynamic VM placement use-case
Deploy 10240 VMs upon 512 PMs
• Prepare the experiment
• Book resources
512 PMs with Hard. Virtualization
A global VLAN
A /18 for IP ranges
• Deploy KVM images and put
PMs in the global VLAN
• Launch/Configure VMs
• A dedicated script leveraging Taktuk
utility to interact with each PM
• G5K-subnet to get booked IPs and
assign them to VMs
• Start the experiment and make publications !
Rennes
Rennes
Lille
Lille
Orsay
Reims
Luxembourg
Nancy
Lyon
Nancy
Grenoble
Sophia
Toulouse
Bordeaux
Sophia
F. Quesnel, D. Balouek, and A. Lebre. Deploying and Scheduling Thousands of Virtual Machines on Hundreds of Nodes Distributed
Geographically. In IEEE International Scalable Computing Challenge (SCALE 2013) (colocated with CCGRID 2013), Netherlands, May 2013
F. Desprez - Cluster 2014 24/09/2014 - 51
52. GRID’5000, Virtualization and Clouds
• More than 198 publications
• Three times finalist of the IEEE Scale challenge
(2nd prize winner in 2013)
• Tomorrow - Virtualization of network functions (Software Defined Network)
- Go one step ahead of KaVLAN to guarantee for instance bandwidth expectations
HiperCal, Emulab approaches Network virtualisation is performed by daemons
running on dedicated nodes ⇒ Do not bring additional capabilities (close to the
Distem project)
- By reconfiguring routers/switches on demand
⇒ Required specific devices (OpenFlow compliant)
Which features should be exported to the end-users ?
Are there any security concerns ?
F. Desprez - Cluster 2014 24/09/2014 - 52
54. Riplay: A Tool to Replay HPC Workloads
• RJMS : Ressource and Job Management System
• It manages resources and schedule jobs on High-Performance Clusters
• Most famous ones : Maui/Moab, OAR, PBS, SLURM
• Riplay
• Replay traces on a real RJMS in an emulated environment
• 2 RJMS supported (OAR and SLURM)
• Jobs replaced by sleep commands
• Can replay a full or an interval of a workload
• On Grid’5000
• 630 emulated cores need 1 physical core to run
• Curie (rank 26th on last Top500, 80640 cores)
• Curie's RJMS can be ran on 128 Grid’5000 cores
F. Desprez - Cluster 2014 24/09/2014 - 54
55. Riplay: A Tool to Replay HPC Workloads
• Riplay of Curie petaflopic cluster workload on Grid5000 using emulation
techniques
• RJMS Evaluation with different scheduling parameters
SLURM with FavorSmall set to True SLURM with FavorSmall set to False
F. Desprez - Cluster 2014 24/09/2014 - 55
56. Riplay: A Tool to Replay HPC Workloads
• Test RJMS scalability
• Without the need of the actual cluster.
• Test a huge cluster fully loaded on a RJMS in minutes.
OAR before optimizations OAR after optimizations
Large Scale Experimentation Methodology for Resource and Job Management Systems
on HPC Clusters, Joseph Emeras, David Glesser, Yiannis Georgiou and Olivier Richard
https://forge.imag.fr/projects/evalys-tools/
F. Desprez - Cluster 2014 24/09/2014 - 56
57. HPC Component Model: From Grid’5000 to Curie
SuperComputer
Issues
• Improve code re-use of HPC applications
• Simplify application portability
Objective
• Validate L2C, a low level HPC component model
• Jacobi1 and 3D FFT2 kernels
Roadmap
• Low scale validation on Grid’5000
• Overhead wrt “classical” HPC models
(Threads, MPI)
• Study of the impact of various node architecture
• Large scale validation on Curie
• Up to 2000 nodes, 8192 cores
Iter +
XY
MPI
Conn
MPI
Conn
Iter +
XY
Iter +
XY
MPI
Conn
MPI
Conn
Iter +
XY
Thread
Conn
Thread
Conn
1: J.Bigot, Z. Hou, C. Pérez, V. Pichon. A Low Level Component Model Easing Performance Portability of HPC Applications. Computing,
page 1–16, 2013, Springer Vienna
2: On going work.
F. Desprez - Cluster 2014 24/09/2014 - 57
58. HPC Component Model: From Grid’5000 to Curie
SuperComputer
Jacobi: No L2C overhead in any version
3D FFT
2563 FFT, 1D decomp., Edel+Genepi Cluster (G5K) 10243 FFT, 2D decomp., Curie (Thin node)
F. Desprez - Cluster 2014 24/09/2014 - 58
Ns/#iter/elem
60. Big Data Management
• Growing demand from our institutes and users
• Grid’5000 has to evolve to be able to cope with such experiments
• Current status
• home quotas has growed from 2GB to 25GB by default (and can be extended
up to 200 GB)
• For larger datasets, users should leverage storage5K (persistant dataset
imported into Grid'5000 between experiments)
• DFS5K, deploy on demand CephFS
• What’s next
• Current challenge: upload datasets (from internet to G5K) / deployment of data
from the archive storage system to the working nodes
• New storage devices: SSD / NVRAM / …
• New booking approaches
• Allow end-users to book node partitions for longer period than the usual reservations
F. Desprez - Cluster 2014 24/09/2014 - 60
61. Scalable Map-Reduce Processing
Goal: High-performance Map-Reduce processing through
concurrency-optimized data processing
Some results
Versioning-based concurrency management for
increased data throughput (BlobSeer approach)
Efficient intermediate data storage in pipelines
Substantial improvements with respect to Hadoop
Application to efficient VM deployment
Intensive, long-run experiments done on Grid'5000
Up to 300 nodes/500 cores
Plans: validation within the IBM environment with IBM
MapReduce Benchmarks
ANR Project Map-Reduce (ARPEGE, 2010-2014)
Partners: Inria (teams : KerData - leader, AVALON, Grand Large), Argonne
National Lab, UIUC, JLPC, IBM, IBCP mapreduce.inria.fr
F. Desprez - Cluster 2014 24/09/2014 - 61
62. Big Data in Post-Petascale HPC Systems
• Problem: simulations generate TB/min
• How to store et transfer data ?
• How to analyze, visualize and extract knowledge?
62
100.000+
cores
PetaBytes of
data
~ 10.000
cores
62
F. Desprez - Cluster 2014 24/09/2014
63. Big Data in Post-Petascale HPC Systems
• Problem: simulations generate TB/min
• How to store et transfer data ?
• How to analyze, visualize and extract knowledge?
What is difficult?
• Too many files (e.g. Blue Waters 100.000+ files/min)
• Too much data
• Unpredictable I/O performance
63
100.000+
cores
PetaBytes of
data
~ 10.000
Does not scale! cores
63
F. Desprez - Cluster 2014 24/09/2014
64. Damaris: A Middleware-Level Approach to I/O
on Multicore HPC Systems
Idea : one dedicated I/O core per multicore node
Originality : shared memory, asynchronous
processing
Implementation: software library
Applications: climate simulations (Blue Waters)
Preliminary experiments on Grid’5000
64
http://damaris.gforge.inria.fr/
F. Desprez - Cluster 2014 24/09/2014
65. Damaris: Leveraging dedicated cores
for in situ visualization
Time
Partitioning
(traditional
approach)
Space
Partitioning
(Damaris
approach)
Without Visualization With Visualization
Experiments on Grid’5000 with Nek5000 (912 cores)
Using Damaris completely hides the performance impact of in situ visualization
F. Desprez - Cluster 2014 24/09/2014 - 65
66. Damaris: Impact of Interactivity
Time-Partitioning
Space-Partitioning
Experiments on Grid’5000 with Nek5000 (48 cores)
Using Damaris completely hides the run time impact of in situ visualization, even in
the presence of user interactivity
F. Desprez - Cluster 2014 24/09/2014 - 66
67. MapReduce framework improvement
Context: MapReduce job execution
Motivation
• Skew in reduce phase execution may cause high execution times
• Base problem: Size of reduce tasks cannot be defined a priori
- FPH approach: modified MapReduce framework where
• Reduce input data is divided into fixed size splits
• Reduce phase is divided into two parts
Intermediate: multiple-iteration execution of fixed-size tasks
Final: final grouping of results
F. Desprez - Cluster 2014 24/09/2014 - 67
68. Results
RT vs Size
RT vs Skew
RT vs Query
RT vs MinSize
RT vs Cluster Size
F. Desprez - Cluster 2014 24/09/2014 - 68
69. RELATED PLATFORMS
Credit: Chris Coleman, School of Computing, University of Utah
F. Desprez - Cluster 2014 24/09/2014 - 69
70. Related Platforms
• PlanetLab
• 1074 nodes over 496 sites world-wide, slices allocation: virtual machines.
• Designed for experiments Internet-wide: new protocols for Internet, overlay networks (file-sharing, routing algorithm,
multi-cast, ...)
• Emulab
• Network emulation testbed. Mono-site, mono-cluster. Emulation. Integrated approach
• Open Cloud
• 480 cores distributed in four locations, interoperability across clouds using open API
• Open Cirrus
• Federation of heterogeneous data centers, test-bed for cloud computing
• DAS-1..4-5
• Federation of 4-6 cluster in Netherland, ~200 nodes, specific target experiment for each generation
• NECTAR
• Federated Australian Research Cloud over 8 sites
• Futuregrid (ended Sept. 30th)
• 4 years project within XCEDE with similar approach as Grid’5000
• Federation of data centers, bare hardware reconfiguration
• 2 new projects awarded by NSF last August
• Chameleon: a large-scale, reconfigurable experimental environment for cloud research, co-located at the University of
Chicago and The University of Texas at Austin
• CloudLab: a large-scale distributed infrastructure based at the University of Utah, Clemson University and the
University of Wisconsin
F. Desprez - Cluster 2014 24/09/2014 - 70
71. Chameleon: A powerful and flexible experimental
instrument
Large-scale instrument
- Targeting Big Data, Big Compute, Big Instrument research
- More than 650 nodes and 5 PB disk over two sites,100G network
Reconfigurable instrument
- Bare metal reconfiguration, operated as single instrument, graduated
approach for ease-of-use
Connected instrument
-Workload and Trace Archive
- Partnerships with production clouds: CERN, OSDC, Rackspace, Google,
and others
- Partnerships with users
Complementary instrument
- Complementing GENI, Grid’5000, and other testbeds
Sustainable instrument
- Strong connections to industry
Credits: Kate Keahey
www.chameleoncloud.org
F. Desprez - Cluster 2014 24/09/2014 - 71
72. Chameleon Hardware
SCUs connect to
core and fully
connected to
each other
To UTSA, GENI, Future Partners
Heterogeneous
Cloud Units
Alternate Processors
and Networks
Switch
Standard
Cloud Unit
42 compute
4 storage
x10
Chicago
Austin
Core Services
Front End and Data
Mover Nodes
Chameleon Core Network
100Gbps uplink public network
(each site)
Core Services
3 PB Central File
Systems, Front End
and Data Movers
504 x86 Compute Servers
48 Dist. Storage Servers
102 Heterogeneous Servers
16 Mgt and Storage Nodes
Switch
Standard
Cloud Unit
42 compute
4 storage
x2
F. Desprez - Cluster 2014 24/09/2014 - 72
73. Related Platforms : BonFIRE
• BonFIRE foundation is operating the results of the BonFIRE project
• For Inria : applying some of the lessons learned running Grid’5000 to the cloud
paradigm
• A federation of 5 sites, accessed through a central API (based on OCCI)
• An observable cloud
• Access to metrics of the underlying host to understand observed behavior
• Number of VMs running
• Power consumption
• A controllable infrastructure
• One site exposing resources managed by Emulab, enabling control over
network parameters
• Ability to be the exclusive user of a host, or to use a specific host
• Higher level features than when using Grid’5000
www.bonfire-project.eu
F. Desprez - Cluster 2014 24/09/2014 - 73
75. Conclusion and Open Challenges
• Computer-Science is also an experimental science
• There are different and complementary approaches for doing experiments in
computer-science
• Computer-science is not at the same level than other sciences
• But things are improving…
• GRiD’5000: a test-bed for experimentation on distributed systems with a unique
combination of features
• Hardware-as-a-Service cloud
• redeployment of operating system on the bare hardware by users
• Access to various technologies (CPUs, high performance networks, etc.)
• Networking: dedicated backbone, monitoring, isolation
• Programmable through an API
• Energy consumption monitoring
• Useful platform
• More than 750 publications with Grid’5000 in their tag (HAL)
• Between 500 and 600 users per year since 2006
F. Desprez - Cluster 2014 24/09/2014 - 75
76. What Have We Learned?
Building such a platform was a real challenge !
- No on-the-shelf software available
- Need to have a team of highly motivated and highly trained engineers and
researchers
- Strong help and deep understanding of involved institutions!
From our experience, experimental platforms should feature
- Experiment isolation
- Capability to reproduce experimental conditions
- Flexibility through high degree of reconfiguration
- The strong control of experiment preparation and running
- Precise measurement methodology
- Tools to help users prepare and run their experiments
- Deep on-line monitoring (essential to help observations understanding)
- Capability to inject real life (real time) experimental conditions (real Internet
traffic, faults)
F. Desprez - Cluster 2014 24/09/2014 - 76
77. Conclusion and Open Challenges, cont
• Testbeds optimized for experimental capabilities, not performance
• Access to the modern architectures / technologies
• Not necessarily the fastest CPUs
• But still expensive funding!
• Ability to trust results
• Regular checks of testbed for bugs
• Ability to understand results
• Documentation of the infrastructure
• Instrumentation & monitoring tools
• network, energy consumption
• Evolution of the testbed
• maintenance logs, configuration history
• Empower users to perform complex experiments
• Facilitate access to advanced software tools
• Paving the way to Open Science of HPC and Cloud – long term goals
• Fully automated execution of experiments
• Automated tracking + archiving of experiments and associated data
F. Desprez - Cluster 2014 24/09/2014 - 77
78. Our Open Access program
An 'Open Access' program enables researchers to get a Grid'5000 account
valid for two months. At the end of this period, the user must submit a
report explaining its use of Grid'5000 to open-access@lists.grid5000.fr,
and (possibly) apply for renewal
Open Access accounts currently provide unrestricted access to Grid'500
(this might be changed in the future). However:
• Usage must follow the Grid'5000 User Charter
• We ask Open Access users not to use OAR's "Advance reservations" for
jobs more than 24 hours in the future
https://www.grid5000.fr/mediawiki/index.php/Special:G5KRequestOpenAccess
F. Desprez - Cluster 2014 24/09/2014 - 78
79. QUESTIONS ?
Special thanks to
G. Antoniu, Y. Georgiou, D.
Glesser, A. Lebre, L. Lefèvre, M.
Liroz, D. Margery, L. Nussbaum,
C. Perez, L. Pouillioux
and the Grid’5000 technical team
www.grid5000.fr