Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Grid'5000: Running a Large Instrument 
for Parallel and Distributed Computing 
Experiments 
F. Desprez 
INRIA Grenoble Rhô...
“One could determine the different ages of a science by the technic of its 
measurement instruments” 
Gaston Bachelard 
Th...
Agenda 
• Experimental Computer Science 
• Overview of GRID’5000 
• GRID’5000 Experiments 
• Related Platforms 
• Conclusi...
VALIDATION IN COMPUTER 
SCIENCE 
F. Desprez - Cluster 2014 24/09/2014 - 4
The discipline of computing: an experimental science 
The reality of computer science 
- Information 
- Computers, network...
Experimental culture not comparable with 
other science 
Different studies 
• 1994: 400 papers 
- Between 40% and 50% of C...
“Good experiments” 
A good experiment should fulfill the following properties 
• Reproducibility: must give the same resul...
Analytic modeling 
Purely analytical (mathematical) models 
• Demonstration of properties (theorem) 
• Models need to be t...
Experimental Validation 
A good alternative to analytical validation 
• Provides a comparison between algorithms and progr...
GRID’5000 
www.grid5000.fr 
F. Desprez - Cluster 2014 24/09/2014 - 10
Grid’5000 Mission 
Support high quality, reproducible experiments on a distributed system 
testbed 
Two areas of work 
• I...
GRID’5000 
• Testbed for research on distributed systems 
• Born from the observation that we need a better and larger tes...
Current Status (Sept. 2014 data) 
• 10 sites (1 outside France) 
• 24 clusters 
• 1006 nodes 
• 8014 cores 
• Diverse tech...
A Large Research Applicability 
F. Desprez - Cluster 2014 24/09/2014 - 14
Backbone Network 
Dedicated 10 Gbps backbone provided by Renater (french NREN) 
Work in progress 
• Packet-level and flow ...
Using GRID’5000: User’s Point of View 
• Key tool: SSH 
• Private network: connect through access machines 
• Data storage...
GRID’5000 Software Stack 
• Resource management: OAR 
• System reconfiguration: Kadeploy 
• Network isolation: KaVLAN 
• M...
Resource Management: OAR 
Batch scheduler with specific features 
• interactive jobs 
• advance reservations 
• powerful r...
Resource Management: OAR, Visualization 
Resource status Gantt chart 
F. Desprez - Cluster 2014 24/09/2014 - 19
Kadeploy – Scalable Cluster Deployment 
• Provides a Hardware-as-a-Service Cloud infrastructure 
• Built on top of PXE, DH...
Kadeploy example – create a data cluster 
Out of 8 nodes 
• 2 *10 core machine (Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz)...
Network Isolation: KaVLAN 
• Reconfigures switches for the duration of a user experiment to complete 
level 2 isolation 
•...
Network Isolation: KaVLAN, cont 
F. Desprez - Cluster 2014 24/09/2014 - 23
Monitoring, Ganglia 
F. Desprez - Cluster 2014 24/09/2014 - 24
Monitoring, Energy 
Unified way to access energy logs for energy 
profiling 
• Energy consumption per resource 
• Energy c...
Putting it all together: GRID’5000 API 
• Individual services & command-line interfaces are painful 
• REST API for each G...
Description of an Experiment over Grid’5000 
• Description and verification of the environment 
• Reconfiguring the testbe...
Description and verification of the environment 
• Typical needs 
• How can I find suitable resources for my experiment? 
...
Description and selection of resources 
• Describing resources understand results 
• Detailed description on the Grid’5000...
Verification of resources 
Inaccuracies in resources descriptions  dramatic consequences 
• Mislead researchers into maki...
Reconfiguring the testbed 
• Typical needs 
• How can I install $SOFTWARE on my nodes? 
• How can I add $PATCH to the kern...
Changing experimental conditions 
• Reconfigure experimental conditions with Distem 
• Introduce heterogeneity in an homog...
Monitoring experiments 
Goal: enable users to understand what happens during their experiment 
Power consumption 
CPU – me...
Exporting and analyzing data 
• Unified access to monitoring tools through the Grid’5000 API 
• Automatically export data ...
Controlling advanced CPU features 
Modern processors have advanced options 
• Hyperthreading (SMT): share execution resour...
Modern CPU features and Grid’5000 
On Grid’5000 
• Recent management cards can control low-level features (BIOS options to...
GRID’5000 EXPERIMENTS 
F. Desprez - Cluster 2014 24/09/2014 - 37
ENERGY MANAGEMENT 
F. Desprez - Cluster 2014 24/09/2014 - 38
Energy efficiency around Grid’5000 
• IT – 2-5% of CO2 emissions / 10% electricity 
• Green IT  reducing electrical consu...
Aggressive ON/OFF is not always the best solution 
• Exploiting the gaps between activities 
• Reducing unused plugged res...
To understand energy measurements : take care of 
your wattmeters ! 
Frequency / precision 
M. Diouri, M. Dolz, O. Glück, ...
Homogeneity (in energy consumption) does not exist ! 
• Depends on technology 
• Same flops but not same flops per watt 
•...
Improving energy mangement with application 
expertise 
• Considered services : resilience & data broadcasting 
• 4 steps:...
Improving energy mangement without application 
expertise 
• Irregular usage of resources 
• Phase detection, characterisa...
VIRTUALIZATION AND CLOUDS 
F. Desprez - Cluster 2014 24/09/2014 - 45
GRID’5000, Virtualization and Clouds 
• Virtualization technologies as the building blocks of Cloud 
Computing 
• Highligh...
GRID’5000, Virtualization and Clouds: 
Dedicated tools 
• Pre-built VMM images maintained by the technical staff 
• Xen, K...
GRID’5000, Virtualization and Clouds: 
Sky computing use-case 
• 2010 - A Nimbus federation over three sites 
• Deployment...
GRID’5000, Virtualization and Clouds: 
Sky computing use-case 
Experiments between USA and France 
• Nimbus (resource mana...
GRID’5000, Virtualization and Clouds: 
Dynamic VM placement use-case 
2012 - Investigate issues related to preemptive sche...
GRID’5000, Virtualization and Clouds: 
Dynamic VM placement use-case 
Deploy 10240 VMs upon 512 PMs 
• Prepare the experim...
GRID’5000, Virtualization and Clouds 
• More than 198 publications 
• Three times finalist of the IEEE Scale challenge 
(2...
HIGH PERFORMANCE 
COMPUTING 
F. Desprez - Cluster 2014 24/09/2014 - 53
Riplay: A Tool to Replay HPC Workloads 
• RJMS : Ressource and Job Management System 
• It manages resources and schedule ...
Riplay: A Tool to Replay HPC Workloads 
• Riplay of Curie petaflopic cluster workload on Grid5000 using emulation 
techniq...
Riplay: A Tool to Replay HPC Workloads 
• Test RJMS scalability 
• Without the need of the actual cluster. 
• Test a huge ...
HPC Component Model: From Grid’5000 to Curie 
SuperComputer 
Issues 
• Improve code re-use of HPC applications 
• Simplify...
HPC Component Model: From Grid’5000 to Curie 
SuperComputer 
Jacobi: No L2C overhead in any version 
3D FFT 
2563 FFT, 1D ...
DATA MANAGEMENT 
F. Desprez - Cluster 2014 24/09/2014 - 59
Big Data Management 
• Growing demand from our institutes and users 
• Grid’5000 has to evolve to be able to cope with suc...
Scalable Map-Reduce Processing 
Goal: High-performance Map-Reduce processing through 
concurrency-optimized data processin...
Big Data in Post-Petascale HPC Systems 
• Problem: simulations generate TB/min 
• How to store et transfer data ? 
• How t...
Big Data in Post-Petascale HPC Systems 
• Problem: simulations generate TB/min 
• How to store et transfer data ? 
• How t...
Damaris: A Middleware-Level Approach to I/O 
on Multicore HPC Systems 
Idea : one dedicated I/O core per multicore node 
O...
Damaris: Leveraging dedicated cores 
for in situ visualization 
Time 
Partitioning 
(traditional 
approach) 
Space 
Partit...
Damaris: Impact of Interactivity 
Time-Partitioning 
Space-Partitioning 
Experiments on Grid’5000 with Nek5000 (48 cores) ...
MapReduce framework improvement 
Context: MapReduce job execution 
Motivation 
• Skew in reduce phase execution may cause ...
Results 
RT vs Size 
RT vs Skew 
RT vs Query 
RT vs MinSize 
RT vs Cluster Size 
F. Desprez - Cluster 2014 24/09/2014 - 68
RELATED PLATFORMS 
Credit: Chris Coleman, School of Computing, University of Utah 
F. Desprez - Cluster 2014 24/09/2014 - ...
Related Platforms 
• PlanetLab 
• 1074 nodes over 496 sites world-wide, slices allocation: virtual machines. 
• Designed f...
Chameleon: A powerful and flexible experimental 
instrument 
Large-scale instrument 
- Targeting Big Data, Big Compute, Bi...
Chameleon Hardware 
SCUs connect to 
core and fully 
connected to 
each other 
To UTSA, GENI, Future Partners 
Heterogeneo...
Related Platforms : BonFIRE 
• BonFIRE foundation is operating the results of the BonFIRE project 
• For Inria : applying ...
CONCLUSION 
F. Desprez - Cluster 2014 24/09/2014 - 74
Conclusion and Open Challenges 
• Computer-Science is also an experimental science 
• There are different and complementar...
What Have We Learned? 
Building such a platform was a real challenge ! 
- No on-the-shelf software available 
- Need to ha...
Conclusion and Open Challenges, cont 
• Testbeds optimized for experimental capabilities, not performance 
• Access to the...
Our Open Access program 
An 'Open Access' program enables researchers to get a Grid'5000 account 
valid for two months. At...
QUESTIONS ? 
Special thanks to 
G. Antoniu, Y. Georgiou, D. 
Glesser, A. Lebre, L. Lefèvre, M. 
Liroz, D. Margery, L. Nuss...
Prochain SlideShare
Chargement dans…5
×

Grid'5000: Running a Large Instrument for Parallel and Distributed Computing Experiments

1 106 vues

Publié le

The increasing complexity of available infrastructures (hierarchical, parallel, distributed, etc.) with specific features (caches, hyper-threading, dual core, etc.) makes it extremely difficult to build analytical models that allow for a satisfying prediction. Hence, it raises the question on how to validate algorithms and software systems if a realistic analytic study is not possible. As for many other sciences, the one answer is experimental validation. However, such experimentations rely on the availability of an instrument able to validate every level of the software stack and offering different hardware and software facilities about compute, storage, and network resources.

Almost ten years after its premises, the Grid'5000 testbed has become one of the most complete testbed for designing or evaluating large-scale distributed systems. Initially dedicated to the study of large HPC facilities, Grid’5000 has evolved in order to address wider concerns related to Desktop Computing, the Internet of Services and more recently the Cloud Computing paradigm. We now target new processors features such as hyperthreading, turbo boost, and power management or large applications managing big data. In this keynote we will both address the issue of experiments in HPC and computer science and the design and usage of the Grid'5000 platform for various kind of applications.

Publié dans : Technologie
  • Yes you are right. There are many research paper writing services available now. But almost services are fake and illegal. Only a genuine service will treat their customer with quality research papers. ⇒ www.HelpWriting.net ⇐
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • You might get some help from ⇒ www.WritePaper.info ⇐ Success and best regards!
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • Soyez le premier à aimer ceci

Grid'5000: Running a Large Instrument for Parallel and Distributed Computing Experiments

  1. 1. Grid'5000: Running a Large Instrument for Parallel and Distributed Computing Experiments F. Desprez INRIA Grenoble Rhône-Alpes, LIP ENS Lyon, Avalon Team Joint work with G. Antoniu, Y. Georgiou, D. Glesser, A. Lebre, L. Lefèvre, M. Liroz, D. Margery, L. Nussbaum, C. Perez, L. Pouillioux F. Desprez - Cluster 2014 24/09/2014 - 1
  2. 2. “One could determine the different ages of a science by the technic of its measurement instruments” Gaston Bachelard The Formation of the scientific mind F. Desprez - Cluster 2014 24/09/2014 - 2
  3. 3. Agenda • Experimental Computer Science • Overview of GRID’5000 • GRID’5000 Experiments • Related Platforms • Conclusions and Open Challenges F. Desprez - Cluster 2014 24/09/2014 - 3
  4. 4. VALIDATION IN COMPUTER SCIENCE F. Desprez - Cluster 2014 24/09/2014 - 4
  5. 5. The discipline of computing: an experimental science The reality of computer science - Information - Computers, network, algorithms, programs, etc. Studied objects (hardware, programs, data, protocols, algorithms, network) are more and more complex Modern infrastructures • Processors have very nice features: caches, hyperthreading, multi-core • Operating system impacts the performance (process scheduling, socket implementation, etc.) • The runtime environment plays a role (MPICH ≠ OPENMPI) • Middleware have an impact • Various parallel architectures that can be heterogeneous, hierarchical, distributed, dynamic F. Desprez - Cluster 2014 24/09/2014 - 5
  6. 6. Experimental culture not comparable with other science Different studies • 1994: 400 papers - Between 40% and 50% of CS ACM papers requiring experimental validation had none (15% in optical engineering) [Lukovicz et al.] • 1998: 612 papers - “Too many articles have no experimental validation” [Zelkowitz and Wallace 98] • 2009 update - Situation is improving • 2007: Survey of simulators used in P2P research - Most papers use an unspecified or custom simulator Computer science not at the same level than some other sciences • Nobody redo experiments • Lack of tool and methodologies Paul Lukowicz et al. Experimental Evaluation in Computer Science: A Quantitative Study. In: J.l of Systems and Software 28:9-18, 1994 M.V. Zelkowitz and D.R. Wallace. Experimental models for validating technology. Computer, 31(5):23-31, May 1998 Marvin V. Zelkowitz. An update to experimental models for validating computer technology. In: J. Syst. Softw. 82.3:373–376, Mar. 2009 S. Naicken et al. The state of peer-to-peer simulators and simulations. In: SIGCOMM Comput. Commun. Rev. 37.2:95–98, Mar. 2007 F. Desprez - Cluster 2014 24/09/2014 - 6
  7. 7. “Good experiments” A good experiment should fulfill the following properties • Reproducibility: must give the same result with the same input • Extensibility: must target possible comparisons with other works and extensions (more/other processors, larger data sets, different architectures) • Applicability: must define realistic parameters and must allow for an easy calibration • “Revisability”: when an implementation does not perform as expected, must help to identify the reasons F. Desprez - Cluster 2014 24/09/2014 - 7
  8. 8. Analytic modeling Purely analytical (mathematical) models • Demonstration of properties (theorem) • Models need to be tractable: over-simplification? • Good to understand the basic of the problem • Most of the time ones still perform a experiments (at least for comparison) For a practical impact (especially in distributed computing): analytic study not always possible or not sufficient F. Desprez - Cluster 2014 24/09/2014 - 8
  9. 9. Experimental Validation A good alternative to analytical validation • Provides a comparison between algorithms and programs • Provides a validation of the model or helps to define the validity domain of the model Several methodologies • Simulation (SimGrid, NS, …) • Emulation (MicroGrid, Distem, …) • Benchmarking (NAS, SPEC, Linpack, ….) • Real-scale (Grid’5000, FutureGrid, OpenCirrus, PlanetLab, …) F. Desprez - Cluster 2014 24/09/2014 - 9
  10. 10. GRID’5000 www.grid5000.fr F. Desprez - Cluster 2014 24/09/2014 - 10
  11. 11. Grid’5000 Mission Support high quality, reproducible experiments on a distributed system testbed Two areas of work • Improve trustworthiness • Testbed description • Experiment description • Control of experimental conditions • Automate experiments • Monitoring & measurement • Improve scope and scale • Handle large number of nodes • Automate experiments • Handle failures • Monitoring and measurements Both goals raise similar challenges F. Desprez - Cluster 2014 24/09/2014 - 11
  12. 12. GRID’5000 • Testbed for research on distributed systems • Born from the observation that we need a better and larger testbed • High Performance Computing, Grids, Peer-to-peer systems, Cloud computing • A complete access to the nodes’ hardware in an exclusive mode (from one node to the whole infrastructure): Hardware as a service • RIaaS : Real Infrastructure as a Service ! ? • History, a community effort • 2003: Project started (ACI GRID) • 2005: Opened to users • Funding • INRIA, CNRS, and many local entities (regions, universities) • One rule: only for research on distributed systems • → no production usage • Free nodes during daytime to prepare experiments • Large-scale experiments during nights and week-ends F. Desprez - Cluster 2014 24/09/2014 - 12
  13. 13. Current Status (Sept. 2014 data) • 10 sites (1 outside France) • 24 clusters • 1006 nodes • 8014 cores • Diverse technologies • Intel (65%), AMD (35%) • CPUs from one to 12 cores • Ethernet 1G, 10G, • Infiniband {S, D, Q}DR • Two GPU clusters • 2 Xeon Phi • 2 data clusters (3-5 disks/node) • More than 500 users per year • Hardware renewed regularly F. Desprez - Cluster 2014 24/09/2014 - 13
  14. 14. A Large Research Applicability F. Desprez - Cluster 2014 24/09/2014 - 14
  15. 15. Backbone Network Dedicated 10 Gbps backbone provided by Renater (french NREN) Work in progress • Packet-level and flow level monitoring F. Desprez - Cluster 2014 24/09/2014 - 15
  16. 16. Using GRID’5000: User’s Point of View • Key tool: SSH • Private network: connect through access machines • Data storage: - NFS (one server per GRID’5000 site) - Datablock service (one per site) - 100TB server in Rennes F. Desprez - Cluster 2014 24/09/2014 - 16
  17. 17. GRID’5000 Software Stack • Resource management: OAR • System reconfiguration: Kadeploy • Network isolation: KaVLAN • Monitoring: Ganglia, Kaspied, Kwapi • Putting all together GRID’5000 API F. Desprez - Cluster 2014 24/09/2014 - 17
  18. 18. Resource Management: OAR Batch scheduler with specific features • interactive jobs • advance reservations • powerful resource matching • Resources hierarchy • cluster / switch / node / cpu / core • Properties • memory size, disk type & size, hardware capabilities, network interfaces, … • Other kind of resources: VLANs, IP ranges for virtualization I want 1 core on 2 nodes of the same cluster with 4096 GB of memory and Infiniband 10G + 1 cpu on 2 nodes of the same switch with dualcore processors for a walltime of 4 hours … oarsub -I -l "memnode=4096 and ib10g=’YES’}/cluster=1/nodes=2/core=1 + {cpucore=2}/switch=1/nodes=2/cpu=1,walltime=4:0:0" F. Desprez - Cluster 2014 24/09/2014 - 18
  19. 19. Resource Management: OAR, Visualization Resource status Gantt chart F. Desprez - Cluster 2014 24/09/2014 - 19
  20. 20. Kadeploy – Scalable Cluster Deployment • Provides a Hardware-as-a-Service Cloud infrastructure • Built on top of PXE, DHCP, TFTP • Scalable, efficient, reliable and flexible • Chain-based and BitTorrent environment broadcast • 255 nodes deployed in 7 minutes (latest scalability test 4000 nodes) • Support of a broad range of systems (Linux, Xen, *BSD, etc.) • Command-line interface & asynchronous interface (REST API) • Similar to a cloud/virtualization provisionning tool (but on real machines) • Choose a system stack and deploy it over GRID’5000 ! Preparation Update PXE Deploy environment fdisk and mkfs Chained broadcast Image writing Prepare boot of deployed environment Install bootloader Update PXE and VLAN Reboot Reboot kadeploy3.gforge.inria.fr F. Desprez - Cluster 2014 24/09/2014 - 20
  21. 21. Kadeploy example – create a data cluster Out of 8 nodes • 2 *10 core machine (Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz) • 1 10Gb interface, 1 1Gb interface • 5* 600GB SAS disks With Kadeploy3, and puppet recipes (shared) • 4mn 25s to deploy a new base image on all nodes • 5 mn 9s to configure them using puppet Any user can get a 17,6 TB CEPH data cluster with • 32 OSDs • 1280 MB/s read performance on the Ceph Storage Cluster (Rados bench) • 517 MB/s write performance with default replication size = 2 (Rados bench) • 1180 MB/s write performance without replication (Rados bench) Playing with tweaked Hadoop deployments is popular Work carried out by P. Morillon (https://github.com/pmorillon/grid5000-xp-ceph) F. Desprez - Cluster 2014 24/09/2014 - 21
  22. 22. Network Isolation: KaVLAN • Reconfigures switches for the duration of a user experiment to complete level 2 isolation • Avoid network pollution (broadcast, unsolicited connections) • Enable users to start their own DHCP servers • Experiment on ethernet-based protocols • Interconnect nodes with another testbed without compromising the security of Grid'5000 • Relies on 802.1q (VLANs) • Compatible with many network equipments • Can use SNMP, SSH or telnet to connect to switches • Supports Cisco, HP, 3Com, Extreme Networks, and Brocade • Controlled with a command-line client or a REST API • Recent work - support several interfaces F. Desprez - Cluster 2014 24/09/2014 - 22
  23. 23. Network Isolation: KaVLAN, cont F. Desprez - Cluster 2014 24/09/2014 - 23
  24. 24. Monitoring, Ganglia F. Desprez - Cluster 2014 24/09/2014 - 24
  25. 25. Monitoring, Energy Unified way to access energy logs for energy profiling • Energy consumption per resource • Energy consumption per user jobs • Energy measurement injection in user F. Desprez - Cluster 2014 24/09/2014 - 25
  26. 26. Putting it all together: GRID’5000 API • Individual services & command-line interfaces are painful • REST API for each Grid'5000 service • Reference API versioned description of Grid'5000 resources • Monitoring API state of Grid'5000 resources • Metrology API access to data probes’ output (ganglia, hdf5, …) • Jobs API OAR interface • Deployments API Kadeploy interface • User API managing the user base F. Desprez - Cluster 2014 24/09/2014 - 26
  27. 27. Description of an Experiment over Grid’5000 • Description and verification of the environment • Reconfiguring the testbed to meet experimental needs • Monitoring experiments, extracting and analyzing data • Improving control and description of experiments F. Desprez - Cluster 2014 24/09/2014 - 27
  28. 28. Description and verification of the environment • Typical needs • How can I find suitable resources for my experiment? • How sure can I be that the actual resources will match their description? • What was the hard drive on the nodes I used six months ago? F. Desprez - Cluster 2014 24/09/2014 - 28
  29. 29. Description and selection of resources • Describing resources understand results • Detailed description on the Grid’5000 wiki • Machine-parsable format (JSON) • Archived (State of testbed 6 months ago?) • Selecting resources • OAR database filled from JSON oarsub -p "wattmeter=’YES’ and gpu=’YES’ » oarsub -l "cluster=’a’/nodes=1+cluster=’b’ and eth10g=’Y’/nodes=2,walltime=2" F. Desprez - Cluster 2014 24/09/2014 - 29
  30. 30. Verification of resources Inaccuracies in resources descriptions  dramatic consequences • Mislead researchers into making false assumptions • Generate wrong results  retracted publications! • Happen frequently: maintenance, broken hardware (e.g. RAM) • Our solution: g5k-checks • Runs at node boot (can also be run manually by users) • Retrieves current description of node in Reference API • Acquire information on node using OHAI, ethtool, etc. • Compare with Reference API • Future work (maybe?) • Verification of performance, not just availability and configuration of hardware (hard drives, network, etc.) • Provide tools to capture the state of the testbed archival with the rest of the experiment's data F. Desprez - Cluster 2014 24/09/2014 - 30
  31. 31. Reconfiguring the testbed • Typical needs • How can I install $SOFTWARE on my nodes? • How can I add $PATCH to the kernel running on my nodes? • Can I run a custom MPI to test my fault tolerance work? • How can I experiment with that Cloud/Grid middleware? • Likely answer on any production facility: you can’t • Or: use virtual machines  experimental bias • On Grid’5000 • Operating System reconfiguration with Kadeploy • Customize networking environment with KaVLAN F. Desprez - Cluster 2014 24/09/2014 - 31
  32. 32. Changing experimental conditions • Reconfigure experimental conditions with Distem • Introduce heterogeneity in an homogeneous cluster • Emulate complex network topologies http://distem.gforge.inria.fr/ • What else can we enable users to change? • BIOS settings • Power management settings • CPU features (Hyperthreading, Turbo mode, etc.) • Cooling system: temperature in the machine room? F. Desprez - Cluster 2014 24/09/2014 - 32
  33. 33. Monitoring experiments Goal: enable users to understand what happens during their experiment Power consumption CPU – memory – disk Network backbone Internal networks F. Desprez - Cluster 2014 24/09/2014 - 33
  34. 34. Exporting and analyzing data • Unified access to monitoring tools through the Grid’5000 API • Automatically export data during/after an experiment • Current work: high resolution monitoring for energy & network F. Desprez - Cluster 2014 24/09/2014 - 34
  35. 35. Controlling advanced CPU features Modern processors have advanced options • Hyperthreading (SMT): share execution resources of a core between 2 logical processors • Turboboost: stop some cores to increase frequency of others • C-states: put cores in different sleep modes • P-states (aka SpeedStep): Give each core a target performance These advanced options are set at a very low level (BIOS or kernel options) • Do they have an impact on the performance of middleware ? • Do publications document this level of experimental setting? • Does the community understand the possible impact ? • How can this be controlled/measured on a shared infrastructure ? F. Desprez - Cluster 2014 24/09/2014 - 35
  36. 36. Modern CPU features and Grid’5000 On Grid’5000 • Recent management cards can control low-level features (BIOS options to be short) have their options set through XML descriptions (tested on DELL’s IDRAC7) • Kadeploy inner workings support • Booting a deployment environment that can be used to apply changes to BIOS options • User control over cmdline options to kernels Ongoing work • Taking BIOS (or UEFI) descriptions as a new parameter at kadeploy3 level • Restoring and documenting the standard state F. Desprez - Cluster 2014 24/09/2014 - 36
  37. 37. GRID’5000 EXPERIMENTS F. Desprez - Cluster 2014 24/09/2014 - 37
  38. 38. ENERGY MANAGEMENT F. Desprez - Cluster 2014 24/09/2014 - 38
  39. 39. Energy efficiency around Grid’5000 • IT – 2-5% of CO2 emissions / 10% electricity • Green IT  reducing electrical consumption of IT equipments • Future exascale/datacenters platforms  systems from 20 to 100MW • How to build such systems and make them (more) energy sustainable/responsible ? Multi dimension approaches : hardware, software, usage • Several activities around energy management in Grid’5000 since 2007 • 2007: first energy efficiency considerations • 2008: Launch of Green activities… • 1st challenge: finding a First (real) powermeter • SME French company Omegawatt • Deployment of multiple powermeters: Lyon, Toulouse, Grenoble • 2010: Increasing scalability: Lyon Grid5000 site - a fully energy monitored site > 150 powermeters in big boxes. F. Desprez - Cluster 2014 24/09/2014 - 39
  40. 40. Aggressive ON/OFF is not always the best solution • Exploiting the gaps between activities • Reducing unused plugged ressources number • Only switiching off if potential energy saving Anne-Cecile Orgerie, Laurent Lefevre, and Jean-Patrick Gelas. "Save Watts in your Grid: Green Strategies for Energy-Aware Framework in Large Scale Distributed Systems", ICPADS 2008 : The 14th IEEE International Conference on Parallel and Distributed Systems, Melbourne, Australia, December 2008 F. Desprez - Cluster 2014 24/09/2014 - 40
  41. 41. To understand energy measurements : take care of your wattmeters ! Frequency / precision M. Diouri, M. Dolz, O. Glück, L. Lefevre, P. Alonso, S. Catalan, R. Mayo, E. Quintan-Orti. Solving some Mysteries in Power Monitoring of Servers: Take Care of your Wattmeters!, EE-LSDS 2013 : Energy Efficiency in Large Scale Distributed Systems conference , Vienna, Austria, April 22-24, 2013 F. Desprez - Cluster 2014 24/09/2014 - 41
  42. 42. Homogeneity (in energy consumption) does not exist ! • Depends on technology • Same flops but not same flops per watt • Idle / static cost • CPU : main responsible • Green scheduler designers must incorporate this issue ! Mohammed el Mehdi Diouri, Olivier Gluck, Laurent Lefevre and Jean-Christophe Mignot. "Your Cluster is not Power Homogeneous: Take Care when Designing Green Schedulers!", IGCC2013 : International Green Computing Conference, Arlington, USA, June 27-29, F. Desprez - Cluster 2014 24/09/2014 - 42
  43. 43. Improving energy mangement with application expertise • Considered services : resilience & data broadcasting • 4 steps: Service analysis, Measurements, Calibration, Estimation • Helping users make the right choices depending on context and parameters M. Diouri, Olivier Glück, Laurent Lefevre, and Franck Cappello. "ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance Protocols during HPC executions", CCGrid2013. F. Desprez - Cluster 2014 24/09/2014 - 43
  44. 44. Improving energy mangement without application expertise • Irregular usage of resources • Phase detection, characterisation • Power saving modes deployment Landry Tsafack, Laurent Lefevre, Jean-Marc Pierson, Patricia Stolf, and Georges Da Costa. "A runtime framework for energy efficient HPC systems without a priori knowledge of applications", ICPADS 2012. F. Desprez - Cluster 2014 24/09/2014 - 44
  45. 45. VIRTUALIZATION AND CLOUDS F. Desprez - Cluster 2014 24/09/2014 - 45
  46. 46. GRID’5000, Virtualization and Clouds • Virtualization technologies as the building blocks of Cloud Computing • Highlighted as a major topic for Grid’5000 as soon as 2008 • Two visions • Iaas providers (investigate low-level concerns) • Cloud end-users (investigate new algorithms/mechanisms on top of major cloud kits) • Require additional tools and support from the Grid’5000 technical board (IP allocations, VM Images management…) Adding Virtualization Capabilities to the Grid'5000 Testbed, Balouek, D., Carpen Marie, A., Charrier, G., Desprez, F., Jeannot, E., Jeanvoine, E., Lèbre, A., Margery, D., Niclausse, N., Nussbaum, L., Richard, O., Perez, C., Quesnel, F., Rohr, C., and Sarzyniec, L., Cloud Computing and Services Science, #367 in Communications in Computer and Information Science series, pp. 3-20, Ivanov, I., Sinderen, M., Leymann, F., and Shan, T. Eds, Springer, 2013 F. Desprez - Cluster 2014 24/09/2014 - 46
  47. 47. GRID’5000, Virtualization and Clouds: Dedicated tools • Pre-built VMM images maintained by the technical staff • Xen, KVM • Possibility for end-users to run VMs directly on the reference environment (like traditional IaaS infrastructures) • Cloud kits: scripts to easily deploy and use Nimbus/OpenNebula/Cloudstack and OpenStack • Network • Need reservation scheme for VM addresses (both MAC and IP) • Mac addresses randomly assigned • Sub-net range can be booked for IPs (/18, /19, …) F. Desprez - Cluster 2014 24/09/2014 - 47
  48. 48. GRID’5000, Virtualization and Clouds: Sky computing use-case • 2010 - A Nimbus federation over three sites • Deployment process • Reserve available nodes (using the Grid‘5000 REST API) • Deploy the Nimbus image on all selected nodes • Finalize the configuration (using the Chef utility) • Allocate roles to nodes (service, repository), • Generate necessary files (Root Certificate Authority, ...) • 31 min later, simply use the clouds : 280 VMMs, 1600 virtual CPUs Experiment renewed and extended to the FutureGrid testbed F. Desprez - Cluster 2014 24/09/2014 - 48
  49. 49. GRID’5000, Virtualization and Clouds: Sky computing use-case Experiments between USA and France • Nimbus (resource management, contextualization)/ViNe (connectivity)/Hadoop (task distribution, fault-tolerance, dynamicity) • FutureGrid (3 sites) and Grid’5000 (3 sites) platforms • Optimization of creation and propagation of VMs MapReduce Application Distributed Application Hadoop Grid’5000 firewall SD IaaS software IaaS software ViNe Large-Scale Cloud Computing Research: Sky Computing on FutureGrid and Grid’5000, by Pierre Riteau, Maurício Tsugawa, Andréa Matsunaga, José Fortes and Kate Keahey, ERCIM News 83, Oct. 2010. Crédits: Pierre Riteau UF UC Rennes Lille Sophia White-listed Queue VR All-to-all connectivity! F. Desprez - Cluster 2014 24/09/2014 - 49
  50. 50. GRID’5000, Virtualization and Clouds: Dynamic VM placement use-case 2012 - Investigate issues related to preemptive scheduling of VMs • Can a system handle VMs across a distributed infrastructure like OSes manipulate processes on local nodes ? • Several proposals in the literature, but • Few real experiments (simulation based results) • Scalability is usually a concern • Can we perform several migrations between several nodes at the same time ? What is the amount of time, the impact on the VMs/on the network ? F. Desprez - Cluster 2014 24/09/2014 - 50
  51. 51. GRID’5000, Virtualization and Clouds: Dynamic VM placement use-case Deploy 10240 VMs upon 512 PMs • Prepare the experiment • Book resources 512 PMs with Hard. Virtualization A global VLAN A /18 for IP ranges • Deploy KVM images and put PMs in the global VLAN • Launch/Configure VMs • A dedicated script leveraging Taktuk utility to interact with each PM • G5K-subnet to get booked IPs and assign them to VMs • Start the experiment and make publications ! Rennes Rennes Lille Lille Orsay Reims Luxembourg Nancy Lyon Nancy Grenoble Sophia Toulouse Bordeaux Sophia F. Quesnel, D. Balouek, and A. Lebre. Deploying and Scheduling Thousands of Virtual Machines on Hundreds of Nodes Distributed Geographically. In IEEE International Scalable Computing Challenge (SCALE 2013) (colocated with CCGRID 2013), Netherlands, May 2013 F. Desprez - Cluster 2014 24/09/2014 - 51
  52. 52. GRID’5000, Virtualization and Clouds • More than 198 publications • Three times finalist of the IEEE Scale challenge (2nd prize winner in 2013) • Tomorrow - Virtualization of network functions (Software Defined Network) - Go one step ahead of KaVLAN to guarantee for instance bandwidth expectations HiperCal, Emulab approaches Network virtualisation is performed by daemons running on dedicated nodes ⇒ Do not bring additional capabilities (close to the Distem project) - By reconfiguring routers/switches on demand ⇒ Required specific devices (OpenFlow compliant)  Which features should be exported to the end-users ?  Are there any security concerns ? F. Desprez - Cluster 2014 24/09/2014 - 52
  53. 53. HIGH PERFORMANCE COMPUTING F. Desprez - Cluster 2014 24/09/2014 - 53
  54. 54. Riplay: A Tool to Replay HPC Workloads • RJMS : Ressource and Job Management System • It manages resources and schedule jobs on High-Performance Clusters • Most famous ones : Maui/Moab, OAR, PBS, SLURM • Riplay • Replay traces on a real RJMS in an emulated environment • 2 RJMS supported (OAR and SLURM) • Jobs replaced by sleep commands • Can replay a full or an interval of a workload • On Grid’5000 • 630 emulated cores need 1 physical core to run • Curie (rank 26th on last Top500, 80640 cores) • Curie's RJMS can be ran on 128 Grid’5000 cores F. Desprez - Cluster 2014 24/09/2014 - 54
  55. 55. Riplay: A Tool to Replay HPC Workloads • Riplay of Curie petaflopic cluster workload on Grid5000 using emulation techniques • RJMS Evaluation with different scheduling parameters SLURM with FavorSmall set to True SLURM with FavorSmall set to False F. Desprez - Cluster 2014 24/09/2014 - 55
  56. 56. Riplay: A Tool to Replay HPC Workloads • Test RJMS scalability • Without the need of the actual cluster. • Test a huge cluster fully loaded on a RJMS in minutes. OAR before optimizations OAR after optimizations Large Scale Experimentation Methodology for Resource and Job Management Systems on HPC Clusters, Joseph Emeras, David Glesser, Yiannis Georgiou and Olivier Richard https://forge.imag.fr/projects/evalys-tools/ F. Desprez - Cluster 2014 24/09/2014 - 56
  57. 57. HPC Component Model: From Grid’5000 to Curie SuperComputer Issues • Improve code re-use of HPC applications • Simplify application portability Objective • Validate L2C, a low level HPC component model • Jacobi1 and 3D FFT2 kernels Roadmap • Low scale validation on Grid’5000 • Overhead wrt “classical” HPC models (Threads, MPI) • Study of the impact of various node architecture • Large scale validation on Curie • Up to 2000 nodes, 8192 cores Iter + XY MPI Conn MPI Conn Iter + XY Iter + XY MPI Conn MPI Conn Iter + XY Thread Conn Thread Conn 1: J.Bigot, Z. Hou, C. Pérez, V. Pichon. A Low Level Component Model Easing Performance Portability of HPC Applications. Computing, page 1–16, 2013, Springer Vienna 2: On going work. F. Desprez - Cluster 2014 24/09/2014 - 57
  58. 58. HPC Component Model: From Grid’5000 to Curie SuperComputer Jacobi: No L2C overhead in any version 3D FFT 2563 FFT, 1D decomp., Edel+Genepi Cluster (G5K) 10243 FFT, 2D decomp., Curie (Thin node) F. Desprez - Cluster 2014 24/09/2014 - 58 Ns/#iter/elem
  59. 59. DATA MANAGEMENT F. Desprez - Cluster 2014 24/09/2014 - 59
  60. 60. Big Data Management • Growing demand from our institutes and users • Grid’5000 has to evolve to be able to cope with such experiments • Current status • home quotas has growed from 2GB to 25GB by default (and can be extended up to 200 GB) • For larger datasets, users should leverage storage5K (persistant dataset imported into Grid'5000 between experiments) • DFS5K, deploy on demand CephFS • What’s next • Current challenge: upload datasets (from internet to G5K) / deployment of data from the archive storage system to the working nodes • New storage devices: SSD / NVRAM / … • New booking approaches • Allow end-users to book node partitions for longer period than the usual reservations F. Desprez - Cluster 2014 24/09/2014 - 60
  61. 61. Scalable Map-Reduce Processing Goal: High-performance Map-Reduce processing through concurrency-optimized data processing  Some results  Versioning-based concurrency management for increased data throughput (BlobSeer approach)  Efficient intermediate data storage in pipelines  Substantial improvements with respect to Hadoop  Application to efficient VM deployment  Intensive, long-run experiments done on Grid'5000  Up to 300 nodes/500 cores  Plans: validation within the IBM environment with IBM MapReduce Benchmarks  ANR Project Map-Reduce (ARPEGE, 2010-2014)  Partners: Inria (teams : KerData - leader, AVALON, Grand Large), Argonne National Lab, UIUC, JLPC, IBM, IBCP mapreduce.inria.fr F. Desprez - Cluster 2014 24/09/2014 - 61
  62. 62. Big Data in Post-Petascale HPC Systems • Problem: simulations generate TB/min • How to store et transfer data ? • How to analyze, visualize and extract knowledge? 62 100.000+ cores PetaBytes of data ~ 10.000 cores 62 F. Desprez - Cluster 2014 24/09/2014
  63. 63. Big Data in Post-Petascale HPC Systems • Problem: simulations generate TB/min • How to store et transfer data ? • How to analyze, visualize and extract knowledge? What is difficult? • Too many files (e.g. Blue Waters 100.000+ files/min) • Too much data • Unpredictable I/O performance 63 100.000+ cores PetaBytes of data ~ 10.000 Does not scale! cores 63 F. Desprez - Cluster 2014 24/09/2014
  64. 64. Damaris: A Middleware-Level Approach to I/O on Multicore HPC Systems Idea : one dedicated I/O core per multicore node Originality : shared memory, asynchronous processing Implementation: software library Applications: climate simulations (Blue Waters) Preliminary experiments on Grid’5000 64 http://damaris.gforge.inria.fr/ F. Desprez - Cluster 2014 24/09/2014
  65. 65. Damaris: Leveraging dedicated cores for in situ visualization Time Partitioning (traditional approach) Space Partitioning (Damaris approach) Without Visualization With Visualization Experiments on Grid’5000 with Nek5000 (912 cores) Using Damaris completely hides the performance impact of in situ visualization F. Desprez - Cluster 2014 24/09/2014 - 65
  66. 66. Damaris: Impact of Interactivity Time-Partitioning Space-Partitioning Experiments on Grid’5000 with Nek5000 (48 cores) Using Damaris completely hides the run time impact of in situ visualization, even in the presence of user interactivity F. Desprez - Cluster 2014 24/09/2014 - 66
  67. 67. MapReduce framework improvement Context: MapReduce job execution Motivation • Skew in reduce phase execution may cause high execution times • Base problem: Size of reduce tasks cannot be defined a priori - FPH approach: modified MapReduce framework where • Reduce input data is divided into fixed size splits • Reduce phase is divided into two parts  Intermediate: multiple-iteration execution of fixed-size tasks  Final: final grouping of results F. Desprez - Cluster 2014 24/09/2014 - 67
  68. 68. Results RT vs Size RT vs Skew RT vs Query RT vs MinSize RT vs Cluster Size F. Desprez - Cluster 2014 24/09/2014 - 68
  69. 69. RELATED PLATFORMS Credit: Chris Coleman, School of Computing, University of Utah F. Desprez - Cluster 2014 24/09/2014 - 69
  70. 70. Related Platforms • PlanetLab • 1074 nodes over 496 sites world-wide, slices allocation: virtual machines. • Designed for experiments Internet-wide: new protocols for Internet, overlay networks (file-sharing, routing algorithm, multi-cast, ...) • Emulab • Network emulation testbed. Mono-site, mono-cluster. Emulation. Integrated approach • Open Cloud • 480 cores distributed in four locations, interoperability across clouds using open API • Open Cirrus • Federation of heterogeneous data centers, test-bed for cloud computing • DAS-1..4-5 • Federation of 4-6 cluster in Netherland, ~200 nodes, specific target experiment for each generation • NECTAR • Federated Australian Research Cloud over 8 sites • Futuregrid (ended Sept. 30th) • 4 years project within XCEDE with similar approach as Grid’5000 • Federation of data centers, bare hardware reconfiguration • 2 new projects awarded by NSF last August • Chameleon: a large-scale, reconfigurable experimental environment for cloud research, co-located at the University of Chicago and The University of Texas at Austin • CloudLab: a large-scale distributed infrastructure based at the University of Utah, Clemson University and the University of Wisconsin F. Desprez - Cluster 2014 24/09/2014 - 70
  71. 71. Chameleon: A powerful and flexible experimental instrument Large-scale instrument - Targeting Big Data, Big Compute, Big Instrument research - More than 650 nodes and 5 PB disk over two sites,100G network Reconfigurable instrument - Bare metal reconfiguration, operated as single instrument, graduated approach for ease-of-use Connected instrument -Workload and Trace Archive - Partnerships with production clouds: CERN, OSDC, Rackspace, Google, and others - Partnerships with users Complementary instrument - Complementing GENI, Grid’5000, and other testbeds Sustainable instrument - Strong connections to industry Credits: Kate Keahey www.chameleoncloud.org F. Desprez - Cluster 2014 24/09/2014 - 71
  72. 72. Chameleon Hardware SCUs connect to core and fully connected to each other To UTSA, GENI, Future Partners Heterogeneous Cloud Units Alternate Processors and Networks Switch Standard Cloud Unit 42 compute 4 storage x10 Chicago Austin Core Services Front End and Data Mover Nodes Chameleon Core Network 100Gbps uplink public network (each site) Core Services 3 PB Central File Systems, Front End and Data Movers 504 x86 Compute Servers 48 Dist. Storage Servers 102 Heterogeneous Servers 16 Mgt and Storage Nodes Switch Standard Cloud Unit 42 compute 4 storage x2 F. Desprez - Cluster 2014 24/09/2014 - 72
  73. 73. Related Platforms : BonFIRE • BonFIRE foundation is operating the results of the BonFIRE project • For Inria : applying some of the lessons learned running Grid’5000 to the cloud paradigm • A federation of 5 sites, accessed through a central API (based on OCCI) • An observable cloud • Access to metrics of the underlying host to understand observed behavior • Number of VMs running • Power consumption • A controllable infrastructure • One site exposing resources managed by Emulab, enabling control over network parameters • Ability to be the exclusive user of a host, or to use a specific host • Higher level features than when using Grid’5000 www.bonfire-project.eu F. Desprez - Cluster 2014 24/09/2014 - 73
  74. 74. CONCLUSION F. Desprez - Cluster 2014 24/09/2014 - 74
  75. 75. Conclusion and Open Challenges • Computer-Science is also an experimental science • There are different and complementary approaches for doing experiments in computer-science • Computer-science is not at the same level than other sciences • But things are improving… • GRiD’5000: a test-bed for experimentation on distributed systems with a unique combination of features • Hardware-as-a-Service cloud • redeployment of operating system on the bare hardware by users • Access to various technologies (CPUs, high performance networks, etc.) • Networking: dedicated backbone, monitoring, isolation • Programmable through an API • Energy consumption monitoring • Useful platform • More than 750 publications with Grid’5000 in their tag (HAL) • Between 500 and 600 users per year since 2006 F. Desprez - Cluster 2014 24/09/2014 - 75
  76. 76. What Have We Learned? Building such a platform was a real challenge ! - No on-the-shelf software available - Need to have a team of highly motivated and highly trained engineers and researchers - Strong help and deep understanding of involved institutions! From our experience, experimental platforms should feature - Experiment isolation - Capability to reproduce experimental conditions - Flexibility through high degree of reconfiguration - The strong control of experiment preparation and running - Precise measurement methodology - Tools to help users prepare and run their experiments - Deep on-line monitoring (essential to help observations understanding) - Capability to inject real life (real time) experimental conditions (real Internet traffic, faults) F. Desprez - Cluster 2014 24/09/2014 - 76
  77. 77. Conclusion and Open Challenges, cont • Testbeds optimized for experimental capabilities, not performance • Access to the modern architectures / technologies • Not necessarily the fastest CPUs • But still expensive  funding! • Ability to trust results • Regular checks of testbed for bugs • Ability to understand results • Documentation of the infrastructure • Instrumentation & monitoring tools • network, energy consumption • Evolution of the testbed • maintenance logs, configuration history • Empower users to perform complex experiments • Facilitate access to advanced software tools • Paving the way to Open Science of HPC and Cloud – long term goals • Fully automated execution of experiments • Automated tracking + archiving of experiments and associated data F. Desprez - Cluster 2014 24/09/2014 - 77
  78. 78. Our Open Access program An 'Open Access' program enables researchers to get a Grid'5000 account valid for two months. At the end of this period, the user must submit a report explaining its use of Grid'5000 to open-access@lists.grid5000.fr, and (possibly) apply for renewal Open Access accounts currently provide unrestricted access to Grid'500 (this might be changed in the future). However: • Usage must follow the Grid'5000 User Charter • We ask Open Access users not to use OAR's "Advance reservations" for jobs more than 24 hours in the future https://www.grid5000.fr/mediawiki/index.php/Special:G5KRequestOpenAccess F. Desprez - Cluster 2014 24/09/2014 - 78
  79. 79. QUESTIONS ? Special thanks to G. Antoniu, Y. Georgiou, D. Glesser, A. Lebre, L. Lefèvre, M. Liroz, D. Margery, L. Nussbaum, C. Perez, L. Pouillioux and the Grid’5000 technical team www.grid5000.fr

×