Running Large Experiments on GRID'5000

Grid'5000: Running a Large Instrument
for Parallel and Distributed Computing
Experiments
F. Desprez
INRIA Grenoble Rhône-Alpes, LIP ENS Lyon, Avalon Team
Joint work with G. Antoniu, Y. Georgiou, D. Glesser, A. Lebre, L. Lefèvre, M. Liroz, D. Margery,
L. Nussbaum, C. Perez, L. Pouillioux
F. Desprez - Cluster 2014 24/09/2014 - 1

“One could determine the different ages of a science by the technic of its
measurement instruments”
Gaston Bachelard
The Formation of the scientific mind

Agenda
• Experimental Computer Science
• Overview of GRID’5000
• GRID’5000 Experiments
• Related Platforms
• Conclusions and Open Challenges

VALIDATION IN COMPUTER
SCIENCE

The discipline of computing: an experimental science
The reality of computer science
- Information
- Computers, network, algorithms, programs, etc.
Studied objects (hardware, programs, data, protocols, algorithms, network)
are more and more complex
Modern infrastructures
• Processors have very nice features: caches, hyperthreading, multi-core
• Operating system impacts the performance
(process scheduling, socket implementation, etc.)
• The runtime environment plays a role
(MPICH ≠ OPENMPI)
• Middleware have an impact
• Various parallel architectures that can be
heterogeneous, hierarchical, distributed, dynamic

Experimental culture not comparable with
other science
Different studies
• 1994: 400 papers
- Between 40% and 50% of CS ACM papers requiring experimental validation had none (15% in
optical engineering) [Lukovicz et al.]
• 1998: 612 papers
- “Too many articles have no experimental validation” [Zelkowitz and Wallace 98]
• 2009 update
- Situation is improving
• 2007: Survey of simulators used in P2P research
- Most papers use an unspecified or custom simulator
Computer science not at the same level than some other sciences
• Nobody redo experiments
• Lack of tool and methodologies
Paul Lukowicz et al. Experimental Evaluation in Computer Science: A Quantitative Study. In: J.l of Systems and Software 28:9-18, 1994
M.V. Zelkowitz and D.R. Wallace. Experimental models for validating technology. Computer, 31(5):23-31, May 1998
Marvin V. Zelkowitz. An update to experimental models for validating computer technology. In: J. Syst. Softw. 82.3:373–376, Mar. 2009
S. Naicken et al. The state of peer-to-peer simulators and simulations. In: SIGCOMM Comput. Commun. Rev. 37.2:95–98, Mar. 2007

“Good experiments”
A good experiment should fulfill the following properties
• Reproducibility: must give the same result with the same input
• Extensibility: must target possible comparisons with other works
and extensions (more/other processors, larger data sets, different
architectures)
• Applicability: must define realistic parameters and must allow for
an easy calibration
• “Revisability”: when an implementation does not perform as
expected, must help to identify the reasons

Analytic modeling
Purely analytical (mathematical) models
• Demonstration of properties (theorem)
• Models need to be tractable: over-simplification?
• Good to understand the basic of the
problem
• Most of the time ones still perform a
experiments (at least for comparison)
For a practical impact (especially in distributed computing):
analytic study not always possible or not sufficient

Experimental Validation
A good alternative to analytical validation
• Provides a comparison between algorithms and programs
• Provides a validation of the model or helps to define the validity
domain of the model
Several methodologies
• Simulation (SimGrid, NS, …)
• Emulation (MicroGrid, Distem, …)
• Benchmarking (NAS, SPEC, Linpack, ….)
• Real-scale (Grid’5000, FutureGrid, OpenCirrus, PlanetLab, …)

GRID’5000
www.grid5000.fr

Grid’5000 Mission
Support high quality, reproducible experiments on a distributed system
testbed
Two areas of work
• Improve trustworthiness
• Testbed description
• Experiment description
• Control of experimental conditions
• Automate experiments
• Monitoring & measurement
• Improve scope and scale
• Handle large number of nodes
• Automate experiments
• Handle failures
• Monitoring and measurements
Both goals raise similar challenges

GRID’5000
• Testbed for research on distributed systems
• Born from the observation that we need a better and larger testbed
• High Performance Computing, Grids, Peer-to-peer systems, Cloud computing
• A complete access to the nodes’ hardware in an exclusive mode (from one
node to the whole infrastructure): Hardware as a service
• RIaaS : Real Infrastructure as a Service ! ?
• History, a community effort
• 2003: Project started (ACI GRID)
• 2005: Opened to users
• Funding
• INRIA, CNRS, and many local entities (regions, universities)
• One rule: only for research on distributed systems
• → no production usage
• Free nodes during daytime to prepare experiments
• Large-scale experiments during nights and week-ends

Current Status (Sept. 2014 data)
• 10 sites (1 outside France)
• 24 clusters
• 1006 nodes
• 8014 cores
• Diverse technologies
• Intel (65%), AMD (35%)
• CPUs from one to 12 cores
• Ethernet 1G, 10G,
• Infiniband {S, D, Q}DR
• Two GPU clusters
• 2 Xeon Phi
• 2 data clusters (3-5 disks/node)
• More than 500 users per year
• Hardware renewed regularly

A Large Research Applicability

Backbone Network
Dedicated 10 Gbps backbone provided by Renater (french NREN)
Work in progress
• Packet-level and flow level
monitoring

Using GRID’5000: User’s Point of View
• Key tool: SSH
• Private network: connect through access machines
• Data storage:
- NFS (one server per GRID’5000 site)
- Datablock service (one per site)
- 100TB server in Rennes

GRID’5000 Software Stack
• Resource management: OAR
• System reconfiguration: Kadeploy
• Network isolation: KaVLAN
• Monitoring: Ganglia, Kaspied, Kwapi
• Putting all together GRID’5000 API

Resource Management: OAR
Batch scheduler with specific features
• interactive jobs
• advance reservations
• powerful resource matching
• Resources hierarchy
• cluster / switch / node / cpu / core
• Properties
• memory size, disk type & size, hardware capabilities, network interfaces, …
• Other kind of resources: VLANs, IP ranges for virtualization
I want 1 core on 2 nodes of the same cluster with 4096 GB of memory and
Infiniband 10G + 1 cpu on 2 nodes of the same switch with dualcore processors
for a walltime of 4 hours …
oarsub -I -l "memnode=4096 and ib10g=’YES’}/cluster=1/nodes=2/core=1
+ {cpucore=2}/switch=1/nodes=2/cpu=1,walltime=4:0:0"

Resource Management: OAR, Visualization
Resource status Gantt chart

Kadeploy – Scalable Cluster Deployment
• Provides a Hardware-as-a-Service Cloud infrastructure
• Built on top of PXE, DHCP, TFTP
• Scalable, efficient, reliable and flexible
• Chain-based and BitTorrent environment broadcast
• 255 nodes deployed in 7 minutes (latest scalability test 4000 nodes)
• Support of a broad range of systems (Linux, Xen, *BSD, etc.)
• Command-line interface & asynchronous interface (REST API)
• Similar to a cloud/virtualization provisionning tool (but on real machines)
• Choose a system stack and deploy it over GRID’5000 !
Preparation
Update PXE
Deploy environment
fdisk and mkfs
Chained broadcast
Image writing
Prepare boot of
deployed environment
Install bootloader
Update PXE and VLAN
Reboot
Reboot
kadeploy3.gforge.inria.fr

Kadeploy example – create a data cluster
Out of 8 nodes
• 2 *10 core machine (Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz)
• 1 10Gb interface, 1 1Gb interface
• 5* 600GB SAS disks
With Kadeploy3, and puppet recipes (shared)
• 4mn 25s to deploy a new base image on all nodes
• 5 mn 9s to configure them using puppet
Any user can get a 17,6 TB CEPH data cluster with
• 32 OSDs
• 1280 MB/s read performance on the Ceph Storage Cluster (Rados bench)
• 517 MB/s write performance with default replication size = 2 (Rados bench)
• 1180 MB/s write performance without replication (Rados bench)
Playing with tweaked Hadoop deployments is popular
Work carried out by P. Morillon (https://github.com/pmorillon/grid5000-xp-ceph)

Network Isolation: KaVLAN
• Reconfigures switches for the duration of a user experiment to complete
level 2 isolation
• Avoid network pollution (broadcast, unsolicited connections)
• Enable users to start their own DHCP servers
• Experiment on ethernet-based protocols
• Interconnect nodes with another testbed without compromising the security of
Grid'5000
• Relies on 802.1q (VLANs)
• Compatible with many network equipments
• Can use SNMP, SSH or telnet to connect to switches
• Supports Cisco, HP, 3Com, Extreme Networks, and Brocade
• Controlled with a command-line client or a REST API
• Recent work
- support several interfaces

Network Isolation: KaVLAN, cont

Monitoring, Ganglia

Monitoring, Energy
Unified way to access energy logs for energy
profiling
• Energy consumption per resource
• Energy consumption per user jobs
• Energy measurement injection in user

Putting it all together: GRID’5000 API
• Individual services & command-line interfaces are painful
• REST API for each Grid'5000 service
• Reference API versioned description of Grid'5000 resources
• Monitoring API state of Grid'5000 resources
• Metrology API access to data probes’ output (ganglia, hdf5, …)
• Jobs API OAR interface
• Deployments API Kadeploy interface
• User API managing the user base

Description of an Experiment over Grid’5000
• Description and verification of the environment
• Reconfiguring the testbed to meet experimental needs
• Monitoring experiments, extracting and analyzing data
• Improving control and description of experiments

Description and verification of the environment
• Typical needs
• How can I find suitable resources for my experiment?
• How sure can I be that the actual resources will match their description?
• What was the hard drive on the nodes I used six months ago?

Description and selection of resources
• Describing resources understand results
• Detailed description on the Grid’5000 wiki
• Machine-parsable format (JSON)
• Archived (State of testbed 6 months ago?)
• Selecting resources
• OAR database filled from JSON
oarsub -p "wattmeter=’YES’ and gpu=’YES’ »
oarsub -l "cluster=’a’/nodes=1+cluster=’b’ and eth10g=’Y’/nodes=2,walltime=2"

Verification of resources
Inaccuracies in resources descriptions  dramatic consequences
• Mislead researchers into making false assumptions
• Generate wrong results  retracted publications!
• Happen frequently: maintenance, broken hardware (e.g. RAM)
• Our solution: g5k-checks
• Runs at node boot (can also be run manually by users)
• Retrieves current description of node in Reference API
• Acquire information on node using OHAI, ethtool, etc.
• Compare with Reference API
• Future work (maybe?)
• Verification of performance, not just availability and configuration of hardware
(hard drives, network, etc.)
• Provide tools to capture the state of the testbed archival with the rest of the
experiment's data

Reconfiguring the testbed
• Typical needs
• How can I install $SOFTWARE on my nodes?
• How can I add $PATCH to the kernel running on my nodes?
• Can I run a custom MPI to test my fault tolerance work?
• How can I experiment with that Cloud/Grid middleware?
• Likely answer on any production facility: you can’t
• Or: use virtual machines  experimental bias
• On Grid’5000
• Operating System reconfiguration with Kadeploy
• Customize networking environment with KaVLAN

Changing experimental conditions
• Reconfigure experimental conditions with Distem
• Introduce heterogeneity in an homogeneous cluster
• Emulate complex network topologies
http://distem.gforge.inria.fr/
• What else can we enable users to change?
• BIOS settings
• Power management settings
• CPU features (Hyperthreading, Turbo mode, etc.)
• Cooling system: temperature in the machine room?

Monitoring experiments
Goal: enable users to understand what happens during their experiment
Power consumption
CPU – memory – disk
Network backbone Internal networks

Exporting and analyzing data
• Unified access to monitoring tools through the Grid’5000 API
• Automatically export data during/after an experiment
• Current work: high resolution monitoring for energy & network

Controlling advanced CPU features
Modern processors have advanced options
• Hyperthreading (SMT): share execution resources of a core between 2 logical
processors
• Turboboost: stop some cores to increase frequency of others
• C-states: put cores in different sleep modes
• P-states (aka SpeedStep): Give each core a target performance
These advanced options are set at a very low level (BIOS or kernel options)
• Do they have an impact on the performance of middleware ?
• Do publications document this level of experimental setting?
• Does the community understand the possible impact ?
• How can this be controlled/measured on a shared infrastructure ?

Modern CPU features and Grid’5000
On Grid’5000
• Recent management cards can control low-level features (BIOS options to be
short) have their options set through XML descriptions (tested on DELL’s
IDRAC7)
• Kadeploy inner workings support
• Booting a deployment environment that can be used to apply changes to BIOS options
• User control over cmdline options to kernels
Ongoing work
• Taking BIOS (or UEFI) descriptions as a new parameter at kadeploy3 level
• Restoring and documenting the standard state

GRID’5000 EXPERIMENTS

ENERGY MANAGEMENT

Energy efficiency around Grid’5000
• IT – 2-5% of CO2 emissions / 10% electricity
• Green IT  reducing electrical consumption of IT equipments
• Future exascale/datacenters platforms  systems from 20 to 100MW
• How to build such systems and make them (more) energy
sustainable/responsible ? Multi dimension approaches : hardware, software,
usage
• Several activities around energy management in Grid’5000 since 2007
• 2007: first energy efficiency considerations
• 2008: Launch of Green activities…
• 1st challenge: finding a First (real) powermeter
• SME French company Omegawatt
• Deployment of multiple powermeters: Lyon, Toulouse, Grenoble
• 2010: Increasing scalability: Lyon Grid5000 site - a fully energy monitored site > 150
powermeters in big boxes.

Aggressive ON/OFF is not always the best solution
• Exploiting the gaps between activities
• Reducing unused plugged ressources number
• Only switiching off if potential energy saving
Anne-Cecile Orgerie, Laurent Lefevre, and Jean-Patrick Gelas. "Save Watts in your Grid: Green Strategies for Energy-Aware
Framework in Large Scale Distributed Systems", ICPADS 2008 : The 14th IEEE International Conference on Parallel and Distributed
Systems, Melbourne, Australia, December 2008

To understand energy measurements : take care of
your wattmeters !
Frequency / precision
M. Diouri, M. Dolz, O. Glück, L. Lefevre, P. Alonso,
S. Catalan, R. Mayo, E. Quintan-Orti. Solving some
Mysteries in Power Monitoring of Servers: Take
Care of your Wattmeters!, EE-LSDS 2013 : Energy
Efficiency in Large Scale Distributed Systems
conference , Vienna, Austria, April 22-24, 2013

Homogeneity (in energy consumption) does not exist !
• Depends on technology
• Same flops but not same flops per watt
• Idle / static cost
• CPU : main responsible
• Green scheduler designers must incorporate
this issue !
Mohammed el Mehdi Diouri, Olivier Gluck, Laurent Lefevre and Jean-Christophe Mignot. "Your Cluster is not Power Homogeneous:
Take Care when Designing Green Schedulers!", IGCC2013 : International Green Computing Conference, Arlington, USA, June 27-29,

Improving energy mangement with application
expertise
• Considered services : resilience & data broadcasting
• 4 steps: Service analysis, Measurements, Calibration,
Estimation
• Helping users make the right choices depending on context
and parameters
M. Diouri, Olivier Glück, Laurent Lefevre, and Franck Cappello. "ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance
Protocols during HPC executions", CCGrid2013.

Improving energy mangement without application
expertise
• Irregular usage of resources
• Phase detection, characterisation
• Power saving modes deployment
Landry Tsafack, Laurent Lefevre, Jean-Marc Pierson, Patricia Stolf, and Georges Da Costa. "A runtime framework for energy efficient HPC systems
without a priori knowledge of applications", ICPADS 2012.

VIRTUALIZATION AND CLOUDS

GRID’5000, Virtualization and Clouds
• Virtualization technologies as the building blocks of Cloud
Computing
• Highlighted as a major topic for Grid’5000 as soon as 2008
• Two visions
• Iaas providers (investigate low-level concerns)
• Cloud end-users (investigate new algorithms/mechanisms on top of major cloud kits)
• Require additional tools and support from the Grid’5000 technical board
(IP allocations, VM Images management…)
Adding Virtualization Capabilities to the Grid'5000 Testbed, Balouek, D., Carpen Marie, A., Charrier, G., Desprez, F., Jeannot, E., Jeanvoine, E.,
Lèbre, A., Margery, D., Niclausse, N., Nussbaum, L., Richard, O., Perez, C., Quesnel, F., Rohr, C., and Sarzyniec, L., Cloud Computing and Services
Science, #367 in Communications in Computer and Information Science series, pp. 3-20, Ivanov, I., Sinderen, M., Leymann, F., and Shan, T. Eds,
Springer, 2013

GRID’5000, Virtualization and Clouds:
Dedicated tools
• Pre-built VMM images maintained by the technical staff
• Xen, KVM
• Possibility for end-users to run VMs directly on the reference environment (like
traditional IaaS infrastructures)
• Cloud kits: scripts to easily deploy and use
Nimbus/OpenNebula/Cloudstack and OpenStack
• Network
• Need reservation scheme for VM addresses (both MAC and IP)
• Mac addresses randomly assigned
• Sub-net range can be booked for IPs (/18, /19, …)

Sky computing use-case
• 2010 - A Nimbus federation over three sites
• Deployment process
• Reserve available nodes (using the Grid‘5000 REST API)
• Deploy the Nimbus image on all selected nodes
• Finalize the configuration (using the Chef utility)
• Allocate roles to nodes (service, repository),
• Generate necessary files (Root Certificate Authority, ...)
• 31 min later, simply use the clouds : 280 VMMs, 1600 virtual CPUs
Experiment renewed and extended to the FutureGrid testbed

Sky computing use-case
Experiments between USA and France
• Nimbus (resource management, contextualization)/ViNe (connectivity)/Hadoop (task
distribution, fault-tolerance, dynamicity)
• FutureGrid (3 sites) and Grid’5000 (3 sites) platforms
• Optimization of creation and propagation of VMs
MapReduce Application
Distributed
Application Hadoop
Grid’5000
firewall
SD IaaS software
IaaS software
ViNe
Large-Scale Cloud Computing Research: Sky Computing on FutureGrid and Grid’5000, by Pierre Riteau, Maurício Tsugawa, Andréa
Matsunaga, José Fortes and Kate Keahey, ERCIM News 83, Oct. 2010.
Crédits: Pierre Riteau
UF
UC
Rennes
Lille
Sophia
White-listed
Queue
VR
All-to-all
connectivity!

Dynamic VM placement use-case
2012 - Investigate issues related to preemptive scheduling of VMs
• Can a system handle VMs across a distributed infrastructure like OSes
manipulate processes on local nodes ?
• Several proposals in the literature, but
• Few real experiments
(simulation based results)
• Scalability is usually a concern
• Can we perform several migrations between several nodes at the same
time ? What is the amount of time, the impact on the VMs/on the network
?

Dynamic VM placement use-case
Deploy 10240 VMs upon 512 PMs
• Prepare the experiment
• Book resources
512 PMs with Hard. Virtualization
A global VLAN
A /18 for IP ranges
• Deploy KVM images and put
PMs in the global VLAN
• Launch/Configure VMs
• A dedicated script leveraging Taktuk
utility to interact with each PM
• G5K-subnet to get booked IPs and
assign them to VMs
• Start the experiment and make publications !
Rennes
Rennes
Lille
Lille
Orsay
Reims
Luxembourg
Nancy
Lyon
Nancy
Grenoble
Sophia
Toulouse
Bordeaux
Sophia
F. Quesnel, D. Balouek, and A. Lebre. Deploying and Scheduling Thousands of Virtual Machines on Hundreds of Nodes Distributed
Geographically. In IEEE International Scalable Computing Challenge (SCALE 2013) (colocated with CCGRID 2013), Netherlands, May 2013

GRID’5000, Virtualization and Clouds
• More than 198 publications
• Three times finalist of the IEEE Scale challenge
(2nd prize winner in 2013)
• Tomorrow - Virtualization of network functions (Software Defined Network)
- Go one step ahead of KaVLAN to guarantee for instance bandwidth expectations
HiperCal, Emulab approaches Network virtualisation is performed by daemons
running on dedicated nodes ⇒ Do not bring additional capabilities (close to the
Distem project)
- By reconfiguring routers/switches on demand
⇒ Required specific devices (OpenFlow compliant)
 Which features should be exported to the end-users ?
 Are there any security concerns ?

HIGH PERFORMANCE
COMPUTING

Riplay: A Tool to Replay HPC Workloads
• RJMS : Ressource and Job Management System
• It manages resources and schedule jobs on High-Performance Clusters
• Most famous ones : Maui/Moab, OAR, PBS, SLURM
• Riplay
• Replay traces on a real RJMS in an emulated environment
• 2 RJMS supported (OAR and SLURM)
• Jobs replaced by sleep commands
• Can replay a full or an interval of a workload
• On Grid’5000
• 630 emulated cores need 1 physical core to run
• Curie (rank 26th on last Top500, 80640 cores)
• Curie's RJMS can be ran on 128 Grid’5000 cores

• Riplay of Curie petaflopic cluster workload on Grid5000 using emulation
techniques
• RJMS Evaluation with different scheduling parameters
SLURM with FavorSmall set to True SLURM with FavorSmall set to False

• Test RJMS scalability
• Without the need of the actual cluster.
• Test a huge cluster fully loaded on a RJMS in minutes.
OAR before optimizations OAR after optimizations
Large Scale Experimentation Methodology for Resource and Job Management Systems
on HPC Clusters, Joseph Emeras, David Glesser, Yiannis Georgiou and Olivier Richard
https://forge.imag.fr/projects/evalys-tools/

HPC Component Model: From Grid’5000 to Curie
SuperComputer
Issues
• Improve code re-use of HPC applications
• Simplify application portability
Objective
• Validate L2C, a low level HPC component model
• Jacobi1 and 3D FFT2 kernels
Roadmap
• Low scale validation on Grid’5000
• Overhead wrt “classical” HPC models
(Threads, MPI)
• Study of the impact of various node architecture
• Large scale validation on Curie
• Up to 2000 nodes, 8192 cores
Iter +
XY
MPI
Conn
MPI
Conn
Iter +
XY
Iter +
XY
MPI
Conn
MPI
Conn
Iter +
XY
Thread
Conn
Thread
Conn
1: J.Bigot, Z. Hou, C. Pérez, V. Pichon. A Low Level Component Model Easing Performance Portability of HPC Applications. Computing,
page 1–16, 2013, Springer Vienna
2: On going work.

HPC Component Model: From Grid’5000 to Curie
SuperComputer
Jacobi: No L2C overhead in any version
3D FFT
2563 FFT, 1D decomp., Edel+Genepi Cluster (G5K) 10243 FFT, 2D decomp., Curie (Thin node)
Ns/#iter/elem

DATA MANAGEMENT

Big Data Management
• Growing demand from our institutes and users
• Grid’5000 has to evolve to be able to cope with such experiments
• Current status
• home quotas has growed from 2GB to 25GB by default (and can be extended
up to 200 GB)
• For larger datasets, users should leverage storage5K (persistant dataset
imported into Grid'5000 between experiments)
• DFS5K, deploy on demand CephFS
• What’s next
• Current challenge: upload datasets (from internet to G5K) / deployment of data
from the archive storage system to the working nodes
• New storage devices: SSD / NVRAM / …
• New booking approaches
• Allow end-users to book node partitions for longer period than the usual reservations

Scalable Map-Reduce Processing
Goal: High-performance Map-Reduce processing through
concurrency-optimized data processing
 Some results
 Versioning-based concurrency management for
increased data throughput (BlobSeer approach)
 Efficient intermediate data storage in pipelines
 Substantial improvements with respect to Hadoop
 Application to efficient VM deployment
 Intensive, long-run experiments done on Grid'5000
 Up to 300 nodes/500 cores
 Plans: validation within the IBM environment with IBM
MapReduce Benchmarks
 ANR Project Map-Reduce (ARPEGE, 2010-2014)
 Partners: Inria (teams : KerData - leader, AVALON, Grand Large), Argonne
National Lab, UIUC, JLPC, IBM, IBCP mapreduce.inria.fr

Big Data in Post-Petascale HPC Systems
• Problem: simulations generate TB/min
• How to store et transfer data ?
• How to analyze, visualize and extract knowledge?
62
100.000+
cores
PetaBytes of
data
~ 10.000
cores
62
F. Desprez - Cluster 2014 24/09/2014

Big Data in Post-Petascale HPC Systems
• Problem: simulations generate TB/min
• How to store et transfer data ?
• How to analyze, visualize and extract knowledge?
What is difficult?
• Too many files (e.g. Blue Waters 100.000+ files/min)
• Too much data
• Unpredictable I/O performance
63
100.000+
cores
PetaBytes of
data
~ 10.000
Does not scale! cores
63

Damaris: A Middleware-Level Approach to I/O
on Multicore HPC Systems
Idea : one dedicated I/O core per multicore node
Originality : shared memory, asynchronous
processing
Implementation: software library
Applications: climate simulations (Blue Waters)
Preliminary experiments on Grid’5000
64
http://damaris.gforge.inria.fr/

Damaris: Leveraging dedicated cores
for in situ visualization
Time
Partitioning
(traditional
approach)
Space
Partitioning
(Damaris
approach)
Without Visualization With Visualization
Experiments on Grid’5000 with Nek5000 (912 cores)
Using Damaris completely hides the performance impact of in situ visualization

Damaris: Impact of Interactivity
Time-Partitioning
Space-Partitioning
Experiments on Grid’5000 with Nek5000 (48 cores)
Using Damaris completely hides the run time impact of in situ visualization, even in
the presence of user interactivity

MapReduce framework improvement
Context: MapReduce job execution
Motivation
• Skew in reduce phase execution may cause high execution times
• Base problem: Size of reduce tasks cannot be defined a priori
- FPH approach: modified MapReduce framework where
• Reduce input data is divided into fixed size splits
• Reduce phase is divided into two parts
 Intermediate: multiple-iteration execution of fixed-size tasks
 Final: final grouping of results

Results
RT vs Size
RT vs Skew
RT vs Query
RT vs MinSize
RT vs Cluster Size

RELATED PLATFORMS
Credit: Chris Coleman, School of Computing, University of Utah

Related Platforms
• PlanetLab
• 1074 nodes over 496 sites world-wide, slices allocation: virtual machines.
• Designed for experiments Internet-wide: new protocols for Internet, overlay networks (file-sharing, routing algorithm,
multi-cast, ...)
• Emulab
• Network emulation testbed. Mono-site, mono-cluster. Emulation. Integrated approach
• Open Cloud
• 480 cores distributed in four locations, interoperability across clouds using open API
• Open Cirrus
• Federation of heterogeneous data centers, test-bed for cloud computing
• DAS-1..4-5
• Federation of 4-6 cluster in Netherland, ~200 nodes, specific target experiment for each generation
• NECTAR
• Federated Australian Research Cloud over 8 sites
• Futuregrid (ended Sept. 30th)
• 4 years project within XCEDE with similar approach as Grid’5000
• Federation of data centers, bare hardware reconfiguration
• 2 new projects awarded by NSF last August
• Chameleon: a large-scale, reconfigurable experimental environment for cloud research, co-located at the University of
Chicago and The University of Texas at Austin
• CloudLab: a large-scale distributed infrastructure based at the University of Utah, Clemson University and the
University of Wisconsin

Chameleon: A powerful and flexible experimental
instrument
Large-scale instrument
- Targeting Big Data, Big Compute, Big Instrument research
- More than 650 nodes and 5 PB disk over two sites,100G network
Reconfigurable instrument
- Bare metal reconfiguration, operated as single instrument, graduated
approach for ease-of-use
Connected instrument
-Workload and Trace Archive
- Partnerships with production clouds: CERN, OSDC, Rackspace, Google,
and others
- Partnerships with users
Complementary instrument
- Complementing GENI, Grid’5000, and other testbeds
Sustainable instrument
- Strong connections to industry
Credits: Kate Keahey
www.chameleoncloud.org

Chameleon Hardware
SCUs connect to
core and fully
connected to
each other
To UTSA, GENI, Future Partners
Heterogeneous
Cloud Units
Alternate Processors
and Networks
Switch
Standard
Cloud Unit
42 compute
4 storage
x10
Chicago
Austin
Core Services
Front End and Data
Mover Nodes
Chameleon Core Network
100Gbps uplink public network
(each site)
Core Services
3 PB Central File
Systems, Front End
and Data Movers
504 x86 Compute Servers
48 Dist. Storage Servers
102 Heterogeneous Servers
16 Mgt and Storage Nodes
Switch
Standard
Cloud Unit
42 compute
4 storage
x2

Related Platforms : BonFIRE
• BonFIRE foundation is operating the results of the BonFIRE project
• For Inria : applying some of the lessons learned running Grid’5000 to the cloud
paradigm
• A federation of 5 sites, accessed through a central API (based on OCCI)
• An observable cloud
• Access to metrics of the underlying host to understand observed behavior
• Number of VMs running
• Power consumption
• A controllable infrastructure
• One site exposing resources managed by Emulab, enabling control over
network parameters
• Ability to be the exclusive user of a host, or to use a specific host
• Higher level features than when using Grid’5000
www.bonfire-project.eu

CONCLUSION

Conclusion and Open Challenges
• Computer-Science is also an experimental science
• There are different and complementary approaches for doing experiments in
computer-science
• Computer-science is not at the same level than other sciences
• But things are improving…
• GRiD’5000: a test-bed for experimentation on distributed systems with a unique
combination of features
• Hardware-as-a-Service cloud
• redeployment of operating system on the bare hardware by users
• Access to various technologies (CPUs, high performance networks, etc.)
• Networking: dedicated backbone, monitoring, isolation
• Programmable through an API
• Energy consumption monitoring
• Useful platform
• More than 750 publications with Grid’5000 in their tag (HAL)
• Between 500 and 600 users per year since 2006

What Have We Learned?
Building such a platform was a real challenge !
- No on-the-shelf software available
- Need to have a team of highly motivated and highly trained engineers and
researchers
- Strong help and deep understanding of involved institutions!
From our experience, experimental platforms should feature
- Experiment isolation
- Capability to reproduce experimental conditions
- Flexibility through high degree of reconfiguration
- The strong control of experiment preparation and running
- Precise measurement methodology
- Tools to help users prepare and run their experiments
- Deep on-line monitoring (essential to help observations understanding)
- Capability to inject real life (real time) experimental conditions (real Internet
traffic, faults)

Conclusion and Open Challenges, cont
• Testbeds optimized for experimental capabilities, not performance
• Access to the modern architectures / technologies
• Not necessarily the fastest CPUs
• But still expensive  funding!
• Ability to trust results
• Regular checks of testbed for bugs
• Ability to understand results
• Documentation of the infrastructure
• Instrumentation & monitoring tools
• network, energy consumption
• Evolution of the testbed
• maintenance logs, configuration history
• Empower users to perform complex experiments
• Facilitate access to advanced software tools
• Paving the way to Open Science of HPC and Cloud – long term goals
• Fully automated execution of experiments
• Automated tracking + archiving of experiments and associated data

Our Open Access program
An 'Open Access' program enables researchers to get a Grid'5000 account
valid for two months. At the end of this period, the user must submit a
report explaining its use of Grid'5000 to open-access@lists.grid5000.fr,
and (possibly) apply for renewal
Open Access accounts currently provide unrestricted access to Grid'500
(this might be changed in the future). However:
• Usage must follow the Grid'5000 User Charter
• We ask Open Access users not to use OAR's "Advance reservations" for
jobs more than 24 hours in the future
https://www.grid5000.fr/mediawiki/index.php/Special:G5KRequestOpenAccess

QUESTIONS ?
Special thanks to
G. Antoniu, Y. Georgiou, D.
Glesser, A. Lebre, L. Lefèvre, M.
Liroz, D. Margery, L. Nussbaum,
C. Perez, L. Pouillioux
and the Grid’5000 technical team
www.grid5000.fr

Running Large Experiments on GRID'5000

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (15)

Similaire à Running Large Experiments on GRID'5000

Similaire à Running Large Experiments on GRID'5000 (20)

Plus de Frederic Desprez

Plus de Frederic Desprez (10)

Dernier

Dernier (20)

Running Large Experiments on GRID'5000