BioITWorld 2013 presentation - Best practices for building multi-tenant HPC clusters for Pharma/BioTech
Essentially a mini case study of a recent deployment of a multi-petabyte, 1000+ CPU core Linux cluster in the Boston area.
Please email me at: chris@bioteam.net if you would like the actual PDF file itself.
2. I’m Chris.
I’m an infrastructure geek.
I work for the BioTeam.
www.bioteam.net - Twitter: @chris_dag 2
3. Your substitute host ...
Speaking on behalf of others
‣ Original speaker can’t
make it today
‣ Stepping in as substitute
due to involvement in
Assessment & Deployment
project phases
‣ Just about everything in
this presentation is the
work of other, smarter,
people!
3
5. Pharma Multi-Tenant HPC
Case Study: Sanofi S.A.
‣ Sanofi: multinational pharmaceutical with
worldwide research & commercial locations
‣ 7 major therapeutic areas: cardiovascular,
central nervous system, diabetes, internal
medicine, oncology, thrombosis and vaccines
‣ Other Sanofi S.A. companies: Merial, Chattem,
Genzyme & Sanofi Pasteur
5
6. History
Case Study: Sanofi S.A.
‣ System discussed here is among 1st major
outcomes of a late-2011 global review effort
called “HPC-O” (HPC Optimization)
‣ HPC-O involved:
• Revisiting prior HPC recommendations
• Intensive data-gathering & cataloging of HPC resources
• North America: Interviews with 30+ senior scientists, IT &
scientific leadership, system operators & all levels of global
IT and infrastructure support services
• Similar effort across EU operations as well
6
7. HPC-O Recommendations
Case Study: Sanofi S.A.
‣ Build a new shared-services HPC environment
• Model/prototype for future global HPC
• Designed to meet scientific & business requirements
• Multiple concurrent users, groups and business units
• ... use IT building blocks that are globally approved and
supported 24x7 by Global IS (“GIS”) organization
‣ Site the initial system in the Boston area
7
8. Current Status
‣ Online since
November 2012
‣ Approaching end of
initial round of testing,
optimization and user
acceptance work
‣ Today: currently running
large-scale workloads
but not formally in
Production status
8
9. Why Multi-Tenant Cluster?
Case Study: Sanofi S.A.
‣ HPC Mission & Scope Creep
• 11+ separate HPC systems in North America alone
• Wildly disparate technology, product & support models
• “Islands” of HPC used by single business units
• Almost no detectable cross-system or cross-site usage
• Huge variance in refresh/upgrade cycles
9
10. Why Multi-Tenant Cluster, cont.
Case Study: Sanofi S.A.
‣ Utilization & Efficiency
• Islands of HPC tended to be underutilized most of the
time and oversubscribed by a single business unit during
peak demand times
• Hardware age and capability varied widely due to huge
differences in maintenance cycles by business unit
• Cost of commercial software licensing (globally) hugely
significant; difficult to maximize ROI & use of very
expensive software entitlements across “islands” of HPC
10
11. Why Multi-Tenant Cluster, cont.
Case Study: Sanofi S.A.
‣ Need for “Opportunistic Capacity”
• Difficult to perform exploratory research outside the
normal scope of business unit activities
‣ Avoid “Shadow IT” problems
• Frustrated users will find their own DIY solutions
• ... “the cloud” is just a departmental credit card away
‣ “Chaperoned” cloud-bursting
• Centrally managed “chaperoned” utilization of IaaS cloud
resources for specific workloads & data
11
12. In-house vs. Cloud
Case Study: Sanofi S.A.
‣ Cloud options studied intensively; still made
decision to invest in significant regional HPC
‣ Some Reasons
• Baseline “always available” capability
• Ability to obsessively tune performance
• Security
• Cost control (many factors here ...)
• Data size, movement & lifecycle issues
• Agility
12
13. In-house vs. Cloud
Case Study: Sanofi S.A.
‣ HPC Storage: “Center of Gravity” for Scientific Data
• Compute power is pretty easy and not super expensive
• Storage need, even just for the Boston region is peta-scale
• Mapping data flows and access patterns reveals a very complex web of
researcher, instrument, workstation and pipeline interactions with storage
‣ In a nutshell:
• Engineer the heck out of a robust peta-scale R&D storage platform for
the Boston region
• Drop a reasonable amount of HPC capability near this storage
• Bias all engineering/design efforts to facilitate agility/change
• Use the cloud only when best-fit
13
23. Enabling Technologies: Facility
Case Study: Sanofi S.A.
‣ A Sanofi company has a suitable local colo suite
• ... already under long-term lease
• ... and with a bit of server consolidation, lots of space for
HPC compute and storage
• ... plenty of room for “adjunct” systems that will likely be
attracted to storage “center of gravity”
‣ Can’t reveal exact size but this facility can handle
double-digit numbers of additional HPC compute,
storage and network cabinets
23
24. Enabling Technologies: WAN
Case Study: Sanofi S.A.
‣ Regional consolidated HPC is not possible
without MAN/WAN efforts to connect all sites
and users
• ... direct routing required; not optimal to route HPC
traffic through Corporate Tier-1 facilities that may be
thousands of miles away
• Existing MAN/WAN network links upgraded if there was
a business/scientific justification
• All other MAN/WAN links verified that expansion is easy/
possible should a business need arise
24
25. Enabling Technologies: WAN
Case Study: Sanofi S.A.
‣ Regional Networking Result
• Most sites: bonded 1-Gigabit path to regional HPC hub
• A Cambridge building has direct 10-Gigabit Ethernet link to
the HPC hub; used for heavy data movement as well as
ingest of data arriving on physical media
• Special routing (HTTP, FTP,) in place for satellite locations
not yet on the converged Enterprise WAN/MAN
• HPC Hub Facility:
- Dedicated HPC-only internet link for open-data downloads
- Internet-2 connection being pursued for EDU collaboration
25
27. Architecture
Philosophy
‣ Intense desire to keep things simple
‣ Commodity works very well; avoid the expensive and
the exotic when we can
‣ Extra commodity capacity compensates for
performance lost by not choosing the exotic
competition
• Also delivers more agility and easier reuse/repurposing
‣ If we build from globally-blessed IT components we can
eventually turn basic operation, maintenance and
monitoring over to the Global IS organization
• ... freeing Research IT staff to concentrate on science & users
27
28. Architecture
HPC Stack
‣ Explicit decision made to source the HPC cluster
stack from a commercial provider
• This is actually a radical departure from prior HPC efforts
‣ Many evaluated; one chosen
‣ Primary drivers:
• 24x7 commercial support
• Research IT staff needs to concentrate on apps/users
• “Single SKU” Out-of-the-box functionality and features (bare
metal provisioning, etc.) that reduce operational burden
28
29. Architecture
HPC Stack - Bright Computing
‣ Bright Computing selected
• Hardware neutral
• Scheduler neutral
• Full API, CLI and lightweight
monitoring stack
• Web GUIs for non-experts
• Single dashboard for advanced
monitoring and management
• Data-aware scheduling & native
support for AWS cloud bursting
29
31. Architecture
Compute Hardware
‣ Key Design Goals
• Use common server
config for as many nodes
as possible
• Modular & extensible
design
• “Blessed” by Global IS
(GIS) organization
31
32. Architecture
Compute Hardware
‣ HP C7000 Blade
Enclosures
• Our basic building block
• Very flexible on network,
interconnect and blade
configuration
• Sanofi GIS approved
• “Lights-out” facility approved
• Pre-negotiated preferential
pricing on almost everything
we needed
32
33. Architecture
Compute Hardware
‣ HP C7000 Blade Enclosure
becomes the smallest
modular unit in HPC design
‣ Big cluster built from
smaller preconfigured
“blocks” of C7000s
‣ 4 standard “blocks”:
• M-Block
• C-Block
• G-Block
• X-Block
33
34. Architecture
Compute Hardware
‣ M-Block (Mgmt)
• HP BL460c Blades
- Dual-socket quad-core
with 96GB RAM & 1TB
mirrored OS disks
‣ 2x HA Master Node(s)
‣ 1x Mgmt Node
‣ 3x HPC Login Node(s)
‣ ... plenty of room ...
34
35. Architecture
Compute Hardware
‣ C-Block (Compute)
• HP BL460c Blades
- Dual-socket quad-core
with 96GB RAM & 1TB
mirrored OS disks
‣ Fully populated with 16
blades per enclosure
‣ Set of 8 C-Blocks =
1024 CPU Cores
35
36. Architecture
Compute Hardware
‣ G-Block (GPU)
• No C7000; HP s6500
enclosure used for G-Block
units
‣ HP SL250s Servers
‣ 3x Tesla GPUs per SL250s
Server
‣ ... 15 Tflop per G-Block
36
37. Architecture
Compute Hardware
‣ X-Block C7000
• Hosting of “Adjunct Servers”
• X-block for unique requirements
that don’t fit into a standard C,G or
M-block configuration; or for
servers supplied by business units
‣ Big Memory Nodes
‣ Virtualization Platform(s)
‣ Big SMP Nodes
‣ Graphics/Viz Nodes
‣ Application Servers
‣ Database Servers
37
38. Architecture
Compute Hardware
‣ Modular design can
grow into double-digit
numbers of datacenter
cabinets
• C-blocks and G-blocks for
compute; M-blocks and X-
blocks for Mgmt and special
cases
• 8-core 96GB RAM, 1TB
BL460c blade is standard
individual server config;
deviation only when required
38
43. Architecture
Storage Hardware
‣ EMC Isilon Scale-out NAS
• ~1 petabyte raw for active use
• ~1 petabyte raw for backup
‣ Why Isilon?
• Large, single-namespace scaling
beyond our most aggressive capacity
projections
• Easy to manage / GIS Approved
• Aggregate throughput increases with
capacity expansion
• Tiering & SSD options
43
44. Architecture
External Connectivity
‣ Dedicated Internet circuit for
new HPC Hub
• Direct download/ingest of large
public datasets without affecting
other business users
• Downloads don’t hit MAN/WAN
networks & avoid the centrally routed
Enterprise internet egress point
located hundreds of miles away
• Very handy for Cloud/VPN efforts as
well
‣ Internet 2
• I2 and other high speed academic
network connectivity planned
44
45. Architecture
Physical data ingest
‣ Large Scale Data
Ingest & Export
• Often overlooked; very
important!
‣ Dedicated Data Station
• 10 Gig link to HPC Hub
• Fast CPUs for checksum and
integrity operations
• Removable SATA/SAS bays
• Lots of USB & eSATA ports
45
47. One more thing ...
Not just a single cluster
‣ Single cluster? Nope.
47
48. One more thing ...
Not just a single cluster
‣ Single cluster? Nope.
‣ The secret sauce is in the
facility, storage and
network core
‣ Petabytes of scientific
data have a “gravitational
pull” within an enterprise
‣ ... we expect many new
users and use cases to
follow
48
49. One more thing ...
Not just a single cluster
‣ We can support:
• Additional clusters & analytic platforms
grafted onto our network and storage core
• Validated server, software and cluster
environments collocated in close proximity
• Integration with private cloud and
virtualization environments
• Integration with public IaaS clouds
• Dedicated Hadoop / Big Data
environments
• On-demand reconfiguration of C-Blocks
into HDFS/Hadoop-optimized mini clusters
• And much more ...
49
51. Beyond the hardware ...
Many other critical factors involved
‣ Lets Discuss:
• Requirements Gathering
• Building Trust
• Governance
• Support Model
51
52. Requirements Gathering
‣ When seven-figure CapEx amounts are involved
you can’t afford to make a mistake
‣ Capturing business & scientific requirements is
non trivial
• ... especially when trying to account for future needs
‣ Not a 1 person / 1 department job
• ... requires significant expertise and insider knowledge
spanning science, software, business plans and both
research and global IT staff
52
53. Requirements Gathering
Our approach
1. Keep the core project team small & focused
• Engage niche resources (legal, security, etc) on demand
2. Promiscuous (“meet with anyone”) data gathering,
meeting & discussion philosophy
3. Strong project management / oversight
4. Public support from senior leadership
5. Frequent sync-up with key leaders & groups
• Global facility/network/storage/support orgs, Research budget
& procurement teams, senior scientific leadership, etc.
53
54. Building Trust
Consolidated HPC requires trust
‣ Previous: Many independent islands of HPC
• ... often built/supported/run by local resources
‣ Moving to a shared-services model requires great
trust among users & scientific leadership
• Researchers have low tolerance for BS/incompetence
• Informatics is essential; users need to be reassured that
current capabilities will be maintained while new capabilities will
be gained
• Enterprise IT must be willing to prove it understands & can
support the unique needs and operational requirements of
research informatics
54
55. Building Trust
Our approach
‣ Our Approach:
• Strong project team with deep technical & institutional
experience. Team members could answer any question
coming from researchers or business unit professionally
and with an aura of expertise & competence
• Explicit vocal support from senior IT and research
leadership (“We will make this work. Promise.”)
• Willingness to accept & respond to criticism & feedback
- ... especially when someone smashes a poor assumption or
finds a gap in the planned design
55
56. Governance
‣ Tied for first place among “reasons why
centralized HPC deployments fail”
‣ Multi-Tenant HPC Governance is essential
‣ ... and often overlooked
56
57. Governance
The basic issue
‣ ... in research HPC settings there are certain
things that should NEVER be dictated by IT
‣ It is not appropriate for an IT SysAdmin to ...
• Create or alter resource allocation policies & quotas
• Decide what users/groups get special treatment
• Decide what software can and can not be used
• ... etc.
‣ A governance structure involving scientific
leadership and user representation is essential
57
58. Governance
Our Approach
‣ Two committees: “Ops” and “Overlord”
‣ Ops Committee: Users & HPC IT staff
coordinating HPC operations jointly
‣ Overlord Committee: Invoked as needed.
Tiebreaker decisions, busts through political/
organizational walls and approve funding/
expansion decisions
58
59. Governance
Our Approach
‣ Ops Committee communicates frequently and is
consulted before any user-affecting changes occur
• Membership is drawn from interested/engaged HPC “power
users” from each business unit + the HPC Admin Team
‣ Ops Committee “owns” HPC scheduler & queue
policies and approves/denies any requests for
special treatment. All scheduler/policy changes
are blessed by Ops before implementation
‣ This is the primary ongoing governance group
59
60. Governance
Our Approach
‣ Overlord Committee meets only as needed
• Membership: the scariest heavy hitters we could recruit
from senior scientific and IT leadership
- VP or Director level is not unreasonable
• This group needs the most senior people you can find.
Heavy hitters required when mediating between
conflicting business units or busting through political/
organizational barriers
- Committee does not need to be large, just powerful
60
61. Support Model
Our Approach
‣ Often-overlooked or under-resourced
‣ We are still working on this ourselves
‣ General model
• Transition server, network and storage maintenance & monitoring
over to Global IS as soon as possible
• Free up rare HPC Support FTE resources to concentrate on
enabling science & supporting users
• Offer frequent training and local “HPC mentor” attention
• Online/portal tools that facilitate user communication, best practice
advice and collaborative “self-support” for common issues
• Still TBD: Helpdesk, Ticketing & Dashboards
61