SlideShare une entreprise Scribd logo
1  sur  59
Télécharger pour lire hors ligne
1
Life Science HPC & Informatics: Trends from the trenches
April 2014
Wednesday, April 9, 14
Who, What, Why ...
2
BioTeam
‣ Independent consulting shop
‣ Staffed by scientists forced to
learn IT, SW & HPC to get our
own research done
‣ 12+ years bridging the “gap”
between science, IT & high
performance computing
‣ Our wide-ranging work is what
gets us invited to speak at
events like this ...
Wednesday, April 9, 14
Active at NIH since 2008
3
BioTeam & NIH
‣ Our primary goal: make science
easier for researchers at NIH via
scientific computing
‣ Recently involved in many
projects:
• NIH-Wide HPC Assessment
• NIAID HPC Assessment
• NIMH Bioinformatics Assessment
• NCATS IT/Informatics Assessment
• NIH Network Modernization Project
Wednesday, April 9, 14
4
Topic 1: Scariest thing first ...
The biggest meta-issue facing life science informatics
Wednesday, April 9, 14
5
It’s a risky time to be doing Bio-IT
Wednesday, April 9, 14
6
Big Picture / Meta Issue
‣ HUGE revolution in the rate at which
lab platforms are being redesigned,
improved & refreshed
• Example: CCD sensor upgrade on that
confocal microscopy rig just doubled
storage requirements
• Example: The 2D ultrasound imager is
now a 3D imager
• Example: Illumina HiSeq upgrade just
doubled the rate at which you can acquire
genomes. Massive downstream increase
in storage, compute & data movement
needs
‣ For the above examples, do you
think IT was informed in advance?
Wednesday, April 9, 14
Science progressing way faster than IT can refresh/change
The Central Problem Is ...
‣ Instrumentation & protocols are changing FAR
FASTER than we can refresh our Research-IT &
Scientific Computing infrastructure
• Bench science is changing month-to-month ...
• ... while our IT infrastructure only gets refreshed every
2-7 years
‣ We have to design systems TODAY that can
support unknown research requirements &
workflows over many years (gulp ...)
7
Wednesday, April 9, 14
The Central Problem Is ...
‣ The easy period is over
‣ 5 years ago we could toss
inexpensive storage and
servers at the problem;
even in a nearby closet or
under a lab bench if
necessary
‣ That does not work any
more; real solutions
required
8
Wednesday, April 9, 14
9
The new normal for informatics
Wednesday, April 9, 14
And a related problem ...
‣ It has never been easier to
acquire vast amounts of data
cheaply and easily
‣ Growth rate of data creation/
ingest exceeds rate at which
the storage industry is
improving disk capacity
‣ Not just a storage lifecycle
problem. This data *moves*
and often needs to be shared
among multiple entities and
providers
• ... ideally without punching holes in
your firewall or consuming all
available internet bandwidth
10
Wednesday, April 9, 14
If we get it wrong ...
‣ Lost opportunity
‣ Missing capability
‣ Frustrated & very vocal scientific staff
‣ Slowed pace of scientific discovery
‣ Problems in recruiting, retention,
publication & product development
11
Wednesday, April 9, 14
Up to a two line subtitle, generally used to describe the
takeaway for the slide
12
Basic Bio/IT Landscape
Wednesday, April 9, 14
Compute related design patterns largely static
13
Core Compute
‣ Linux compute clusters
are still the baseline
compute platform
‣ Even our lab instruments
know how to submit jobs
to common HPC cluster
schedulers
‣ Compute is not hard. It’s a
commodity that is easy to
acquire & deploy in 2014
Wednesday, April 9, 14
We have them all
File & Data Types
‣ Massive text files
‣ Massive binary files
‣ Flatfile ‘databases’
‣ Spreadsheets everywhere
‣ Directories w/ 6 million
files
‣ Large files: 600GB+
‣ Small files: 30kb or smaller
14
Wednesday, April 9, 14
15
Application characteristics
‣ Mostly SMP/threaded apps
performance bound by IO and/or
RAM
‣ Hundreds of apps, codes & toolkits
‣ 1TB - 2TB RAM “High Memory”
nodes becoming essential
‣ Lots of Perl/Python/R
‣ MPI is rare
• Well written MPI is even rarer
‣ Few MPI apps actually benefit from
expensive low-latency
interconnects*
• *Chemistry, modeling and structure work
is the exception
Wednesday, April 9, 14
16
Storage & Data Management
‣ LifeSci core requirement:
• Shared, simultaneous read/write
access across many instruments,
desktops & HPC silos
‣ NAS = “easiest” option
• Scale Out NAS products are the
mainstream standard
‣ Parallel & Distributed storage
for edge cases and large
organizations with known
performance needs
• Becoming much more common:
GPFS has taken hold in LifeSci
Wednesday, April 9, 14
17
Storage & Data Management
‣ Storage & data mgmt. is the #1
infrastructure headache in life
science environments
‣ Most labs need “peta capable”
storage due to unpredictable
future
• Only a small % will actually hit 1PB
• Often forced to trade away performance
in order to obtain capacity
‣ Object stores, ZFS and commodity
“Nexentastor-style” methods are
making significant inroads
Wednesday, April 9, 14
18
Data Movement & Data Sharing
‣ Peta-scale data movement
needs
• Within an organization
• To/from collaborators
• To/from suppliers
• To/from public data repos
‣ Peta-scale data sharing needs
• Collaborators and partners may be
all over the world
Wednesday, April 9, 14
19
Networking
‣ Major 2014 focus
‣ May surpass storage as our
#1 infrastructure headache
‣ Why?
• Petascale storage meaningless
if you can’t access/move it
• 10-Gig, 40-Gig and 100-Gig
networking will force significant
changes elsewhere in the ‘bio-
IT’ infrastructure
Wednesday, April 9, 14
Physical & Network
20
We Have Both Ingest Problems
‣ Significant physical ingest
occurring in Life Science
• Standard media: naked SATA drives
shipped via Fedex
‣ Cliche example:
• 30 genomes outsourced means 30
drives will soon be sitting in your mail
pile
‣ Organizations often use similar
methods to freight data between
buildings and among geographic
sites
Wednesday, April 9, 14
21
Physical Ingest Just Plain Nasty
‣ Most common high-speed
network: FedEx
‣ Easy to talk about in theory
‣ Seems “easy” to scientists
and even IT at first glance
‣ Really really nasty in practice
• Incredibly time consuming
• Significant operational burden
• Easy to do badly / lose data
Wednesday, April 9, 14
And huge need for fast(er) research networks!
22
Huge Need For Network Ingest
1. Public data repositories have
petabytes of useful data
2. Collaborators still need to
swap data in serious ways
3. Amazon becoming an
important repo of public and
private sources
4. Many vendors now “deliver”
to the cloud
Wednesday, April 9, 14
23
It all boils down to this ...
Wednesday, April 9, 14
24
Life Science In One Slide:
‣ Huge compute needs but not intractable and generally
solved via Linux HPC farms. Most of our workloads are
serial/batch in nature
‣ Ludicrous rate of innovation in lab drives a similar rate of
change for our software and tool environment
‣ With science changing faster than IT, emphasis is on
agility and flexibility - we’ll trade performance for some
measure of future proofing
‣ Buried in data. Getting worse. Individual scientists can
generate petascale data streams.
‣ We have all of the Information Lifecycle problems: Storing,
Curating, Managing, Sharing, Ingesting and Moving
Wednesday, April 9, 14
25
Trends: DevOps & Org Charts
Wednesday, April 9, 14
26
The social contract between
scientist and IT is changing forever
Wednesday, April 9, 14
27
You can blame “the cloud” for this
Wednesday, April 9, 14
28
DevOps & Scriptable Everything
‣ On (real) clouds,
EVERYTHING has an API
‣ If it’s got an API you can
automate and orchestrate
it
‣ “scriptable infrastructure”
is now a reality
‣ Driving capabilities that
we will need in 2014 and
beyond
Wednesday, April 9, 14
29
DevOps & Scriptable Everything
‣ Incredible innovation in
the past few years
‣ Driven mainly by
companies with
massive internet
‘fleets’ to manage
‣ ... but the benefits
trickle down to us little
people
Wednesday, April 9, 14
... and conquer the enterprise
30
DevOps will enable hybrid HPC
‣ Cloud automation/
orchestration methods
have been trickling down
into our local
infrastructures
‣ Driving significant impact
on careers, job
descriptions and org charts
‣ These methods are
necessary for emerging
hybrid cloud models for
HPC/sharing
Wednesday, April 9, 14
2014: Continue to blur the lines between all these roles
31
Scientist/SysAdmin/Programmer
‣ IT jobs, roles and
responsibilities are going
to change significantly
‣ SysAdmins must learn to
program in order to
harness automation tools
‣ Programmers &
Scientists can now self-
provision and control
sophisticated IT
resources
Wednesday, April 9, 14
2014: Continue to blur the lines between all these roles
32
Scientist/SysAdmin/Programmer
‣ My take on the future ...
• SysAdmins (Windows & Linux) who
can’t code will have career issues
• Far more control is going into the
hands of the research end user
• IT support roles will radically change
-- no longer owners or gatekeepers
‣ IT will “own” policies,
procedures, reference patterns,
identity mgmt, security & best
practices
‣ Research will control the
“what”, “when” and “how big”
Wednesday, April 9, 14
Research needing more and more compute
33
IT Orgs are Changing as well...
‣ 25% of researchers will
need HPC this year
‣ 75% will need high-
volume storage
‣ IT evolved from
administrative need
• Science started grabbing
resources
• IT either adapted or was
replaced
Wednesday, April 9, 14
Research needing more and more compute
34
IT Orgs are Changing as well...
‣ Three types of adaptations
• IT evolved to include research
IT support
• IT split into research IT and
corporate IT
• IT became primarily research
org -> run by CSIO
‣ Orgs with scientific
missions need adaptive IT
with stake in research
projects -> restrictions kill
science
Wednesday, April 9, 14
35
Trends: Compute
Wednesday, April 9, 14
36
Compute:
‣ Kind of boring. Solved
problem in 2014
‣ Compute power is a
commodity
• Inexpensive relative to other
costs
• Far less vendor differentiation
than storage
• Easy to acquire; easy to
deploy
Wednesday, April 9, 14
Defensive hedge against Big Data / HDFS
37
Compute: Local Disk is Back
‣ We’ve started to see organizations move
away from blade servers and 1U pizza box
enclosures for HPC
‣ The “new normal” may be 4U enclosures
with massive local disk spindles - not
occupied, just available
‣ Why? Hadoop & Big Data
‣ This is a defensive hedge against future
HDFS or similar requirements
• Remember the ‘meta’ problem - science is
changing far faster than we can refresh IT. This
is a defensive future-proofing play.
‣ Hardcore Hadoop rigs sometimes operate
at 1:1 ratio between core count and disk
count
Wednesday, April 9, 14
New and refreshed HPC systems running many node types
38
Compute: Huge trend in ‘diversity’
‣ Accelerated trend since at least 2012 ...
• HPC compute resources no longer homogenous;
many types and flavors now deployed in single
HPC stacks
‣ Newer clusters mix-and-match to match
the known use cases:
• GPU nodes for compute
• GPU nodes for visualization
• Large memory nodes (512GB +)
• Very Large memory nodes (1TB +)
• ‘Fat’ nodes with many CPU cores
• ‘Thin’ nodes with super-fast CPUs
• Analytic nodes with SSD, FusionIO, flash or large
local disk for ‘big data’ tasks
Wednesday, April 9, 14
GPUs, Coprocessors & FPGAs
39
Compute: Hardware Acceleration
‣ Specialized hardware
acceleration has it’s place
but will not take over the
world
• “... the activation energy required
for a scientist to use this stuff is
generally quite high ...”
‣ GPU, Phi and FPGA best
used in large scale pipelines
or as specific solution to a
singular pain point
Wednesday, April 9, 14
Also known as hybrid clouds
Emerging Trend: Hybrid HPC
‣ Relatively new idea
• small local footprint
• large, dynamic, scalable, orchestrated
public cloud component
‣ DevOps is key to making this work
‣ High-speed network to public
cloud required
‣ Software interface layer acting as
the mediator between local and
public resources
‣ Good for tight budgets, has to be
done right to work
‣ Not many working examples yet
40
Wednesday, April 9, 14
41
Trends: Network
Wednesday, April 9, 14
42
Network: Speed @ Core and Edge
‣ Huge potential pain point
‣ May surpass storage as our
#1 infrastructure headache
‣ Petascale data is useless if
you can’t move it or access
it fast enough
‣ Don’t be smug about 10
Gigabit - folks need to start
thinking *now* about 40 and
even 100 Gigabit Ethernet
Wednesday, April 9, 14
43
Network: Speed @ Core and Edge
‣ Remember 2004 when
research storage
requirements started to dwarf
what the enterprise was
using?
‣ Same thing is happening now
for networking
‣ Research core, edge and top-
of-rack networking speeds
may exceed what the rest of
the organization has
standardized on
Wednesday, April 9, 14
Massive data movement needs are driving innovation
NIH Tackling this now!
‣ Currently installing
100Gb research network
‣ Will tackle the petascale
data movement head on
• NIH gaining ground on
1PB/month
• Collaboration, core
compute, data commons,
external data sources
• Science DMZ!
44
Wednesday, April 9, 14
Network: ‘ScienceDMZ’
‣ “ScienceDMZ” concept is real and necessary
‣ BioTeam will be building them in 2014 and
beyond
‣ Central premise:
• Legacy firewall, network and security methods
architected for “many small data flows” use cases
• Not built to handle smaller #s of massive data
flows
• Also very hard to deploy ‘traditional’ security gear on
10Gigabit and faster networks
‣ More details, background & documents at
http://fasterdata.es.net/science-dmz/
45
Background
traffic or
competing bursts
DTN traffic with
wire-speed
bursts
10GE
10GE
10GE
Wednesday, April 9, 14
Network: ‘ScienceDMZ’
‣ Start thinking/discussing this sooner rather
than later
‣ Just like “the cloud” this may fundamentally
change internal operations and technology
‣ Will also require conscious buy-in and
support from senior network, security and
risk management professionals
• ... these talks take time. Best to plan ahead
46
Wednesday, April 9, 14
Network: ‘ScienceDMZ’
‣ A Science DMZ has 3 required components:
1. Very fast “low-friction” network links and paths with
security policy and enforcement specific to scientific
workflows
2. Dedicated, high performance data transfer nodes
(“DTNs”) highly optimized for high speed data xfer
3. Dedicated network performance/measurement nodes
47
Wednesday, April 9, 14
48
Simple Science DMZ:
Image source: “The Science DMZ: Introduction & Architecture” -- esnet
Wednesday, April 9, 14
More hype than useful reality at the moment
49
Network: SDN Hype vs. Reality
‣ Software Defined Networking
(“SDN”) is the new buzzword
‣ It will become pervasive and will
change how we build and architect
things
‣ But ...
‣ Not hugely practical at the moment
for most environments
• We need far more than APIs that control
port forwarding behavior on switches
• More time needed for all of the related
technologies and methods to coalesce
into something broadly useful and usable
Wednesday, April 9, 14
50
Trends: Storage
Wednesday, April 9, 14
51
Storage
‣ Still the biggest expense, biggest headache and
scariest systems to design in modern life science
informatics environments
‣ Many of the pain points we’ve talked about for years
are still in place:
• Explosive growth forcing tradeoffs in capacity over performance
• Lots of monolithic single tiers of storage
• Critical need to actively manage data through it’s full life cycle
(just storing data is not enough ...)
• Need for post-POSIX solutions such as iRODS and other
metadata-aware data repositories
Wednesday, April 9, 14
52
Storage Trends
‣ The large but monolithic storage platforms we’ve
built up over the years are no longer sufficient
• Do you know how many people are running a single large
scale-out NAS or parallel filesystem? Most of us!
‣ Tiered storage is now an absolute requirement
• At a minimum we need an active storage tier plus something
far cheaper/deeper for cold files
‣ Expect the tiers to involve multiple vendors,
products and technologies
• The Tier1 storage vendors tend to have unacceptable
pricing for their “all in one” tiered data management solutions
Wednesday, April 9, 14
The Tier 1 storage vendors may be too expensive ...
53
Storage: Disruptive stuff ahead
‣ BioTeam has built 1Petabyte ZFS-based storage pools from
commodity whitebox hardware for about $100,000
‣ Infinidat “IZbox” provides 1Petabyte of usable NAS as a turnkey
appliance for roughly $375,000
• Both of these would be a nice, cost-effective archive or “cold” tier for less-
active file and data storage
• Solutions like these cost far, far less than what Tier 1 storage vendors would
charge for a petabyte of usable storage
• ... of course they come with their own risks and operational burden. This is an
area where proper research and due diligence is essential
‣ Companies like Avere Systems are producing boxes that unify
disparate storage tiers and link them to cloud and object stores
• This is a route to unifying “tier 1” storage with the “cheap & deep” storage
Wednesday, April 9, 14
54
The road ahead ...
Wednesday, April 9, 14
Some final thoughts
55
Future Trends and Patterns
‣ Data generation out-
pacing technology
‣ Cheap/easy laboratory
assays taking over
• Researchers largely don’t
know what to do with it all
• Holding on to the data until
someone figures it out
• This will cause some
interesting headaches for IT
• Huge need for real “Big Data”
applications to be developed
Wednesday, April 9, 14
Some final thoughts
56
Future Trends and Patterns
‣ Unless there’s an investment
in ultra-high speed
networking, need to change
thought on analysis
‣ Data commons are becoming
a precedent
• Need to minimize the movement of
data
• Include compute power and
analysis platform with data
commons
‣ Move the analysis to the data,
don’t move the data
• Requires sharing/Large core
institutional resources
Wednesday, April 9, 14
Some final thoughts
57
Future Trends & Patterns
‣ Compute continues to become easier
‣ Data movement (physical & network) gets
harder
‣ Cost of storage will be dwarfed by “cost of
managing stored data”
‣ We can see end-of-life for our current IT
architecture and design patterns; new patterns
will start to appear over next 2-5 years
‣ We’ve got a new headache to worry about ...
Wednesday, April 9, 14
A new challenge ...
58
Future Trends & Patterns
‣ Responsible sharing of clinical and genomic data
will be the grand challenge of the post human
genome project era
‣ We HAVE to get it right
‣ The ‘Global Alliance’ whitepaper cosigned by 70+
organizations is a must read:
• Short link to whitepaper: http://biote.am/9j
• Long link: https://www.broadinstitute.org/files/news/pdfs/
GAWhitePaperJune3.pdf
• NIH will be critical in making this work for the world
Wednesday, April 9, 14
Up to a two line subtitle, generally used to describe the
takeaway for the slide
59
end; Thanks!
`
Wednesday, April 9, 14

Contenu connexe

Tendances

Multi-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersMulti-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersChris Dagdigian
 
Taming Big Science Data Growth with Converged Infrastructure
Taming Big Science Data Growth with Converged InfrastructureTaming Big Science Data Growth with Converged Infrastructure
Taming Big Science Data Growth with Converged InfrastructureThe BioTeam Inc.
 
BioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesBioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesChris Dagdigian
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Chris Dagdigian
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)mark madsen
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except usmark madsen
 
Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?mark madsen
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Sciencelarsgeorge
 
Big Data and Bad Analogies
Big Data and Bad AnalogiesBig Data and Bad Analogies
Big Data and Bad Analogiesmark madsen
 
Briefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collectionBriefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collectionmark madsen
 
HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)
HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)
HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)Ioan Toma
 
Everything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data WarehouseEverything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data Warehousemark madsen
 
Use and reuse: research data locally & globally #esipfed
Use and reuse: research data locally & globally #esipfedUse and reuse: research data locally & globally #esipfed
Use and reuse: research data locally & globally #esipfedKevin Ashley
 
"Unlocked: The Hybrid Cloud" Business Track
"Unlocked: The Hybrid Cloud" Business Track"Unlocked: The Hybrid Cloud" Business Track
"Unlocked: The Hybrid Cloud" Business TrackHart Hoover
 
Innovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerInnovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerMicrosoft
 
Collections Databases; Making the system work for you
Collections Databases; Making the system work for youCollections Databases; Making the system work for you
Collections Databases; Making the system work for youirowson
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it worldChris Dwan
 
Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)Pavlo Baron
 
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Chris Dagdigian
 

Tendances (20)

Multi-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersMulti-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC Clusters
 
Taming Big Science Data Growth with Converged Infrastructure
Taming Big Science Data Growth with Converged InfrastructureTaming Big Science Data Growth with Converged Infrastructure
Taming Big Science Data Growth with Converged Infrastructure
 
BioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesBioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the Trenches
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except us
 
Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Science
 
Big Data and Bad Analogies
Big Data and Bad AnalogiesBig Data and Bad Analogies
Big Data and Bad Analogies
 
Briefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collectionBriefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collection
 
HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)
HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)
HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)
 
Everything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data WarehouseEverything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data Warehouse
 
Use and reuse: research data locally & globally #esipfed
Use and reuse: research data locally & globally #esipfedUse and reuse: research data locally & globally #esipfed
Use and reuse: research data locally & globally #esipfed
 
"Unlocked: The Hybrid Cloud" Business Track
"Unlocked: The Hybrid Cloud" Business Track"Unlocked: The Hybrid Cloud" Business Track
"Unlocked: The Hybrid Cloud" Business Track
 
Innovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerInnovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringer
 
Collections Databases; Making the system work for you
Collections Databases; Making the system work for youCollections Databases; Making the system work for you
Collections Databases; Making the system work for you
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
 
Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)
 
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
 

Similaire à BioTeam Trends from the Trenches - NIH, April 2014

2014 BioIT Trends From The Trenches
2014 BioIT Trends From The Trenches2014 BioIT Trends From The Trenches
2014 BioIT Trends From The TrenchesThe BioTeam Inc.
 
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedBio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedChris Dagdigian
 
From the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science ResearchFrom the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science ResearchAri Berman
 
Trends from the Trenches: 2019
Trends from the Trenches: 2019Trends from the Trenches: 2019
Trends from the Trenches: 2019Chris Dagdigian
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
 
2012: Trends from the Trenches
2012: Trends from the Trenches2012: Trends from the Trenches
2012: Trends from the TrenchesChris Dagdigian
 
High-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life SciencesHigh-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life SciencesAri Berman
 
Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1Aravindharamanan S
 
Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research dataARDC
 
TidalScale Overview
TidalScale OverviewTidalScale Overview
TidalScale OverviewPete Jarvis
 
USUGM 2014 - John B. Kinney (DuPont): Improving The Effectiveness Of Our R&D ...
USUGM 2014 - John B. Kinney (DuPont): Improving The Effectiveness Of Our R&D ...USUGM 2014 - John B. Kinney (DuPont): Improving The Effectiveness Of Our R&D ...
USUGM 2014 - John B. Kinney (DuPont): Improving The Effectiveness Of Our R&D ...ChemAxon
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiSlim Baltagi
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!DataWorks Summit/Hadoop Summit
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategyHimanshu Bari
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudOla Spjuth
 
How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?Slim Baltagi
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) SkillsOscar Corcho
 
Research Data Management at Imperial College London
Research Data Management at Imperial College LondonResearch Data Management at Imperial College London
Research Data Management at Imperial College LondonSarah Anna Stewart
 

Similaire à BioTeam Trends from the Trenches - NIH, April 2014 (20)

2014 BioIT Trends From The Trenches
2014 BioIT Trends From The Trenches2014 BioIT Trends From The Trenches
2014 BioIT Trends From The Trenches
 
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedBio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
 
From the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science ResearchFrom the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science Research
 
Trends from the Trenches: 2019
Trends from the Trenches: 2019Trends from the Trenches: 2019
Trends from the Trenches: 2019
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
2012: Trends from the Trenches
2012: Trends from the Trenches2012: Trends from the Trenches
2012: Trends from the Trenches
 
BDIA Findings
BDIA FindingsBDIA Findings
BDIA Findings
 
High-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life SciencesHigh-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life Sciences
 
Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1
 
Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research data
 
TidalScale Overview
TidalScale OverviewTidalScale Overview
TidalScale Overview
 
USUGM 2014 - John B. Kinney (DuPont): Improving The Effectiveness Of Our R&D ...
USUGM 2014 - John B. Kinney (DuPont): Improving The Effectiveness Of Our R&D ...USUGM 2014 - John B. Kinney (DuPont): Improving The Effectiveness Of Our R&D ...
USUGM 2014 - John B. Kinney (DuPont): Improving The Effectiveness Of Our R&D ...
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and Cloud
 
How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
Research Data Management at Imperial College London
Research Data Management at Imperial College LondonResearch Data Management at Imperial College London
Research Data Management at Imperial College London
 

Dernier

Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 

Dernier (20)

Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 

BioTeam Trends from the Trenches - NIH, April 2014

  • 1. 1 Life Science HPC & Informatics: Trends from the trenches April 2014 Wednesday, April 9, 14
  • 2. Who, What, Why ... 2 BioTeam ‣ Independent consulting shop ‣ Staffed by scientists forced to learn IT, SW & HPC to get our own research done ‣ 12+ years bridging the “gap” between science, IT & high performance computing ‣ Our wide-ranging work is what gets us invited to speak at events like this ... Wednesday, April 9, 14
  • 3. Active at NIH since 2008 3 BioTeam & NIH ‣ Our primary goal: make science easier for researchers at NIH via scientific computing ‣ Recently involved in many projects: • NIH-Wide HPC Assessment • NIAID HPC Assessment • NIMH Bioinformatics Assessment • NCATS IT/Informatics Assessment • NIH Network Modernization Project Wednesday, April 9, 14
  • 4. 4 Topic 1: Scariest thing first ... The biggest meta-issue facing life science informatics Wednesday, April 9, 14
  • 5. 5 It’s a risky time to be doing Bio-IT Wednesday, April 9, 14
  • 6. 6 Big Picture / Meta Issue ‣ HUGE revolution in the rate at which lab platforms are being redesigned, improved & refreshed • Example: CCD sensor upgrade on that confocal microscopy rig just doubled storage requirements • Example: The 2D ultrasound imager is now a 3D imager • Example: Illumina HiSeq upgrade just doubled the rate at which you can acquire genomes. Massive downstream increase in storage, compute & data movement needs ‣ For the above examples, do you think IT was informed in advance? Wednesday, April 9, 14
  • 7. Science progressing way faster than IT can refresh/change The Central Problem Is ... ‣ Instrumentation & protocols are changing FAR FASTER than we can refresh our Research-IT & Scientific Computing infrastructure • Bench science is changing month-to-month ... • ... while our IT infrastructure only gets refreshed every 2-7 years ‣ We have to design systems TODAY that can support unknown research requirements & workflows over many years (gulp ...) 7 Wednesday, April 9, 14
  • 8. The Central Problem Is ... ‣ The easy period is over ‣ 5 years ago we could toss inexpensive storage and servers at the problem; even in a nearby closet or under a lab bench if necessary ‣ That does not work any more; real solutions required 8 Wednesday, April 9, 14
  • 9. 9 The new normal for informatics Wednesday, April 9, 14
  • 10. And a related problem ... ‣ It has never been easier to acquire vast amounts of data cheaply and easily ‣ Growth rate of data creation/ ingest exceeds rate at which the storage industry is improving disk capacity ‣ Not just a storage lifecycle problem. This data *moves* and often needs to be shared among multiple entities and providers • ... ideally without punching holes in your firewall or consuming all available internet bandwidth 10 Wednesday, April 9, 14
  • 11. If we get it wrong ... ‣ Lost opportunity ‣ Missing capability ‣ Frustrated & very vocal scientific staff ‣ Slowed pace of scientific discovery ‣ Problems in recruiting, retention, publication & product development 11 Wednesday, April 9, 14
  • 12. Up to a two line subtitle, generally used to describe the takeaway for the slide 12 Basic Bio/IT Landscape Wednesday, April 9, 14
  • 13. Compute related design patterns largely static 13 Core Compute ‣ Linux compute clusters are still the baseline compute platform ‣ Even our lab instruments know how to submit jobs to common HPC cluster schedulers ‣ Compute is not hard. It’s a commodity that is easy to acquire & deploy in 2014 Wednesday, April 9, 14
  • 14. We have them all File & Data Types ‣ Massive text files ‣ Massive binary files ‣ Flatfile ‘databases’ ‣ Spreadsheets everywhere ‣ Directories w/ 6 million files ‣ Large files: 600GB+ ‣ Small files: 30kb or smaller 14 Wednesday, April 9, 14
  • 15. 15 Application characteristics ‣ Mostly SMP/threaded apps performance bound by IO and/or RAM ‣ Hundreds of apps, codes & toolkits ‣ 1TB - 2TB RAM “High Memory” nodes becoming essential ‣ Lots of Perl/Python/R ‣ MPI is rare • Well written MPI is even rarer ‣ Few MPI apps actually benefit from expensive low-latency interconnects* • *Chemistry, modeling and structure work is the exception Wednesday, April 9, 14
  • 16. 16 Storage & Data Management ‣ LifeSci core requirement: • Shared, simultaneous read/write access across many instruments, desktops & HPC silos ‣ NAS = “easiest” option • Scale Out NAS products are the mainstream standard ‣ Parallel & Distributed storage for edge cases and large organizations with known performance needs • Becoming much more common: GPFS has taken hold in LifeSci Wednesday, April 9, 14
  • 17. 17 Storage & Data Management ‣ Storage & data mgmt. is the #1 infrastructure headache in life science environments ‣ Most labs need “peta capable” storage due to unpredictable future • Only a small % will actually hit 1PB • Often forced to trade away performance in order to obtain capacity ‣ Object stores, ZFS and commodity “Nexentastor-style” methods are making significant inroads Wednesday, April 9, 14
  • 18. 18 Data Movement & Data Sharing ‣ Peta-scale data movement needs • Within an organization • To/from collaborators • To/from suppliers • To/from public data repos ‣ Peta-scale data sharing needs • Collaborators and partners may be all over the world Wednesday, April 9, 14
  • 19. 19 Networking ‣ Major 2014 focus ‣ May surpass storage as our #1 infrastructure headache ‣ Why? • Petascale storage meaningless if you can’t access/move it • 10-Gig, 40-Gig and 100-Gig networking will force significant changes elsewhere in the ‘bio- IT’ infrastructure Wednesday, April 9, 14
  • 20. Physical & Network 20 We Have Both Ingest Problems ‣ Significant physical ingest occurring in Life Science • Standard media: naked SATA drives shipped via Fedex ‣ Cliche example: • 30 genomes outsourced means 30 drives will soon be sitting in your mail pile ‣ Organizations often use similar methods to freight data between buildings and among geographic sites Wednesday, April 9, 14
  • 21. 21 Physical Ingest Just Plain Nasty ‣ Most common high-speed network: FedEx ‣ Easy to talk about in theory ‣ Seems “easy” to scientists and even IT at first glance ‣ Really really nasty in practice • Incredibly time consuming • Significant operational burden • Easy to do badly / lose data Wednesday, April 9, 14
  • 22. And huge need for fast(er) research networks! 22 Huge Need For Network Ingest 1. Public data repositories have petabytes of useful data 2. Collaborators still need to swap data in serious ways 3. Amazon becoming an important repo of public and private sources 4. Many vendors now “deliver” to the cloud Wednesday, April 9, 14
  • 23. 23 It all boils down to this ... Wednesday, April 9, 14
  • 24. 24 Life Science In One Slide: ‣ Huge compute needs but not intractable and generally solved via Linux HPC farms. Most of our workloads are serial/batch in nature ‣ Ludicrous rate of innovation in lab drives a similar rate of change for our software and tool environment ‣ With science changing faster than IT, emphasis is on agility and flexibility - we’ll trade performance for some measure of future proofing ‣ Buried in data. Getting worse. Individual scientists can generate petascale data streams. ‣ We have all of the Information Lifecycle problems: Storing, Curating, Managing, Sharing, Ingesting and Moving Wednesday, April 9, 14
  • 25. 25 Trends: DevOps & Org Charts Wednesday, April 9, 14
  • 26. 26 The social contract between scientist and IT is changing forever Wednesday, April 9, 14
  • 27. 27 You can blame “the cloud” for this Wednesday, April 9, 14
  • 28. 28 DevOps & Scriptable Everything ‣ On (real) clouds, EVERYTHING has an API ‣ If it’s got an API you can automate and orchestrate it ‣ “scriptable infrastructure” is now a reality ‣ Driving capabilities that we will need in 2014 and beyond Wednesday, April 9, 14
  • 29. 29 DevOps & Scriptable Everything ‣ Incredible innovation in the past few years ‣ Driven mainly by companies with massive internet ‘fleets’ to manage ‣ ... but the benefits trickle down to us little people Wednesday, April 9, 14
  • 30. ... and conquer the enterprise 30 DevOps will enable hybrid HPC ‣ Cloud automation/ orchestration methods have been trickling down into our local infrastructures ‣ Driving significant impact on careers, job descriptions and org charts ‣ These methods are necessary for emerging hybrid cloud models for HPC/sharing Wednesday, April 9, 14
  • 31. 2014: Continue to blur the lines between all these roles 31 Scientist/SysAdmin/Programmer ‣ IT jobs, roles and responsibilities are going to change significantly ‣ SysAdmins must learn to program in order to harness automation tools ‣ Programmers & Scientists can now self- provision and control sophisticated IT resources Wednesday, April 9, 14
  • 32. 2014: Continue to blur the lines between all these roles 32 Scientist/SysAdmin/Programmer ‣ My take on the future ... • SysAdmins (Windows & Linux) who can’t code will have career issues • Far more control is going into the hands of the research end user • IT support roles will radically change -- no longer owners or gatekeepers ‣ IT will “own” policies, procedures, reference patterns, identity mgmt, security & best practices ‣ Research will control the “what”, “when” and “how big” Wednesday, April 9, 14
  • 33. Research needing more and more compute 33 IT Orgs are Changing as well... ‣ 25% of researchers will need HPC this year ‣ 75% will need high- volume storage ‣ IT evolved from administrative need • Science started grabbing resources • IT either adapted or was replaced Wednesday, April 9, 14
  • 34. Research needing more and more compute 34 IT Orgs are Changing as well... ‣ Three types of adaptations • IT evolved to include research IT support • IT split into research IT and corporate IT • IT became primarily research org -> run by CSIO ‣ Orgs with scientific missions need adaptive IT with stake in research projects -> restrictions kill science Wednesday, April 9, 14
  • 36. 36 Compute: ‣ Kind of boring. Solved problem in 2014 ‣ Compute power is a commodity • Inexpensive relative to other costs • Far less vendor differentiation than storage • Easy to acquire; easy to deploy Wednesday, April 9, 14
  • 37. Defensive hedge against Big Data / HDFS 37 Compute: Local Disk is Back ‣ We’ve started to see organizations move away from blade servers and 1U pizza box enclosures for HPC ‣ The “new normal” may be 4U enclosures with massive local disk spindles - not occupied, just available ‣ Why? Hadoop & Big Data ‣ This is a defensive hedge against future HDFS or similar requirements • Remember the ‘meta’ problem - science is changing far faster than we can refresh IT. This is a defensive future-proofing play. ‣ Hardcore Hadoop rigs sometimes operate at 1:1 ratio between core count and disk count Wednesday, April 9, 14
  • 38. New and refreshed HPC systems running many node types 38 Compute: Huge trend in ‘diversity’ ‣ Accelerated trend since at least 2012 ... • HPC compute resources no longer homogenous; many types and flavors now deployed in single HPC stacks ‣ Newer clusters mix-and-match to match the known use cases: • GPU nodes for compute • GPU nodes for visualization • Large memory nodes (512GB +) • Very Large memory nodes (1TB +) • ‘Fat’ nodes with many CPU cores • ‘Thin’ nodes with super-fast CPUs • Analytic nodes with SSD, FusionIO, flash or large local disk for ‘big data’ tasks Wednesday, April 9, 14
  • 39. GPUs, Coprocessors & FPGAs 39 Compute: Hardware Acceleration ‣ Specialized hardware acceleration has it’s place but will not take over the world • “... the activation energy required for a scientist to use this stuff is generally quite high ...” ‣ GPU, Phi and FPGA best used in large scale pipelines or as specific solution to a singular pain point Wednesday, April 9, 14
  • 40. Also known as hybrid clouds Emerging Trend: Hybrid HPC ‣ Relatively new idea • small local footprint • large, dynamic, scalable, orchestrated public cloud component ‣ DevOps is key to making this work ‣ High-speed network to public cloud required ‣ Software interface layer acting as the mediator between local and public resources ‣ Good for tight budgets, has to be done right to work ‣ Not many working examples yet 40 Wednesday, April 9, 14
  • 42. 42 Network: Speed @ Core and Edge ‣ Huge potential pain point ‣ May surpass storage as our #1 infrastructure headache ‣ Petascale data is useless if you can’t move it or access it fast enough ‣ Don’t be smug about 10 Gigabit - folks need to start thinking *now* about 40 and even 100 Gigabit Ethernet Wednesday, April 9, 14
  • 43. 43 Network: Speed @ Core and Edge ‣ Remember 2004 when research storage requirements started to dwarf what the enterprise was using? ‣ Same thing is happening now for networking ‣ Research core, edge and top- of-rack networking speeds may exceed what the rest of the organization has standardized on Wednesday, April 9, 14
  • 44. Massive data movement needs are driving innovation NIH Tackling this now! ‣ Currently installing 100Gb research network ‣ Will tackle the petascale data movement head on • NIH gaining ground on 1PB/month • Collaboration, core compute, data commons, external data sources • Science DMZ! 44 Wednesday, April 9, 14
  • 45. Network: ‘ScienceDMZ’ ‣ “ScienceDMZ” concept is real and necessary ‣ BioTeam will be building them in 2014 and beyond ‣ Central premise: • Legacy firewall, network and security methods architected for “many small data flows” use cases • Not built to handle smaller #s of massive data flows • Also very hard to deploy ‘traditional’ security gear on 10Gigabit and faster networks ‣ More details, background & documents at http://fasterdata.es.net/science-dmz/ 45 Background traffic or competing bursts DTN traffic with wire-speed bursts 10GE 10GE 10GE Wednesday, April 9, 14
  • 46. Network: ‘ScienceDMZ’ ‣ Start thinking/discussing this sooner rather than later ‣ Just like “the cloud” this may fundamentally change internal operations and technology ‣ Will also require conscious buy-in and support from senior network, security and risk management professionals • ... these talks take time. Best to plan ahead 46 Wednesday, April 9, 14
  • 47. Network: ‘ScienceDMZ’ ‣ A Science DMZ has 3 required components: 1. Very fast “low-friction” network links and paths with security policy and enforcement specific to scientific workflows 2. Dedicated, high performance data transfer nodes (“DTNs”) highly optimized for high speed data xfer 3. Dedicated network performance/measurement nodes 47 Wednesday, April 9, 14
  • 48. 48 Simple Science DMZ: Image source: “The Science DMZ: Introduction & Architecture” -- esnet Wednesday, April 9, 14
  • 49. More hype than useful reality at the moment 49 Network: SDN Hype vs. Reality ‣ Software Defined Networking (“SDN”) is the new buzzword ‣ It will become pervasive and will change how we build and architect things ‣ But ... ‣ Not hugely practical at the moment for most environments • We need far more than APIs that control port forwarding behavior on switches • More time needed for all of the related technologies and methods to coalesce into something broadly useful and usable Wednesday, April 9, 14
  • 51. 51 Storage ‣ Still the biggest expense, biggest headache and scariest systems to design in modern life science informatics environments ‣ Many of the pain points we’ve talked about for years are still in place: • Explosive growth forcing tradeoffs in capacity over performance • Lots of monolithic single tiers of storage • Critical need to actively manage data through it’s full life cycle (just storing data is not enough ...) • Need for post-POSIX solutions such as iRODS and other metadata-aware data repositories Wednesday, April 9, 14
  • 52. 52 Storage Trends ‣ The large but monolithic storage platforms we’ve built up over the years are no longer sufficient • Do you know how many people are running a single large scale-out NAS or parallel filesystem? Most of us! ‣ Tiered storage is now an absolute requirement • At a minimum we need an active storage tier plus something far cheaper/deeper for cold files ‣ Expect the tiers to involve multiple vendors, products and technologies • The Tier1 storage vendors tend to have unacceptable pricing for their “all in one” tiered data management solutions Wednesday, April 9, 14
  • 53. The Tier 1 storage vendors may be too expensive ... 53 Storage: Disruptive stuff ahead ‣ BioTeam has built 1Petabyte ZFS-based storage pools from commodity whitebox hardware for about $100,000 ‣ Infinidat “IZbox” provides 1Petabyte of usable NAS as a turnkey appliance for roughly $375,000 • Both of these would be a nice, cost-effective archive or “cold” tier for less- active file and data storage • Solutions like these cost far, far less than what Tier 1 storage vendors would charge for a petabyte of usable storage • ... of course they come with their own risks and operational burden. This is an area where proper research and due diligence is essential ‣ Companies like Avere Systems are producing boxes that unify disparate storage tiers and link them to cloud and object stores • This is a route to unifying “tier 1” storage with the “cheap & deep” storage Wednesday, April 9, 14
  • 54. 54 The road ahead ... Wednesday, April 9, 14
  • 55. Some final thoughts 55 Future Trends and Patterns ‣ Data generation out- pacing technology ‣ Cheap/easy laboratory assays taking over • Researchers largely don’t know what to do with it all • Holding on to the data until someone figures it out • This will cause some interesting headaches for IT • Huge need for real “Big Data” applications to be developed Wednesday, April 9, 14
  • 56. Some final thoughts 56 Future Trends and Patterns ‣ Unless there’s an investment in ultra-high speed networking, need to change thought on analysis ‣ Data commons are becoming a precedent • Need to minimize the movement of data • Include compute power and analysis platform with data commons ‣ Move the analysis to the data, don’t move the data • Requires sharing/Large core institutional resources Wednesday, April 9, 14
  • 57. Some final thoughts 57 Future Trends & Patterns ‣ Compute continues to become easier ‣ Data movement (physical & network) gets harder ‣ Cost of storage will be dwarfed by “cost of managing stored data” ‣ We can see end-of-life for our current IT architecture and design patterns; new patterns will start to appear over next 2-5 years ‣ We’ve got a new headache to worry about ... Wednesday, April 9, 14
  • 58. A new challenge ... 58 Future Trends & Patterns ‣ Responsible sharing of clinical and genomic data will be the grand challenge of the post human genome project era ‣ We HAVE to get it right ‣ The ‘Global Alliance’ whitepaper cosigned by 70+ organizations is a must read: • Short link to whitepaper: http://biote.am/9j • Long link: https://www.broadinstitute.org/files/news/pdfs/ GAWhitePaperJune3.pdf • NIH will be critical in making this work for the world Wednesday, April 9, 14
  • 59. Up to a two line subtitle, generally used to describe the takeaway for the slide 59 end; Thanks! ` Wednesday, April 9, 14