Note: Contact me directly dag@bioteam.net if you would like a PDF download of these slides
This is Chris Dagdigian’s 10th year delivering his no holds barred, candid state of the industry address at BioIT World, and we are not going to let a pandemic stop him.
Instead of his typical talk, five distinguished panelists will join Chris for a spirited discussion on Current Events and Scientific Computing and the impacts of the COVID-19 Pandemic:
6. Trends from the trenches digitalTwitter: chris_dag
<dag@bioteam.net>
About today’s event
● 10th Anniversary:
“Trends FromTheTrenches @ Bio-IT World”
● Postponed conference? No problem!
○ We can’t be together physically
○ But we can bring our friends and colleagues
together virtually
11. What is driving Bio-it requirements in 2020?
Science
Driver
Context IT Impact
Genomics and
Bioinformatics
Historically dominant consumer of both storage and compute
resources.This will continue as sequencing becomes less expensive
and more widely used in both the lab and clinic settings.
● Storage capacity
● Non-GPU computing
● Large Memory computing
● Data Ingest & movement
Image-based data
acquisition and
analysis
The fastest-growing IT driver BioTeam observes “in the trenches”
continues to be image capture and image-based storage driven by
the increasing importance of both light (confocal and lattice
light-sheet), 3D microscopy, CryoEM, MRI and fMRI image analysis.
A number of BioTeam clients deployed CryoEM in 2019-2020.
● Storage capacity
● Storage performance
● GPU computing
● Large scale data ingest &
movement
12. What is driving Bio-it requirements in 2020?
Science
Driver
Context IT Impact
ML and AI ML and AI techniques are expected to make a significant future
contributions to Bio-IT requirements and platforms.These
approaches may need hardware beyond the general purpose GPU
and may require advanced GPUs, FPGAs, and neural processors.
● Storage performance
● Storage capacity
● GPU computing
● Cloud workload migration
Chemistry and
Molecular Dynamics
Computational chemistry and MD simulations requirements differ
significantly from bioinformatics/genomics requirements. It is worth
noting that chemists are capable of consuming nearly infinite
amounts of compute capacity -- if more power is available they
simply run longer or more complex simulations.
● GPU computing
● Scratch storage
performance
● Cloud workload migration
13. What is driving Bio-it requirements in 2020?
● Notice anything in the prior science/IT driver list?
○ Storage (and changing requirements for storage in 2020 ..) plays a big part
● Fortunate that this event is being sponsored by a very interesting storage player
● SinceVAST won’t talk tech today I’ll leave this URL for the nerds who want a
deeper dive.
This was the original write up by @glennklockwood that drove our interest:
https://glennklockwood.blogspot.com/2019/02/vast-datas-storage-system-architecture.html
15. What are vendors most likely to lie about in 2020?
● Every IT purchase cycle includes breathlessly
overhyped tech, heavily stage-managed &
subsidized reference projects and
aggressive-to-the-point-of-misleading sales
techniques
● ML and AI have been like this for a while now and
this stuff is creeping into *every* product pitch
● In the hype zone you are also more likely to
encounter people who say creepy sexualized
stuff like “... open the Kimono… ” in purportedly
professional meetings
ML & AI Of Course!
16. What are vendors most likely to lie about in 2020?
● Fact 1: These methods are real, beneficial and
currently driving significant transformation in life
science & healthcare
● Fact 2: Within the hype zone of new tech it is essential
to approach things cautiously, carefully and a bit
cynically (test all claims!)
○ Be cautious when buying big into stuff your org
may not be able to fully exploit as this market
innovates extremely fast
● Fact 3: Many end-users are still getting up to speed;
the market knows this and sometimes relies on
customer naivety or C-Suite bandwagon pressure
ML & AI Of Course!
17. Today's focus:
Scientific Data
● Why 1: Petascale data storage has been easy for many
years now; managing and understanding the billions of
files we store at petascale is still a gnarly problem
● Why 2: Data Management, Data Movement and Data
Federation/Access still make up a significant
percentage of BioTeam informatics-focused consulting
● Why 3: Some folk are nearing limits of what can be
sensibly done with standard scale-out NAS
● Why 4: Image-based acquisition/analysis and ML/AI
workloads are changing our baseline requirements for
scientific data storage system capabilities in 2020
23. { Big } data ...
● “Far easier to acquire or
generate data faster than
it can be effectively stored
over it’s full lifecycle”
● “Storage pricing not
decreasing fast enough to
match our increase in
consumption”
● CryoEM: “Hold my sensor …”
● This disconnect was OK when
most of us were at the low-end of
peta-capable storage system
limits
○ Rudderless expansion requires
only ${money}; no leadership, no
ownership and no difficult
conversations with scientists
○ … now seeing limits of this style
Me in years past: May 2020:
24. { Big } data ...
● “At 1-petabyte level your
scientific storage
platform needs a human
data curator and real
governance”
● Bioteam has seen multiple 10+
petabyte orgs in 2019-2020 with little
to no governance, standards or
human-led curation
● “Data awareness” is a competitive
differentiator; organizations can
succeed or fail on this capability alone
● Human data wranglers now need
sophisticated storage reporting and
metadata-aware tooling to perform
their role
Me in years past: May 2020:
25. { Big } data ...
● “Data triage is required; OK
to delete some raw data if it
is cheaper to go back to the
-40F Freezer and rerun the
experiment”
● “IT can’t make deletion
decisions; triage is always
led by Science/Research”
● Collectively we’ve kinda
failed at data
management
● New and sterner
methods are required
○ Unconstrained growth of un-managed
data will come back to haunt you in
painful ways
Me in years past: May 2020:
26. { Big } data ...
1. Culture Change -- Storage Is a Lab/Group Consumable
a. View, manage and treat data storage systems in the same
way we handle laboratory consumables -- have a plan, defend
your request and budget accordingly
b. Scientists and IT must both actively manage this consumable
2. No scientific data in $HOME and no more giant $HOME folders
a. ALL storage allocations are now via Project or Group
b. Few or no exceptions
Trends I’mTryingTo Create In 2020
27. { Big } data ...
NERSC $HOME policy!
NERSC File System
Quotas & Purging
Overview
https://docs.nersc.gov/
29. { siloed } data ...
● ‘Data rich’ environments at network edge are increasing
● Data sources and types are increasing
● Not just images, instrument data and genomes
○ Events, Documents, iOT; time-series sensor streams, etc.
● Collaborative research efforts are increasing
● Petabytes of open access data available for access/download
More and more silos & sources
30. { siloed } data ...
● Slow networks & lack of science DMZs can strand data @ edge
● Compute|transform happening at ingest/edge (emerging …)
● Data Lakes are effective but still *many* failures
● Data Commons methods increasingly attracting attention
○ Gen3 Commons from CTDS is our jam - https://gen3.org/
https://ctds.uchicago.edu/gen3
○ Gen3 COVID19 Data Commons:
https://chicagoland.pandemicresponsecommons.org/
What we see ...
32. { dirty } data ...
● In one career generation we’ve gone from handwritten lab notebooks to
petascale data wrangling
● Generally we’ve scaled capacity but ignored or underinvested in
governance, curation, metadata, data cleaning, SOPs and standards
○ Blame lies with scientists and scientific leadership
○ IT can’t MAKE you clean up after yourself
○ Years (or decades) of data neglect are PAINFUL to handle
● Annoying 2019 - 2020
● But this is gonna mess up a ton of ML & AI work in coming years
We are not great at data hygiene
34. ● Our responsibility to ensure ML and AI are handled responsibly
● Especially with our prior failures at ‘data hygiene’
● Need clean data from diverse and equitable sources to have any
hope of applying machine methods broadly and across our many
disciplines
● { our panelists may have a lot to say on this topic … }
Model & Data Bias -- Risks for the ML/AI Era
{ Biased } data ...
37. Stuff I got wrong
1. “Compilers Matter Again!”
2. “CPU benchmarking is back!”
3. “We need policy driven auto-tiering storage”
4. “Single global storage namespace should be the goal”
Failed predictions from past trends talks ...
38. Stuff I got wrong
1. “Compilers Matter Again!”
Failed predictions from past trends talks ...
● BioTeam observed performance differences between GNU compilers
and optimized commercial compilers in ‘19
● Significant difference for ‘hot’ tools like Relion CryoEM suite
● Expected more interest and more work building scientific tools for
different compilers in 2019 - 2020
● I was wrong
39. Stuff I got wrong
2. “CPU benchmarking is back!”
Failed predictions from past trends talks ...
● Intel has serious CPU competition for the first time in a while
● We expected a ton of AMD vs Intel benchmarking
● … especially as the exascale supercomputers announced in
2019-2020 seem to have clearly made their architecture choices
● I was wrong. Never materialized in our enterprise work
40. Stuff I got wrong
3. “We need policy driven auto-tiering storage”
Failed predictions from past trends talks ...
● Changed my mind on this
● IT-managed auto-tiering based on “policy” is not ideal for our world
○ Let me tell you about that time the policy engine on a 12 petabyte filesystem decided to archive
& stub all .bashrc files to tape :)
○ Making IT own or control the tiering process is wrong. Active partnership needed.
○ Scientists manage large data sets in ways that are not easily translated to generic “policy” like
“last access time” , “file age” -- great for corporate, bad for science workflows
● What we ACTUALLY need
○ User self-service for tiering, movement and archive decisions
○ Let researchers tier/move/archive based on Project or Group paths|tags
41. Stuff I got wrong
4. “Single global storage namespace should be the goal”
Failed predictions from past trends talks ...
● Awesome idea in theory
● But ..
○ Scientific leadership has failed to drive data management as a priority
○ Years of data shows that scientific end-users are not responsible stewards when the storage
environment is charitably described as “wild west”
● Tired of scientists who build careers on “data intensive science” WHINING about
having to … um … actively manage data that drives their career
● I’m done coddling scientists as an IT person in this particular space
○ If big data is part of your research mission than do your darn job and take ownership
○ Have to move data between systems?Tiers? Archive?Tough cookies. Do your job.