Bio-IT Trends From The Trenches (digital edition)

Trends from The
Socially Distanced Trenches
Twitter: chris_dag
Email: <dag@bioteam.net >
LinkedIn: chrisdag
Slides: slideshare.net/chrisdag

Live event only!
Visit: slido.com
Enter Event Code:TFT
● Live Q&A
● Polls
● Panelist Bios & URLs Or scan QR -->

Trends from the trenches digitalTwitter: chris_dag
<dag@bioteam.net>
About today’s event

<dag@bioteam.net>
10
Years
Ago ...

Twitter: chris_dag
<dag@bioteam.net>
BY accident we created an enduring thing ...

<dag@bioteam.net>
● 10th Anniversary:
“Trends FromTheTrenches @ Bio-IT World”
● Postponed conference? No problem!
○ We can’t be together physically
○ But we can bring our friends and colleagues
together virtually

And we have some amazing friends with us...

Bio-IT World 2020 is postponed; not cancelled!

Intro over …
{mini} Trends time

2020 scientiﬁc drivers for research it

What is driving Bio-it requirements in 2020?
Science
Driver
Context IT Impact
Genomics and
Bioinformatics
Historically dominant consumer of both storage and compute
resources.This will continue as sequencing becomes less expensive
and more widely used in both the lab and clinic settings.
● Storage capacity
● Non-GPU computing
● Large Memory computing
● Data Ingest & movement
Image-based data
acquisition and
analysis
The fastest-growing IT driver BioTeam observes “in the trenches”
continues to be image capture and image-based storage driven by
the increasing importance of both light (confocal and lattice
light-sheet), 3D microscopy, CryoEM, MRI and fMRI image analysis.
A number of BioTeam clients deployed CryoEM in 2019-2020.
● Storage performance
● GPU computing
● Large scale data ingest &
movement

Science
Driver
Context IT Impact
ML and AI ML and AI techniques are expected to make a significant future
contributions to Bio-IT requirements and platforms.These
approaches may need hardware beyond the general purpose GPU
and may require advanced GPUs, FPGAs, and neural processors.
● Storage performance
● GPU computing
● Cloud workload migration
Chemistry and
Molecular Dynamics
Computational chemistry and MD simulations requirements differ
significantly from bioinformatics/genomics requirements. It is worth
noting that chemists are capable of consuming nearly infinite
amounts of compute capacity -- if more power is available they
simply run longer or more complex simulations.
● GPU computing
● Scratch storage
performance
● Cloud workload migration

● Notice anything in the prior science/IT driver list?
○ Storage (and changing requirements for storage in 2020 ..) plays a big part
● Fortunate that this event is being sponsored by a very interesting storage player
● SinceVAST won’t talk tech today I’ll leave this URL for the nerds who want a
deeper dive.
This was the original write up by @glennklockwood that drove our interest:
https://glennklockwood.blogspot.com/2019/02/vast-datas-storage-system-architecture.html

What are vendors most likely to lie about in 2020?

● Every IT purchase cycle includes breathlessly
overhyped tech, heavily stage-managed &
subsidized reference projects and
aggressive-to-the-point-of-misleading sales
techniques
● ML and AI have been like this for a while now and
this stuﬀ is creeping into *every* product pitch
● In the hype zone you are also more likely to
encounter people who say creepy sexualized
stuﬀ like “... open the Kimono… ” in purportedly
professional meetings
ML & AI Of Course!

● Fact 1: These methods are real, beneficial and
currently driving significant transformation in life
science & healthcare
● Fact 2: Within the hype zone of new tech it is essential
to approach things cautiously, carefully and a bit
cynically (test all claims!)
○ Be cautious when buying big into stuff your org
may not be able to fully exploit as this market
innovates extremely fast
● Fact 3: Many end-users are still getting up to speed;
the market knows this and sometimes relies on
customer naivety or C-Suite bandwagon pressure
ML & AI Of Course!

Today's focus:
Scientific Data
● Why 1: Petascale data storage has been easy for many
years now; managing and understanding the billions of
files we store at petascale is still a gnarly problem
● Why 2: Data Management, Data Movement and Data
Federation/Access still make up a significant
percentage of BioTeam informatics-focused consulting
● Why 3: Some folk are nearing limits of what can be
sensibly done with standard scale-out NAS
● Why 4: Image-based acquisition/analysis and ML/AI
workloads are changing our baseline requirements for
scientific data storage system capabilities in 2020

Today's focus:
Scientiﬁc Data
BIG

Today's focus:
Scientiﬁc Data
BIG
SILOED

Today's focus:
Scientiﬁc Data
BIG
SILOED
DIRTY

Today's focus:
Scientiﬁc Data
BIG
SILOED
DIRTY
BIASED

{ Big } data ...
● “Far easier to acquire or
generate data faster than
it can be eﬀectively stored
over it’s full lifecycle”
● “Storage pricing not
decreasing fast enough to
match our increase in
consumption”
● CryoEM: “Hold my sensor …”
● This disconnect was OK when
most of us were at the low-end of
peta-capable storage system
limits
○ Rudderless expansion requires
only ${money}; no leadership, no
ownership and no diﬃcult
conversations with scientists
○ … now seeing limits of this style
Me in years past: May 2020:

{ Big } data ...
● “At 1-petabyte level your
scientiﬁc storage
platform needs a human
data curator and real
governance”
● Bioteam has seen multiple 10+
petabyte orgs in 2019-2020 with little
to no governance, standards or
human-led curation
● “Data awareness” is a competitive
diﬀerentiator; organizations can
succeed or fail on this capability alone
● Human data wranglers now need
sophisticated storage reporting and
metadata-aware tooling to perform
their role

{ Big } data ...
● “Data triage is required; OK
to delete some raw data if it
is cheaper to go back to the
-40F Freezer and rerun the
experiment”
● “IT can’t make deletion
decisions; triage is always
led by Science/Research”
● Collectively we’ve kinda
failed at data
management
● New and sterner
methods are required
○ Unconstrained growth of un-managed
data will come back to haunt you in
painful ways

{ Big } data ...
1. Culture Change -- Storage Is a Lab/Group Consumable
a. View, manage and treat data storage systems in the same
way we handle laboratory consumables -- have a plan, defend
your request and budget accordingly
b. Scientists and IT must both actively manage this consumable
2. No scientiﬁc data in $HOME and no more giant $HOME folders
a. ALL storage allocations are now via Project or Group
b. Few or no exceptions
Trends I’mTryingTo Create In 2020

{ Big } data ...
NERSC $HOME policy!
NERSC File System
Quotas & Purging
Overview
https://docs.nersc.gov/

{ siloed } data ...
● ‘Data rich’ environments at network edge are increasing
● Data sources and types are increasing
● Not just images, instrument data and genomes
○ Events, Documents, iOT; time-series sensor streams, etc.
● Collaborative research eﬀorts are increasing
● Petabytes of open access data available for access/download
More and more silos & sources

{ siloed } data ...
● Slow networks & lack of science DMZs can strand data @ edge
● Compute|transform happening at ingest/edge (emerging …)
● Data Lakes are eﬀective but still *many* failures
● Data Commons methods increasingly attracting attention
○ Gen3 Commons from CTDS is our jam - https://gen3.org/
https://ctds.uchicago.edu/gen3
○ Gen3 COVID19 Data Commons:
https://chicagoland.pandemicresponsecommons.org/
What we see ...

{ dirty } data ...
● In one career generation we’ve gone from handwritten lab notebooks to
petascale data wrangling
● Generally we’ve scaled capacity but ignored or underinvested in
governance, curation, metadata, data cleaning, SOPs and standards
○ Blame lies with scientists and scientiﬁc leadership
○ IT can’t MAKE you clean up after yourself
○ Years (or decades) of data neglect are PAINFUL to handle
● Annoying 2019 - 2020
● But this is gonna mess up a ton of ML & AI work in coming years
We are not great at data hygiene

● Our responsibility to ensure ML and AI are handled responsibly
● Especially with our prior failures at ‘data hygiene’
● Need clean data from diverse and equitable sources to have any
hope of applying machine methods broadly and across our many
disciplines
● { our panelists may have a lot to say on this topic … }
Model & Data Bias -- Risks for the ML/AI Era
{ Biased } data ...

Join at
slido.com
#TFT
Audience Q&A Session

Stuff I got wrong
1. “Compilers Matter Again!”
2. “CPU benchmarking is back!”
3. “We need policy driven auto-tiering storage”
4. “Single global storage namespace should be the goal”
Failed predictions from past trends talks ...

Stuff I got wrong
1. “Compilers Matter Again!”
● BioTeam observed performance differences between GNU compilers
and optimized commercial compilers in ‘19
● Significant difference for ‘hot’ tools like Relion CryoEM suite
● Expected more interest and more work building scientific tools for
different compilers in 2019 - 2020
● I was wrong

Stuff I got wrong
2. “CPU benchmarking is back!”
● Intel has serious CPU competition for the ﬁrst time in a while
● We expected a ton of AMD vs Intel benchmarking
● … especially as the exascale supercomputers announced in
2019-2020 seem to have clearly made their architecture choices
● I was wrong. Never materialized in our enterprise work

Stuff I got wrong
3. “We need policy driven auto-tiering storage”
● Changed my mind on this
● IT-managed auto-tiering based on “policy” is not ideal for our world
○ Let me tell you about that time the policy engine on a 12 petabyte filesystem decided to archive
& stub all .bashrc files to tape :)
○ Making IT own or control the tiering process is wrong. Active partnership needed.
○ Scientists manage large data sets in ways that are not easily translated to generic “policy” like
“last access time” , “file age” -- great for corporate, bad for science workflows
● What we ACTUALLY need
○ User self-service for tiering, movement and archive decisions
○ Let researchers tier/move/archive based on Project or Group paths|tags

Stuff I got wrong
4. “Single global storage namespace should be the goal”
● Awesome idea in theory
● But ..
○ Scientiﬁc leadership has failed to drive data management as a priority
○ Years of data shows that scientiﬁc end-users are not responsible stewards when the storage
environment is charitably described as “wild west”
● Tired of scientists who build careers on “data intensive science” WHINING about
having to … um … actively manage data that drives their career
● I’m done coddling scientists as an IT person in this particular space
○ If big data is part of your research mission than do your darn job and take ownership
○ Have to move data between systems?Tiers? Archive?Tough cookies. Do your job.

thanks;
Panel time!
Acknowledgements:
○ Stan Gloss
○ twitter.com/CircuitSwan
○ twitter.com/melrom
○ https://desertedislanddevops.com/
Twitter: chris_dag
Email: <dag@bioteam.net>
LinkedIn: chrisdag
Slides: slideshare.net/chrisdag

Bio-IT Trends From The Trenches (digital edition)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Bio-IT Trends From The Trenches (digital edition)

Similaire à Bio-IT Trends From The Trenches (digital edition) (20)

Plus de Chris Dagdigian

Plus de Chris Dagdigian (8)

Dernier

Dernier (20)

Bio-IT Trends From The Trenches (digital edition)