SlideShare une entreprise Scribd logo
1  sur  44
Trends from The
Socially Distanced Trenches
Twitter: chris_dag
Email: <dag@bioteam.net >
LinkedIn: chrisdag
Slides: slideshare.net/chrisdag
Live event only!
Visit: slido.com
Enter Event Code:TFT
● Live Q&A
● Polls
● Panelist Bios & URLs Or scan QR -->
Trends from the trenches digitalTwitter: chris_dag
<dag@bioteam.net>
About today’s event
Trends from the trenches digitalTwitter: chris_dag
<dag@bioteam.net>
About today’s event
10
Years
Ago ...
Twitter: chris_dag
<dag@bioteam.net>
BY accident we created an enduring thing ...
Trends from the trenches digitalTwitter: chris_dag
<dag@bioteam.net>
About today’s event
● 10th Anniversary:
“Trends FromTheTrenches @ Bio-IT World”
● Postponed conference? No problem!
○ We can’t be together physically
○ But we can bring our friends and colleagues
together virtually
And we have some amazing friends with us...
Bio-IT World 2020 is postponed; not cancelled!
Intro over …
{mini} Trends time
2020 scientific drivers for research it
What is driving Bio-it requirements in 2020?
Science
Driver
Context IT Impact
Genomics and
Bioinformatics
Historically dominant consumer of both storage and compute
resources.This will continue as sequencing becomes less expensive
and more widely used in both the lab and clinic settings.
● Storage capacity
● Non-GPU computing
● Large Memory computing
● Data Ingest & movement
Image-based data
acquisition and
analysis
The fastest-growing IT driver BioTeam observes “in the trenches”
continues to be image capture and image-based storage driven by
the increasing importance of both light (confocal and lattice
light-sheet), 3D microscopy, CryoEM, MRI and fMRI image analysis.
A number of BioTeam clients deployed CryoEM in 2019-2020.
● Storage capacity
● Storage performance
● GPU computing
● Large scale data ingest &
movement
What is driving Bio-it requirements in 2020?
Science
Driver
Context IT Impact
ML and AI ML and AI techniques are expected to make a significant future
contributions to Bio-IT requirements and platforms.These
approaches may need hardware beyond the general purpose GPU
and may require advanced GPUs, FPGAs, and neural processors.
● Storage performance
● Storage capacity
● GPU computing
● Cloud workload migration
Chemistry and
Molecular Dynamics
Computational chemistry and MD simulations requirements differ
significantly from bioinformatics/genomics requirements. It is worth
noting that chemists are capable of consuming nearly infinite
amounts of compute capacity -- if more power is available they
simply run longer or more complex simulations.
● GPU computing
● Scratch storage
performance
● Cloud workload migration
What is driving Bio-it requirements in 2020?
● Notice anything in the prior science/IT driver list?
○ Storage (and changing requirements for storage in 2020 ..) plays a big part
● Fortunate that this event is being sponsored by a very interesting storage player
● SinceVAST won’t talk tech today I’ll leave this URL for the nerds who want a
deeper dive.
This was the original write up by @glennklockwood that drove our interest:
https://glennklockwood.blogspot.com/2019/02/vast-datas-storage-system-architecture.html
What are vendors most likely to lie about in 2020?
What are vendors most likely to lie about in 2020?
● Every IT purchase cycle includes breathlessly
overhyped tech, heavily stage-managed &
subsidized reference projects and
aggressive-to-the-point-of-misleading sales
techniques
● ML and AI have been like this for a while now and
this stuff is creeping into *every* product pitch
● In the hype zone you are also more likely to
encounter people who say creepy sexualized
stuff like “... open the Kimono… ” in purportedly
professional meetings
ML & AI Of Course!
What are vendors most likely to lie about in 2020?
● Fact 1: These methods are real, beneficial and
currently driving significant transformation in life
science & healthcare
● Fact 2: Within the hype zone of new tech it is essential
to approach things cautiously, carefully and a bit
cynically (test all claims!)
○ Be cautious when buying big into stuff your org
may not be able to fully exploit as this market
innovates extremely fast
● Fact 3: Many end-users are still getting up to speed;
the market knows this and sometimes relies on
customer naivety or C-Suite bandwagon pressure
ML & AI Of Course!
Today's focus:
Scientific Data
● Why 1: Petascale data storage has been easy for many
years now; managing and understanding the billions of
files we store at petascale is still a gnarly problem
● Why 2: Data Management, Data Movement and Data
Federation/Access still make up a significant
percentage of BioTeam informatics-focused consulting
● Why 3: Some folk are nearing limits of what can be
sensibly done with standard scale-out NAS
● Why 4: Image-based acquisition/analysis and ML/AI
workloads are changing our baseline requirements for
scientific data storage system capabilities in 2020
Today's focus:
Scientific Data
BIG
Today's focus:
Scientific Data
BIG
SILOED
Today's focus:
Scientific Data
BIG
SILOED
DIRTY
Today's focus:
Scientific Data
BIG
SILOED
DIRTY
BIASED
Big data ...
{ Big } data ...
● “Far easier to acquire or
generate data faster than
it can be effectively stored
over it’s full lifecycle”
● “Storage pricing not
decreasing fast enough to
match our increase in
consumption”
● CryoEM: “Hold my sensor …”
● This disconnect was OK when
most of us were at the low-end of
peta-capable storage system
limits
○ Rudderless expansion requires
only ${money}; no leadership, no
ownership and no difficult
conversations with scientists
○ … now seeing limits of this style
Me in years past: May 2020:
{ Big } data ...
● “At 1-petabyte level your
scientific storage
platform needs a human
data curator and real
governance”
● Bioteam has seen multiple 10+
petabyte orgs in 2019-2020 with little
to no governance, standards or
human-led curation
● “Data awareness” is a competitive
differentiator; organizations can
succeed or fail on this capability alone
● Human data wranglers now need
sophisticated storage reporting and
metadata-aware tooling to perform
their role
Me in years past: May 2020:
{ Big } data ...
● “Data triage is required; OK
to delete some raw data if it
is cheaper to go back to the
-40F Freezer and rerun the
experiment”
● “IT can’t make deletion
decisions; triage is always
led by Science/Research”
● Collectively we’ve kinda
failed at data
management
● New and sterner
methods are required
○ Unconstrained growth of un-managed
data will come back to haunt you in
painful ways
Me in years past: May 2020:
{ Big } data ...
1. Culture Change -- Storage Is a Lab/Group Consumable
a. View, manage and treat data storage systems in the same
way we handle laboratory consumables -- have a plan, defend
your request and budget accordingly
b. Scientists and IT must both actively manage this consumable
2. No scientific data in $HOME and no more giant $HOME folders
a. ALL storage allocations are now via Project or Group
b. Few or no exceptions
Trends I’mTryingTo Create In 2020
{ Big } data ...
NERSC $HOME policy!
NERSC File System
Quotas & Purging
Overview
https://docs.nersc.gov/
siloed data ...
{ siloed } data ...
● ‘Data rich’ environments at network edge are increasing
● Data sources and types are increasing
● Not just images, instrument data and genomes
○ Events, Documents, iOT; time-series sensor streams, etc.
● Collaborative research efforts are increasing
● Petabytes of open access data available for access/download
More and more silos & sources
{ siloed } data ...
● Slow networks & lack of science DMZs can strand data @ edge
● Compute|transform happening at ingest/edge (emerging …)
● Data Lakes are effective but still *many* failures
● Data Commons methods increasingly attracting attention
○ Gen3 Commons from CTDS is our jam - https://gen3.org/
https://ctds.uchicago.edu/gen3
○ Gen3 COVID19 Data Commons:
https://chicagoland.pandemicresponsecommons.org/
What we see ...
dirty data ...
{ dirty } data ...
● In one career generation we’ve gone from handwritten lab notebooks to
petascale data wrangling
● Generally we’ve scaled capacity but ignored or underinvested in
governance, curation, metadata, data cleaning, SOPs and standards
○ Blame lies with scientists and scientific leadership
○ IT can’t MAKE you clean up after yourself
○ Years (or decades) of data neglect are PAINFUL to handle
● Annoying 2019 - 2020
● But this is gonna mess up a ton of ML & AI work in coming years
We are not great at data hygiene
Biased data ...
● Our responsibility to ensure ML and AI are handled responsibly
● Especially with our prior failures at ‘data hygiene’
● Need clean data from diverse and equitable sources to have any
hope of applying machine methods broadly and across our many
disciplines
● { our panelists may have a lot to say on this topic … }
Model & Data Bias -- Risks for the ML/AI Era
{ Biased } data ...
Join at
slido.com
#TFT
Audience Q&A Session
Stuff I
got wrong
Stuff I got wrong
1. “Compilers Matter Again!”
2. “CPU benchmarking is back!”
3. “We need policy driven auto-tiering storage”
4. “Single global storage namespace should be the goal”
Failed predictions from past trends talks ...
Stuff I got wrong
1. “Compilers Matter Again!”
Failed predictions from past trends talks ...
● BioTeam observed performance differences between GNU compilers
and optimized commercial compilers in ‘19
● Significant difference for ‘hot’ tools like Relion CryoEM suite
● Expected more interest and more work building scientific tools for
different compilers in 2019 - 2020
● I was wrong
Stuff I got wrong
2. “CPU benchmarking is back!”
Failed predictions from past trends talks ...
● Intel has serious CPU competition for the first time in a while
● We expected a ton of AMD vs Intel benchmarking
● … especially as the exascale supercomputers announced in
2019-2020 seem to have clearly made their architecture choices
● I was wrong. Never materialized in our enterprise work
Stuff I got wrong
3. “We need policy driven auto-tiering storage”
Failed predictions from past trends talks ...
● Changed my mind on this
● IT-managed auto-tiering based on “policy” is not ideal for our world
○ Let me tell you about that time the policy engine on a 12 petabyte filesystem decided to archive
& stub all .bashrc files to tape :)
○ Making IT own or control the tiering process is wrong. Active partnership needed.
○ Scientists manage large data sets in ways that are not easily translated to generic “policy” like
“last access time” , “file age” -- great for corporate, bad for science workflows
● What we ACTUALLY need
○ User self-service for tiering, movement and archive decisions
○ Let researchers tier/move/archive based on Project or Group paths|tags
Stuff I got wrong
4. “Single global storage namespace should be the goal”
Failed predictions from past trends talks ...
● Awesome idea in theory
● But ..
○ Scientific leadership has failed to drive data management as a priority
○ Years of data shows that scientific end-users are not responsible stewards when the storage
environment is charitably described as “wild west”
● Tired of scientists who build careers on “data intensive science” WHINING about
having to … um … actively manage data that drives their career
● I’m done coddling scientists as an IT person in this particular space
○ If big data is part of your research mission than do your darn job and take ownership
○ Have to move data between systems?Tiers? Archive?Tough cookies. Do your job.
thanks;
Panel time!
Acknowledgements:
○ Stan Gloss
○ twitter.com/CircuitSwan
○ twitter.com/melrom
○ https://desertedislanddevops.com/
Twitter: chris_dag
Email: <dag@bioteam.net>
LinkedIn: chrisdag
Slides: slideshare.net/chrisdag
Join at
slido.com
#TFT
Audience Q&A Session

Contenu connexe

Tendances

Multi-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersMulti-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersChris Dagdigian
 
Mapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the CloudMapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the CloudChris Dagdigian
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...Jochem van Grondelle
 
Big Data and Fast Data – Big and Fast Combined, is it Possible?
Big Data and Fast Data – Big and Fast Combined, is it Possible?Big Data and Fast Data – Big and Fast Combined, is it Possible?
Big Data and Fast Data – Big and Fast Combined, is it Possible?Guido Schmutz
 
Big Data and Fast Data - big and fast combined, is it possible?
Big Data and Fast Data - big and fast combined, is it possible?Big Data and Fast Data - big and fast combined, is it possible?
Big Data and Fast Data - big and fast combined, is it possible?Guido Schmutz
 
Introduction to open data in DataOps
Introduction to open data in DataOpsIntroduction to open data in DataOps
Introduction to open data in DataOpsDataops Ghent Meetup
 
Agents for Agility - The Just-in-Time Enterprise Has Arrived
Agents for Agility - The Just-in-Time Enterprise Has ArrivedAgents for Agility - The Just-in-Time Enterprise Has Arrived
Agents for Agility - The Just-in-Time Enterprise Has ArrivedInside Analysis
 
Trends from the Trenches (Singapore Edition)
Trends from the Trenches (Singapore Edition)Trends from the Trenches (Singapore Edition)
Trends from the Trenches (Singapore Edition)Chris Dagdigian
 
Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?mark madsen
 
Microsoft cloud migration and modernization playbook 031819 (1) (2)
Microsoft cloud migration and modernization playbook 031819 (1) (2)Microsoft cloud migration and modernization playbook 031819 (1) (2)
Microsoft cloud migration and modernization playbook 031819 (1) (2)didicadoida
 
Everything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data WarehouseEverything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data Warehousemark madsen
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
 
Leveraging open source for big data stack
Leveraging open source for big data stackLeveraging open source for big data stack
Leveraging open source for big data stackFlytxt
 
Rethink Your Data Governance - POPI Act Compliance Made Easy with Data Virtua...
Rethink Your Data Governance - POPI Act Compliance Made Easy with Data Virtua...Rethink Your Data Governance - POPI Act Compliance Made Easy with Data Virtua...
Rethink Your Data Governance - POPI Act Compliance Made Easy with Data Virtua...Denodo
 
How Global Data Availability Accelerates Collaboration And Delivers Business ...
How Global Data Availability Accelerates Collaboration And Delivers Business ...How Global Data Availability Accelerates Collaboration And Delivers Business ...
How Global Data Availability Accelerates Collaboration And Delivers Business ...Dana Gardner
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data FrameworkseXascale Infolab
 
Applying Big Data Superpowers to Healthcare
Applying Big Data Superpowers to HealthcareApplying Big Data Superpowers to Healthcare
Applying Big Data Superpowers to HealthcarePaul Boal
 
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Edward Curry
 
Big Data & the Cloud
Big Data & the CloudBig Data & the Cloud
Big Data & the CloudDATAVERSITY
 

Tendances (20)

Multi-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersMulti-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC Clusters
 
Mapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the CloudMapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the Cloud
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...
 
Big Data and Fast Data – Big and Fast Combined, is it Possible?
Big Data and Fast Data – Big and Fast Combined, is it Possible?Big Data and Fast Data – Big and Fast Combined, is it Possible?
Big Data and Fast Data – Big and Fast Combined, is it Possible?
 
Big Data and Fast Data - big and fast combined, is it possible?
Big Data and Fast Data - big and fast combined, is it possible?Big Data and Fast Data - big and fast combined, is it possible?
Big Data and Fast Data - big and fast combined, is it possible?
 
Introduction to open data in DataOps
Introduction to open data in DataOpsIntroduction to open data in DataOps
Introduction to open data in DataOps
 
Agents for Agility - The Just-in-Time Enterprise Has Arrived
Agents for Agility - The Just-in-Time Enterprise Has ArrivedAgents for Agility - The Just-in-Time Enterprise Has Arrived
Agents for Agility - The Just-in-Time Enterprise Has Arrived
 
Trends from the Trenches (Singapore Edition)
Trends from the Trenches (Singapore Edition)Trends from the Trenches (Singapore Edition)
Trends from the Trenches (Singapore Edition)
 
Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?
 
Microsoft cloud migration and modernization playbook 031819 (1) (2)
Microsoft cloud migration and modernization playbook 031819 (1) (2)Microsoft cloud migration and modernization playbook 031819 (1) (2)
Microsoft cloud migration and modernization playbook 031819 (1) (2)
 
Everything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data WarehouseEverything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data Warehouse
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Leveraging open source for big data stack
Leveraging open source for big data stackLeveraging open source for big data stack
Leveraging open source for big data stack
 
Rethink Your Data Governance - POPI Act Compliance Made Easy with Data Virtua...
Rethink Your Data Governance - POPI Act Compliance Made Easy with Data Virtua...Rethink Your Data Governance - POPI Act Compliance Made Easy with Data Virtua...
Rethink Your Data Governance - POPI Act Compliance Made Easy with Data Virtua...
 
How Global Data Availability Accelerates Collaboration And Delivers Business ...
How Global Data Availability Accelerates Collaboration And Delivers Business ...How Global Data Availability Accelerates Collaboration And Delivers Business ...
How Global Data Availability Accelerates Collaboration And Delivers Business ...
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 
Applying Big Data Superpowers to Healthcare
Applying Big Data Superpowers to HealthcareApplying Big Data Superpowers to Healthcare
Applying Big Data Superpowers to Healthcare
 
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
 
Big Data & the Cloud
Big Data & the CloudBig Data & the Cloud
Big Data & the Cloud
 

Similaire à Bio-IT Trends From The Trenches (digital edition)

2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the Trenches2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the TrenchesChris Dagdigian
 
Big data issues and challenges
Big data issues and challengesBig data issues and challenges
Big data issues and challengesDilpreet kaur Virk
 
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptxarpit206900
 
BigDataFinal.pptx
BigDataFinal.pptxBigDataFinal.pptx
BigDataFinal.pptxPentaTech
 
DataEd Slides: Leveraging Data Management Technologies
DataEd Slides: Leveraging Data Management TechnologiesDataEd Slides: Leveraging Data Management Technologies
DataEd Slides: Leveraging Data Management TechnologiesDATAVERSITY
 
Big Data overview
Big Data overviewBig Data overview
Big Data overviewalexisroos
 
Neotys PAC - Todd De Capua
Neotys PAC - Todd De CapuaNeotys PAC - Todd De Capua
Neotys PAC - Todd De CapuaNeotys_Partner
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?Denodo
 
DataEd Slides: Approaching Data Management Technologies
DataEd Slides:  Approaching Data Management TechnologiesDataEd Slides:  Approaching Data Management Technologies
DataEd Slides: Approaching Data Management TechnologiesDATAVERSITY
 
A Survey on Big Data Analytics
A Survey on Big Data AnalyticsA Survey on Big Data Analytics
A Survey on Big Data AnalyticsBHARATH KUMAR
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Denodo
 
BioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeBioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeChris Dagdigian
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
 
Big Data Analytics with Microsoft
Big Data Analytics with MicrosoftBig Data Analytics with Microsoft
Big Data Analytics with MicrosoftCaserta
 
SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herreroBigData_Europe
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?Denodo
 
Data-centric design and the knowledge graph
Data-centric design and the knowledge graphData-centric design and the knowledge graph
Data-centric design and the knowledge graphAlan Morrison
 
Machine Learning and AI: An Intuitive Introduction - CFA Institute Masterclass
Machine Learning and AI: An Intuitive Introduction - CFA Institute MasterclassMachine Learning and AI: An Intuitive Introduction - CFA Institute Masterclass
Machine Learning and AI: An Intuitive Introduction - CFA Institute MasterclassQuantUniversity
 
In memory big data management and processing
In memory big data management and processingIn memory big data management and processing
In memory big data management and processingPranav Gontalwar
 

Similaire à Bio-IT Trends From The Trenches (digital edition) (20)

2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the Trenches2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the Trenches
 
Big data issues and challenges
Big data issues and challengesBig data issues and challenges
Big data issues and challenges
 
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
 
BigDataFinal.pptx
BigDataFinal.pptxBigDataFinal.pptx
BigDataFinal.pptx
 
DataEd Slides: Leveraging Data Management Technologies
DataEd Slides: Leveraging Data Management TechnologiesDataEd Slides: Leveraging Data Management Technologies
DataEd Slides: Leveraging Data Management Technologies
 
Big Data overview
Big Data overviewBig Data overview
Big Data overview
 
Neotys PAC - Todd De Capua
Neotys PAC - Todd De CapuaNeotys PAC - Todd De Capua
Neotys PAC - Todd De Capua
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
 
DataEd Slides: Approaching Data Management Technologies
DataEd Slides:  Approaching Data Management TechnologiesDataEd Slides:  Approaching Data Management Technologies
DataEd Slides: Approaching Data Management Technologies
 
A Survey on Big Data Analytics
A Survey on Big Data AnalyticsA Survey on Big Data Analytics
A Survey on Big Data Analytics
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
 
BioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeBioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology Exchange
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Big Data Analytics with Microsoft
Big Data Analytics with MicrosoftBig Data Analytics with Microsoft
Big Data Analytics with Microsoft
 
SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herrero
 
Kartikey tripathi
Kartikey tripathiKartikey tripathi
Kartikey tripathi
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?
 
Data-centric design and the knowledge graph
Data-centric design and the knowledge graphData-centric design and the knowledge graph
Data-centric design and the knowledge graph
 
Machine Learning and AI: An Intuitive Introduction - CFA Institute Masterclass
Machine Learning and AI: An Intuitive Introduction - CFA Institute MasterclassMachine Learning and AI: An Intuitive Introduction - CFA Institute Masterclass
Machine Learning and AI: An Intuitive Introduction - CFA Institute Masterclass
 
In memory big data management and processing
In memory big data management and processingIn memory big data management and processing
In memory big data management and processing
 

Plus de Chris Dagdigian

BioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesBioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesChris Dagdigian
 
Cloud Security for Life Science R&D
Cloud Security for Life Science R&DCloud Security for Life Science R&D
Cloud Security for Life Science R&DChris Dagdigian
 
2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentationChris Dagdigian
 
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingBio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingChris Dagdigian
 
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedBio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedChris Dagdigian
 
AWS re:Invent - Accelerating Research
AWS re:Invent - Accelerating ResearchAWS re:Invent - Accelerating Research
AWS re:Invent - Accelerating ResearchChris Dagdigian
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersChris Dagdigian
 
Practical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationPractical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationChris Dagdigian
 

Plus de Chris Dagdigian (8)

BioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesBioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the Trenches
 
Cloud Security for Life Science R&D
Cloud Security for Life Science R&DCloud Security for Life Science R&D
Cloud Security for Life Science R&D
 
2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation
 
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingBio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
 
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedBio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
 
AWS re:Invent - Accelerating Research
AWS re:Invent - Accelerating ResearchAWS re:Invent - Accelerating Research
AWS re:Invent - Accelerating Research
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility Managers
 
Practical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationPractical Cloud & Workflow Orchestration
Practical Cloud & Workflow Orchestration
 

Dernier

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Dernier (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Bio-IT Trends From The Trenches (digital edition)

  • 1. Trends from The Socially Distanced Trenches Twitter: chris_dag Email: <dag@bioteam.net > LinkedIn: chrisdag Slides: slideshare.net/chrisdag
  • 2. Live event only! Visit: slido.com Enter Event Code:TFT ● Live Q&A ● Polls ● Panelist Bios & URLs Or scan QR -->
  • 3. Trends from the trenches digitalTwitter: chris_dag <dag@bioteam.net> About today’s event
  • 4. Trends from the trenches digitalTwitter: chris_dag <dag@bioteam.net> About today’s event 10 Years Ago ...
  • 5. Twitter: chris_dag <dag@bioteam.net> BY accident we created an enduring thing ...
  • 6. Trends from the trenches digitalTwitter: chris_dag <dag@bioteam.net> About today’s event ● 10th Anniversary: “Trends FromTheTrenches @ Bio-IT World” ● Postponed conference? No problem! ○ We can’t be together physically ○ But we can bring our friends and colleagues together virtually
  • 7. And we have some amazing friends with us...
  • 8. Bio-IT World 2020 is postponed; not cancelled!
  • 9. Intro over … {mini} Trends time
  • 10. 2020 scientific drivers for research it
  • 11. What is driving Bio-it requirements in 2020? Science Driver Context IT Impact Genomics and Bioinformatics Historically dominant consumer of both storage and compute resources.This will continue as sequencing becomes less expensive and more widely used in both the lab and clinic settings. ● Storage capacity ● Non-GPU computing ● Large Memory computing ● Data Ingest & movement Image-based data acquisition and analysis The fastest-growing IT driver BioTeam observes “in the trenches” continues to be image capture and image-based storage driven by the increasing importance of both light (confocal and lattice light-sheet), 3D microscopy, CryoEM, MRI and fMRI image analysis. A number of BioTeam clients deployed CryoEM in 2019-2020. ● Storage capacity ● Storage performance ● GPU computing ● Large scale data ingest & movement
  • 12. What is driving Bio-it requirements in 2020? Science Driver Context IT Impact ML and AI ML and AI techniques are expected to make a significant future contributions to Bio-IT requirements and platforms.These approaches may need hardware beyond the general purpose GPU and may require advanced GPUs, FPGAs, and neural processors. ● Storage performance ● Storage capacity ● GPU computing ● Cloud workload migration Chemistry and Molecular Dynamics Computational chemistry and MD simulations requirements differ significantly from bioinformatics/genomics requirements. It is worth noting that chemists are capable of consuming nearly infinite amounts of compute capacity -- if more power is available they simply run longer or more complex simulations. ● GPU computing ● Scratch storage performance ● Cloud workload migration
  • 13. What is driving Bio-it requirements in 2020? ● Notice anything in the prior science/IT driver list? ○ Storage (and changing requirements for storage in 2020 ..) plays a big part ● Fortunate that this event is being sponsored by a very interesting storage player ● SinceVAST won’t talk tech today I’ll leave this URL for the nerds who want a deeper dive. This was the original write up by @glennklockwood that drove our interest: https://glennklockwood.blogspot.com/2019/02/vast-datas-storage-system-architecture.html
  • 14. What are vendors most likely to lie about in 2020?
  • 15. What are vendors most likely to lie about in 2020? ● Every IT purchase cycle includes breathlessly overhyped tech, heavily stage-managed & subsidized reference projects and aggressive-to-the-point-of-misleading sales techniques ● ML and AI have been like this for a while now and this stuff is creeping into *every* product pitch ● In the hype zone you are also more likely to encounter people who say creepy sexualized stuff like “... open the Kimono… ” in purportedly professional meetings ML & AI Of Course!
  • 16. What are vendors most likely to lie about in 2020? ● Fact 1: These methods are real, beneficial and currently driving significant transformation in life science & healthcare ● Fact 2: Within the hype zone of new tech it is essential to approach things cautiously, carefully and a bit cynically (test all claims!) ○ Be cautious when buying big into stuff your org may not be able to fully exploit as this market innovates extremely fast ● Fact 3: Many end-users are still getting up to speed; the market knows this and sometimes relies on customer naivety or C-Suite bandwagon pressure ML & AI Of Course!
  • 17. Today's focus: Scientific Data ● Why 1: Petascale data storage has been easy for many years now; managing and understanding the billions of files we store at petascale is still a gnarly problem ● Why 2: Data Management, Data Movement and Data Federation/Access still make up a significant percentage of BioTeam informatics-focused consulting ● Why 3: Some folk are nearing limits of what can be sensibly done with standard scale-out NAS ● Why 4: Image-based acquisition/analysis and ML/AI workloads are changing our baseline requirements for scientific data storage system capabilities in 2020
  • 23. { Big } data ... ● “Far easier to acquire or generate data faster than it can be effectively stored over it’s full lifecycle” ● “Storage pricing not decreasing fast enough to match our increase in consumption” ● CryoEM: “Hold my sensor …” ● This disconnect was OK when most of us were at the low-end of peta-capable storage system limits ○ Rudderless expansion requires only ${money}; no leadership, no ownership and no difficult conversations with scientists ○ … now seeing limits of this style Me in years past: May 2020:
  • 24. { Big } data ... ● “At 1-petabyte level your scientific storage platform needs a human data curator and real governance” ● Bioteam has seen multiple 10+ petabyte orgs in 2019-2020 with little to no governance, standards or human-led curation ● “Data awareness” is a competitive differentiator; organizations can succeed or fail on this capability alone ● Human data wranglers now need sophisticated storage reporting and metadata-aware tooling to perform their role Me in years past: May 2020:
  • 25. { Big } data ... ● “Data triage is required; OK to delete some raw data if it is cheaper to go back to the -40F Freezer and rerun the experiment” ● “IT can’t make deletion decisions; triage is always led by Science/Research” ● Collectively we’ve kinda failed at data management ● New and sterner methods are required ○ Unconstrained growth of un-managed data will come back to haunt you in painful ways Me in years past: May 2020:
  • 26. { Big } data ... 1. Culture Change -- Storage Is a Lab/Group Consumable a. View, manage and treat data storage systems in the same way we handle laboratory consumables -- have a plan, defend your request and budget accordingly b. Scientists and IT must both actively manage this consumable 2. No scientific data in $HOME and no more giant $HOME folders a. ALL storage allocations are now via Project or Group b. Few or no exceptions Trends I’mTryingTo Create In 2020
  • 27. { Big } data ... NERSC $HOME policy! NERSC File System Quotas & Purging Overview https://docs.nersc.gov/
  • 29. { siloed } data ... ● ‘Data rich’ environments at network edge are increasing ● Data sources and types are increasing ● Not just images, instrument data and genomes ○ Events, Documents, iOT; time-series sensor streams, etc. ● Collaborative research efforts are increasing ● Petabytes of open access data available for access/download More and more silos & sources
  • 30. { siloed } data ... ● Slow networks & lack of science DMZs can strand data @ edge ● Compute|transform happening at ingest/edge (emerging …) ● Data Lakes are effective but still *many* failures ● Data Commons methods increasingly attracting attention ○ Gen3 Commons from CTDS is our jam - https://gen3.org/ https://ctds.uchicago.edu/gen3 ○ Gen3 COVID19 Data Commons: https://chicagoland.pandemicresponsecommons.org/ What we see ...
  • 32. { dirty } data ... ● In one career generation we’ve gone from handwritten lab notebooks to petascale data wrangling ● Generally we’ve scaled capacity but ignored or underinvested in governance, curation, metadata, data cleaning, SOPs and standards ○ Blame lies with scientists and scientific leadership ○ IT can’t MAKE you clean up after yourself ○ Years (or decades) of data neglect are PAINFUL to handle ● Annoying 2019 - 2020 ● But this is gonna mess up a ton of ML & AI work in coming years We are not great at data hygiene
  • 34. ● Our responsibility to ensure ML and AI are handled responsibly ● Especially with our prior failures at ‘data hygiene’ ● Need clean data from diverse and equitable sources to have any hope of applying machine methods broadly and across our many disciplines ● { our panelists may have a lot to say on this topic … } Model & Data Bias -- Risks for the ML/AI Era { Biased } data ...
  • 37. Stuff I got wrong 1. “Compilers Matter Again!” 2. “CPU benchmarking is back!” 3. “We need policy driven auto-tiering storage” 4. “Single global storage namespace should be the goal” Failed predictions from past trends talks ...
  • 38. Stuff I got wrong 1. “Compilers Matter Again!” Failed predictions from past trends talks ... ● BioTeam observed performance differences between GNU compilers and optimized commercial compilers in ‘19 ● Significant difference for ‘hot’ tools like Relion CryoEM suite ● Expected more interest and more work building scientific tools for different compilers in 2019 - 2020 ● I was wrong
  • 39. Stuff I got wrong 2. “CPU benchmarking is back!” Failed predictions from past trends talks ... ● Intel has serious CPU competition for the first time in a while ● We expected a ton of AMD vs Intel benchmarking ● … especially as the exascale supercomputers announced in 2019-2020 seem to have clearly made their architecture choices ● I was wrong. Never materialized in our enterprise work
  • 40. Stuff I got wrong 3. “We need policy driven auto-tiering storage” Failed predictions from past trends talks ... ● Changed my mind on this ● IT-managed auto-tiering based on “policy” is not ideal for our world ○ Let me tell you about that time the policy engine on a 12 petabyte filesystem decided to archive & stub all .bashrc files to tape :) ○ Making IT own or control the tiering process is wrong. Active partnership needed. ○ Scientists manage large data sets in ways that are not easily translated to generic “policy” like “last access time” , “file age” -- great for corporate, bad for science workflows ● What we ACTUALLY need ○ User self-service for tiering, movement and archive decisions ○ Let researchers tier/move/archive based on Project or Group paths|tags
  • 41. Stuff I got wrong 4. “Single global storage namespace should be the goal” Failed predictions from past trends talks ... ● Awesome idea in theory ● But .. ○ Scientific leadership has failed to drive data management as a priority ○ Years of data shows that scientific end-users are not responsible stewards when the storage environment is charitably described as “wild west” ● Tired of scientists who build careers on “data intensive science” WHINING about having to … um … actively manage data that drives their career ● I’m done coddling scientists as an IT person in this particular space ○ If big data is part of your research mission than do your darn job and take ownership ○ Have to move data between systems?Tiers? Archive?Tough cookies. Do your job.
  • 42. thanks; Panel time! Acknowledgements: ○ Stan Gloss ○ twitter.com/CircuitSwan ○ twitter.com/melrom ○ https://desertedislanddevops.com/ Twitter: chris_dag Email: <dag@bioteam.net> LinkedIn: chrisdag Slides: slideshare.net/chrisdag
  • 43.