Novel biotechnologies allow creating data in exascale dimension with relatively minor effort of human and laboratory and thus monetary resources compared to capabilities only a decade ago. While the availability of this data salvage to find answers for research questions, which would not have been feasible before, maybe even not feasible to ask before, the amount of data creates new challenges, which obviously need new software and data management systems. Such new solutions have to consider integrative approaches, which are not only considering the effectiveness and efficiency of data processing but improve usability, reusability and reproducibility especially tailored to the target user communities of biological data. Science gateways address such challenges and are intuitive graphical user interfaces offering a single point of entry to distributed job, workflow and/or data management across organizational boundaries. Their overall goal is to increase the usability of applications allowing users to focus on their specific research question instead of becoming acquainted with command line tools and diverse access mechanisms to infrastructures. The talk will give an overview on existing technologies, on current issues regarding reusability and reproducibility as well as on results of two user surveys. It will especially highlight key challenges and the characteristics cutting-edge developments should possess for fulfilling the needs of the user communities to allow for seamless data analysis on a large scale.
DSPy a system for AI to Write Prompts and Do Fine Tuning
Usability, Reusability and Reproducibility of Bioinformatic Applications
1.
Sandra
Gesing
Center
for
Research
Compu6ng
sandra.gesing@nd.edu
12
February
2016
Usability,
Reusability
and
Reproducibility
of
Bioinforma6c
Applica6ons
2. University
of
Notre
Dame
Sandra
Gesing
2
hHp://chartsbin.com/view/1124
hHp://chartsbin.com/view/1124
•
In
the
middle
of
nowhere
of
northern
Indiana
(1.5
h
from
here)
•
4
undergraduate
colleges
•
~35
research
ins6tutes
and
centers
•
~12,000
students
3. Center
for
Research
Compu6ng
Sandra
Gesing
3
•
SoSware
development
and
profiling
•
Cyberinfrastructure/science
gateway
development
•
Geographical
Informa6on
Systems
•
Visualiza6on
Support
•
Computa6onal
Scien6st
support
•
Collabora6ve
research/
grant
development
•
System
administra6on/
design
and
acquisi6on
•
~40
researchers,
research
programmers,
HPC
specialists
CRC
and
OIT
building
hHp://crc.nd.edu
4. Center
for
Research
Compu6ng
Sandra
Gesing
4
•
Computa6onal
resources:
25,000
cores+
•
Storage
resources:
3
PB
•
Visualiza6on
systems
•
Systems
for
virtual
hos6ng
•
Prototype
architectures
e.g.,
Docker,
OpenStack
•
Access
and
interface
to
• XSEDE
• Open
Science
Grid
• Blue
Waters
CRC
HPC
Center
(old
Union
Sta6on)
6. The
Genomics
Boom
Sandra
Gesing
6
February
16,
2001
biotech
company
Celera
February
15,
2001
The
Human
Genome
Project
7. The
Genomics
Boom
Sandra
Gesing
7
Craig
Venter
(leS)
and
Francis
Collins
(right)
8. Big
Data
Sandra
Gesing
8
•
Explosion
in
the
quan6ty,
variety
and
complexity
of
data
•
Ques6ons
can
be
answered
impossible
to
even
ask
about
10
years
ago
•
Costs
far
reduced
(e.g.,
Human
Genome
project,
15
years,
~$2
billion;
today
~3
days,
$1000)
9. Big
Data
Sandra
Gesing
9
hHp://www.genome.gov/images/content/cost_per_genome_oct2015.jpg
10. State
of
the
Art
Sandra
Gesing
Data
and
compute-‐
intensive
problems
High-‐speed
networks
Users
generally
not
IT
specialists
Tools
and
workflow
engines
Web-‐based
agile
frameworks
Distributed
data
and
compu6ng
infrastructures
11. Challenge
for
Developers
Sandra
Gesing
11
Data
and
compute-‐
intensive
problems
High-‐speed
networks
Tools
and
workflow
engines
Web-‐based
agile
frameworks
Distributed
data
and
compu6ng
infrastructures
Users
generally
not
IT
specialists
Need
for
intui6ve
and
self-‐explanatory
user
interfaces!
12. Challenge
for
Developers
Sandra
Gesing
12
Data
and
compute-‐
intensive
problems
High-‐speed
networks
Tools
and
workflow
engines
Web-‐based
agile
frameworks
Distributed
data
and
compu6ng
infrastructures
Users
generally
not
IT
specialists
14. Usability
Sandra
Gesing
14
“ASer
all,
usability
really
just
means
that
making
sure
that
something
works
well:
that
a
person
…
can
use
the
thing
-‐
whether
it's
a
Web
site,
a
fighter
jet,
or
a
revolving
door
-‐
for
its
intended
purpose
without
gerng
hopelessly
frustrated.”
(Steve
Krug
in
“Don't
make
me
think!:
A
Common
Sense
Approach
to
Web
Usability”,
2005)
15. Reusability
Sandra
Gesing
15
“The
key
to
produc6vity
is
reusability.
The
easiest
way
to
produce
code
is
obviously
to
have
it
already!"
(John
R.
Bourne
in
“Object-‐oriented
Engineering:
Building
Engineering
Systems
Using
Smalltalk-‐80”,
1992)
16. Reproducibility
Sandra
Gesing
16
“The
closeness
of
agreement
between
independent
results
obtained
with
the
same
method
on
iden6cal
test
material
but
under
different
condi6ons
(different
operators,
different
apparatus,
different
laboratories
and/or
aSer
different
intervals
of
6me)
…”
(IUPAC
(Interna6onal
Union
of
Pure
and
Applied
Chemistry
iupac.org)
GoldBook)
17. Reproducibility
Sandra
Gesing
17
“The
closeness
of
agreement
between
independent
results
obtained
with
the
same
method
on
iden6cal
test
material
but
under
different
condi6ons
(different
operators,
different
apparatus,
different
laboratories
and/or
aSer
different
intervals
of
6me)
…”
(IUPAC
(Interna6onal
Union
of
Pure
and
Applied
Chemistry
iupac.org)
GoldBook)
18. Science
Gateways
Sandra
Gesing
Science
Gateways
18
“A
Science
Gateway
is
a
community-‐developed
set
of
tools,
applica6ons,
and
data
that
is
integrated
via
a
portal
or
a
suite
of
applica6ons,
usually
in
a
graphical
user
interface,
that
is
further
customized
to
meet
the
needs
of
a
specific
community.”
TeraGrid/XSEDE
20. Science
Gateways
Sandra
Gesing
Science
Gateways
20
It’s
a
Science
Gateway
It’s
a
Research
Portal
It’s
a
Collaboratory
It’s
a
Cyberinfrastructure
It’s
e-‐Science
eResearch
It’s
a
Virtual
Lab
21. Frameworks
and
APIs
Sandra
Gesing
21
Re-‐inven6ng
is
not
always
necessary..
22. Frameworks
and
APIs
Sandra
Gesing
22
...
and
users
should
get
more
features
easily...
23. Frameworks
and
APIs
Sandra
Gesing
23
...
but
the
model
should
fit
to
the
demands
of
the
community
26. Development
of
Science
Gateways
Sandra
Gesing
26
Crucial
Topics
• Close
collabora6on
with
user
communi6es
• Knowledge
about
available
technical
solu6ons
Sounds
easy
but…
• Requirements
of
user
communi6es
oSen
not
so
clear
• Technologies
some6mes
s6ll
under
development
for
certain
building
blocks
è Slow
uptake
of
solu6ons
è Larger
effort
for
crea6ng
science
gateways
28. New
Science
Gateways
-‐
Checklist
Sandra
Gesing
Science
Gateways
28
Domain-‐specific
aspects:
• Goal,
target
area
and
target
users
• Visions/demands
on
the
layout
• Priori6es
of
features
and
op6ons,
e.g.,
a
list
from
must-‐have
to
great-‐to-‐have
op6ons
• Integra6on
of
exis6ng
applica6ons
or
development
of
applica6ons
• Technologies
of
the
applica6ons
• Visualiza6on
• Security
demands
• Workflows
29. New
Science
Gateways
-‐
Checklist
Sandra
Gesing
Science
Gateways
29
Organiza6onal
aspects:
• Time
constraints
for
the
development,
agreement
on
a
(maybe
even
rough)
project
plan
with
milestones
• Agreement
on
alpha-‐
or
beta-‐tester
• Regular
mee6ngs
30. New
Science
Gateways
-‐
Checklist
Sandra
Gesing
Science
Gateways
30
Technical
aspects:
• Experience
with
exis6ng
frameworks
and
programming
languages
• Available
infrastructure
including
security
infrastructure
and
resources
• Available
support
of
suitable
technologies
• Scalability
of
suitable
technologies
• Effort
for
extending
exis6ng
technologies
compared
to
novel
developments
• Synergy
effects
with
other
science
gateway
projects
31. Science
Gateways
Sandra
Gesing
Science
Gateways
31
A
new
era…
• Novel
developments
of
web-‐based
agile
frameworks
• Infrastructure
providers
report
that
science
gateways
are
more
used
than
commandlines
hHp://www.iplantcollabora6ve.org
32. Science
Gateways
Sandra
Gesing
Science
Gateways
32
A
new
era…
• Novel
developments
of
web-‐based
agile
frameworks
• Infrastructure
providers
report
that
science
gateways
are
more
used
than
commandlines
But
also
always
new
challenges…
• Novel
infrastructures
• Novel
data
sources
such
as
the
next
Next-‐Gen
Sequencing
è Support
of
developers
necessary
33. Science
Gateway
Ins6tute
Sandra
Gesing
Science
Gateways
33
2012
NSF
SoSware
Ins6tute
conceptualiza6on
award
2015
NSF
SoSware
Ins6tute
implementa6on
proposal
($15M)
Services
• Incubator
• Developer
support
team
• Gateway
framework
directory
• Workforce
development
hHp://sciencegateways.org
34. Science
Gateway
Survey
2014
Sandra
Gesing
Science
Gateways
34
• 29,000-‐person
survey
• 4957
responses
from
across
domains
35. Science
Gateway
Survey
2014
Sandra
Gesing
Science
Gateways
35
What
services
would
be
helpful?
36. Bioinforma6c
Infrastructure
Survey
Sandra
Gesing
36
•
Nick
Loman
(Birmingham,
UK)
• Thomas
Connor
(Cardiff,
UK)
•
October
2015
•
272
answers
hHps://drive.google.com/drive/folders/0B7KZv1TRi06fLUJCU1BYM3JScjg
38. Bioinforma6c
Infrastructure
Survey
Sandra
Gesing
38
0" 20" 40" 60" 80" 100" 120"
Cloud"
Ins0tu0on2wide"resource"
Local"resource"
Personal"computer"
Where
do
bioinforma6cians
do
most
of
their
work
39. Bioinforma6c
Infrastructure
Survey
Sandra
Gesing
39
0" 20" 40" 60" 80" 100" 120"
Cloud"
Ins0tu0on2wide"resource"
Local"resource"
Personal"computer"
0.00%$ 10.00%$20.00%$30.00%$40.00%$50.00%$60.00%$70.00%$80.00%$90.00%$
Best$for$job$
Good$documenta>on$
Word$of$mouth$recommenda>on$
Used$in$similar$analysis$
Quickest$
Already$installed$on$server$
Other$
Graphical$interface$
Where
do
bioinforma6cians
do
most
of
their
work
Why
do
bioinforma6cians
use
the
soSware
they
use
40. Bioinforma6c
Infrastructure
Survey
Sandra
Gesing
40
0" 20" 40" 60" 80" 100" 120"
Cloud"
Ins0tu0on2wide"resource"
Local"resource"
Personal"computer"
0.00%$ 10.00%$20.00%$30.00%$40.00%$50.00%$60.00%$70.00%$80.00%$90.00%$
Best$for$job$
Good$documenta>on$
Word$of$mouth$recommenda>on$
Used$in$similar$analysis$
Quickest$
Already$installed$on$server$
Other$
Graphical$interface$
Where
do
bioinforma6cians
do
most
of
their
work
Why
do
bioinforma6cians
use
the
soSware
they
use
41. Bioinforma6c
Infrastructure
Survey
Sandra
Gesing
41
Ques6ons
around
frustra6on
and
limita6ons
of
using
•
Bioinforma6c
soSware
•
Bioinforma6c
resources
•
HPC
and
Cloud
infrastructures
and
about
challenges
to
train
students
in
bioinforma6cs
Answers
oSen
address
• Hurdles
to
use
bioinforma6c
resources
because
of
commandline
access
or
not
available
soSware
• Quality
of
documenta6on
of
soSware
• Need
for
parsers
and
converters
for
diverse
data
formats
• Long
wai6ng
6me
for
support
or
even
lack
of
support
42. Challenges
Sandra
Gesing
42
A
world-‐wide
research
compu6ng
infrastructure
• Transparent
service
selec6on
• e.g.,
Docker
could
be
part
of
the
solu6on
• Access
to
data
irrespec6ve
of
loca6on
• Op6ons
to
share
data
efficiently
• Appropriate
privacy
and
security
measures
• Op6mized
usage
of
resources
• e.g.,
op6mized
usage
of
cloud
compu6ng
and
their
business
models
45. Challenges
Sandra
Gesing
45
Integra6on
of
data
sources
and
instruments
• Different
data
formats
• Different
interfaces
• Different
hardwares
and
technologies
…
from
small
ones
to
the
big
ones…
46. Challenges
Sandra
Gesing
46
SoSware
searchability,
reproducibility
and
reusability
• Science
gateways
step
in
the
right
direc6on
but
…
much
more
work
necessary
on
searchibility…
Not
only
finding
any
data
for
a
research
area
but
finding
the
right
data
• Metadata
approaches
• Dic6onaries
• More
involvement
of
librarians
47. Challenges
Sandra
Gesing
47
SoSware
searchability,
reproducibility
and
reusability
• Science
gateways
step
in
the
right
direc6on
but
…
much
more
work
necessary
on
reproducibility
and
reusability…
• studies
in
medicine
and
pharmacology:
11%
or
6%
of
the
analysed
research
was
reproducible
• myExperiment:
only
20%
of
workflows
reusable
because
of
dependencies
on
hardware,
local
or
distributed
data,
soSware
versions
48. Challenges
Sandra
Gesing
48
SoSware
searchability,
reproducibility
and
reusability
• Science
gateways
and
workflow
systems
step
in
the
right
direc6on
but
…
much
more
work
necessary
on
reproducibility
and
reusability…
• Containeriza6on
approaches
• Migra6on
approaches
• Combina6on
of
both
50. Projects
-‐
OSF
Sandra
Gesing
Science
Gateways
50
• Big
Data
• Reproducibility
Open
Access
to
Data
and
Projects
could
solve
parts
of
the
problems…
51. Projects
-‐
WSSSPE
Sandra
Gesing
Science
Gateways
51
Need
of
founda6onal
building
blocks
and
a
reward
system
for
soSware
engineering!
hHps://github.com/wssspe
Early
adopters
Publicity
Wider
adop3on
Funding
ends
Scien3sts
disillusioned
New
project
prototype
52. Projects
–
B3
Book
Sandra
Gesing
Science
Gateways
52
Biology,
Bioinforma6cs
and
Big
Data
arXiv:1511.02689
[cs.DC]
53. EU
COST
Ac6on
cHiPSet
(IC1406)
Sandra
Gesing
Science
Gateways
53
cHiPSet
–
High
Performance
Modeling
and
Simula6on
for
Big
Data
Applica6ons
• April
2015
–
April
2019
• 15
countries
-‐
12
COST,
3
non-‐COST
(US,
China,
Australia)
• 37
reseach
organiza6ons/companies
(31
COST,
6
non-‐
COST)
hHp://www.cost.eu/COST_Ac6ons/ict/Ac6ons/IC1406
54. EU
COST
Ac6on
cHiPSet
Sandra
Gesing
Science
Gateways
54
55. cHiPSet
-‐
Collabora6ons
Sandra
Gesing
Science
Gateways
55
Projects
declared
interest
for
collabora6on
• NESUS
(Network
for
Sustainable
Ultrascale
Compu6ng)
hHp://www.nesus.eu/
• KEYSTONE
(Seman6c
keyword-‐based
search
on
structured
data
sources)
hHp://www.keystone-‐cost.eu/
• AAPELE
(Algorithms,
Architectures
and
Pla•orms
for
Enhanced
Living
Environment)
hHp://aapele.eu/
And
maybe
YOU?