4. - 大野 圭一朗 (Keiichiro Ono)
- カリフォルニア大学
サンディエゴ校医学部
- Trey Ideker Lab
- Software Engineer /
Research Associate
- National Resource for Network
Biology (NRNB)
- nrnb.org
5. Keiichiro Ono
Background
Bioinformatics
Computer Science
Work
Research
Bioinformatics workflow
Visualization pipeline
Data
Visualization
Networks
Other Biological Data
Integration
Molecular Interactions
Pathways
Annotations
Software Development
Cytoscape
NeXO
Cyberinfrastructure
All kinds of small tools
Like
Art
Kandinsky
Mondrian
Music
Electronica
Techno
Minimal
Detroit
Jazz
Sci-fi
Movie
Novel
Life
US
San Diego
San Francisco Bay Area
Los Angeles
Orange County
Japan
Gifu
Tokyo
8. Keiichiro Ono
Background
Bioinformatics
Computer Science
Work
Research
Bioinformatics workflow
Visualization pipeline
Data
Visualization
Networks
Other Biological Data
Integration
Molecular Interactions
Pathways
Annotations
Software Development
Cytoscape
NeXO
Cyberinfrastructure
All kinds of small tools
Like
Art
Kandinsky
Mondrian
Music
Electronica
Techno
Minimal
Detroit
Jazz
Sci-fi
Movie
Novel
Life
US
San Diego
San Francisco Bay Area
Los Angeles
Orange County
Japan
Gifu
Tokyo
37. Chart Editor
- Visualize multiple data points
to a single view
- Time series data
- Multiple GO terms
- Chart types: Bar, Box, Pie,
Heat Map, Ring
- Part of standard Visual Style
Editor
- Everything will be saved
into session files
47. User Type I
- いわゆる「ベンチバイオロ
ジスト」
- Excelを多用
- ドメインの専門家でデータ
を生み出す人々
- しかし解析と可視化はま
だ…と言う場合も多い
48. User Type II
- バイオインフォマティシャン
- Python + SciPy/NumPy, R +
Bioconductor, MATLABといっ
たもの日常的に利用する
- 必要に応じてライブラリも書く
- 大規模な計算機リソースも多用
する
- 「手作業は悪!」
49. どちらのユーザーも重要!
- Type I: “Bench Biologists”
- Domain experts
- Data producers
- Type II: Computational Biologists
- Experts of large-scale data analysis
- Especially important for genome-scale
data analysis
Cytoscapeにはこちらに
フォーカスした機能が少ない…
50. User Type II
- バイオインフォマティシャン
- Python + SciPy /NumPy, R +
Bioconductor, MATLABといっ
たもの日常的に利用する
- 必要に応じてコードも書く
- 大規模な計算機リソースも多
用する
- 「手作業は悪!」
51. Requests from Type II Users
- I have 200 networks in my session and I need to create
one PDF per view. How can I do it with Cytoscape?
- I need to use igraph for network analysis, but its
visualization feature is limited. I want to use Cytoscape
as an external visualization engine for R.
- Usually I use IPython Notebook to record my work.
How can I integrate Cytoscape into my workflow?
- I want to generate Style for each time point and create
small multiples of networks.
72. ソフトウェア開発スタイルの変化
- An application is a collection of smaller services
- JavaScript is a first-class citizen in the world of
programming languages
- Design application with cloud services in mind
74. In the modern era, software is commonly delivered as a
service: called web apps, or software-as-a-service. The twelve-factor
app is a methodology for building software-as-a-service apps that:
• Use declarative formats for setup automation, to minimize time and
cost for new developers joining the project
• Have a clean contract with the underlying operating system, offering
maximum portability between execution environments
• Are suitable for deployment on modern cloud platforms, obviating
the need for servers and systems administration
• Minimize divergence between development and production,
enabling continuous deployment for maximum agility
• And can scale up without significant changes to tooling,
architecture, or development practices.
77. This MANIFESTO counters
current trends in
bioinformatics where
institutes and companies
are creating monolithic
software solutions aimed
mostly at end-users.
78. –THE SMALL TOOLS MANIFESTO FOR BIOINFORMATICS
“Every single tool should do the smallest possible
task really well”
85. データ解析ツールの傾向
- Python is becoming the standard
language for “Data Scientists”
- Python itself is a very slow language,
but is a perfect glue
- Lots of tools are made by scientists
(e.g. Anaconda by Continuum)
- They do understand current
problems in modern scientific
computing, and trying to solve them
89. - Visualization needs varies,
especially for complex data sets like the
one from life science domain
- For that purpose, Java is not the best
language to implement applications
- Even large-scale data visualization
applications are moving to the web
browsers
- Canvas (Cytoscape.js), WebGL
(Three.js), SVG (D3.js)
- Most of the talented hackers are
working on the web browsers, i.e.,
JavaScript
92. 科学系計算機環境における課題
- No more free lunch
- Even if you buy expensive machines, you cannot get free
performance gain anymore. You have to design your code for
massively distributed environment. (From Scale-up to Scale-out)
- Complex Data Analysis Pipeline
- Needs for complex, customized data visualization
- Reproducibility
- パイプラインの構築そのものが複雑で再現性の確保が困難
107. Srivas, Rohith et al. “Assembling Global Maps of Cellular Function through
Integrative Analysis of Physical and Genetic Networks.” Nature Protocols
6.9 (2011): 1308–1323. PMC. Web. 1 Dec. 2014.
108. Core algorithm 1
as Python
Java Implementation of
Algorithms
Cytoscape 2.x Plugin
Biological
Problem
Cytoscape 3.x App
Core algorithm 2
as Python
Core algorithm n
as Python
PanGIA Service
(Implement in Python again…?)
by Sourav
by Greg, Rohith
by Greg, Rothith and Cytoscape Team
by David
History of PanGIA Application
112. NeXO Web
- Term Enrichment Analysis
- From list of genes, perform
hypergeometric test over set of
machine-generated ontology (NeXO)
terms and display terms with p-values
- It is independent from all other parts of
NeXO Web application
113. Term
Enrichment Service API by Flask
Python Core
SciPy
NumPy
Overview of NeXO Term
Enrichment Service
NeXO Web RESTful API
114. Term
Enrichment Service API by Flask
Python Core
SciPy
NumPy
Overview of NeXO Term
Enrichment Service
NeXO Web RESTful API
115. Option 1: As a Cytoscape App
- Re-implement this algorithm as a Cytoscape App
(Java Application)
- Pros:
- Easy to install
- Cons:
- A lot of work…
- Should be written in Java
- Does not scale-out!
116. Option 2: As a Service
- Wrap existing applications and deploy to platform of users’ choice:
- Laptops, private servers, and commercial cloud services (AWS/Google
Computing Cloud, etc.)
- Pros:
- Scales-out
- Client-independent
- Workflow-friendly
- Cons:
- Need to adopt to the new way of software design
- Relatively more complex deployment
117. –THE SMALL TOOLS MANIFESTO FOR BIOINFORMATICS
“Every single tool should do the smallest possible
task really well”
128. Software Distribution Problem
- “It-worked-on-my-machine” syndrome
- This is a serious problem especially when
you want to share your workflow with
collaborators.
129.
130.
131. What is Docker?
- Container to run applications in an isolated
environment
- Application = Layer of images
- Sharable Environments
- Environments as code
148. We (the NIH) Are Working On, But As
Yet Do Not Have Good Answers To:
1. Today, how much are we actually
spending on data and software related
activities?
2. How much should we be spending to
achieve the maximum benefit to
biomedical science relative to what we
spend in other areas?
Biomedical Research as an Open Digital Enterprise by Philip E. Bourne Ph.D.
Associate Director for Data Science (NIH)
149. Reproducibility
! Most of the 27 Institutes and Centers of the NIH are
currently reviewing the ability to reproduce research
they are funding
! The NIH recently convened a meeting with publishers
to discuss the issue – a set of guiding principles
arose
Biomedical Research as an Open Digital Enterprise by Philip E. Bourne Ph.D.
Associate Director for Data Science (NIH)
150. NIH The Commons
(Definition by Dr. Bourne)
• Is Not:
• A database
• Confined to one
physical location
• A new large
infrastructure
• Owned by any one
group
• Is:
• A conceptual framework
• Analogous to the Internet
• A collaboratory
• A few shared rules
• All research objects have
unique identifiers
• All research objects have
limited provenance