Presenter: Chen Li, PhD. Professor, Department of Computer Science, University of California Irvine
Abstract
Many data analytics projects have collaborators with complementary backgrounds, including biologists, bioinformaticians, computer scientists, and AI/ML experts. Many of them have limited experience to code, set up a computing infrastructure, and use MLmodels. Existing tools and services, such as email attachments, GitHub, and Google Drive are inefficient for sharing data and analyses. In this talk, we present an open source system called Texera that provides a cloud computing platform for collaborators to share data and analyses as workflows. After seven years of development, the system has a rich set of powerful features, such as shared editing, shared execution, version control, commenting, debugging, user-defined functions in multiple languages (e.g., Python, R, Java), and support of state-of-the-art AI/ML techniques. Its backend parallel engine enables scalable computation on large data sets using computing clusters. We will show a demo of the system, and present our vision supported by a recent NIH award, dkNET(NIDDK Information Network, https://dknet.org), to serve the diabetes, endocrinology, and metabolic diseases research communities through the FAIR sharing of data and knowledge.
Resource link: https://github.com/Texera/texera
Upcoming webinars schedule: https://dknet.org/about/webinar
4. - Coding is hard!
- Version control of libraries
- Needs servers
- Slow on large data
- Not every lab can afford a
bioinformatician
4
Data preparation
Coding challenges
Data analytics
Visualization
Sally: Bioinformatician
Chen Li, UCI
5. Collaboration challenges
● Collaborators of different backgrounds:
○ Biologists
○ Bioinformaticians
○ Computer scientists
● Collaborators from different organizations
○ Same lab: senior students vs new students
○ Other labs
5
Chen Li, UCI
6. Limitations
● Only file management, no run-time environment
● Inefficient!
Collaboration: existing tools
6
Chen Li, UCI
7. ● How to utilize state-of-the-art AI/ML technologies?
● Require advanced coding skills
● Not easily available
AI/ML opportunities
7
Chen Li, UCI
8. Cloud-computing services for sharing data and workflow-based analyses
Benefits:
- Cloud services (no installation, software patches)
- Version control
- Shared editing/execution
- Sharing data and workflows
- Parallel engine, scalable
- …
Our solution
8
Chen Li, UCI
12. Figures on the entire dataset
12
Quality Control
Elbow plot
Clustered UMAP
Annotated UMAP
Chen Li, UCI
13. Texera Statistics
13
# of user accounts 332 # of projects 86
# of workflows 2,257 # of executions 31,000
# of workflow versions 357,000 # of publications 23
# of deployed servers 7 # of CPU cores in the largest deployment 400
# of files on GitHub 1,291 # of lines of code on GitHub 101,690
# of pull requests on GitHub 2,096 # of current PhD students 7
# of collaborating professors 17 # of involved undergraduates 80+
# of completed PhD theses 3 # of development years 7
Chen Li, UCI
17. Mission: to serve the diabetes, endocrinology, and metabolic diseases
research communities through the FAIR sharing of data and knowledge.
New NIH award (dkNet)
17
Chen Li, UCI
19. - Support a ChatGPT-like interface
- Provide more operators and workflows related to sequencing
- Make analysis parameters configurable
- Parallelize bottleneck steps
- Make more AI/ML techniques available
- Migrate existing programs to workflows
- Support public clouds (e.g., AWS, GCP)
- …
Open research problems
19
Chen Li, UCI
20. - Cloud-computing platform
- GUI-based workflows (no coding needed)
- Collaboration and sharing of data/analyses
- Parallel computing: for big data
- Supporting multiple languages: Python, R, Java, …
- Supporting AI/ML (training, inference, …)
Summary
20
Chen Li, UCI
21. Prof. Chen Li
Computer Science, UC Irvine
Texera: A Scalable Cloud Computing Platform for
Sharing Data and Workflow-Based Analyses
21
Acknowledgements: Yicong Huang, Sally Lee, Xinyuan Lin, Xiaozhen
Liu, Kun Woo (Chris) Park, Kevin Wu, and the Texera team