The primary goal of my trip to Seattle was to establish a collaboration with a world-leading group on data integration. But by having chosen Seattle, a hub for technology companies, I also learned about synergies between business and research: Ilya Shmulevich from the Institute for Systems Biology makes use of Amazon's ''Random Forest" implementation and Google's 600.000 CPU cluster for cancer genomic association discovery. I also met with experts from University of Washington and Microsoft research to learn about technological advancements to tackle BigData and commoditizing parallelization. Finally, I observed a government funded research agency invest in solutions geared towards their enterprise structure rather than adopt solutions designed for research institutes without active computational community. In conclusion: CSIRO has unique properties and skill-sets that many collaborators would be interested in benefiting from, in return such collaborations would propel CSIRO instantly to the forefront of technology, which in particular for the analysis of big, unstructured datasets could be very rewarding.
Computer 10: Lesson 10 - Online Crimes and Hazards
Trip Report Seattle
1. Seattle Trip Report
Data Integration – Company Engagement – BigData
Denis C. Bauer | Research Scientist
19 November 2012
CMIS
2. About me
• BSc (Germany) Bioinformatics + Hons (ITEE, UQ) “In Silico Protein Design” Machine Learning
• PhD (IMB, UQ) “Quantitative models of Transcriptional regulation” Optimization
• PostDoc (IMB, UQ) “Sorting the intranuclear proteom” Bayesian Networks
• PostDoc (QBI, UQ) Bioinformatics for the Sequencing Facility Operation
• Research Scientist (CSIRO)
“Data integration of ‘Omics data in CRC”
• Develop protocols for data generation
• Develop pipelines for analysis
• Research ways for data integration
pHealth (Garry Hannan)
3. Seattle: Future hub for life sciences?
Seattle Trip Report | Denis C. Bauer | Page 3
4. Primary Goal: Collaboration with
William Noble
Bayesian Network for automatic
grouping of genomic functional elements
(TSS, gene) by learning simultaneously from
measured genomic features (histone Bill Noble
modifications)
Michael Hoffman
Seattle Trip Report | Denis C. Bauer | Page 4
6. Institute for Systems Biology: case study
for BigData
TCGA has 20 different cancer
types with up to 900 samples
each.
• Faster computers
• Better approaches
Amazon: machine learning method for uncovering Ilya Shmulevich
multivariate associations from large and diverse data sets.
Google: Use 10.000 – 600.000 cores and benefit from
Google expertise in compute and storage.
Seattle Trip Report | Denis C. Bauer | Page 6
7. ISB App Engine Presentation at Google IO 2012
http://popcorn.webmadecontent.org/4d3
Seattle Trip Report | Denis C. Bauer | Page 7
8. Focusing on large scale and tactile interactive experiences that engross and
envelope the visitor, Philip Worthington (1977-) created Shadow Monsters, a
digital version of the traditional shadow puppet.
Seattle Trip Report | Denis C. Bauer | Page 8
9. Can CSIRO use outline-detection to do cool stuff ?
Seattle Trip Report | Denis C. Bauer | Page 9
10. Road Trip to Pacific Northwestern National Laboratory
Presentation title | Presenter name | Page 10
11. Road Trip to PNNL
Presentation title | Presenter name | Page 11
12. Road Trip to PNNL
Presentation title | Presenter name | Page 12
13. Road Trip to PNNL
Presentation title | Presenter name | Page 13
14. Road Trip to PNNL
Presentation title | Presenter name | Page 14
15. Road Trip to PNNL
Presentation title | Presenter name | Page 15
16. Enterprise-wide multidisciplinary
collaborations
PNNL predicts from sensor data if and when
radioactive material hits ground water.
Mathematical and visual prediction methods of
compute-intensive expert systems
Ian’s team develops a framework that allows
enterprise wide collaboration
• Data sharing/annotation/provenance
• Computational expert pipelines -> graphical
programming -> domain experts
• Developed for computer-grid infrastructure
Ian Gorton
Seattle Trip Report | Denis C. Bauer | Page 16
17. Commoditize parallelization
Computer Science & Engineering
University of Washington
Currently: Expert-system if !(embarrassingly parallel)
• Deciding how to most efficiently bundle for parallel
execution and how to resolve
• The appropriate method can change with the actual load
at runtime
Parallelization needs to become something the
compiler at run time works out for us
(just like we don’t write assembly code anymore)
• SciDB
• SKEWTUNE (better load for Hadoop)
• HaLoop (Iterative parallele Data Processing)
Magdalena Balazinska
Presentation title | Presenter name | Page 17