Data science requires so many skills, people and time before the results can be accessed. Moreover, these results cannot be static anymore. And finally, the Big Data comes to the plate and the whole tool chain needs to change.
In this talk Data Fellas introduces Shar3, a tool kit aiming to bridged the gaps to build a interactive distributed data processing pipeline, or loop!
Then the talk covers genomics nowadays problems including data types, processing, discovery by introducing the GA4GH initiative and its implementation using Shar3.
Apidays New York 2024 - The value of a flexible API Management solution for O...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to Genomics
1. by Data Fellas,
Data Enthusiasts v 4.0 (July, 13th ‘15)
Scalable and Interoperable data services
Applied to Genomics
2. Young Belgian Startup
The Data Fellas Startup
Data Science
Xavier Tordoir
@xtordoir
Andy Petrella
@noootsab
Data Processing
Scalable Machine Learning
Micro Services oriented
4. Data Fellas: Evangelizing
Training
Scala
Apache Spark (BE, in September)
http://spark4devs.data-fellas.guru/
Distributed Machine Learning
Pipeline (Oakland, August)
http://bigdatascala.bythebay.io/training.html
Apache Spark
(SFO with BoldRadius, August)
Talks
Scala IO, Devoxx Belgium,
Devoxx France, Scala Days, KTH,
KUL, Spark Meetup London, …
more to come (Italy, …)
PMC Member at Strata NY
PMC member at Devoxx
PMC Member at Foss4G
13. Next: Applied TO Genomics
Genomics data is pretty big
● 100,000’s genomes in 2015
● 1,000,000’s …
● 100,000,000’s …
● …
14. Next: Applied TO Genomics
Genomics data is pretty big and of High dimensionality
One genome:
○ 3 billions bases (basic DNA component) sequence
○ 30 - 60 x coverage for quality
○ 10’s to 100’s millions variants (variable bases
from one individual to the next)
15. Next: Applied TO Genomics
e.g. 1000genomes project:
● 200TB compressed data
● organised in files/directories
● data formatted following specs in a … PDF
Data and services schemas are required
16. What we do with genomics data?
Lots of Querying and Learning:
E.G.
● Population structure is a fundamental basis
● Querying relationships between genomes and other
biological features
Hey… no one has all data!
Metadata
17. What we do with genomics data?
Lots of Querying and Learning:
E.G.
● We do some specific Modelling on some data…
Hey… no two serve the same computations!
Service Discovery
21. Wrap-UP
Follow us @DataFellas and get notified about our
+ sharing platform at scale: Shar3
+ Google Genomics At Home (^.^): Med@Scale
+ future plans: modules for Trading, Geospatial,
other medical data, …