SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
Weaving Dataflows with Silk 
Taro L. Saito 
Treasure Data, Inc. 
leo@xerial.org 
 
September 6th, 2014  
ScalaMatsuri @ Tokyo 
1xerial.org/silk
About Me 
Weaving Dataflows with Silk 
xerial.org/silk2
Treasure Data Console 
Weaving Dataflows with Silk 
xerial.org/silk3
Processing Job Table 
Weaving Dataflows with Silk 
xerial.org/silk4
Functional Style Writing 
Weaving Dataflows with Silk 
xerial.org/silk5
Need an Optimization? 
Weaving Dataflows with Silk 
xerial.org/silk6
Procedural Style Writing 
Weaving Dataflows with Silk 
l Describes How to Process Data. 
xerial.org/silk7
Declarative Style Writing 
Weaving Dataflows with Silk 
l Less programming 
l System decides how to optimize the code 
l Hash joins, bloom filters and various optimization techniques are 
now available. 
xerial.org/silk8
Weaving Silk 
Weaving Dataflows with Silk 
In-memory weaver 
Cluster weaver (Spark?) 
MapReduce weaver 
Result 
Your own weaver (using TD?) 
l Making data processing code independent from the execution method! 
xerial.org/silk9 
Silk[A] 
(operation DAG) 
Weave 
(Execute)Silk Product
Cluster Weaver: Logical Plan to Physical Plan on Cluster 
Weaving Dataflows with Silk 
l Physical plan on cluster 
xerial.org/silk10 
I1 
I2 
I3 
P1 
P2 
P3 
P1 
P2 
P3 
P1 
P2 
P3 
S1 
S2 
S3 
S1 
S2 
S3 
S1 
S2 
S3 
R1 
S1 
S1 
S1 
S2 
S2 
S2 
S3 
S3 
S3 
P1 
P1 
P1 
P2 
P2 
P2 
P3 
P3 
P3 
R2 
R3 
Partition 
(hashing) 
serializationshuffledeserializationmerge sort 
Silk[people] 
Scatter
DAG-based Data Processing Engines 
Weaving Dataflows with Silk 
l Spark 
l Creates a task schedule for distributed processing 
l Summingbird 
l Integrates stream and batch data processing 
l e.g. Running Scalding and Storm at the same time 
l Apache Tez 
l Creates a dag schedule for optimizing MapReduce pipelines 
l GNU Makefile 
l Describes a pipeline of UNIX commands 
Why do we need another framework? 
xerial.org/silk11
Challenge: Isolate Code Writing and Its Execution 
Weaving Dataflows with Silk 
weaver 
Result 
Result 
Result 
l Why canʼ’t we run the program until finish writing? 
l How can we departure from compile-‐‑‒then-‐‑‒run paradigm? 
xerial.org/silk12 
Silk[A] 
(operation DAG) 
Weave 
(Execute)Silk Product
Weaving Dataflows with Silk 
l W 
xerial.org/silk13
Genome Science is A Big Data Science 
Weaving Dataflows with Silk 
l By sequencing, we can find 3 millions of SNPs for each person 
l To find the cause of disease (one or a few SNPs), we need to sequence as many samples as possible for 
narrowing down the candidate SNPs 
 
l Input: FASTQ file(s) 500GB (50x coverage, 200 million entries) 
l DNA Sequencer (Illumina, PacBio, etc.) 
l f: An alignment program 
l Output: Alignment results 750GB (sequence + alignment data) 
l Total storage space required: 1.2TB  
Output 
f 
Input 
University of Tokyo Genome Browser (UTGB) 
xerial.org/silk14
Human Genome Data Processing Workflows in Silk 
Weaving Dataflows with Silk 
l c”(UNIX Command)” 
xerial.org/silk15
Human Genome Data Processing Workflows 
Weaving Dataflows with Silk 
l Makefile: The result ($@) is stored into a file 
l Silk: The result is stored in variable 
l Computation of each command may take 1 or more hours  
xerial.org/silk16
SBT: A Good Hint 
Weaving Dataflows with Silk 
l SBT 
l Supports incremental  
compilation and testing 
l sbt ~∼test-‐‑‒only 
l Monitor source code change 
l Running specific tests 
l  sbt ~∼test-‐‑‒quick 
l Running failed tests only  
 
A 
fB 
C 
g 
D 
E 
F 
G 
l How do we compute the not-‐‑‒yet started part of a Scala 
program? 
l We need to know: 
l A-‐‑‒B and D-‐‑‒E are running 
l If B is finished, we can start B-‐‑‒C 
xerial.org/silk17
Writing A Dataflow 
Weaving Dataflows with Silk 
l Apply function f to the input A, then produce the output B 
l This step may take more than 1 hours in big data analysis 
 
18 
A 
B 
f 
val B = A.map(f) 
 
xerial.org/silk 
a 
Program v1
Distribution and Recovery 
Weaving Dataflows with Silk 
l Resume only B2 = A2.map(f) 
xerial.org/silk19 
A0 
A1 
A2 
B1 
B2 
f 
B0 
Failure! 
A 
B 
f 
a 
Program v1 
Retry
Extending Dataflows 
Weaving Dataflows with Silk 
Program v2 
l While running program v1, we may want to add another code 
(program v2) 
l We need to know variable B is now being processed 
20 
A 
B 
f 
C 
g 
Program v1 
xerial.org/silk
Labeling Program with Variable Names 
Weaving Dataflows with Silk 
Program v2 
l Storing intermediate results using variable names 
l variable names := program markers 
l But, we lost the variable names after compilation 
l Solution: Extract variable names from AST upon compile time 
l Using Scala Macros (Since Scala 2.10) 
21 
A 
B 
f 
val B = A.map(f) 
val C = B.map(g) 
 
C 
g 
Program v1 
xerial.org/silk
Scala Program (AST) to DAG Schedule (Logical Plan) 
Weaving Dataflows with Silk 
Program v2 
l Translate a program into a set of Silk operation objects 
l val B = MapOp(input:A, output:”B”, function:f) 
l val C = MapOp(input:B, output:”C”, function:g) 
l Operations in Silk form a DAG 
l val C = MapOp( 
input:MapOp(input:A, output:”B”, function:f), output:”C”, function:g) 
22 
A 
B 
f 
C 
g 
Program v1 
xerial.org/silk
Using Scala Macros 
Weaving Dataflows with Silk 
l Produce operation objects with Scala Macros 
l map(f:A=B) produces MapOp[A, B](…) 
l Why do we need to use Macro here? 
l To extract FContext (target variable name, enclosing method, class, 
etc.) from AST. 
xerial.org/silk23
Weaving Dataflows with Silk 
l s 
xerial.org/silk24
Extract target variable name and enclosing method 
Weaving Dataflows with Silk 
xerial.org/silk25
Finding Target Variable 
Weaving Dataflows with Silk 
xerial.org/silk26
Weaving Dataflows with Silk 
Program v2 
l Translate a program into a set of Silk operation objects 
l val B = MapOp(input:A, output:”B”, function:f) 
l val C = MapOp(input:B, output:”C”, function:g) 
l Silk uses these variable names to store the intermediate data 
27 
A 
B 
f 
C 
g 
Program v1 
xerial.org/silk
Weaving Dataflows with Silk 
l Silk defines various types of operations  
xerial.org/silk28
Object-Oriented Dataflow Programming 
Weaving Dataflows with Silk 
l Reusing and overriding dataflows 
xerial.org/silk29
Summary 
Weaving Dataflows with Silk 
weaver 
Result 
Result 
Result 
Cluster weaver 
l Declarative-‐‑‒style coding is necessary for creating DAG schedule 
l DAG schedules are labeled with variable names using ScalaMacros 
l Weaver: An abstraction of how to execute the code. 
l Weaver manages the running and finished parts of the code 
xerial.org/silk30 
Silk[A] 
(operation DAG) 
Weave 
(Execute)Silk Product
http://xerial.org/silk 
Weaving Dataflows with Silk 
xerial.org/silk31
Copyright 
©2014 
Treasure 
Data. 
All 
Rights 
Reserved. 
32 
WE 
ARE 
HIRING! 
www.treasuredata.com
Silk[A] 
Weaving Silk materializes objects 
Resource Table 
(CPU, memory) 
User program 
builds workflows 
Static optimization 
DAG Schedule 
• read file, toSilk 
• map, reduce, join, 
• groupBy 
• UNIX commands 
• etc. 
• Register ClassBox 
• Submit schedule 
Silk Master 
dispatch 
Silk Client 
ZooKeeper 
Node Table 
Slice Table 
Task Scheduler 
Task Status 
Task Executor 
Resource Monitor 
Silk Client 
Task Scheduler 
Task Executor 
Resource Monitor 
ensemble mode 
(at least 3 ZK instances) 
• Leader election 
• Collects locations of slices 
and ClassBox jars 
• Watches active nodes 
• Watches available resources 
• Submits tasks 
• Run-time optimization 
• Resource allocation 
• Monitoring resource usage 
• Launches Web UI 
• Manages assigned task status 
• Object serialization/deserialization 
• Serves slice data 
Local ClassBox 
classpaths  local jar files 
ClassBox Table 
weave 
• Dispatches tasks to clients 
• Manages master resource table 
• Authorizes resource allocation 
• Automatic recovery by 
leader election in ZK 
Data Server 
Data Server 
Silk[A] 
SilkSingle[A] SilkSeq[A] 
weave 
A 
single object 
Seq[A] 
sequence of objects 
Local machine 
Cluster 
xerial.org/silk33
Integrating Varieties of Data Sources 
Weaving Dataflows with Silk 
l WormTSS: http://wormtss.utgenome.org/ 
l Integrating various data sources 
xerial.org/silk34
Varieties of Data Analysis 
Weaving Dataflows with Silk 
Using R, JFreeChart, etc. 
Need a automated 
pipeline to redo the entire 
analysis for answering the 
paper review within a 
month. 
xerial.org/silk35
Makefile 
Weaving Dataflows with Silk 
l Describes dependencies of commands through files 
l Good: We can resume and update the data flow processing 
l Bad: Makefile of WormTSS analysis exceeds 1,000 lines 
36
Splitting Data Analysis Into Command Modules 
Weaving Dataflows with Silk 
l Added a new command as we needed a new analysis and data processing 
l The result: 
l hundreds of commands! 
l # of files limits the parallelism  
37

Contenu connexe

Tendances

Martin Odersky - Evolution of Scala
Martin Odersky - Evolution of ScalaMartin Odersky - Evolution of Scala
Martin Odersky - Evolution of Scala
Scala Italy
 
Scala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on HerokuScala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on Heroku
Havoc Pennington
 

Tendances (20)

camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
 
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
 
Scala Days NYC 2016
Scala Days NYC 2016Scala Days NYC 2016
Scala Days NYC 2016
 
Martin Odersky - Evolution of Scala
Martin Odersky - Evolution of ScalaMartin Odersky - Evolution of Scala
Martin Odersky - Evolution of Scala
 
Scala Days San Francisco
Scala Days San FranciscoScala Days San Francisco
Scala Days San Francisco
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
Scala Refactoring for Fun and Profit (Japanese subtitles)
Scala Refactoring for Fun and Profit (Japanese subtitles)Scala Refactoring for Fun and Profit (Japanese subtitles)
Scala Refactoring for Fun and Profit (Japanese subtitles)
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Scala profiling
Scala profilingScala profiling
Scala profiling
 
Scala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on HerokuScala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on Heroku
 
Scalax
ScalaxScalax
Scalax
 
How Scala code is expressed in the JVM
How Scala code is expressed in the JVMHow Scala code is expressed in the JVM
How Scala code is expressed in the JVM
 
Scala - The Simple Parts, SFScala presentation
Scala - The Simple Parts, SFScala presentationScala - The Simple Parts, SFScala presentation
Scala - The Simple Parts, SFScala presentation
 
Scala Matsuri 2016: Japanese Text Mining with Scala and Spark
Scala Matsuri 2016: Japanese Text Mining with Scala and SparkScala Matsuri 2016: Japanese Text Mining with Scala and Spark
Scala Matsuri 2016: Japanese Text Mining with Scala and Spark
 
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
 
The Evolution of Scala
The Evolution of ScalaThe Evolution of Scala
The Evolution of Scala
 
Advanced Production Debugging
Advanced Production DebuggingAdvanced Production Debugging
Advanced Production Debugging
 
Scala in practice
Scala in practiceScala in practice
Scala in practice
 

En vedette

GitBucket: The perfect Github clone by Scala
GitBucket: The perfect Github clone by ScalaGitBucket: The perfect Github clone by Scala
GitBucket: The perfect Github clone by Scala
takezoe
 
Yarn & Fabric Formation by Sukhvir Sabharwal
Yarn & Fabric Formation by Sukhvir SabharwalYarn & Fabric Formation by Sukhvir Sabharwal
Yarn & Fabric Formation by Sukhvir Sabharwal
Sukhvir Sabharwal
 
芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜
芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜
芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜
scalaconfjp
 

En vedette (20)

Xitrum Web Framework Live Coding Demos / Xitrum Web Framework ライブコーディング
Xitrum Web Framework Live Coding Demos / Xitrum Web Framework ライブコーディングXitrum Web Framework Live Coding Demos / Xitrum Web Framework ライブコーディング
Xitrum Web Framework Live Coding Demos / Xitrum Web Framework ライブコーディング
 
Solid And Sustainable Development in Scala
Solid And Sustainable Development in ScalaSolid And Sustainable Development in Scala
Solid And Sustainable Development in Scala
 
What's a macro?: Learning by Examples / Scalaのマクロに実用例から触れてみよう!
What's a macro?: Learning by Examples / Scalaのマクロに実用例から触れてみよう!What's a macro?: Learning by Examples / Scalaのマクロに実用例から触れてみよう!
What's a macro?: Learning by Examples / Scalaのマクロに実用例から触れてみよう!
 
Scalable Generator: Using Scala in SIer Business (ScalaMatsuri)
Scalable Generator: Using Scala in SIer Business (ScalaMatsuri)Scalable Generator: Using Scala in SIer Business (ScalaMatsuri)
Scalable Generator: Using Scala in SIer Business (ScalaMatsuri)
 
[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫
[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫
[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫
 
GitBucket: The perfect Github clone by Scala
GitBucket: The perfect Github clone by ScalaGitBucket: The perfect Github clone by Scala
GitBucket: The perfect Github clone by Scala
 
sbt, past and future / sbt, 傾向と対策
sbt, past and future / sbt, 傾向と対策sbt, past and future / sbt, 傾向と対策
sbt, past and future / sbt, 傾向と対策
 
From Ruby to Scala
From Ruby to ScalaFrom Ruby to Scala
From Ruby to Scala
 
Node.js vs Play Framework (with Japanese subtitles)
Node.js vs Play Framework (with Japanese subtitles)Node.js vs Play Framework (with Japanese subtitles)
Node.js vs Play Framework (with Japanese subtitles)
 
TAUS Scotland Asia Online Technology Platform V1
TAUS Scotland  Asia Online Technology Platform   V1TAUS Scotland  Asia Online Technology Platform   V1
TAUS Scotland Asia Online Technology Platform V1
 
Yarn & Fabric Formation by Sukhvir Sabharwal
Yarn & Fabric Formation by Sukhvir SabharwalYarn & Fabric Formation by Sukhvir Sabharwal
Yarn & Fabric Formation by Sukhvir Sabharwal
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 
芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜
芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜
芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜
 
Woven Fabric Design: Weaving Plan
Woven Fabric Design: Weaving PlanWoven Fabric Design: Weaving Plan
Woven Fabric Design: Weaving Plan
 
Scala が支える医療系ウェブサービス #jissenscala
Scala が支える医療系ウェブサービス #jissenscalaScala が支える医療系ウェブサービス #jissenscala
Scala が支える医療系ウェブサービス #jissenscala
 
BRANDING — Fashion Institute of Technology Denim Project Presentation — Feb 2...
BRANDING — Fashion Institute of Technology Denim Project Presentation — Feb 2...BRANDING — Fashion Institute of Technology Denim Project Presentation — Feb 2...
BRANDING — Fashion Institute of Technology Denim Project Presentation — Feb 2...
 
Process control of weaving
Process control of weaving Process control of weaving
Process control of weaving
 
Scala@SmartNews_20150221
Scala@SmartNews_20150221Scala@SmartNews_20150221
Scala@SmartNews_20150221
 
Scala@SmartNews AdFrontend を Scala で書いた話
Scala@SmartNews AdFrontend を Scala で書いた話Scala@SmartNews AdFrontend を Scala で書いた話
Scala@SmartNews AdFrontend を Scala で書いた話
 
ビズリーチの新サービスをScalaで作ってみた 〜マイクロサービスの裏側 #jissenscala
ビズリーチの新サービスをScalaで作ってみた 〜マイクロサービスの裏側 #jissenscalaビズリーチの新サービスをScalaで作ってみた 〜マイクロサービスの裏側 #jissenscala
ビズリーチの新サービスをScalaで作ってみた 〜マイクロサービスの裏側 #jissenscala
 

Similaire à Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
Fei Dong
 

Similaire à Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo (20)

Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
 
Scala+data
Scala+dataScala+data
Scala+data
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Encode Club workshop slides
Encode Club workshop slidesEncode Club workshop slides
Encode Club workshop slides
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Using akka streams to access s3 objects
Using akka streams to access s3 objectsUsing akka streams to access s3 objects
Using akka streams to access s3 objects
 
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Java 8
Java 8Java 8
Java 8
 
New features in jdk8 iti
New features in jdk8 itiNew features in jdk8 iti
New features in jdk8 iti
 

Plus de Taro L. Saito

Plus de Taro L. Saito (20)

Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
 
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
 
Airframe RPC
Airframe RPCAirframe RPC
Airframe RPC
 
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
 
Airframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpecAirframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpec
 
Presto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 UpdatesPresto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 Updates
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of Presto
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
 
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
 
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS Projects
 
Learning Silicon Valley Culture
Learning Silicon Valley CultureLearning Silicon Valley Culture
Learning Silicon Valley Culture
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure Data
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例
 
JNuma Library
JNuma LibraryJNuma Library
JNuma Library
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

  • 1. Weaving Dataflows with Silk Taro L. Saito Treasure Data, Inc. leo@xerial.org September 6th, 2014 ScalaMatsuri @ Tokyo 1xerial.org/silk
  • 2. About Me Weaving Dataflows with Silk xerial.org/silk2
  • 3. Treasure Data Console Weaving Dataflows with Silk xerial.org/silk3
  • 4. Processing Job Table Weaving Dataflows with Silk xerial.org/silk4
  • 5. Functional Style Writing Weaving Dataflows with Silk xerial.org/silk5
  • 6. Need an Optimization? Weaving Dataflows with Silk xerial.org/silk6
  • 7. Procedural Style Writing Weaving Dataflows with Silk l Describes How to Process Data. xerial.org/silk7
  • 8. Declarative Style Writing Weaving Dataflows with Silk l Less programming l System decides how to optimize the code l Hash joins, bloom filters and various optimization techniques are now available. xerial.org/silk8
  • 9. Weaving Silk Weaving Dataflows with Silk In-memory weaver Cluster weaver (Spark?) MapReduce weaver Result Your own weaver (using TD?) l Making data processing code independent from the execution method! xerial.org/silk9 Silk[A] (operation DAG) Weave (Execute)Silk Product
  • 10. Cluster Weaver: Logical Plan to Physical Plan on Cluster Weaving Dataflows with Silk l Physical plan on cluster xerial.org/silk10 I1 I2 I3 P1 P2 P3 P1 P2 P3 P1 P2 P3 S1 S2 S3 S1 S2 S3 S1 S2 S3 R1 S1 S1 S1 S2 S2 S2 S3 S3 S3 P1 P1 P1 P2 P2 P2 P3 P3 P3 R2 R3 Partition (hashing) serializationshuffledeserializationmerge sort Silk[people] Scatter
  • 11. DAG-based Data Processing Engines Weaving Dataflows with Silk l Spark l Creates a task schedule for distributed processing l Summingbird l Integrates stream and batch data processing l e.g. Running Scalding and Storm at the same time l Apache Tez l Creates a dag schedule for optimizing MapReduce pipelines l GNU Makefile l Describes a pipeline of UNIX commands Why do we need another framework? xerial.org/silk11
  • 12. Challenge: Isolate Code Writing and Its Execution Weaving Dataflows with Silk weaver Result Result Result l Why canʼ’t we run the program until finish writing? l How can we departure from compile-‐‑‒then-‐‑‒run paradigm? xerial.org/silk12 Silk[A] (operation DAG) Weave (Execute)Silk Product
  • 13. Weaving Dataflows with Silk l W xerial.org/silk13
  • 14. Genome Science is A Big Data Science Weaving Dataflows with Silk l By sequencing, we can find 3 millions of SNPs for each person l To find the cause of disease (one or a few SNPs), we need to sequence as many samples as possible for narrowing down the candidate SNPs l Input: FASTQ file(s) 500GB (50x coverage, 200 million entries) l DNA Sequencer (Illumina, PacBio, etc.) l f: An alignment program l Output: Alignment results 750GB (sequence + alignment data) l Total storage space required: 1.2TB Output f Input University of Tokyo Genome Browser (UTGB) xerial.org/silk14
  • 15. Human Genome Data Processing Workflows in Silk Weaving Dataflows with Silk l c”(UNIX Command)” xerial.org/silk15
  • 16. Human Genome Data Processing Workflows Weaving Dataflows with Silk l Makefile: The result ($@) is stored into a file l Silk: The result is stored in variable l Computation of each command may take 1 or more hours xerial.org/silk16
  • 17. SBT: A Good Hint Weaving Dataflows with Silk l SBT l Supports incremental compilation and testing l sbt ~∼test-‐‑‒only l Monitor source code change l Running specific tests l sbt ~∼test-‐‑‒quick l Running failed tests only A fB C g D E F G l How do we compute the not-‐‑‒yet started part of a Scala program? l We need to know: l A-‐‑‒B and D-‐‑‒E are running l If B is finished, we can start B-‐‑‒C xerial.org/silk17
  • 18. Writing A Dataflow Weaving Dataflows with Silk l Apply function f to the input A, then produce the output B l This step may take more than 1 hours in big data analysis 18 A B f val B = A.map(f) xerial.org/silk a Program v1
  • 19. Distribution and Recovery Weaving Dataflows with Silk l Resume only B2 = A2.map(f) xerial.org/silk19 A0 A1 A2 B1 B2 f B0 Failure! A B f a Program v1 Retry
  • 20. Extending Dataflows Weaving Dataflows with Silk Program v2 l While running program v1, we may want to add another code (program v2) l We need to know variable B is now being processed 20 A B f C g Program v1 xerial.org/silk
  • 21. Labeling Program with Variable Names Weaving Dataflows with Silk Program v2 l Storing intermediate results using variable names l variable names := program markers l But, we lost the variable names after compilation l Solution: Extract variable names from AST upon compile time l Using Scala Macros (Since Scala 2.10) 21 A B f val B = A.map(f) val C = B.map(g) C g Program v1 xerial.org/silk
  • 22. Scala Program (AST) to DAG Schedule (Logical Plan) Weaving Dataflows with Silk Program v2 l Translate a program into a set of Silk operation objects l val B = MapOp(input:A, output:”B”, function:f) l val C = MapOp(input:B, output:”C”, function:g) l Operations in Silk form a DAG l val C = MapOp( input:MapOp(input:A, output:”B”, function:f), output:”C”, function:g) 22 A B f C g Program v1 xerial.org/silk
  • 23. Using Scala Macros Weaving Dataflows with Silk l Produce operation objects with Scala Macros l map(f:A=B) produces MapOp[A, B](…) l Why do we need to use Macro here? l To extract FContext (target variable name, enclosing method, class, etc.) from AST. xerial.org/silk23
  • 24. Weaving Dataflows with Silk l s xerial.org/silk24
  • 25. Extract target variable name and enclosing method Weaving Dataflows with Silk xerial.org/silk25
  • 26. Finding Target Variable Weaving Dataflows with Silk xerial.org/silk26
  • 27. Weaving Dataflows with Silk Program v2 l Translate a program into a set of Silk operation objects l val B = MapOp(input:A, output:”B”, function:f) l val C = MapOp(input:B, output:”C”, function:g) l Silk uses these variable names to store the intermediate data 27 A B f C g Program v1 xerial.org/silk
  • 28. Weaving Dataflows with Silk l Silk defines various types of operations xerial.org/silk28
  • 29. Object-Oriented Dataflow Programming Weaving Dataflows with Silk l Reusing and overriding dataflows xerial.org/silk29
  • 30. Summary Weaving Dataflows with Silk weaver Result Result Result Cluster weaver l Declarative-‐‑‒style coding is necessary for creating DAG schedule l DAG schedules are labeled with variable names using ScalaMacros l Weaver: An abstraction of how to execute the code. l Weaver manages the running and finished parts of the code xerial.org/silk30 Silk[A] (operation DAG) Weave (Execute)Silk Product
  • 31. http://xerial.org/silk Weaving Dataflows with Silk xerial.org/silk31
  • 32. Copyright ©2014 Treasure Data. All Rights Reserved. 32 WE ARE HIRING! www.treasuredata.com
  • 33. Silk[A] Weaving Silk materializes objects Resource Table (CPU, memory) User program builds workflows Static optimization DAG Schedule • read file, toSilk • map, reduce, join, • groupBy • UNIX commands • etc. • Register ClassBox • Submit schedule Silk Master dispatch Silk Client ZooKeeper Node Table Slice Table Task Scheduler Task Status Task Executor Resource Monitor Silk Client Task Scheduler Task Executor Resource Monitor ensemble mode (at least 3 ZK instances) • Leader election • Collects locations of slices and ClassBox jars • Watches active nodes • Watches available resources • Submits tasks • Run-time optimization • Resource allocation • Monitoring resource usage • Launches Web UI • Manages assigned task status • Object serialization/deserialization • Serves slice data Local ClassBox classpaths local jar files ClassBox Table weave • Dispatches tasks to clients • Manages master resource table • Authorizes resource allocation • Automatic recovery by leader election in ZK Data Server Data Server Silk[A] SilkSingle[A] SilkSeq[A] weave A single object Seq[A] sequence of objects Local machine Cluster xerial.org/silk33
  • 34. Integrating Varieties of Data Sources Weaving Dataflows with Silk l WormTSS: http://wormtss.utgenome.org/ l Integrating various data sources xerial.org/silk34
  • 35. Varieties of Data Analysis Weaving Dataflows with Silk Using R, JFreeChart, etc. Need a automated pipeline to redo the entire analysis for answering the paper review within a month. xerial.org/silk35
  • 36. Makefile Weaving Dataflows with Silk l Describes dependencies of commands through files l Good: We can resume and update the data flow processing l Bad: Makefile of WormTSS analysis exceeds 1,000 lines 36
  • 37. Splitting Data Analysis Into Command Modules Weaving Dataflows with Silk l Added a new command as we needed a new analysis and data processing l The result: l hundreds of commands! l # of files limits the parallelism 37