SlideShare a Scribd company logo
1 of 68
Cluster Computing with
               DryadLINQ
                       Mihai Budiu
             Microsoft Research, Silicon Valley

Cloud computing: Infrastructure, Services, and Applications
              UC Berkeley, March 4 2009
Goal




       2
Design Space

Internet




                      Data-
                     parallel



           Shared
Private    memory
 data
center


           Latency                Throughput
                                               3
Data-Parallel Computation

Application
                 SQL       Sawzall     ≈SQL       LINQ, SQL
                          Sawzall    Pig, Hive   DryadLINQ
Language
                                                   Scope
                           Map-
               Parallel              Hadoop
Execution                 Reduce                   Dryad
              Databases
                                                  Cosmos
Storage                     GFS       HDFS
                                                   Azure
                          BigTable     S3
                                                 SQL Server


                                                              4
Software Stack
                                       Applications

              Log parsing
                                    Machine          Data
               SQL      C#          Learning Graphs mining
legacy                                                                   SSIS
 code PSQL       Scope         .Net Distributed Data Structures
                                                                           SQL




                                                                                    queueing
    Distributed Shell                    DryadLINQ                C++     server

                                           Dryad

 Distributed FS (Cosmos)          Azure XStore           SQL Server          NTFS

      Cluster Services                 Azure XCompute                 Windows HPC

  Windows                    Windows                  Windows             Windows
   Server                     Server                   Server              Server
                                                                                               5
•   Introduction
•   Dryad
•   DryadLINQ
•   Conclusions




                   6
Dryad
•   Continuously deployed since 2006
•   Running on >> 104 machines
•   Sifting through > 10Pb data daily
•   Runs on clusters > 3000 machines
•   Handles jobs with > 105 processes each
•   Platform for rich software ecosystem
•   Used by >> 100 developers

• Written at Microsoft Research, Silicon Valley
                                                  7
Dryad = Execution Layer


Job (application)       Pipeline

     Dryad
                    ≈    Shell

    Cluster             Machine



                                   8
2-D Piping
• Unix Pipes: 1-D
     grep | sed | sort | awk | perl



• Dryad: 2-D
  grep1000 | sed500 | sort1000 | awk500 | perl50




                                                   9
Virtualized 2-D Pipelines




                            10
Virtualized 2-D Pipelines




                            11
Virtualized 2-D Pipelines




                            12
Virtualized 2-D Pipelines




                            13
Virtualized 2-D Pipelines
     • 2D DAG
     • multi-machine
     • virtualized




                            14
Dryad Job Structure

Input           Channels
 files                      Stage                Output
                            sort                  files
         grep                       awk
                      sed                 perl
         grep               sort
                      sed           awk
         grep               sort


           Vertices
          (processes)                              15
Channels
              Finite streams of items
X
              • distributed filesystem files
                      (persistent)
    Items     • SMB/NTFS files
                      (temporary)
              • TCP pipes
M                     (inter-machine)
              • memory FIFOs
                      (intra-machine)

                                               16
Dryad System Architecture
                                    data plane
                        Files, TCP, FIFO, Network
job schedule


                                V             V    V

                     NS         PD            PD   PD

  Job manager   control plane       cluster

                                                        17
Fault Tolerance
Policy Managers
R       R          R           R    Stage R


                           Connection R-X


X        X          X           X
                                    Stage X

                            R-X
    X Manager R manager   Manager
                 Job
               Manager
                                              19
Dynamic Graph Rewriting

 X[0]       X[1]      X[3]   X[2]            X’[2]


                              Slow           Duplicate
        Completed vertices
                              vertex          vertex




Duplication Policy = f(running times, data volumes)
Cluster network topology

                      top-level switch




                      top-of-rack switch




                      rack
Dynamic Aggregation
     S      S           S           S            S     S


                               T
static


  #1S      #2S      #1S            #3S          #3S   #2S


  rack #
                 # 1A       # 2A         # 3A



dynamic                        T                            22
Policy vs. Mechanism

• Application-level      • Built-in
• Most complex in          •   Scheduling
  C++ code                 •   Graph rewriting
• Invoked with upcalls     •   Fault tolerance
• Need good default        •   Statistics and
  implementations              reporting
• DryadLINQ provides
  a comprehensive set
                                                 23
•   Introduction
•   Dryad
•   DryadLINQ
•   Conclusions




                   24
LINQ => DryadLINQ




    Dryad




                    25
LINQ = .Net+ Queries


Collection<T> collection;
bool IsLegal(Key);
string Hash(Key);

var results = from c in collection
            where IsLegal(c.key)
            select new { Hash(c.key), c.value};
                                                  26
Collections and Iterators
class Collection<T> : IEnumerable<T>;



              public interface IEnumerable<T> {
                     IEnumerator<T> GetEnumerator();
              }

 public interface IEnumerator <T> {
        T Current { get; }
        bool MoveNext();
        void Reset();
 }
                                                   27
DryadLINQ Data Model
Partition                .Net objects




            Collection


                                        28
DryadLINQ = LINQ + Dryad
           Collection<T> collection;
           bool IsLegal(Key k);
           string Hash(Key);
Vertex
code       var results = from c in collection
                        where IsLegal(c.key)
                        select new { Hash(c.key), c.value};             Query
                                                                        plan
                                                                        (Dryad job)
         Data



                                                                   collection

         C#            C#                C#                   C#
                                                                   results
                                                                                  29
Demo




       30
Example: Histogram
public static IQueryable<Pair> Histogram(
   IQueryable<LineRecord> input, int k)
{
  var words = input.SelectMany(x => x.line.Split(' '));
  var groups = words.GroupBy(x => x);
  var counts = groups.Select(x => new Pair(x.Key, x.Count()));
  var ordered = counts.OrderByDescending(x => x.count);
  var top = ordered.Take(k);
  return top;
}
         “A line of words of wisdom”
         [“A”, “line”, “of”, “words”, “of”, “wisdom”]
         [[“A”], [“line”], [“of”, “of”], [“words”], [“wisdom”]]
         [ {“A”, 1}, {“line”, 1}, {“of”, 2}, {“words”, 1}, {“wisdom”, 1}]
         [{“of”, 2}, {“A”, 1}, {“line”, 1}, {“words”, 1}, {“wisdom”, 1}]
         [{“of”, 2}, {“A”, 1}, {“line”, 1}]                                 31
Histogram Plan
    SelectMany
           Sort
GroupBy+Select
 HashDistribute
    MergeSort
     GroupBy
       Select
         Sort
         Take
    MergeSort
         Take




                                   32
Map-Reduce in DryadLINQ

public static IQueryable<S> MapReduce<T,M,K,S>(
  this IQueryable<T> input,
  Expression<Func<T, IEnumerable<M>>> mapper,
  Expression<Func<M,K>> keySelector,
  Expression<Func<IGrouping<K,M>,S>> reducer)
{
  var map = input.SelectMany(mapper);
  var group = map.GroupBy(keySelector);
  var result = group.Select(reducer);
  return result;
}



                                                  33
Map-Reduce Plan
                         M                M         M         M              M         M         M    map

                             Q            Q         Q         Q              Q         Q         Q    sort




                                                                                                                         map
                         G1               G1        G1        G1             G1        G1        G1   groupby

M                            R            R         R         R              R         R         R    reduce

                             D            D         D         D              D         D         D    distribute
G




                                                                                                                        partial aggregation
    R                                                                                  MS        MS   mergesort
                         MS                    MS        MS
    X                                                                                  G2        G2   groupby
                         G2                    G2        G2
                             R                 R         R                             R         R    reduce

                             X                 X         X                                            mergesort
                                                                                  MS        MS
            static                   dynamic                       dynamic        G2        G2        groupby




                                                                                                                        reduce
S       S        S       S       S    S                                           R         R         reduce
             A       A       A                                                                        consumer
                                                                                  X         X                      34
                     T
Distributed Sorting Plan

             DS             DS       DS            DS          DS

              H                  H                         H

O             D             D        D                 D       D
    static        dynamic                dynamic

              M                  M                 M       M    M

              S                  S                 S       S    S
                                                                    35
Expectation Maximization




                   • 160 lines
                   • 3 iterations shown




                                36
Probabilistic Index Maps
Images




features
                               37
Language Summary


Where
Select
GroupBy
OrderBy
Aggregate
Join
Apply
Materialize                  38
LINQ System Architecture
      Local machine             Execution engine
                                •LINQ-to-obj
                                •PLINQ
           Query                •LINQ-to-SQL
  .Net                          •LINQ-to-WS
program                LINQ     •DryadLINQ
(C#, VB,             Provider
F#, etc)
                                •Flickr
           Objects              •Oracle
                                •LINQ-to-XML
                                •Your own

                                                   39
The DryadLINQ Provider

             Client machine
                        DryadLINQ
   .Net                                                  Data center

                          Distributed Invoke             Vertex Con-     Input
                                                 Query
ToCollection Query Expr   query plan                      code text      Tables

                                                                   Dryad
                                                 Dryad JM
                                                                 Execution

                           Output
 foreach                    (11)
             .Net Objects DryadTable   Results           Output Tables



                                                                                  40
Combining Query Providers
       Local machine              Execution engines

                         LINQ
                       Provider        PLINQ
             Query
   .Net                  LINQ
                       Provider
                                    SQL Server
 program
(C#, VB, F               LINQ
                                    DryadLINQ
  #, etc)              Provider
             Objects     LINQ
                                    LINQ-to-obj
                       Provider


                                                      41
Using PLINQ
              Query

           DryadLINQ




Local query

   PLINQ


                                42
Using LINQ to SQL Server
                          Query

                      DryadLINQ




Query     Query   Query     LINQ to SQL    LINQ to SQL



                                   Query          Query


                                                          43
Using LINQ-to-objects

Local machine
                              LINQ to obj

                                   debug
                Query
                      production
                DryadLINQ



Cluster

                                            44
•   Introduction
•   Dryad
•   DryadLINQ
•   Conclusions




                   45
Lessons Learned (1)
• What worked well?
  – Complete separation of
    storage / execution / language
  – Using LINQ +.Net (language integration)
  – Strong typing for data
  – Allowing flexible and powerful policies
  – Centralized job manager: no replication, no
    consensus, no checkpointing
  – Porting (HPC, Cosmos, Azure, SQL Server)
  – Technology transfer (done at the right time)   46
Lessons Learned (2)
• What worked less well
  – Error handling and propagation
  – Distributed (randomized) resource allocation
  – TCP pipe channels
  – Hierarchical dataflow graphs
    (each vertex = small graph)
  – Forking the source tree



                                                   47
Lessons Learned (3)
• Tricks of the trade
  – Asynchronous operations hide latency
  – Management through distributed state machines
  – Logging state transitions for debugging
  – Complete separation of data and control
  – Leases clean-up after themselves
  – Understand scaling factors
     O(machines) < O(vertices) < O(edges)
  – Don’t fix a broken API, re-design it
  – Compression trades-off bandwidth for CPU
  – Managed code increases productivity by 10x10
                                                    48
Ongoing Dryad/DryadLINQ Research
•   Performance modeling
•   Scheduling and resource allocation
•   Profiling and performance debugging
•   Incremental computation
•   Hardware acceleration
•   High-level programming abstractions
•   Many domain-specific applications

                                          49
Sample applications written using DryadLINQ           Class
Distributed linear algebra                            Numerical
Accelerated Page-Rank computation                     Web graph
Privacy-preserving query language                     Data mining
Expectation maximization for a mixture of Gaussians   Clustering
K-means                                               Clustering
Linear regression                                     Statistics
Probabilistic Index Maps                              Image processing
Principal component analysis                          Data mining
Probabilistic Latent Semantic Indexing                Data mining
Performance analysis and visualization                Debugging
Road network shortest-path preprocessing              Graph
Botnet detection                                      Data mining
Epitome computation                                   Image processing
Neural network training                               Statistics
Parallel machine learning framework infer.net         Machine learning
Distributed query caching                             Optimization
Image indexing                                        Image processing
                                                                     50
Web indexing structure                                Web graph
Conclusions




  =
                   51




              51
“What’s the point if I can’t have it?”

• Glad you asked
• We’re offering Dryad+DryadLINQ to
  academic partners
• Dryad is in binary form, DryadLINQ in source
• Requires signing a 3-page licensing agreement



                                              52
Backup Slides




                53
DryadLINQ
• Declarative programming
• Integration with Visual Studio
• Integration with .Net
• Type safety
• Automatic serialization
• Job graph optimizations
     static
     dynamic
• Conciseness
                                   54
What does DryadLINQ do?
 public struct Data { …
   public static int Compare(Data left, Data right);
 }

 Data g = new Data();
 var result = table.Where(s => Data.Compare(s, g) < 0);


                        public static void Read(this DryadBinaryReader reader, out Data obj);
   Data serialization
                        public static int Write(this DryadBinaryWriter writer, Data obj);

        Data factory    public class DryadFactoryType__0 : LinqToDryad.DryadFactory<Data>

                        DryadVertexEnv denv = new DryadVertexEnv(args);
     Channel writer     var dwriter__2 = denv.MakeWriter(FactoryType__0);
     Channel reader     var dreader__3 = denv.MakeReader(FactoryType__0);
                        var source__4 = DryadLinqVertex.Where(dreader__3,
          LINQ code               s => (Data.Compare(s, ((Data)DryadLinqObjectStore.Get(0))) <
Context serialization             ((System.Int32)(0))), false);
                        dwriter__2.WriteItemSequence(source__4);
                                                                                          55
Range-Distribution Manager

                           S             S               S
                                       [0-100)


   S      S      S              Hist
                                [0-30),[30-100)

static    T                 D            D              D


                               T                  T
                                 [0-30)
                                 [0-?)                [30-100)
                                                      [?-100)
                        dynamic
                                                          56
Staging
1. Build




     2. Send                           7. Serialize
     .exe                               vertices                                vertex
                                                                                 code

                        5. Generate graph
           JM code
                                                      Cluster
                     6. Initialize vertices           services
     3. Start JM                                                8. Monitor
                                                             Vertex execution
                             4. Query
                         cluster resources
Bibliography
Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007

DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level
Language
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon
Currey
Symposium on Operating System Design and Implementation (OSDI), San Diego, CA, December 8-
10, 2008

SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets
Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren
Zhou
Very Large Databases Conference (VLDB), Auckland, New Zealand, August 23-28 2008

Hunting for problems with Artemis
Gabriela F. Creţu-Ciocârlie, Mihai Budiu, and Moises Goldszmidt
USENIX Workshop on the Analysis of System Logs (WASL), San Diego, CA, December 7, 2008


                                                                                                  58
Data Partitioning
      DATA



             RAM




   DATA



                    59
Linear Algebra & Machine Learning
           in DryadLINQ


        Data analysis            Machine learning

                                Large Vector

                           DryadLINQ

                        Dryad




                                                    60
Operations on Large Vectors:
                  Map 1
                      f
                  T       U



                               f preserves partitioning



    T



f

    U

                                                     61
Map 2 (Pairwise)
                   f
           T   U       V




    T




    U



f

    V
                           62
Map 3 (Vector-Scalar)
                          f
              T   U           V




    T




                      U



f

    V
                                  63
Reduce (Fold)
                     f
           U U           U




U



    f            f           f

    U            U           U
                     f

                 U

                                 64
Linear Algebra


T




                          m       m n
    T
        , ,
        U     V   =   ,       ,

                                        65
Linear Regression
• Data
           n          m
   xt          , yt       t {1,...,n}
• Find
           n m
     A
• S.t.

     Axt       yt
                                        66
Analytic Solution
       A   (       t
                     yt xtT )(                t
                                                xt xtT )     1


X[0]        X[1]            X[2]       Y[0]           Y[1]       Y[2]

                                                                        Map
X×XT       X×XT       X×XT         Y×XT             Y×XT     Y×XT

                                                                        Reduce

                    Σ                           Σ


                    [ ]-1

                                   *

                                   A                                             67
Linear Regression Code
                        T           T     1
         A     (   t
                     yt x )(
                        t      t
                                 xt x )
                                    t



Vectors x = input(0), y = input(1);
Matrices xx = x.Map(x, (a,b) => a.OuterProd(b));
OneMatrix xxs = xx.Sum();
Matrices yx = y.Map(x, (a,b) => a.OuterProd(b));
OneMatrix yxs = yx.Sum();
OneMatrix xxinv = xxs.Map(a => a.Inverse());
OneMatrix A = yxs.Map(xxinv, (a, b) => a.Mult(b));
                                                 68

More Related Content

What's hot

Kerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataKerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadata
Enkitec
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
inside-BigData.com
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
inside-BigData.com
 
Mirantis Folsom Meetup Intro
Mirantis Folsom Meetup IntroMirantis Folsom Meetup Intro
Mirantis Folsom Meetup Intro
Mirantis
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Community
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
Cloudera, Inc.
 
Jma hr gsm_data_gr_ads_20130529
Jma hr gsm_data_gr_ads_20130529Jma hr gsm_data_gr_ads_20130529
Jma hr gsm_data_gr_ads_20130529
JMA_447
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
mapr-academy
 

What's hot (20)

Kerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataKerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadata
 
No sql & dq2 tracer service
No sql & dq2 tracer serviceNo sql & dq2 tracer service
No sql & dq2 tracer service
 
DRP (Stretch Cluster) for HDP - Future of Data : Paris
DRP (Stretch Cluster) for HDP - Future of Data : Paris DRP (Stretch Cluster) for HDP - Future of Data : Paris
DRP (Stretch Cluster) for HDP - Future of Data : Paris
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
 
Mirantis Folsom Meetup Intro
Mirantis Folsom Meetup IntroMirantis Folsom Meetup Intro
Mirantis Folsom Meetup Intro
 
gcov和clang中的实现
gcov和clang中的实现gcov和clang中的实现
gcov和clang中的实现
 
The Convergence of HPC and Deep Learning
The Convergence of HPC and Deep LearningThe Convergence of HPC and Deep Learning
The Convergence of HPC and Deep Learning
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Mission
 
Jma hr gsm_data_gr_ads_20130529
Jma hr gsm_data_gr_ads_20130529Jma hr gsm_data_gr_ads_20130529
Jma hr gsm_data_gr_ads_20130529
 
Presentazione laurea 1.2 matteo concas
Presentazione laurea 1.2   matteo concasPresentazione laurea 1.2   matteo concas
Presentazione laurea 1.2 matteo concas
 
産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlap
 
Sierra Supercomputer: Science Unleashed
Sierra Supercomputer: Science UnleashedSierra Supercomputer: Science Unleashed
Sierra Supercomputer: Science Unleashed
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider Data
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS Performance
 

Viewers also liked (9)

What's Available in Assistive Technology for Students with ...
What's Available in Assistive Technology for Students with ...What's Available in Assistive Technology for Students with ...
What's Available in Assistive Technology for Students with ...
 
1 elena topoleva o zelyah i zadachah konferenzii
1 elena topoleva o zelyah i zadachah konferenzii1 elena topoleva o zelyah i zadachah konferenzii
1 elena topoleva o zelyah i zadachah konferenzii
 
GoOpen 2010: Olav Torvund
GoOpen 2010: Olav TorvundGoOpen 2010: Olav Torvund
GoOpen 2010: Olav Torvund
 
REQUEST FOR PROPOSAL PROCEDURES
REQUEST FOR PROPOSAL PROCEDURESREQUEST FOR PROPOSAL PROCEDURES
REQUEST FOR PROPOSAL PROCEDURES
 
Analysis of LDPC Codes under Wi-Max IEEE 802.16e
Analysis of LDPC Codes under Wi-Max IEEE 802.16eAnalysis of LDPC Codes under Wi-Max IEEE 802.16e
Analysis of LDPC Codes under Wi-Max IEEE 802.16e
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 

Similar to Cluster Computing with Dryad

What CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBDWhat CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBD
ShapeBlue
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 

Similar to Cluster Computing with Dryad (20)

Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
David Loureiro - Presentation at HP's HPC & OSL TES
David Loureiro - Presentation at HP's HPC & OSL TESDavid Loureiro - Presentation at HP's HPC & OSL TES
David Loureiro - Presentation at HP's HPC & OSL TES
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
What CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBDWhat CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBD
 
RISC V in Spacer
RISC V in SpacerRISC V in Spacer
RISC V in Spacer
 
OpenStack and OpenFlow Demos
OpenStack and OpenFlow DemosOpenStack and OpenFlow Demos
OpenStack and OpenFlow Demos
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
Netflix and Open Source
Netflix and Open SourceNetflix and Open Source
Netflix and Open Source
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
 
Oracle RAC and Docker: The Why and How
Oracle RAC and Docker: The Why and HowOracle RAC and Docker: The Why and How
Oracle RAC and Docker: The Why and How
 

More from butest

Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 
Download
DownloadDownload
Download
butest
 
resume.doc
resume.docresume.doc
resume.doc
butest
 
Download.doc.doc
Download.doc.docDownload.doc.doc
Download.doc.doc
butest
 
Resume
ResumeResume
Resume
butest
 

More from butest (20)

Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 
Download
DownloadDownload
Download
 
resume.doc
resume.docresume.doc
resume.doc
 
Download.doc.doc
Download.doc.docDownload.doc.doc
Download.doc.doc
 
Resume
ResumeResume
Resume
 

Cluster Computing with Dryad

  • 1. Cluster Computing with DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Cloud computing: Infrastructure, Services, and Applications UC Berkeley, March 4 2009
  • 2. Goal 2
  • 3. Design Space Internet Data- parallel Shared Private memory data center Latency Throughput 3
  • 4. Data-Parallel Computation Application SQL Sawzall ≈SQL LINQ, SQL Sawzall Pig, Hive DryadLINQ Language Scope Map- Parallel Hadoop Execution Reduce Dryad Databases Cosmos Storage GFS HDFS Azure BigTable S3 SQL Server 4
  • 5. Software Stack Applications Log parsing Machine Data SQL C# Learning Graphs mining legacy SSIS code PSQL Scope .Net Distributed Data Structures SQL queueing Distributed Shell DryadLINQ C++ server Dryad Distributed FS (Cosmos) Azure XStore SQL Server NTFS Cluster Services Azure XCompute Windows HPC Windows Windows Windows Windows Server Server Server Server 5
  • 6. Introduction • Dryad • DryadLINQ • Conclusions 6
  • 7. Dryad • Continuously deployed since 2006 • Running on >> 104 machines • Sifting through > 10Pb data daily • Runs on clusters > 3000 machines • Handles jobs with > 105 processes each • Platform for rich software ecosystem • Used by >> 100 developers • Written at Microsoft Research, Silicon Valley 7
  • 8. Dryad = Execution Layer Job (application) Pipeline Dryad ≈ Shell Cluster Machine 8
  • 9. 2-D Piping • Unix Pipes: 1-D grep | sed | sort | awk | perl • Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50 9
  • 14. Virtualized 2-D Pipelines • 2D DAG • multi-machine • virtualized 14
  • 15. Dryad Job Structure Input Channels files Stage Output sort files grep awk sed perl grep sort sed awk grep sort Vertices (processes) 15
  • 16. Channels Finite streams of items X • distributed filesystem files (persistent) Items • SMB/NTFS files (temporary) • TCP pipes M (inter-machine) • memory FIFOs (intra-machine) 16
  • 17. Dryad System Architecture data plane Files, TCP, FIFO, Network job schedule V V V NS PD PD PD Job manager control plane cluster 17
  • 19. Policy Managers R R R R Stage R Connection R-X X X X X Stage X R-X X Manager R manager Manager Job Manager 19
  • 20. Dynamic Graph Rewriting X[0] X[1] X[3] X[2] X’[2] Slow Duplicate Completed vertices vertex vertex Duplication Policy = f(running times, data volumes)
  • 21. Cluster network topology top-level switch top-of-rack switch rack
  • 22. Dynamic Aggregation S S S S S S T static #1S #2S #1S #3S #3S #2S rack # # 1A # 2A # 3A dynamic T 22
  • 23. Policy vs. Mechanism • Application-level • Built-in • Most complex in • Scheduling C++ code • Graph rewriting • Invoked with upcalls • Fault tolerance • Need good default • Statistics and implementations reporting • DryadLINQ provides a comprehensive set 23
  • 24. Introduction • Dryad • DryadLINQ • Conclusions 24
  • 25. LINQ => DryadLINQ Dryad 25
  • 26. LINQ = .Net+ Queries Collection<T> collection; bool IsLegal(Key); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; 26
  • 27. Collections and Iterators class Collection<T> : IEnumerable<T>; public interface IEnumerable<T> { IEnumerator<T> GetEnumerator(); } public interface IEnumerator <T> { T Current { get; } bool MoveNext(); void Reset(); } 27
  • 28. DryadLINQ Data Model Partition .Net objects Collection 28
  • 29. DryadLINQ = LINQ + Dryad Collection<T> collection; bool IsLegal(Key k); string Hash(Key); Vertex code var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; Query plan (Dryad job) Data collection C# C# C# C# results 29
  • 30. Demo 30
  • 31. Example: Histogram public static IQueryable<Pair> Histogram( IQueryable<LineRecord> input, int k) { var words = input.SelectMany(x => x.line.Split(' ')); var groups = words.GroupBy(x => x); var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x.count); var top = ordered.Take(k); return top; } “A line of words of wisdom” [“A”, “line”, “of”, “words”, “of”, “wisdom”] [[“A”], [“line”], [“of”, “of”], [“words”], [“wisdom”]] [ {“A”, 1}, {“line”, 1}, {“of”, 2}, {“words”, 1}, {“wisdom”, 1}] [{“of”, 2}, {“A”, 1}, {“line”, 1}, {“words”, 1}, {“wisdom”, 1}] [{“of”, 2}, {“A”, 1}, {“line”, 1}] 31
  • 32. Histogram Plan SelectMany Sort GroupBy+Select HashDistribute MergeSort GroupBy Select Sort Take MergeSort Take 32
  • 33. Map-Reduce in DryadLINQ public static IQueryable<S> MapReduce<T,M,K,S>( this IQueryable<T> input, Expression<Func<T, IEnumerable<M>>> mapper, Expression<Func<M,K>> keySelector, Expression<Func<IGrouping<K,M>,S>> reducer) { var map = input.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.Select(reducer); return result; } 33
  • 34. Map-Reduce Plan M M M M M M M map Q Q Q Q Q Q Q sort map G1 G1 G1 G1 G1 G1 G1 groupby M R R R R R R R reduce D D D D D D D distribute G partial aggregation R MS MS mergesort MS MS MS X G2 G2 groupby G2 G2 G2 R R R R R reduce X X X mergesort MS MS static dynamic dynamic G2 G2 groupby reduce S S S S S S R R reduce A A A consumer X X 34 T
  • 35. Distributed Sorting Plan DS DS DS DS DS H H H O D D D D D static dynamic dynamic M M M M M S S S S S 35
  • 36. Expectation Maximization • 160 lines • 3 iterations shown 36
  • 39. LINQ System Architecture Local machine Execution engine •LINQ-to-obj •PLINQ Query •LINQ-to-SQL .Net •LINQ-to-WS program LINQ •DryadLINQ (C#, VB, Provider F#, etc) •Flickr Objects •Oracle •LINQ-to-XML •Your own 39
  • 40. The DryadLINQ Provider Client machine DryadLINQ .Net Data center Distributed Invoke Vertex Con- Input Query ToCollection Query Expr query plan code text Tables Dryad Dryad JM Execution Output foreach (11) .Net Objects DryadTable Results Output Tables 40
  • 41. Combining Query Providers Local machine Execution engines LINQ Provider PLINQ Query .Net LINQ Provider SQL Server program (C#, VB, F LINQ DryadLINQ #, etc) Provider Objects LINQ LINQ-to-obj Provider 41
  • 42. Using PLINQ Query DryadLINQ Local query PLINQ 42
  • 43. Using LINQ to SQL Server Query DryadLINQ Query Query Query LINQ to SQL LINQ to SQL Query Query 43
  • 44. Using LINQ-to-objects Local machine LINQ to obj debug Query production DryadLINQ Cluster 44
  • 45. Introduction • Dryad • DryadLINQ • Conclusions 45
  • 46. Lessons Learned (1) • What worked well? – Complete separation of storage / execution / language – Using LINQ +.Net (language integration) – Strong typing for data – Allowing flexible and powerful policies – Centralized job manager: no replication, no consensus, no checkpointing – Porting (HPC, Cosmos, Azure, SQL Server) – Technology transfer (done at the right time) 46
  • 47. Lessons Learned (2) • What worked less well – Error handling and propagation – Distributed (randomized) resource allocation – TCP pipe channels – Hierarchical dataflow graphs (each vertex = small graph) – Forking the source tree 47
  • 48. Lessons Learned (3) • Tricks of the trade – Asynchronous operations hide latency – Management through distributed state machines – Logging state transitions for debugging – Complete separation of data and control – Leases clean-up after themselves – Understand scaling factors O(machines) < O(vertices) < O(edges) – Don’t fix a broken API, re-design it – Compression trades-off bandwidth for CPU – Managed code increases productivity by 10x10 48
  • 49. Ongoing Dryad/DryadLINQ Research • Performance modeling • Scheduling and resource allocation • Profiling and performance debugging • Incremental computation • Hardware acceleration • High-level programming abstractions • Many domain-specific applications 49
  • 50. Sample applications written using DryadLINQ Class Distributed linear algebra Numerical Accelerated Page-Rank computation Web graph Privacy-preserving query language Data mining Expectation maximization for a mixture of Gaussians Clustering K-means Clustering Linear regression Statistics Probabilistic Index Maps Image processing Principal component analysis Data mining Probabilistic Latent Semantic Indexing Data mining Performance analysis and visualization Debugging Road network shortest-path preprocessing Graph Botnet detection Data mining Epitome computation Image processing Neural network training Statistics Parallel machine learning framework infer.net Machine learning Distributed query caching Optimization Image indexing Image processing 50 Web indexing structure Web graph
  • 51. Conclusions = 51 51
  • 52. “What’s the point if I can’t have it?” • Glad you asked • We’re offering Dryad+DryadLINQ to academic partners • Dryad is in binary form, DryadLINQ in source • Requires signing a 3-page licensing agreement 52
  • 54. DryadLINQ • Declarative programming • Integration with Visual Studio • Integration with .Net • Type safety • Automatic serialization • Job graph optimizations  static  dynamic • Conciseness 54
  • 55. What does DryadLINQ do? public struct Data { … public static int Compare(Data left, Data right); } Data g = new Data(); var result = table.Where(s => Data.Compare(s, g) < 0); public static void Read(this DryadBinaryReader reader, out Data obj); Data serialization public static int Write(this DryadBinaryWriter writer, Data obj); Data factory public class DryadFactoryType__0 : LinqToDryad.DryadFactory<Data> DryadVertexEnv denv = new DryadVertexEnv(args); Channel writer var dwriter__2 = denv.MakeWriter(FactoryType__0); Channel reader var dreader__3 = denv.MakeReader(FactoryType__0); var source__4 = DryadLinqVertex.Where(dreader__3, LINQ code s => (Data.Compare(s, ((Data)DryadLinqObjectStore.Get(0))) < Context serialization ((System.Int32)(0))), false); dwriter__2.WriteItemSequence(source__4); 55
  • 56. Range-Distribution Manager S S S [0-100) S S S Hist [0-30),[30-100) static T D D D T T [0-30) [0-?) [30-100) [?-100) dynamic 56
  • 57. Staging 1. Build 2. Send 7. Serialize .exe vertices vertex code 5. Generate graph JM code Cluster 6. Initialize vertices services 3. Start JM 8. Monitor Vertex execution 4. Query cluster resources
  • 58. Bibliography Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007 DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey Symposium on Operating System Design and Implementation (OSDI), San Diego, CA, December 8- 10, 2008 SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou Very Large Databases Conference (VLDB), Auckland, New Zealand, August 23-28 2008 Hunting for problems with Artemis Gabriela F. Creţu-Ciocârlie, Mihai Budiu, and Moises Goldszmidt USENIX Workshop on the Analysis of System Logs (WASL), San Diego, CA, December 7, 2008 58
  • 59. Data Partitioning DATA RAM DATA 59
  • 60. Linear Algebra & Machine Learning in DryadLINQ Data analysis Machine learning Large Vector DryadLINQ Dryad 60
  • 61. Operations on Large Vectors: Map 1 f T U f preserves partitioning T f U 61
  • 62. Map 2 (Pairwise) f T U V T U f V 62
  • 63. Map 3 (Vector-Scalar) f T U V T U f V 63
  • 64. Reduce (Fold) f U U U U f f f U U U f U 64
  • 65. Linear Algebra T m m n T , , U V = , , 65
  • 66. Linear Regression • Data n m xt , yt t {1,...,n} • Find n m A • S.t. Axt yt 66
  • 67. Analytic Solution A ( t yt xtT )( t xt xtT ) 1 X[0] X[1] X[2] Y[0] Y[1] Y[2] Map X×XT X×XT X×XT Y×XT Y×XT Y×XT Reduce Σ Σ [ ]-1 * A 67
  • 68. Linear Regression Code T T 1 A ( t yt x )( t t xt x ) t Vectors x = input(0), y = input(1); Matrices xx = x.Map(x, (a,b) => a.OuterProd(b)); OneMatrix xxs = xx.Sum(); Matrices yx = y.Map(x, (a,b) => a.OuterProd(b)); OneMatrix yxs = yx.Sum(); OneMatrix xxinv = xxs.Map(a => a.Inverse()); OneMatrix A = yxs.Map(xxinv, (a, b) => a.Mult(b)); 68

Editor's Notes

  1. Enable any programmer to write and run applications on small and large computer clusters.
  2. Dryad is optimized for: throughput, data-parallel computation, in a private data-center.
  3. In the same way as the Unix shell does not understand the pipeline running on top, but manages its execution (i.e., killing processes when one exits), Dryad does not understand the job running on top.
  4. Dryad is a generalization of the Unix piping mechanism: instead of uni-dimensional (chain) pipelines, it provides two-dimensional pipelines. The unit is still a process connected by a point-to-point channel, but the processes are replicated.
  5. This is a possible schedule of a Dryad job using 2 machines.
  6. The Unix pipeline is generalized 3-ways:2D instead of 1D spans multiple machines resources are virtualized: you can run the same large job on many or few machines
  7. This is the basic Dryad terminology.
  8. Channels are very abstract, enabling a variety of transport mechanisms.The performance and fault-tolerance of these machanisms vary widely.
  9. The brain of a Dryad job is a centralizedJob Manager, which maintains a complete state of the job.The JM controls the processes running on a cluster, but never exchanges data with them.(The data plane is completely separated from the control plane.)
  10. Vertex failures and channel failures are handled differently.
  11. The handling of apparently very slow computation by duplication of vertices is handled by a stage manager.
  12. Aggregating data with associative operators can be done in a bandwidth-preserving fashion in the intermediate aggregations are placed close to the source data.
  13. DryadLINQ adds a wealth of features on top of plain Dryad.
  14. Language Integrated Query is an extension of.Net which allows one to write declarative computations on collections (green part).
  15. DryadLINQ translates LINQ programs into Dryad computations:- C# and LINQ data objects become distributed partitioned files. - LINQ queries become distributed Dryad jobs. -C# methods become code running on the vertices of a Dryad job.
  16. More complicated, even iterative algorithms, can be implemented.
  17. At the bottom DryadLINQ uses LINQ to run the computation in parallel on multiple cores.
  18. We believe that Dryad and DryadLINQ are a great foundation for cluster computing.
  19. DryadLINQ adds a wealth of features on top of plain Dryad.
  20. Using a connection manager one can load-balance the data distribution at run-time, based on data statistics obtained from sampling the data stream. In this case the number of destination vertices and the ranges for each vertex are decided dynamically.
  21. Computation Staging
  22. A common scenario: too much data to process. Instead of trying to be clever, just use more machines and a brute-force algorithm.
  23. I will now focus on a library for machine-learning algorithms we have built on top of DryadLINQ.
  24. One can apply an arbitrary C# side-effect free function f to all objects in a vector.
  25. Or one can do it to a pair of vectors.
  26. Or one can use a vector and a scalar, replicating the scalar for each element of the vector.
  27. Finally, one can fold a vector to a scalar.
  28. Having vectors of vectors or matrices builds to a nice linear algebra library.
  29. We will show how to compute linear regression parameters.
  30. This expression uses a query plan composed of 2 (pairwise) maps and 2 reduces.
  31. The complete source code for linear regression has 6 lines of code.