SlideShare une entreprise Scribd logo
1  sur  38
Cascading
Alex Su
2011/02/11

             Copyright 2010 TCloud Computing Inc.
Agenda

•Introduction
• How it works
•Data Processing
•Advanced Processing
•Monitoring
•Testing
•Best Practices
•Cascading GUI




                       Trend Micro Confidential
Introduction

•Hadoop coding is non-trivial
•Hadoop is looking for a class to do Map steps and a
 class to do Reduce step
•What if you need multiple in your application?
 Who coordinates what can be run in parallel?
•What if you need to do non-Hadoop logic between
 Hadoop steps?
•Chain the Operations into data processing work-
 flows




                         Trend Micro Confidential
Introduction

•Operations are chained together to define a Pipe
assembly or a reusable sub-assembly




                         Trend Micro Confidential
Introduction
Pipe lhs = new Pipe( "lhs" );
lhs = new Each( lhs, new SomeFunction() );
lhs = new Each( lhs, new SomeFilter() );

// the "right hand side" assembly head
Pipe rhs = new Pipe( "rhs" );
rhs = new Each( rhs, new SomeFunction() );

// joins the lhs and rhs
Pipe join = new CoGroup( lhs, rhs );
join = new Every( join, new SomeAggregator() );
join = new GroupBy( join );
join = new Every( join, new SomeAggregator() );

// the tail of the assembly
join = new Each( join, new SomeFunction() );

Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, Main.class );

FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( “join", source, sink, join);
// execute the flow, block until complete
flow.complete();

                                                   Trend Micro Confidential
How it works

•Pipe Assemblies become Flows
•Translates a DAG of operations to a DAG of
 MapReduce jobs
•All MapReduce jobs in Flow scheduled in
 dependency order




                         Trend Micro Confidential
How it works
digraph G {
  1 [label = "Every('akamaiPipe*whiteListPipe')[Count[decl:'count']]"];
  2 [label = "Hfs['TextLine[['host', 'count']->[ALL]]']['/user/alex/output']']"];
  3 [label = "GroupBy('akamaiPipe*whiteListPipe')[by:['host']]"];
  4 [label = "Each('akamaiPipe*whiteListPipe')[NotMatchedFilter[decl:'host', 'offset', 'line']]"];
  5 [label = "CoGroup('akamaiPipe*whiteListPipe')[by:whiteListPipe:['line']akamaiPipe:['host']]"];
  6 [label = "Hfs['TextLine[['offset', 'line']->[ALL]]']['/user/alex/whitelist/whitelist.txt']']"];
  7 [label = "Each('akamaiPipe')[RegexParser[decl:'host'][args:1]]"];
  8 [label = "Hfs['TextLine[['line']->[ALL]]']['/user/alex/input/akamai.log']']"];
  9 [label = "[head]"];
  10 [label = "[tail]"];
  11 [label = "TempHfs['SequenceFile[['host', 'offset', 'line']]'][akamaiPipe_whiteListPipe/52729/]"];
  12 [label = "Hfs['TextLine[['offset', 'line']->[ALL]]']['/user/alex/trap']']"];
  1 -> 2 [label = "[{2}:'host', 'count']n[{3}:'host', 'offset', 'line']"];
  7 -> 5 [label = "[{1}:'host']n[{1}:'host']"];
  5 -> 4 [label = "whiteListPipe[{1}:'line'],akamaiPipe[{1}:'host']n[{3}:'host', 'offset', 'line']"];
  3 -> 1 [label = "akamaiPipe*whiteListPipe[{1}:'host']n[{3}:'host', 'offset', 'line']"];
  9 -> 6 [label = ""];
  9 -> 8 [label = ""];
  2 -> 10 [label = "[{?}:ALL]n[{?}:ALL]"];
  4 -> 11 [label = "[{3}:'host', 'offset', 'line']n[{3}:'host', 'offset', 'line']"];
  11 -> 3 [label = "[{3}:'host', 'offset', 'line']n[{3}:'host', 'offset', 'line']"];
  8 -> 7 [label = "[{1}:'line']n[{1}:'line']"];
  6 -> 5 [label = "[{2}:'offset', 'line']n[{2}:'offset', 'line']"];
  7 -> 12 [label = ""];
}

                                                           Trend Micro Confidential
Data Processing

•Tuple
   •A single ‘row’ of data being processed
   •Each column is named
   •Can access data by name or position




                         Trend Micro Confidential
Data Processing

•TAP
   •Abstraction on top of Hadoop files
   •Allows you to define own parser for files
   •Example:
   •Scheme
      •TextLine
      •TextDelimited
      •SequenceFile
      •WritableSequenceFile

Hfs input = new Hfs(new TextLine(), a_hdfsDirectory + "/" + name);



                                  Trend Micro Confidential
Data Processing

• Tap
    •LFS
    •DFS
    •HFS
    •MultiSourceTap
    •MultiSinkTap
    •TemplateTap
    •GlobHfs
    •S3fs(Deprecated)




                        Trend Micro Confidential
Data Processing

• TemplateTap
    TemplateTap can be used to write tuple streams
     out to subdirectories based on the values in
     the Tuple instance.




                         Trend Micro Confidential
Data Processing

• TemplateTap

TextDelimited scheme = new TextDelimited( new Fields( "year",
  "month", "entry" ), "t" );
Hfs tap = new Hfs( scheme, path );
String template = "%s-%s"; // dirs named "year-month"
Tap months = new TemplateTap( tap, template, SinkMode.REPLACE );




                               Trend Micro Confidential
Data Processing

•TAP types
   •SinkMode.KEEP
   •SinkMode.REPLACE
   •SinkMode.UPDATE




                       Trend Micro Confidential
Data Processing

•Integration
   •Cascading.Avro
   •Cascading.Hbase
   •Cascading.JDBC
   •Cascading.Memcached
   •Cascading.SimpleDB




                     Trend Micro Confidential
Data Processing

•Pipe




           Trend Micro Confidential
Data Processing
• Pipe
     • a base class for core processing model types
• Each
     • for each “tuple” in data do this to it
• GroupBy
     • similar to a ‘group by’ in SQL
• CoGroup
     • joins of tuple streams together
• Every
     • applies an Aggregator (like count, or sum) or Buffer (a sliding
       window) Operation to every group of Tuples that pass through
       it.
• SubAssembly
     • allows for nesting reusable pipe assemblies into a Pipe class




                                   Trend Micro Confidential
Data Processing
 • CoGroup
     • InnerJoin
     • OuterJoin
     • LeftJoin
     • RightJoin
     • MixedJoin



       lhsFields new Fields("url", "word", “count");
Fields common = new Fields( "url" );
       rhsFields = new Fields("url", “sentence", “count");
Fields declared = new Fields( "url1", "word", "wd_count", "url2", "sentence", "snt_count" );
Pipe join = new CoGroup( lhs, common, rhs, common, declared, new InnerJoin() );
                               lhsFields,      rhsFields, new InnerJoin() );




                                      Trend Micro Confidential
Data Processing

•Operation
  •Define what to do on the data
  •Each operations allow logic on the row, such a
    parsing dates, creating new attributes etc.
  •Every operations allow you to iterate over the
    ‘group’ of rows to do non-trivial operations.




                        Trend Micro Confidential
Data Processing

•Function
   •Identity Function
   •Debug Function
   •Sample and Limit Functions
   •Insert Function
   •Text Functions
   •Regular Expression Operations
   •Java Expression Operations
       •"first-name" is a valid field name for use
        with Cascading, but this expression, first-
        name.trim(), will fail.


                          Trend Micro Confidential
Data Processing

•Filter
    •And
    •Or
    •Not
    •Xor
    •NotNull
    •Null
    •RegexFilter




                      Trend Micro Confidential
Data Processing

•Aggregator
   •Average
   •Count
   •First
   •Last
   •Max
   •Min
   •Sum




                 Trend Micro Confidential
Data Processing

•Buffer
   •It is very similar to the typical Reducer
    interface
   •It is very useful when header or footer values
    need to be inserted into a grouping, or if values
    need to be inserted into the middle of the
    group values




                          Trend Micro Confidential
Data Processing

•Buffer




             Trend Micro Confidential
Data Processing




   Trend Micro Confidential
Data Processing

•Flow
   •To create a Flow, it must be planned though
    the FlowConnector object. The connect()
    method is used to create new Flow instances
    based on a set of sink Taps, source Taps, and a
    pipe assembly.

Flow flow = new FlowConnector(new Properties()).connect( "flow-name",
   source, sink, pipe );
flow.complete();




                                    Trend Micro Confidential
Data Processing

•MapReduceFlow
  •a Flow subclass that supports custom
   MapReduce jobs pre-configured via the
   JobConf object.

• ProcessFlow
    • a Flow subclass that supports custom Riffle
      jobs.




                        Trend Micro Confidential
Data Processing

•Cascades
   •Groups of Flow are called Cascades
   •Custom MapReduce jobs can participate in
    Cascade

Cascade cascade = cascadeConnector.connect(flow1, flow2, flow3);
cascade.complete();




                                 Trend Micro Confidential
Advanced Processing

•Stream Assertions
   •Unit and Regression tests for Flows
   •Planner can remove ‘strict’, ‘validating’, or all
     assertions




                           Trend Micro Confidential
Advanced Processing

•Failure Traps
   •Catch data causing Operations or Assertions to
     fail
   •Allows processes to continue without data loss




                         Trend Micro Confidential
Advanced Processing

•Partial Aggregation instead of Combiners
   •trade Memory for IO gains by caching values

Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
Fields sumField = new Fields( "total-size" );
assembly = new SumBy( assembly, groupingFields, valueField,
  sumField, long.class );




                             Trend Micro Confidential
Monitoring

•Implement FlowListener interface
   •onStarting
   •onStopping
   •onCompleted
   •onThrowable




                        Trend Micro Confidential
Monitoring

• Polling FlowStats
Flow ID: 756271765aa375773f9bbb5570de4d2a
StepStats Count: 2

cascading.flow.FlowStepJob$1: 1, Step{status=RUNNING, startTime=1297344994624}
Name: (1/2) ...SequenceFile[['host', 'offset', 'line']]"][akamaiPipe_whiteListPipe/52729/]
Status: RUNNING
Num Mappers: 2
Num Reducers: 1
  Task ID: task_201102101702_0002_m_000003
  Task ID: task_201102101702_0002_m_000000
  Task ID: task_201102101702_0002_m_000001
  Task ID: task_201102101702_0002_r_000000
  Task ID: task_201102101702_0002_m_000002

cascading.flow.FlowStepJob$1: 2, Step{status=PENDING, startTime=0}
Name: (2/2) Hfs["TextLine[['host', 'count']->[ALL]]"]["/user/alex/output"]"]
Status: PENDING
Num Mappers: 0
Num Reducers: 0




                                                     Trend Micro Confidential
Testing

•Use ClusterTestCase if you want to launch an
 embedded Hadoop cluster inside your TestCase
•A few validation and hadoop functions are
 provided
•Doesn’t support Hadoop 0.21 testing library




                       Trend Micro Confidential
Cascading GUI

•Yahoo Pipes
Pipes is a powerful composition tool to aggregate,
 manipulate, and mashup content from around the
 web.




                         Trend Micro Confidential
Cascading GUI

•WireIt
WireIt is an open-source javascript library to create
 web wirable interfaces for dataflow applications,
 visual programming languages, graphical modeling,
 or graph editors.




                          Trend Micro Confidential
Cascading GUI




  Trend Micro Confidential
Live Demo




 Trend Micro Confidential
THANK YOU!




  Trend Micro Confidential

Contenu connexe

Tendances

Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comJungsu Heo
 
Redis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupRedis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupItamar Haber
 
Sasi, cassandra on full text search ride
Sasi, cassandra on full text search rideSasi, cassandra on full text search ride
Sasi, cassandra on full text search rideDuyhai Doan
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
 
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...Lucidworks
 
Exploring the replication in MongoDB
Exploring the replication in MongoDBExploring the replication in MongoDB
Exploring the replication in MongoDBIgor Donchovski
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityCurtis Mosters
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopPatricia Gorla
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa
 
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...DataStax
 
RedisConf18 - Redis and Elasticsearch
RedisConf18 - Redis and ElasticsearchRedisConf18 - Redis and Elasticsearch
RedisConf18 - Redis and ElasticsearchRedis Labs
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesRussell Spitzer
 
Druid meetup 4th_sql_on_druid
Druid meetup 4th_sql_on_druidDruid meetup 4th_sql_on_druid
Druid meetup 4th_sql_on_druidYousun Jeong
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases labFabio Fumarola
 
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...DataStax
 
Working with MongoDB as MySQL DBA
Working with MongoDB as MySQL DBAWorking with MongoDB as MySQL DBA
Working with MongoDB as MySQL DBAIgor Donchovski
 
From sql server to mongo db
From sql server to mongo dbFrom sql server to mongo db
From sql server to mongo dbRyan Hoffman
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 

Tendances (20)

Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
 
Redis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupRedis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetup
 
Sasi, cassandra on full text search ride
Sasi, cassandra on full text search rideSasi, cassandra on full text search ride
Sasi, cassandra on full text search ride
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
 
Exploring the replication in MongoDB
Exploring the replication in MongoDBExploring the replication in MongoDB
Exploring the replication in MongoDB
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionality
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
 
RedisConf18 - Redis and Elasticsearch
RedisConf18 - Redis and ElasticsearchRedisConf18 - Redis and Elasticsearch
RedisConf18 - Redis and Elasticsearch
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Druid meetup 4th_sql_on_druid
Druid meetup 4th_sql_on_druidDruid meetup 4th_sql_on_druid
Druid meetup 4th_sql_on_druid
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab
 
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
 
Working with MongoDB as MySQL DBA
Working with MongoDB as MySQL DBAWorking with MongoDB as MySQL DBA
Working with MongoDB as MySQL DBA
 
From sql server to mongo db
From sql server to mongo dbFrom sql server to mongo db
From sql server to mongo db
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 

Similaire à Cascading introduction

Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Data Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache HadoopData Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache HadoopHikmat Dhamee
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardITCamp
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Startedabramsm
 
Cassandra
CassandraCassandra
Cassandraexsuns
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Stormviirya
 
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxData
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 

Similaire à Cascading introduction (20)

Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Data Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache HadoopData Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache Hadoop
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Started
 
Cassandra
CassandraCassandra
Cassandra
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Dapper
DapperDapper
Dapper
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Spark etl
Spark etlSpark etl
Spark etl
 
מיכאל
מיכאלמיכאל
מיכאל
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Plus de Alex Su

Node js introduction
Node js introductionNode js introduction
Node js introductionAlex Su
 
One click deployment
One click deploymentOne click deployment
One click deploymentAlex Su
 
Scrum Introduction
Scrum IntroductionScrum Introduction
Scrum IntroductionAlex Su
 
Redis Introduction
Redis IntroductionRedis Introduction
Redis IntroductionAlex Su
 
Python decorators
Python decoratorsPython decorators
Python decoratorsAlex Su
 
Using puppet
Using puppetUsing puppet
Using puppetAlex Su
 
JMS Introduction
JMS IntroductionJMS Introduction
JMS IntroductionAlex Su
 
Spring Framework Introduction
Spring Framework IntroductionSpring Framework Introduction
Spring Framework IntroductionAlex Su
 
Java Unit Test and Coverage Introduction
Java Unit Test and Coverage IntroductionJava Unit Test and Coverage Introduction
Java Unit Test and Coverage IntroductionAlex Su
 

Plus de Alex Su (9)

Node js introduction
Node js introductionNode js introduction
Node js introduction
 
One click deployment
One click deploymentOne click deployment
One click deployment
 
Scrum Introduction
Scrum IntroductionScrum Introduction
Scrum Introduction
 
Redis Introduction
Redis IntroductionRedis Introduction
Redis Introduction
 
Python decorators
Python decoratorsPython decorators
Python decorators
 
Using puppet
Using puppetUsing puppet
Using puppet
 
JMS Introduction
JMS IntroductionJMS Introduction
JMS Introduction
 
Spring Framework Introduction
Spring Framework IntroductionSpring Framework Introduction
Spring Framework Introduction
 
Java Unit Test and Coverage Introduction
Java Unit Test and Coverage IntroductionJava Unit Test and Coverage Introduction
Java Unit Test and Coverage Introduction
 

Dernier

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Dernier (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Cascading introduction

  • 1. Cascading Alex Su 2011/02/11 Copyright 2010 TCloud Computing Inc.
  • 2. Agenda •Introduction • How it works •Data Processing •Advanced Processing •Monitoring •Testing •Best Practices •Cascading GUI Trend Micro Confidential
  • 3. Introduction •Hadoop coding is non-trivial •Hadoop is looking for a class to do Map steps and a class to do Reduce step •What if you need multiple in your application? Who coordinates what can be run in parallel? •What if you need to do non-Hadoop logic between Hadoop steps? •Chain the Operations into data processing work- flows Trend Micro Confidential
  • 4. Introduction •Operations are chained together to define a Pipe assembly or a reusable sub-assembly Trend Micro Confidential
  • 5. Introduction Pipe lhs = new Pipe( "lhs" ); lhs = new Each( lhs, new SomeFunction() ); lhs = new Each( lhs, new SomeFilter() ); // the "right hand side" assembly head Pipe rhs = new Pipe( "rhs" ); rhs = new Each( rhs, new SomeFunction() ); // joins the lhs and rhs Pipe join = new CoGroup( lhs, rhs ); join = new Every( join, new SomeAggregator() ); join = new GroupBy( join ); join = new Every( join, new SomeAggregator() ); // the tail of the assembly join = new Each( join, new SomeFunction() ); Properties properties = new Properties(); FlowConnector.setApplicationJarClass( properties, Main.class ); FlowConnector flowConnector = new FlowConnector( properties ); Flow flow = flowConnector.connect( “join", source, sink, join); // execute the flow, block until complete flow.complete(); Trend Micro Confidential
  • 6. How it works •Pipe Assemblies become Flows •Translates a DAG of operations to a DAG of MapReduce jobs •All MapReduce jobs in Flow scheduled in dependency order Trend Micro Confidential
  • 7. How it works digraph G { 1 [label = "Every('akamaiPipe*whiteListPipe')[Count[decl:'count']]"]; 2 [label = "Hfs['TextLine[['host', 'count']->[ALL]]']['/user/alex/output']']"]; 3 [label = "GroupBy('akamaiPipe*whiteListPipe')[by:['host']]"]; 4 [label = "Each('akamaiPipe*whiteListPipe')[NotMatchedFilter[decl:'host', 'offset', 'line']]"]; 5 [label = "CoGroup('akamaiPipe*whiteListPipe')[by:whiteListPipe:['line']akamaiPipe:['host']]"]; 6 [label = "Hfs['TextLine[['offset', 'line']->[ALL]]']['/user/alex/whitelist/whitelist.txt']']"]; 7 [label = "Each('akamaiPipe')[RegexParser[decl:'host'][args:1]]"]; 8 [label = "Hfs['TextLine[['line']->[ALL]]']['/user/alex/input/akamai.log']']"]; 9 [label = "[head]"]; 10 [label = "[tail]"]; 11 [label = "TempHfs['SequenceFile[['host', 'offset', 'line']]'][akamaiPipe_whiteListPipe/52729/]"]; 12 [label = "Hfs['TextLine[['offset', 'line']->[ALL]]']['/user/alex/trap']']"]; 1 -> 2 [label = "[{2}:'host', 'count']n[{3}:'host', 'offset', 'line']"]; 7 -> 5 [label = "[{1}:'host']n[{1}:'host']"]; 5 -> 4 [label = "whiteListPipe[{1}:'line'],akamaiPipe[{1}:'host']n[{3}:'host', 'offset', 'line']"]; 3 -> 1 [label = "akamaiPipe*whiteListPipe[{1}:'host']n[{3}:'host', 'offset', 'line']"]; 9 -> 6 [label = ""]; 9 -> 8 [label = ""]; 2 -> 10 [label = "[{?}:ALL]n[{?}:ALL]"]; 4 -> 11 [label = "[{3}:'host', 'offset', 'line']n[{3}:'host', 'offset', 'line']"]; 11 -> 3 [label = "[{3}:'host', 'offset', 'line']n[{3}:'host', 'offset', 'line']"]; 8 -> 7 [label = "[{1}:'line']n[{1}:'line']"]; 6 -> 5 [label = "[{2}:'offset', 'line']n[{2}:'offset', 'line']"]; 7 -> 12 [label = ""]; } Trend Micro Confidential
  • 8. Data Processing •Tuple •A single ‘row’ of data being processed •Each column is named •Can access data by name or position Trend Micro Confidential
  • 9. Data Processing •TAP •Abstraction on top of Hadoop files •Allows you to define own parser for files •Example: •Scheme •TextLine •TextDelimited •SequenceFile •WritableSequenceFile Hfs input = new Hfs(new TextLine(), a_hdfsDirectory + "/" + name); Trend Micro Confidential
  • 10. Data Processing • Tap •LFS •DFS •HFS •MultiSourceTap •MultiSinkTap •TemplateTap •GlobHfs •S3fs(Deprecated) Trend Micro Confidential
  • 11. Data Processing • TemplateTap TemplateTap can be used to write tuple streams out to subdirectories based on the values in the Tuple instance. Trend Micro Confidential
  • 12. Data Processing • TemplateTap TextDelimited scheme = new TextDelimited( new Fields( "year", "month", "entry" ), "t" ); Hfs tap = new Hfs( scheme, path ); String template = "%s-%s"; // dirs named "year-month" Tap months = new TemplateTap( tap, template, SinkMode.REPLACE ); Trend Micro Confidential
  • 13. Data Processing •TAP types •SinkMode.KEEP •SinkMode.REPLACE •SinkMode.UPDATE Trend Micro Confidential
  • 14. Data Processing •Integration •Cascading.Avro •Cascading.Hbase •Cascading.JDBC •Cascading.Memcached •Cascading.SimpleDB Trend Micro Confidential
  • 15. Data Processing •Pipe Trend Micro Confidential
  • 16. Data Processing • Pipe • a base class for core processing model types • Each • for each “tuple” in data do this to it • GroupBy • similar to a ‘group by’ in SQL • CoGroup • joins of tuple streams together • Every • applies an Aggregator (like count, or sum) or Buffer (a sliding window) Operation to every group of Tuples that pass through it. • SubAssembly • allows for nesting reusable pipe assemblies into a Pipe class Trend Micro Confidential
  • 17. Data Processing • CoGroup • InnerJoin • OuterJoin • LeftJoin • RightJoin • MixedJoin lhsFields new Fields("url", "word", “count"); Fields common = new Fields( "url" ); rhsFields = new Fields("url", “sentence", “count"); Fields declared = new Fields( "url1", "word", "wd_count", "url2", "sentence", "snt_count" ); Pipe join = new CoGroup( lhs, common, rhs, common, declared, new InnerJoin() ); lhsFields, rhsFields, new InnerJoin() ); Trend Micro Confidential
  • 18. Data Processing •Operation •Define what to do on the data •Each operations allow logic on the row, such a parsing dates, creating new attributes etc. •Every operations allow you to iterate over the ‘group’ of rows to do non-trivial operations. Trend Micro Confidential
  • 19. Data Processing •Function •Identity Function •Debug Function •Sample and Limit Functions •Insert Function •Text Functions •Regular Expression Operations •Java Expression Operations •"first-name" is a valid field name for use with Cascading, but this expression, first- name.trim(), will fail. Trend Micro Confidential
  • 20. Data Processing •Filter •And •Or •Not •Xor •NotNull •Null •RegexFilter Trend Micro Confidential
  • 21. Data Processing •Aggregator •Average •Count •First •Last •Max •Min •Sum Trend Micro Confidential
  • 22. Data Processing •Buffer •It is very similar to the typical Reducer interface •It is very useful when header or footer values need to be inserted into a grouping, or if values need to be inserted into the middle of the group values Trend Micro Confidential
  • 23. Data Processing •Buffer Trend Micro Confidential
  • 24. Data Processing Trend Micro Confidential
  • 25. Data Processing •Flow •To create a Flow, it must be planned though the FlowConnector object. The connect() method is used to create new Flow instances based on a set of sink Taps, source Taps, and a pipe assembly. Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe ); flow.complete(); Trend Micro Confidential
  • 26. Data Processing •MapReduceFlow •a Flow subclass that supports custom MapReduce jobs pre-configured via the JobConf object. • ProcessFlow • a Flow subclass that supports custom Riffle jobs. Trend Micro Confidential
  • 27. Data Processing •Cascades •Groups of Flow are called Cascades •Custom MapReduce jobs can participate in Cascade Cascade cascade = cascadeConnector.connect(flow1, flow2, flow3); cascade.complete(); Trend Micro Confidential
  • 28. Advanced Processing •Stream Assertions •Unit and Regression tests for Flows •Planner can remove ‘strict’, ‘validating’, or all assertions Trend Micro Confidential
  • 29. Advanced Processing •Failure Traps •Catch data causing Operations or Assertions to fail •Allows processes to continue without data loss Trend Micro Confidential
  • 30. Advanced Processing •Partial Aggregation instead of Combiners •trade Memory for IO gains by caching values Fields groupingFields = new Fields( "date" ); Fields valueField = new Fields( "size" ); Fields sumField = new Fields( "total-size" ); assembly = new SumBy( assembly, groupingFields, valueField, sumField, long.class ); Trend Micro Confidential
  • 31. Monitoring •Implement FlowListener interface •onStarting •onStopping •onCompleted •onThrowable Trend Micro Confidential
  • 32. Monitoring • Polling FlowStats Flow ID: 756271765aa375773f9bbb5570de4d2a StepStats Count: 2 cascading.flow.FlowStepJob$1: 1, Step{status=RUNNING, startTime=1297344994624} Name: (1/2) ...SequenceFile[['host', 'offset', 'line']]"][akamaiPipe_whiteListPipe/52729/] Status: RUNNING Num Mappers: 2 Num Reducers: 1 Task ID: task_201102101702_0002_m_000003 Task ID: task_201102101702_0002_m_000000 Task ID: task_201102101702_0002_m_000001 Task ID: task_201102101702_0002_r_000000 Task ID: task_201102101702_0002_m_000002 cascading.flow.FlowStepJob$1: 2, Step{status=PENDING, startTime=0} Name: (2/2) Hfs["TextLine[['host', 'count']->[ALL]]"]["/user/alex/output"]"] Status: PENDING Num Mappers: 0 Num Reducers: 0 Trend Micro Confidential
  • 33. Testing •Use ClusterTestCase if you want to launch an embedded Hadoop cluster inside your TestCase •A few validation and hadoop functions are provided •Doesn’t support Hadoop 0.21 testing library Trend Micro Confidential
  • 34. Cascading GUI •Yahoo Pipes Pipes is a powerful composition tool to aggregate, manipulate, and mashup content from around the web. Trend Micro Confidential
  • 35. Cascading GUI •WireIt WireIt is an open-source javascript library to create web wirable interfaces for dataflow applications, visual programming languages, graphical modeling, or graph editors. Trend Micro Confidential
  • 36. Cascading GUI Trend Micro Confidential
  • 37. Live Demo Trend Micro Confidential
  • 38. THANK YOU! Trend Micro Confidential

Notes de l'éditeur

  1. Cascading and its extensions have their own Maven/Ivy Jar repositoryThis 1.2 release will run against hadoop 0.19.x, and 0.20.x. Including Amazon Elastic MapReduce. And 0.21Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files.At one level Cascading is a MapReduce query planner, just like PIG. Except the Cascading API is for public consumption and fully extensiblein PIG you typically interact with the PigLatin text syntax. With Cascading, you can layer your own syntax on top of the APIGiven a data set and you want to run a number of groupBys i.e. group by key1, generate value1, ... group by keyN, generate valueN, Cascading primary programming model is similar to PIG but with a Java API.Pig would optimize from N to smaller (e.g. 1) number of reduce runsOozie workflows are actions arranged in a control dependency DAG (Direct Acyclic Graph).Cascading runs as a client from the command lineOozieis a server system (like Hadoop Job Tracker) to which you submit workflow jobs and later check the status.
  2. By providing a clean API to the core Cascading model, tools like Jython, Groovy, and JRuby can be used instead to define complex processing flow
  3. The MapReduce Job Planner is an internal feature of Cascading.Every job is delimited by a temporary file that is the sink from the first job, and then the source to the next job.temporary file will be deleted whether the flow runs successfully or failed. However, it’s configurable.If two or more Flow instances have no dependencies, they will be submitted together so they can execute in parallel.DAG : directed acyclic graph : 不循環有向圖an internal graph that makes each Flow a 'vertex', and each file an 'edge‘When a vertex has all it's incoming edges (files) available, it will be scheduled on the cluster.TopologicalOrderAnd by default, if any outputs from a Flow are newer than the inputs, the Flow is skippedI can’t customize combiner and partitioner
  4. 7 tools can parse the dot file.DOT is a plain text graph description language. To see how your Flows are partitioned, call the Flow#writeDOT() method. This will write a DOT fileThe writeDOTapi isn’t useful for logging
  5. All Taps must have a Scheme associated with them. If the Tap is about where the data is, and how to get it, the Scheme is about what the data is.TextLineTextLine reads and writes raw text files and returns Tuples with two field names by default, "offset" and "line".TextDelimited(csv, tsv, etc)SequenceFile - SequenceFile is based on the Hadoop Sequence file, which is a binary format.WritableSequenceFile - like the SequenceFile Scheme, except it was designed to read and write key and/or value Hadoop Writable objects directly.
  6. MultiSourceTapThe cascading.tap.MultiSourceTap is used to tie multiple Tap instances into a single Tap for use as an input source. The only restriction is that all the Tap instances passed to a new MultiSourceTap share the same Scheme classes (not necessarily the same Scheme instance).MultiSinkTapThe cascading.tap.MultiSinkTap is used to tie multiple Tap instances into a single Tap for use as an output sink. During runtime, for every Tuple output by the pipe assembly each child tap to the MultiSinkTap will sink the Tuple.TemplateTapTemplateTap can be used to write tuple streams out to subdirectories based on the values in the Tuple instance. The constructor takes a HfsTap and a Formatter format syntax String. This allows Tuple values at given positions to be used as directory names. Note that Hadoop can only sink to directories, and all files in those directories are "part-xxxxx" files. openTapsThreshold limits the number of open files to be output to. This value defaults to 300 files. Each time the threshold is exceeded, 10% of the least recently used open files will be closed. TextDelimited scheme = new TextDelimited( new Fields( "year", "month", "entry" ), "\\t" ); Hfs tap = new Hfs( scheme, path ); String template = "%s-%s"; // dirs named "year-month" Tap months = new TemplateTap( tap, template, SinkMode.REPLACE );GlobHfs extends MultiSourceTapThe cascading.tap.GlobHfs Tap accepts Hadoop style 'file globbing' expression patterns. This allows for multiple paths to be used as a single source, where all paths match the given pattern.Changed the semantics of file globbing with a PathFilter (using the globStatus method of FileSystem). Previously, the filtering was too restrictive, so that a glob of /*/* and a filter that only accepts /a/b would not have matched /a/b. With this change /a/b does match.
  7. SinkMode.KEEP This is the default behavior. If the resource exists, attempting to write to it will fail.SinkMode.REPLACE This allows Cascading to delete the file immediately after the Flow is started.SinkMode.UPDATE Allows for new Tap types that have the concept of update or append. For example, updating records in a database. It is up to the Tap to decide how to implement its "update" semantics. When Cascading sees the update mode, it knows not to attempt to delete the resource first or to not fail because it already exists.
  8. Avro is a data serialization system.Avro provides functionality similar to systems such as Thrift, Protocol BuffersCascading.SimpleDB - Integration with Amazon SimpleDB.
  9. It is not required that an Every follow either GroupBy or CoGroup, an Each may follow immediately after. But an Every many not follow an Each.For example : DISTINCTThe Each pipe may only apply Functions and Filters to the tuple stream as these operations may only operate on one Tuple at a time.The Every pipe may only apply Aggregators and Buffers to the tuple stream as these operations may only operate on groups of tuples, one grouping at a time.GroupBy supports ordering
  10. Self joins supportedIn practice this would fail since the result Tuple has duplicate field names.A Mixed join is where 3 or more tuple streams are joined, and each pair must be joined differently. See the cascading.pipe.cogroup.MixedJoin class for more details.When joining two streams via a CoGroup Pipe, attempt to place the largest of the streams in the left most argument to the CoGroup. Joining multiple streams requires some accumulation of values before the join operator can begin, but the left most stream will not be accumulated. This should improve the performance of most joins.
  11. Operation is a superclass of Function, Filter, Aggregator, Buffer, and Assertion. Function and Filter are each operationsAggregator and Buffer are every operationsUsually extends BaseOperation class
  12. Identity FunctionDiscard unused fieldsRename all fieldsRename a single fieldDebugLevelenum values NONE,DEFAULT, or VERBOSEFlowConnector.setDebugLevel( properties, DebugLevel.NONE ); Sample The cascading.operation.filter.Sample filter allows a percentage of tuples to pass.Limit The cascading.operation.filter.Limit filter allows a set number of Tuples to pass.when some missing parameter or value, like a date String for the current date, needs to be inserted.Text FunctionsDateParserDateFormatterRegular Expression OperationsRegexParserRegexSplitterJava Expression OperationsExpressionFunctionExpressionFilterExpressionFilter filter = new ExpressionFilter( "status != 200", Integer.TYPE ); some characters will cause compilation errors
  13. (Function, Filter,Aggregator, or Buffer) do not store operation state in class fields.For example, if implementing a custom 'counter' Aggregator, do not create a field named 'count' and increment it on every Aggregator.aggregate() call. There is no guarantee your Operation will be called from a single thread in a JVMThere is a context that you can record aggregation value. It’s the same ashadoop.
  14. An Buffer may only be used with an Every pipe, and it may only follow a GroupBy or CoGroup pipe type.It differs by the fact that an Iterator is provided and it is the responsibility of the operate(cascading.flow.FlowProcess, BufferCall) method to iterate overall all the input arguments returned by this Iterator, if any. Header, footerdocument_id, term, term_count_in_document, total_terms_in_document
  15. An Buffer may only be used with an Every pipe, and it may only follow a GroupBy or CoGroup pipe type.AggregateBy is a SubAssembly
  16. Verifying input and output schemas before running flowStart() method is anasynchronized callA properties object can be set into FlowConnector, as you setHadoopjobconf
  17. riffle is a lightweight Java library for executing collections of dependent processes as a single process. This library provides Java Annotations for tagging classes and methods supporting required life-cycle stages,import riffle.process.DependencyIncoming;import riffle.process.DependencyOutgoing;import riffle.process.ProcessCleanup;import riffle.process.ProcessComplete;import riffle.process.ProcessPrepare;import riffle.process.ProcessStart;import riffle.process.ProcessStop;
  18. Assertions aren’t pipes.When running a tests against regression data, it makes sense to use strict assertions. This regression data should be small and represent many of the edge cases the processing assembly must support robustly. When running tests in staging, or with data that may vary in quality since it is from an unmanaged source, using validating assertions make much sense. Then there are obvious cases where assertions just get in the way and slow down processing and it would be nice to just bypass them.
  19. Traps were not designed as a filtering mechanism
  20. Since version 1.2Cascading does not support the so called MapReduce Combiners. Combiners are very powerful in that they reduce the IO between the Mappers and Reducers. Why send all your Mapper to data to Reducers when you can compute some values Map side and combine them in the Reducer. But Combiners are limited to Associative and Commutative functions only, like 'sum' and 'max'. And in order to work, values emitted from the Map task must be serialized, sorted (deserialized and compared), deserialized again and operated on, where again the results are serialized and sorted. Combiners trade CPU for gains in IO.Cascading takes a different approach by providing a mechanism to perform partial aggregations Map side and also combine them Reduce side. But Cascading chooses to trade Memory for IO gains by caching values (up to a threshold). This approach bypasses the unnecessary serialization, deserialization, and sorting steps. It also allows for any aggregate function to be implemented, not just Associative and Commutative ones.Class AggregateBy is a SubAssembly that serves two roles for handling aggregate operations. AverageBy, CountBy, SumBy
  21. FlowStatFAILEDPENDINGRUNNINGSKIPPEDSTOPPEDSUCCESSFUL
  22. ClusterTestCase : MiniDFSCluster, MiniMRCluster, FileSystemFunctions : copyFromLocal, getFileSystem,getJobConf, getPropertiesLimit will get half records in version 1.1
  23. Wireit supports firefox 3.5 above, it doesn’t work on firefox 3.0WireIt is released under the MIT License.