SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
The Cascading 
(big) data 
application framework 
André Kelpe | HUG France | Paris | 25. November 2014
Who am I? 
André Kelpe 
Senior Software Engineer at Concurrent 
company behind Cascading, Lingual and 
Driven 
http://concurrentinc.com / @concurrent 
andre@concurrentinc.com / @fs111
http://cascading.org 
Apache licensed Java framework for writing data 
oriented applications 
production ready, stable and battle proven 
(soundcloud, twitter, etsy, climate corp + many 
more)
Cascading goals 
developer productivity 
focus on business problems, not distributed 
systems knowledge 
useful abstractions over underlying „fabrics“
Cascading goals 
Testability & robustness 
production quality applications rather than a 
collection of scripts 
(hooks into the core for experts)
https://www.flickr.com/photos/theilr/4283377543/sizes/l
Cascading terminology 
Taps are sources and sinks for data 
Schemes represent the format of the data 
Pipes are connecting Taps
Cascading terminology 
● Tuples flow through Pipes 
● Fields describe the Tuples 
● Operations are executed on Tuples in 
TupleStreams 
● FlowConnector uses QueryPlanner to 
translate FlowDef into Flow to run on 
computational fabric
Compiler 
QueryPlanner 
FlowDef 
FlowDef 
FlowDef 
Hadoop 
FlowDef Tez 
Spark 
User Code Translation 
Optimization 
Assembly 
CPU Architecture
User-APIs 
● Fluid - A Fluent API for Cascading 
– Targeted at application writers 
– https://github.com/Cascading/fluid 
● „Raw“ Cascading API 
– Targeted for library writers, code generators, 
integration layers 
– https://github.com/Cascading/cascading
Counting words 
// configuration 
String docPath = args[ 0 ]; 
String wcPath = args[ 1 ]; 
Properties properties = new Properties(); 
AppProps.setApplicationJarClass( properties, Main.class ); 
FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties ); 
// create source and sink taps 
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); 
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); 
...
Counting words (cont.) 
// specify a regex operation to split the "document" text lines into a 
token stream 
Fields token = new Fields( "token" ); 
Fields text = new Fields( "text" ); 
RegexSplitGenerator splitter = 
new RegexSplitGenerator( token, "[ [](),.]" ); 
// only returns "token" 
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); 
// determine the word counts 
Pipe wcPipe = new Pipe( "wc", docPipe ); 
wcPipe = new GroupBy( wcPipe, token ); 
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); 
...
Counting words (cont.) 
// connect the taps, pipes, etc., into a flow 
FlowDef flowDef = FlowDef.flowDef() 
.setName( "wc" ) 
.addSource( docPipe, docTap ) 
.addTailSink( wcPipe, wcTap ); 
Flow wcFlow = flowConnector.connect( flowDef ) 
wcFlow.complete(); // ← runs the code 
}
https://driven.cascading.io/driven/871A2C66DA1D 
4841B229CDD2B04B9FDA
Impatient 
Cascading for the Impatient 
http://docs.cascading.org/impatient/index.html
● Operations 
A full toolbox 
– Function 
– Filter 
– Regex/Scripts 
– Boolean operators 
– Count/Limit/Last/First 
– Scripts 
– Unique 
– Asserts 
– Min/Max 
– … 
● Splices 
– GroupBy 
– CoGroup 
– HashJoin 
– Merge 
● Joins 
Left, right, outer, inner, 
mixed...
A full toolbox 
data access: JDBC, HBase, elasticsearch, 
redshift, HDFS, S3, Cassandra... 
data formats: avro, thrift, protobuf, CSV, TSV... 
integration points: Cascading Lingual (SQL), 
Apache Hive, classical M/R apps.. 
not Java?: Scalding (Scala), Cascalog (clojure)
Status quo 
● Cascading 2.6 
– Production release 
● Hadoop 2.x 
● Hadoop 1.x 
● Local mode 
● Cascading 3.0 
– public wip builds 
● Tez 
● Hadoop 2.x 
● Hadoop 1.x 
● Local mode 
● Others (Spark...)
Questions? 
andre@concurrentinc.com
Link Collection 
http://www.cascading.org/ 
https://github.com/Cascading/ 
http://concurrentinc.com 
http://cascading.io/driven/ 
https://groups.google.com/forum/#!forum/cascading-user 
http://docs.cascading.org/impatient/ 
http://docs.cascading.org/cascading/2.6/userguide/html/
fin.

Contenu connexe

Tendances

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 

Tendances (20)

Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
Creating data centric microservices
Creating data centric microservicesCreating data centric microservices
Creating data centric microservices
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
 
Data Pipelines with Apache Airflow
Data Pipelines with Apache AirflowData Pipelines with Apache Airflow
Data Pipelines with Apache Airflow
 
Are you a Tortoise or a Hare?
Are you a Tortoise or a Hare?Are you a Tortoise or a Hare?
Are you a Tortoise or a Hare?
 
Data Analytics with Apache Spark and Cassandra
Data Analytics with Apache Spark and CassandraData Analytics with Apache Spark and Cassandra
Data Analytics with Apache Spark and Cassandra
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Understanding Spark Structured Streaming
Understanding Spark Structured StreamingUnderstanding Spark Structured Streaming
Understanding Spark Structured Streaming
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Introduction to Akka Streams [Part-II]
Introduction to Akka Streams [Part-II]Introduction to Akka Streams [Part-II]
Introduction to Akka Streams [Part-II]
 
Introduction to Akka Streams [Part-I]
Introduction to Akka Streams [Part-I]Introduction to Akka Streams [Part-I]
Introduction to Akka Streams [Part-I]
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker container
 
Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbAirstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At Airbnb
 
Complex queries in a distributed multi-model database
Complex queries in a distributed multi-model databaseComplex queries in a distributed multi-model database
Complex queries in a distributed multi-model database
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
 

Similaire à The Cascading (big) data application framework

Data Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache HadoopData Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache Hadoop
Hikmat Dhamee
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
Sri Ambati
 

Similaire à The Cascading (big) data application framework (20)

Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
 
Data Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache HadoopData Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache Hadoop
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Pattern - an open source project for migrating predictive models from SAS, et...
Pattern - an open source project for migrating predictive models from SAS, et...Pattern - an open source project for migrating predictive models from SAS, et...
Pattern - an open source project for migrating predictive models from SAS, et...
 
Kafka streams - From pub/sub to a complete stream processing platform
Kafka streams - From pub/sub to a complete stream processing platformKafka streams - From pub/sub to a complete stream processing platform
Kafka streams - From pub/sub to a complete stream processing platform
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
Reducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsReducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop Applications
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Hadoop Summit: Pattern – an open source project for migrating predictive mode...
Hadoop Summit: Pattern – an open source project for migrating predictive mode...Hadoop Summit: Pattern – an open source project for migrating predictive mode...
Hadoop Summit: Pattern – an open source project for migrating predictive mode...
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open DataOSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
 
Using Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open DataUsing Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open Data
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 

Plus de Modern Data Stack France

Plus de Modern Data Stack France (20)

Stash - Data FinOPS
Stash - Data FinOPSStash - Data FinOPS
Stash - Data FinOPS
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Talend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupTalend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark Meetup
 
Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Hug janvier 2016 -EDF
Hug   janvier 2016 -EDFHug   janvier 2016 -EDF
Hug janvier 2016 -EDF
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France -  20160114 industrialisation_process_big_data CanalPlusHUG France -  20160114 industrialisation_process_big_data CanalPlus
HUG France - 20160114 industrialisation_process_big_data CanalPlus
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
 
Spark dataframe
Spark dataframeSpark dataframe
Spark dataframe
 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandation
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Spark meetup at viadeo
Spark meetup at viadeoSpark meetup at viadeo
Spark meetup at viadeo
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
 
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REXHadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

The Cascading (big) data application framework

  • 1. The Cascading (big) data application framework André Kelpe | HUG France | Paris | 25. November 2014
  • 2. Who am I? André Kelpe Senior Software Engineer at Concurrent company behind Cascading, Lingual and Driven http://concurrentinc.com / @concurrent andre@concurrentinc.com / @fs111
  • 3. http://cascading.org Apache licensed Java framework for writing data oriented applications production ready, stable and battle proven (soundcloud, twitter, etsy, climate corp + many more)
  • 4. Cascading goals developer productivity focus on business problems, not distributed systems knowledge useful abstractions over underlying „fabrics“
  • 5. Cascading goals Testability & robustness production quality applications rather than a collection of scripts (hooks into the core for experts)
  • 7. Cascading terminology Taps are sources and sinks for data Schemes represent the format of the data Pipes are connecting Taps
  • 8. Cascading terminology ● Tuples flow through Pipes ● Fields describe the Tuples ● Operations are executed on Tuples in TupleStreams ● FlowConnector uses QueryPlanner to translate FlowDef into Flow to run on computational fabric
  • 9. Compiler QueryPlanner FlowDef FlowDef FlowDef Hadoop FlowDef Tez Spark User Code Translation Optimization Assembly CPU Architecture
  • 10. User-APIs ● Fluid - A Fluent API for Cascading – Targeted at application writers – https://github.com/Cascading/fluid ● „Raw“ Cascading API – Targeted for library writers, code generators, integration layers – https://github.com/Cascading/cascading
  • 11. Counting words // configuration String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); ...
  • 12. Counting words (cont.) // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); ...
  • 13. Counting words (cont.) // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); Flow wcFlow = flowConnector.connect( flowDef ) wcFlow.complete(); // ← runs the code }
  • 15. Impatient Cascading for the Impatient http://docs.cascading.org/impatient/index.html
  • 16. ● Operations A full toolbox – Function – Filter – Regex/Scripts – Boolean operators – Count/Limit/Last/First – Scripts – Unique – Asserts – Min/Max – … ● Splices – GroupBy – CoGroup – HashJoin – Merge ● Joins Left, right, outer, inner, mixed...
  • 17. A full toolbox data access: JDBC, HBase, elasticsearch, redshift, HDFS, S3, Cassandra... data formats: avro, thrift, protobuf, CSV, TSV... integration points: Cascading Lingual (SQL), Apache Hive, classical M/R apps.. not Java?: Scalding (Scala), Cascalog (clojure)
  • 18. Status quo ● Cascading 2.6 – Production release ● Hadoop 2.x ● Hadoop 1.x ● Local mode ● Cascading 3.0 – public wip builds ● Tez ● Hadoop 2.x ● Hadoop 1.x ● Local mode ● Others (Spark...)
  • 20. Link Collection http://www.cascading.org/ https://github.com/Cascading/ http://concurrentinc.com http://cascading.io/driven/ https://groups.google.com/forum/#!forum/cascading-user http://docs.cascading.org/impatient/ http://docs.cascading.org/cascading/2.6/userguide/html/
  • 21. fin.