SlideShare une entreprise Scribd logo
1  sur  55
MapReduce
with Scalding
Antonios Chalkiopoulos
24th Big Data London Meetup
Scalding.io
$ whoami
Scalding.io
http://scalding.io
http://github.com/scalding-io
@chalkiopoulos
My recent achievement..
Scalding.io
What are we gonna talk about..?
Scalding.io
Scalding.io
A Scala API on top of Cascading
Scalding.io
But what is ?
Scalding.io
A few years ago I started on a
fresh Big Data team…
Scalding.io
How do we efficiently develop
MapReduce jobs for our new
hadoop cluster ?
Scalding.io
MapReduce Techs
Scalding.io
Java MapReduce
Hadoop
abstraction
ws
Java MapReduce
Word count example
MapReduce Techs
Scalding.io
Java MapReduce
Pig Hive
Hadoop
Cascading Others
abstraction
The promise of Cascading
Scalding.io
[1]
A simple, high level java API for
MapReduce easy to understand
and work with.
Scalding.io
[2]
Extensions to
MANY platforms
Scalding.io
Scalding.io
Cascading
NoSQL
Databases
SQL
Databases
Hadoop
Filesystem
Local
Filesystem
In memory
systems
Search
Platforms
 MongoDB
 Cassandra
 HBASE
 Accumulo
…
 ElasticSearch
 Solr
…
 Redis
 Memcached
…
How it works?
Scalding.io
A pipeline architecture
Scalding.io
Scalding.io
data
data
data
Source tap
data
data
Sinktap
Scalding.io
Log files
Customer Data
Log & Customer
Final
Results
Log files
Log files
Customer Data
Results
Results
Cascading Example
Scalding.io
Word count in Cascading
1. public class WordCount {
2. public static void main(String[] args) {
3. Properties properties = new Properties();
4. FlowConnector.setApplicationJarClass (properties, WordCount.class);
5. Scheme sourceScheme = new TextLine (new Fields(“line”));
6. Scheme sinkScheme = new TextLine (new Fields(“word”,”count”));
7. Tap source = new Hfs( sourceScheme, args[0]);
8. Tap sink = new Hfs( sinkScheme, args[1], SinkMode.REPLACE );
9. Pipe assembly = new Pipe(“ Word Count “);
10. String regex = “(?>!pL)(?=pL)[^ ]*(?<=pL)(?!pL)”;
11. Function function = new RegexGenerator( new Fields(“word”), regex);
12. assembly = new Each( assembly, new Fields(“line”), function );
13. assembly = new GroupBy( assembly, new Fields(“word”) );
14. Aggregator count = new Count(new Fields(“count”) );
15. assembly = new Every( assembly, count );
16. FlowConnector flowConnector = new FlowConnector( properties );
17. Flow flow = flowConnector.connect(“word-count”, source, sink, assembly);
18. flow.complete();
19. }
20. }
Scalding.io
70% less boilerplate code
But still some infrastructure code
Scalding.io
Scalding.io
No boilerplate code at all
Functional
Robust & Scalable
Run on JVM
Here it comes 
Scalding.io
Java MapReduce
Pig Hive
Hadoop
Cascading Others
abstraction
Scalding
The power of Scala on top of
Cascading
Scalding.io
Scala fits naturally with data
Scalding.io
Word count in Scalding
Scalding.io
1. import com.twitter.scalding._
2. class WordCountJob(args : Args) extends Job(args) {
3. TextLine("input.txt”).read
4. .flatMap('line -> 'word) { line : String => line.split("s+") }
5. .groupBy('word) { _.size }
6. .write( Tsv(”results.tsv”) )
7. }
Map
phase
Reduce
phase
4
Who is using it?
Scalding.io
Many many others…
Scalding…
…open sourced by twitter at 2011
…has more than 100 open source contributors
…exposes the right abstractions
…maximizes expressiveness
…promotes extensibility
…adds new capabilities to Cascading
Scalding.io
Core Concepts
Scalding.io
Sources & Sinks
1. Tsv("data.tsv", ('productID,'price,'quantity))
2. .read
3. .write(UnpackedAvroSource("data.avro”))
Scalding.io
Tsv
Csv
Osv
Avro
Parquet
…
Map Operations
Scalding.io
1. pipe1.filter ('age) { age:Int => age > 18 }
2. pipe1.map ('price -> ’withVAT) { price:Double => price * 1.2 }
3. pipe1.project('name, 'surname)
15 map
operations
translated into
map phases
Join operations
1. pipe1.joinWithSmaller('productId -> 'productId, pipe2)
2. pipe1.joinWithLarger ('productId -> 'productId, pipe2)
3. pipe1.joinWithTiny ('productId -> 'productId, pipe2)
Scalding.io
Optimize by hinting the relative
sizes
Supports Left, Right, Inner,
Outer Joins
1. pipe1
2. .joinWithSmaller('productId -> 'productId, pipe2,
3. joiner=new LeftJoin)
Group operations
1. val pipe = Tsv(“input”, ('shopId, 'itemId, 'quantity))
2. .groupBy('shopId) {
3. _.sum[Long]('quantity-> 'totalSoldItems)
4. }
5. .write(Tsv(“results.tsv”))
Scalding.io
Group by particular
fields
.groupBy
.groupAll Group all data
Pipe operations
1. val p = (pipe1 ++ pipe2) // Concatenate 2 pipes
2. .debug // Print sample data to screen
3. .addTrap(Tsv(“bogus_lines”) // dirty data are recorded
Scalding.io
Simple pipe operations
Connect with external systems
Scalding.io
Scalding + Hive
1. class HiveExample (args: Args) extends Job(args) {
2. val USER_SCHEMA = List('userId, 'username, 'photo)
3. HiveSource("myHiveTable", SinkMode.KEEP)
4. .withHCatScheme(osvInputScheme(fields = USER_SCHEMA))
5. .write(Tsv("outputFromHive"))
6. }
Scalding.io
Define the schemaQuery Hcatalog
Read directly from
HDFS
Scalding + ElasticSearch
1. val schema = List('number, 'product, 'description)
2. val readES = ElasticSearchTap("localhost", 9200,"index firstType","",
schema).read.write(Tsv("data/es-out.tsv"))
3. val writeES = Tsv("data.tsv”).read.write(ElasticSearchTap
("localhost”, 9200,"index/secondType","", schema))
Scalding.io
Read from
ElasticSearch in one
line!Also index new data
in ES
Design patterns
Scalding.io
Dependency Injection
Late bound
External Operations
How about defining external operations?
Scalding.io
1. val pipe1 = Tsv(“omniture.tsv”,OMNITURE_SCHEMA)
2. .read
3. .ETLOmnitureData
4. .calculateOmnitureUserStats
5. .joinWithCustomerDB('userId->'userId, customerPipe)
6. .write(Tsv(“omniture-results.tsv”))
Custom operations:
 Re-usable modular code
 Single responsibility
 TestabilityFull-code
http://bit.ly/1pNSUKf
Scalding Testing
Scalding.io
Testing challenges in the context of MR
Scalding.io
Acceptance Tests
Unit – Component Tests
System Tests
Integration Tests Scalding enables
testing in every layer
&
TDD
example
Scalding.io
1. class TsvWordCountJobTest extends FlatSpec
2. with ShouldMatchers with TuppleConversions {
3. “WordCountJob” should “count words” in {
4. JobTest(new WordCountJob(_))
5. .args(“input”,”inFile”)
6. .args(“output”,”outFile”)
7. .source(TextLine(“inFile”), List(“0”) -> “cool Scala cool”))
8. .sink[(String,Int)](Tsv(“outFile”)) { out =>
9. out.toList should contain (“cool” -> 2)
10. }
11. .run
12. .finish
13. }
14. }
Replaces taps
with in-memory
collections
and asserts the
expected output
Monitoring
Scalding.io
“Driven takes Cascading
application development to the
next level with management and
monitoring capabilities for your
apps”
Scalding.io
http://driven.cascading.io
Scalding.io
Collects telemetry data and expose through a Web UI
Advanced Concepts
Scalding.io
Scalding adds
 Typed API
 Matrix API
 Graphs
 Machine Learning Algorithm
Scalding.io
What the future like?
Scalding.io
So far…
Scalding.io
abstraction
Real TimeBatch Hybrid
Scalding.io
abstraction
Summingbird
A unified API for everything
StormTEZ Spark
Enables the Lambda architecture
Scalding.io
Questions?

Contenu connexe

Tendances

Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water MeetupSri Ambati
 
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)DataStax Academy
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopSages
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Works with persistent graphs using OrientDB
Works with persistent graphs using OrientDB Works with persistent graphs using OrientDB
Works with persistent graphs using OrientDB graphdevroom
 
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with ElasticsearchHolden Karau
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling WaterSri Ambati
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David SzakallasDatabricks
 
Codestrong 2012 breakout session hacking titanium
Codestrong 2012 breakout session   hacking titaniumCodestrong 2012 breakout session   hacking titanium
Codestrong 2012 breakout session hacking titaniumAxway Appcelerator
 
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Red Hat Developers
 
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...InfluxData
 
Cassandra + Spark (You’ve got the lighter, let’s start a fire)
Cassandra + Spark (You’ve got the lighter, let’s start a fire)Cassandra + Spark (You’ve got the lighter, let’s start a fire)
Cassandra + Spark (You’ve got the lighter, let’s start a fire)Robert Stupp
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 
Webinar: Building Your First App in Node.js
Webinar: Building Your First App in Node.jsWebinar: Building Your First App in Node.js
Webinar: Building Your First App in Node.jsMongoDB
 
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubJoining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubData Con LA
 

Tendances (20)

Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water Meetup
 
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
R and cpp
R and cppR and cpp
R and cpp
 
R and C++
R and C++R and C++
R and C++
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Works with persistent graphs using OrientDB
Works with persistent graphs using OrientDB Works with persistent graphs using OrientDB
Works with persistent graphs using OrientDB
 
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with Elasticsearch
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling Water
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
 
Codestrong 2012 breakout session hacking titanium
Codestrong 2012 breakout session   hacking titaniumCodestrong 2012 breakout session   hacking titanium
Codestrong 2012 breakout session hacking titanium
 
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
 
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
 
Cassandra + Spark (You’ve got the lighter, let’s start a fire)
Cassandra + Spark (You’ve got the lighter, let’s start a fire)Cassandra + Spark (You’ve got the lighter, let’s start a fire)
Cassandra + Spark (You’ve got the lighter, let’s start a fire)
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Webinar: Building Your First App in Node.js
Webinar: Building Your First App in Node.jsWebinar: Building Your First App in Node.js
Webinar: Building Your First App in Node.js
 
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubJoining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
 

Similaire à Scalding Presentation

Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...PROIDEA
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark ApplicationsTzach Zohar
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingToni Cebrián
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Sri Ambati
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一scalaconfjp
 
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Provectus
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandranickmbailey
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkVictor Coustenoble
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
JavaScript Growing Up
JavaScript Growing UpJavaScript Growing Up
JavaScript Growing UpDavid Padbury
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 

Similaire à Scalding Presentation (20)

Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark Applications
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
 
Jet presentation
Jet presentationJet presentation
Jet presentation
 
Scala+data
Scala+dataScala+data
Scala+data
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
 
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
JavaScript Growing Up
JavaScript Growing UpJavaScript Growing Up
JavaScript Growing Up
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 

Dernier

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Dernier (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Scalding Presentation

Notes de l'éditeur

  1. Monday, June 30, 2014 6:30 PM to 9:30 PM Barclays Accelerator 69-89 Mile End Road, E1 4UJ, London http://www.meetup.com/big-data-london/events/188925412/
  2. Here are my contact details. Find a number of open-source projects related to Hadoop & MapReduce that I have been contributing on GitHub And also the technical blog http://scalding.io
  3. First book ever available on Scala + MapReduce + Hadoop Comes with hundreds of ready to run examples Book @ Amazon = http://amazon.co.uk/dp/1783287012 Book @ PACKT = http://packtpub.com/programming-mapreduce-with-scalding/book GitHub repository with examples = https://github.com/scalding-io/ProgrammingWithScalding
  4. http://github.com/twitter/scalding
  5. Once upon a time..
  6. Hadoop provides HDFS for the distributed storage of large files and services for coordinated execution of MapReduce tasks The Java MapReduce API is very verbose and 70 lines of code for a simple WordCount example
  7. Hadoop provides HDFS for the distributed storage of large files and services for coordinated execution of MapReduce tasks The Java MapReduce API is very verbose and 70 lines of code for a simple WordCount example
  8. In-memory systems i.e. memcached , redis etc Document Databases Search systems
  9. Explain: Taps, Tuples, pipes
  10. Show parallelism
  11. Cascading word count requires 20 lines Java MapReduce API word count requires 70 lines = We manage to remove 50 lines of code (70%) by using Cascading
  12. Is this what adds on top of cascading ?
  13. Parquet => Efficient columnar storage For a Scalding application to execute all defined input and output taps must participate in the pipeline. Reading & Writing files
  14. // 15 map operations – that are translated into map phases
  15. How many items does each of our shops sell ?
  16. Code reference - https://github.com/scalding-io/ProgrammingWithScalding/tree/master/chapter5/src/main/scala/externaloperations
  17. Testing is challeging in the context of MapReduce and its challenges
  18. Driven takes Cascading application development to the next level with management and monitoring capabilities for your Cascading apps
  19. Driven takes Cascading application development to the next level with management and monitoring capabilities for your Cascading apps
  20. This is the stack presented so far