Scalding Presentation

MapReduce
with Scalding
Antonios Chalkiopoulos
24th Big Data London Meetup
Scalding.io

$ whoami
Scalding.io
http://scalding.io
http://github.com/scalding-io
@chalkiopoulos

My recent achievement..
Scalding.io

What are we gonna talk about..?
Scalding.io

A Scala API on top of Cascading
Scalding.io

A few years ago I started on a
fresh Big Data team…
Scalding.io

How do we efficiently develop
MapReduce jobs for our new
hadoop cluster ?
Scalding.io

MapReduce Techs
Scalding.io
Java MapReduce
Hadoop
abstraction

ws
Java MapReduce
Word count example

MapReduce Techs
Scalding.io
Java MapReduce
Pig Hive
Hadoop
Cascading Others
abstraction

The promise of Cascading
Scalding.io

[1]
A simple, high level java API for
MapReduce easy to understand
and work with.
Scalding.io

[2]
Extensions to
MANY platforms
Scalding.io

Scalding.io
Cascading
NoSQL
Databases
SQL
Databases
Hadoop
Filesystem
Local
Filesystem
In memory
systems
Search
Platforms
 MongoDB
 Cassandra
 HBASE
 Accumulo
…
 ElasticSearch
 Solr
…
 Redis
 Memcached
…

A pipeline architecture
Scalding.io

Scalding.io
data
data
data
Source tap
data
data
Sinktap

Scalding.io
Log files
Customer Data
Log & Customer
Final
Results
Log files
Log files
Customer Data
Results
Results

Word count in Cascading
1. public class WordCount {
2. public static void main(String[] args) {
3. Properties properties = new Properties();
4. FlowConnector.setApplicationJarClass (properties, WordCount.class);
5. Scheme sourceScheme = new TextLine (new Fields(“line”));
6. Scheme sinkScheme = new TextLine (new Fields(“word”,”count”));
7. Tap source = new Hfs( sourceScheme, args[0]);
8. Tap sink = new Hfs( sinkScheme, args[1], SinkMode.REPLACE );
9. Pipe assembly = new Pipe(“ Word Count “);
10. String regex = “(?>!pL)(?=pL)[^ ]*(?<=pL)(?!pL)”;
11. Function function = new RegexGenerator( new Fields(“word”), regex);
12. assembly = new Each( assembly, new Fields(“line”), function );
13. assembly = new GroupBy( assembly, new Fields(“word”) );
14. Aggregator count = new Count(new Fields(“count”) );
15. assembly = new Every( assembly, count );
16. FlowConnector flowConnector = new FlowConnector( properties );
17. Flow flow = flowConnector.connect(“word-count”, source, sink, assembly);
18. flow.complete();
19. }
20. }
Scalding.io
70% less boilerplate code
But still some infrastructure code

Scalding.io
No boilerplate code at all
Functional
Robust & Scalable
Run on JVM

Here it comes 
Scalding.io
Java MapReduce
Pig Hive
Hadoop
Cascading Others
abstraction
Scalding

The power of Scala on top of
Cascading
Scalding.io

Scala fits naturally with data
Scalding.io

Word count in Scalding
Scalding.io
1. import com.twitter.scalding._
2. class WordCountJob(args : Args) extends Job(args) {
3. TextLine("input.txt”).read
4. .flatMap('line -> 'word) { line : String => line.split("s+") }
5. .groupBy('word) { _.size }
6. .write( Tsv(”results.tsv”) )
7. }
Map
phase
Reduce
phase
4

Who is using it?
Scalding.io
Many many others…

Scalding…
…open sourced by twitter at 2011
…has more than 100 open source contributors
…exposes the right abstractions
…maximizes expressiveness
…promotes extensibility
…adds new capabilities to Cascading
Scalding.io

Sources & Sinks
1. Tsv("data.tsv", ('productID,'price,'quantity))
2. .read
3. .write(UnpackedAvroSource("data.avro”))
Scalding.io
Tsv
Csv
Osv
Avro
Parquet
…

Map Operations
Scalding.io
1. pipe1.filter ('age) { age:Int => age > 18 }
2. pipe1.map ('price -> ’withVAT) { price:Double => price * 1.2 }
3. pipe1.project('name, 'surname)
15 map
operations
translated into
map phases

Join operations
1. pipe1.joinWithSmaller('productId -> 'productId, pipe2)
2. pipe1.joinWithLarger ('productId -> 'productId, pipe2)
3. pipe1.joinWithTiny ('productId -> 'productId, pipe2)
Scalding.io
Optimize by hinting the relative
sizes
Supports Left, Right, Inner,
Outer Joins
1. pipe1
2. .joinWithSmaller('productId -> 'productId, pipe2,
3. joiner=new LeftJoin)

Group operations
1. val pipe = Tsv(“input”, ('shopId, 'itemId, 'quantity))
2. .groupBy('shopId) {
3. _.sum[Long]('quantity-> 'totalSoldItems)
4. }
5. .write(Tsv(“results.tsv”))
Scalding.io
Group by particular
fields
.groupBy
.groupAll Group all data

Pipe operations
1. val p = (pipe1 ++ pipe2) // Concatenate 2 pipes
2. .debug // Print sample data to screen
3. .addTrap(Tsv(“bogus_lines”) // dirty data are recorded
Scalding.io
Simple pipe operations

Connect with external systems
Scalding.io

Scalding + Hive
1. class HiveExample (args: Args) extends Job(args) {
2. val USER_SCHEMA = List('userId, 'username, 'photo)
3. HiveSource("myHiveTable", SinkMode.KEEP)
4. .withHCatScheme(osvInputScheme(fields = USER_SCHEMA))
5. .write(Tsv("outputFromHive"))
6. }
Scalding.io
Define the schemaQuery Hcatalog
Read directly from
HDFS

Scalding + ElasticSearch
1. val schema = List('number, 'product, 'description)
2. val readES = ElasticSearchTap("localhost", 9200,"index firstType","",
schema).read.write(Tsv("data/es-out.tsv"))
3. val writeES = Tsv("data.tsv”).read.write(ElasticSearchTap
("localhost”, 9200,"index/secondType","", schema))
Scalding.io
Read from
ElasticSearch in one
line!Also index new data
in ES

Dependency Injection
Late bound
External Operations

How about defining external operations?
Scalding.io
1. val pipe1 = Tsv(“omniture.tsv”,OMNITURE_SCHEMA)
2. .read
3. .ETLOmnitureData
4. .calculateOmnitureUserStats
5. .joinWithCustomerDB('userId->'userId, customerPipe)
6. .write(Tsv(“omniture-results.tsv”))
Custom operations:
 Re-usable modular code
 Single responsibility
 TestabilityFull-code
http://bit.ly/1pNSUKf

Testing challenges in the context of MR
Scalding.io
Acceptance Tests
Unit – Component Tests
System Tests
Integration Tests Scalding enables
testing in every layer
&
TDD

example
Scalding.io
1. class TsvWordCountJobTest extends FlatSpec
2. with ShouldMatchers with TuppleConversions {
3. “WordCountJob” should “count words” in {
4. JobTest(new WordCountJob(_))
5. .args(“input”,”inFile”)
6. .args(“output”,”outFile”)
7. .source(TextLine(“inFile”), List(“0”) -> “cool Scala cool”))
8. .sink[(String,Int)](Tsv(“outFile”)) { out =>
9. out.toList should contain (“cool” -> 2)
10. }
11. .run
12. .finish
13. }
14. }
Replaces taps
with in-memory
collections
and asserts the
expected output

“Driven takes Cascading
application development to the
next level with management and
monitoring capabilities for your
apps”
Scalding.io
http://driven.cascading.io

Scalding.io
Collects telemetry data and expose through a Web UI

Scalding adds
 Typed API
 Matrix API
 Graphs
 Machine Learning Algorithm
Scalding.io

What the future like?
Scalding.io

So far…
Scalding.io
abstraction

Real TimeBatch Hybrid
Scalding.io
abstraction
Summingbird
A unified API for everything
StormTEZ Spark
Enables the Lambda architecture

Scalding Presentation

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Scalding Presentation

Similaire à Scalding Presentation (20)

Dernier

Dernier (20)

Scalding Presentation

Notes de l'éditeur