5. Pre-aggregation
• Start from generator variables
• Resolve as many variables as possible using:
• Joins
• Functions
• Use as many filters as possible
• Join all sources into one set of tuples
19. Cascading pipes
• Each: can occur in Map or Reduce
• GroupBy: Causes a Reduce step
• Every: One or more follow GroupBy
• CoGroup: Join implementation, causes
Reduce step
24. To Cascading
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
Each
Each
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
25. To Cascading
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
Group by ?delta
GroupBy
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
26. To Cascading
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
Every
Group by ?delta [?delta ?count]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
Execute aggregators on each group
27. To Cascading
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
Group by ?delta [?delta ?count]
Each
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
28. To Cascading
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
Group by ?delta [?delta ?count]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
Each
Project fields to [?delta ?count]
29. To MapReduce
[?person2 ?age2 ?double-age2]
Job 1
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
Group by ?delta [?delta ?count]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
Project fields to [?delta ?count]
30. To MapReduce
[?person2 ?age2 ?double-age2]
Job 2 [?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
Group by ?delta [?delta ?count]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
Project fields to [?delta ?count]
31. To MapReduce
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
Group by ?delta [?delta ?count]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
Job 3
Project fields to [?delta ?count]
39. ElephantDB
Shard 0
Shard 1
Shard 2 Distributed
Key/Value pairs
Shard 3 Filesystem
Pre-shard Shard 4
and index in
Shard 5
MapReduce
Generation of domain of data
40. ElephantDB
DFS ElephantDB
Server
Shard 0
Shard 1
Shard 2 ElephantDB
Server
Shard 3
Shard 4
Shard 5 ElephantDB
Server
Serving domain of data