Talk Abstract
Aggregation has long been a use case of Accumulo Iterators. Iterators' ability to reduce data during compaction and scanning can greatly simplify an aggregation system built on Accumulo. This talk will first review how Accumulo's Iterators/Combiners work in the context of aggregating values. I'll then step back and look at the abstraction of aggregation functions as commutative operations and the several benefits that result by making this abstraction. We will see how it becomes no harder to introduce powerful operations such as cardinality estimation and approximate top-k than it is to sum integers. I will show how to integrate these ideas into Accumulo with an example schema and Iterator. Finally, a practical aggregation use case will be discussed to highlight the concepts from the talk.
Speakers
Gadalia O'Bryan
Senior Solutions Architect, Koverse
Gadalia O'Bryan is a Sr. Solutions Architect at Koverse, where she leads customer projects and contributes to key feature and algorithm design, such as Koverse's Aggregation Framework. Prior to Koverse, Gadalia was a mathematician for the National Security Agency. She has an M.A. in mathematics from UCLA and has been working with Accumulo for the past 6 years.
Bill Slacum
Software Engineer, Koverse
Bill is an Accumulo committer and PMC member who has been working on large scale query and analytic frameworks since 2010. He holds BS's in computer science and financial economics from UMBC. Having never used his passport to leave the United States, he is currently a national man of mystery.
9. Associative + Commutative
Operations
• Associative: 1 + (2 + 3) = (1 + 2) + 3
• Commutative: 1 + 2 = 2 + 1
• Allows us to parallelize our reduce (for
instance locally in combiners)
• Applies to many operations, not just
integer addition.
• Spoiler: Key to incremental aggregations
10. {a,
b}
{b, c}
{a, c}
{a}
{a, b,
c}
+ +
=
{a, c}
=
{a, b,
c}
=
+
We can also parallelize the “addition” of other types, like Sets, as
Set Union is associative
11. Monoid Interface
• Abstract Algebra provides a formal foundation for
what we can casually observe.
• Don’t be thrown off by the name, just think of it as
another trait/interface.
• Monoids provide a critical abstraction to treat
aggregations of different types in the same way
12. Many Monoid Implementations
Already Exist
• https://github.com/twitter/algebird/
• Long, String, Set, Seq, Map, etc…
• HyperLogLog – Cardinality Estimates
• QTree – Quantile Estimates
• SpaceSaver/HeavyHitters – Approx Top-K
• Also easy to add your own with libraries
like stream-lib [C3]
13. Serialization
• One additional trait we need our
“aggregatable” types to have is that we
can serialize/deserialize them.
1
2
3
4
3
+ +
=
7
=
1
0
=
+
1) zero()
2) plus()
3) plus()
4) serialize()
6) deserialize()
5) zero()
7) plus()
9) plus()
3
78) deserialize()
14. These abstractions enable a
small library of reusable code to
aggregate data in many parts of
your system.
16. SQL on Hadoop
• Impala, Hive, SparkSQL
milliseconds seconds minutes
large many few
seconds minutes hours
billions millions thousands
Query Latency
# of Users
Freshness
Data Size
17. Online Incremental Systems
• Twitter’s Summingbird [PA1, C4], Google’s Mesa [PA2],
Koverse’s Aggregation Framework
milliseconds seconds minutes
large many few
seconds minutes hours
billions millions thousands
Query Latency
# of Users
Freshness
Data Size
S
M
K
18. Online Incremental Systems:
Common Components
• Aggregations are computed/reduced
incrementally via associative operations
• Results are mostly pre-computed for so
queries are inexpensive
• Aggregations, keyed by dimensions, are
stored in low latency, scalable key-value
store
23. Ingest (1/2)
• We bulk import RFiles over writing via a
BatchWriter
• Failure case is simpler as we can retry
whole batch in case an aggregation job
fails or a bulk import fails
• BatchWriters can be used, but code needs
to be written handle Mutations that are
uncommitted and there’s no roll back for
successful commits
24. Ingest (2/2)
• As a consequence of importing (usually
small) RFiles, we will be compacting more
• In testing (20 nodes, 200+ jobs/day), we
have not had to tweak compaction
thresholds nor strategies
• Can possibly be attributed to relatively
small amounts of data being held at any
given time due to reduction
25. Accumulo Iterator
• Combiner Iterator:
A SortedKeyValueIterator that combines the
Values for different versions (timestamp) of a
Key within a row into a single Value. Combiner
will replace one or more versions of a Key and
their Values with the most recent Key and a
Value which is the result of the reduce method.
26. Our Combiner
• We can re-use Accumulo's Combiner type here:
override def reduce:(key: Key, values:
Iterator[Value]) Value = {
val sum = agg.reduceAll(
values.map(v => agg deserialize v))
return (key, sum)
}
• Our function has to be commutative because major
compactions will often pick smaller files to combine,
which means we only see discrete subsets of data in
an iterator invocation.
28. Visibilities (1/2)
• Easy to store, bit tougher to query
• Data can be stored at separate visibilities
• Combiner logic has no concept of visibility,
it only loops over a given
PartialKey.ROW_COLFAM_COLQUAL
• We know how to combine values (Longs,
CountMinSketchs), but how do we
combine visibilities?
29. Visibilities (2/2)
• Say we have some data on Facebook photo
albums:
– facebookx1falbum_size count: [public] 800
– facebookx1falbum_size count: [private] 100
• Combined value would be 900
• But, what should we return for the visibility of
public + private? We need more context to
properly interpret this value.
• Alternatively, we can just drop it
30. Queries
• This schema is geared towards point
queries.
• Order of fields matters.
• GOOD “What are the top-k destinations
from BWI?”
• NOT GOOD“What are all the dimensions
and aggregations I have for BWI?”
36. Aggregation Flow
RowId: hour:2014_08_24_09|
client:Web
CF: Count
CQ:
Value: 3
RowId: client:Android
CF: Count
CQ:
Value: 1
RowId: client:Android
CF: Count
CQ:
Value: 5
RowId: client:iPhone
CF: Count
CQ:
Value: 6
kv_records kv_aggregates
New Records from Import Jobs client: iPhone
timestamp: 1408935773
...
client: Android
timestamp: 1408935871
...
client: Web
timestamp: 1408935792
...
Periodic, Incremental MapReduce Jobs
(like the current Stats Job) read Records
and emit Aggregate KVs based on the
Aggregate configuration for the Collection
Aggregate(
onKey(
“client”,
“hour”, “client”)
produce(
Count)
prepare(
(“timestamp”, “hour”, BinByHour())
)
Aggregate Configuration is a type-safe,
Scala object. Code is sent to the server
as a String, where it is compiled (not
executed). The serialized object is
passed to the MR job to generate KVs
from Records. Contains the dimensions
(onKeys), aggregation operation
(produce), and optional projections
(prepare) which can be built-in functions
or custom Scala closures. We envision
an UI building these objects in the future.
Map
Combine
Emit KVs.
Key = dimension + operation
Value = Serialized Monoid Aggregator
Aggregation Reduction
Reduce
Aggregation Reduction
RFiles
RowId: client:iPhone
CF: Count
CQ:
Value: 3
RowId: client:Android
CF: Count
CQ:
Value: 5
RowId: hour:2014_08_24_09|
client:Android
CF: Count
CQ:
Value: 2
MinC
MajC
Aggregation Reduction
Aggregation Reduction
UserQuery
Scan
Iterator
Aggregation Reduction
{ key: “client:iPhone”, produce: Count }
{ key: “client:iPhone”, produce: Count, value: 9 }
Aggregation Reduction is the same common code in 5 places. For
Aggregates with the same Key, the Values are reduced based on the
operation (Sum, Set, Cardinality Est., etc). The Values are always
serialized objects that implement the MonoidAggregator interface.
Adding a new aggregation operation will impact a single class only -
no new Iterators or MR code.
RowId: hour:2014_08_24_09|
client:Web
CF: Count
CQ:
Value: 8
Aggregate Queries are simple point
queries for a single KV. If the user wants
something like an enumeration of “client”
values, they will use a Set or Top-K
operation and the single value will contain
the answer with no range scans required.
The API may support batching multiple
keys per request to efficiently support
queries to build timeseries (e.g., counts
for each hour in the day)