OpenDremel's Metaxa Architecture

Metaxa Architecture
June 22th
By Camuel, OpenDremel

Meet Metaxa
• Implements Dremel using LAPHROAIG as execution engine and as storage
backend.
• No distribution, METAXA is single jar file and executed in single JVM, it
produced and executes single threaded MAP job.
• All input data reside inside single LAPHROAIG object.
• Output is one of following:
• New LAPHROAIG objet
• Streamed back.
• Convert type commands convert single LAPHROAIG object from popular
objects serialization formats to nested columnar dremel format or vice versa.
• Query type commands process LAPHROAIG objects in nested columnar
dremel format and can store result in another object or convert them to
popular objects serialization formats and stream back to user.
• LAPHROAIG object is a container of other “serialized objects” or
“columnar encoded objects”. Two types of objects not to be confused.
• Just four use cases:
– Convert “serialized objects” into “columnar encoded objects”.
– Convert “columnar encoded objects” into “serialized objects”.
– Query “columnar encoded objects” with BQL producing “serialized objects”
and streaming it back to caller.
– Query “columnar encoded objects” with BQL producing “serialized objects”
and saving it as new LAPHROAIG “container” object
– Query “columnar encoded objects” with BQL producing “columnar
encoded objects” and saving it as new LAPHROAIG “container” object

Use case #1: Convert serialized objects into columnar-encoded
objects
Convert
Command Hierarchical
Schema Serialized objects
(Protobuf, Avro, Thrift)

Metaxa.jar LAPHROAIG

columnar-encoded
objects (Tablet)

Use case #2: Convert columnar-encoded objects into serialized
objects
Convert
Command
columnar-encoded
objects (Tablet)


Hierarchical

Use case #3: Query “columnar encoded objects” with BQL
producing “serialized objects” and streaming it back to
caller.
BQL columnar-encoded
Query
objects (Tablet)


Hierarchical

Use case #4: Query “columnar encoded objects” with BQL producing
“serialized objects” and saving it

Query objects (Tablet)


Hierarchical

Use case #5: Query
“columnar encoded objects” with BQL
producing “columnar encoded objects” and saving it

Query
objects (Tablet)


columnar-encoded
objects (Tablet)

SerObjs – Serialized Objects
• A result data got by serializing objects with
Protobuf, Avro and Thrift.
• Hierarchical data.
• Flat data like CSV
• RDBMS originated data.
• Data from KV-stores and document stores.
• Logs.
• Schema may be embedded or provided
separately.

Tablet– Columnar-encoded objects
• Immutable chunk of data.
• Logically comprised from Slices and can be turned into Slice series.
• Columnar and dremel-encoded.
• Consists of header (called Tablet Schema) and multiple {byte, word, dword or
qword}-streams.
• Tablet schema describes
– Tablet columns (multi-dimensional arrays) including metadata and compression and encoding metadata
as well as references for associated dictionaries, rep & def levels and etc.
– Original SerObjs schema and mapping to tablet columns
– Future: additional SerObjs schemas and mappings
• Tablet data are a set of multidimensional arrays of 8,16 ,32 or 64 bit elements
denoted byte or b, word or w, double word or dw and quad word or qw. Each
arrays represents a column and can be accessed independently without incurring access
costs for neighbor arrays. Every element is a bit-field with various bits representing
different information. For example (multiple) column values, counts (RLE)m rep and
def levels.
• Tablet scanner can mask some of the details of column encoding and provide
higher-level interface to tablet automatically decoding RLE, dictionary and rep & def
levels. However, tablet binary format is an stable interface between Metaxa
modules and between different versions of OpenDremel system
• Tablet are horizontal partitions of larger columnar dataset.

Slice– Columnar-encoded object fraction
• Slice is a vector (ordered list of scalars) where each scalar corresponds to a current
value of a different tablet column that is being scanned / iterated.
• Tablet can be broken down into ordered list of slices and comprised back from
series of slices.
• Slice in Metaxa contains plain integer values (not bit fields) of b, w, dw and qw.
• Slice may contain less values than columns in tablet. In this case columns
represented in slice are called “projected columns”.
• Slice also contains additional integer field called Level. This Level is also aliased as
FetchLevel or SelectLevel depending whether Tablet is being sliced into
series of slices or being reconstructed from series of slices.

Query Plan (QP)
• QP is a descriptor of source tablet, a result tablet and a set of scalar
transformations and a DAG of their dataflow interconnections.
• Scalar transformations are of one of following types
– Plain transformation => Also called expressions, many inputs but one output.
– Predicates => boolean expression which when evaluating to false cancels the issuance of
the result slice.
– Aggregates => Count, Sum and Distinct functions, aggregates slices and then when the
last slice in a aggregation group is detected, issues multiple result slices.
• QP input and output is always slice. Because of predicates it is
possible that for some input slices no output slice will be issued. Also
because of aggregates it is also possible that for one input slice,
multiple output slices will be issued.
• Input slices contain FetchLevel and output slices contain SelectLevel.
(according to appendix D in paper)

Conceptual View of Tablet
Levels
(dimensions)

0 1 2
Record [5] Record [4] Record [3] Record [2] Record [1] Record [0]

[]

[ ][ ]

[ ][ ][ ]

Conceptual View of Tablet Slicing
Levels
(dimensions)

0 1 2 Slice Slice Slice Slice Slice
Record [0] Slice
[0][2][2] [0][1][1] [0][1][0] [0][0][2] [0][0][1] [0][0][0]

[]

[ ][ ]

[ ][ ]

[ ][ ][ ]

[ ][ ][ ]

Conceptual View of QP
Levels
(dimensions)

0 1 2 Slice Slice
Record [1] Record [0]
[0][1][1] [0][0][0]

[] Expr (rep=0) []

[ ][ ]
Expr (rep=1) [ ][ ]
[ ][ ]

[ ][ ][ ] Expr (rep=2) [ ][ ][ ]

[ ][ ][ ]

Compiler
Translates BQL into Query Plan
Requirements:
– Must parse and compile valid BQL as defined by BigQuery.
– Must not accept invalid BQL and supply user-friendly messages.
– Must produce executable QP object with following features:
• It is Serializable => without circular references, without references to “system”
objects like file handlers, pure object model
• getProcessSliceSource => returns text of in java source-code form
• getSourceTablets => returns tablets to run QP on
• setResultTablet => Sets result tablet
• setExecutionStatusCode => to indicate status of QP execution
• log => allows logging important events during QP execution
• getDiagram => returns graphic image of QP diagram (for debugging)
– Must provide basic command-line arguments functionality as well as
simple shell functionality.

Vocabulary Compiler

• Token - lexeme
• Parse tree – token tree
• AST – Abstract Syntax Tree
• SM – Semantic Model
• ASM – Annotated Semantic Model
• QP – Query Plan
• DAG – Directed Acyclic Graph
• Schema – Metadata about dataset.

Compiler
Prerequisite Materials
– http://code.google.com/apis/bigquery/docs/query-reference.html
– http://www.antlr.org/
– http://en.wikipedia.org/wiki/Parsing
– http://en.wikipedia.org/wiki/Query_plan
– http://en.wikipedia.org/wiki/Compiler_construction
– http://www.amazon.com/Terence-Parr/e/B001JS3O0U

Compiler
High-Level Design (verbose)
SerObjs
Command Schema
line
arguments
/ shell Shell
input BQL Antlr
AST SemanticP
Parser
arser

Result
Result SM Semantic Analyzer
SerObjs Schema •Validation SM
Schema Generator Annotated •Resolving references
•Result Schema Inference Semantic Model
Semantic
•Optimization (Java object model
Model
implemented via java
QP collections)
QP
Generator
Query Plan
(includes Metadata
ResultTablet (files locations Optimization Validation
metadata)
C / asm and statistics) Rules Rules
Template

Compiler
[Annotated] Semantic Model

• Comprehensibly describes query to every detail
• Java objects (packed into collections, without
spaghetti cyclic references)
• Must be serializable with SerObjLib
framework to a file and restorable.
• Must be printable to something comprehensible
by human
• Must be rendered on request into nice graphic
diagram with legend.

QP: Scalar Transformation functions (Expr) Compiler

• Set of primitive predefined scalar operations and functions applied on
xfunc arguments in particular prescribed order.
• Expressed in valid C or assembly with some restrictions.
• Purely functional => side-effect free. Meaning no static/global
variables and no memory allocations. However, for performance and
brevity they are inlined into single processSlice function.
• Some functions have a context object where they can store their
externalized state between calls. One regular and one associative array
is provided as context for this functions
– Context-free transformation functions
• One value in, one value out a+b
– Scalar context transformation functions
• Many value in, many value out sum(a) within links
– Map context transformation functions
• Many value in, many value out (out of sync) sum(a) group by date

Compiler
QP in C Form
• Generated ProcessSlice(..){..} function.
– Input: inSlice
– Output: outSlice
– Context object for state-externalization
• inSlice contains scalar values for every source function and
also fetchLevel
• outSlice must have correct scalar values for every result
function and also correct selectLevel.
– outSlice are guarantied to preserve its content between calls. So it can be
used as cache result functions that haven’t changed and also as cache for
selectLevel if it is not changed.

– outSlice values can also be read (contains results of previous outSlice)

– on first call all values on outSlice are guaranteed to be zeros.

QP template Compiler
(according to appendix D)
void processSlice(inSlice, outSlice, Context) {

Evaluate where clause…, if evaluates to false then do:
outSlice.setSkip;
outSlice.selectLevel = min(outSlice.selectLevel, inSlice.fetchLevel);
return;

If where clause evaluates to true then…
switch(inSlice.fetchLevel) {
case 0:
Evaluate expressions (xfuncs) with repetition level = 0
……..
……..
case n:
Evaluate expressions (xfuncs) with repetition level = n
If it is the last slide in aggregation group then:
//the below line will cause to additional calls to ProcessSlice
outSlice.setAdditionalSliceCount( Number of slices in aggregation
}
}

Columnar Abstraction
• Tableton is a set of sequentially-accessed multidimensional scalar arrays.
• Tablet is serialized dremel-encoded columnar dataset with fixed size. Each array in
tablet can be independently serially accessed without incurring the cost of buffering
neighbor arrays.
• Four types of arrays: bytes, words (16b), dwords(32b), qwords(64b).
• Following operations are defined:

– Parsing Tablet Schema => reading and parsing tablet header/metadata also called tablet
schema and providing an object model for it.

– Reading => converting Tablet to SerObjs using FSM for better performance as
descrbed in Dremel paper (calling calback functions to let them construct
SerObjs in various formats)

– Slicing => synchronized multi-array scalar iteration of Tablet

– Building Tablet Schema => creating tablet header/metadata also called tablet schema with
convenient builder API. Also called TabletSchema Editor.

– Construction => re-creating Tablet from slices, this interface is also used for dissecting
SerObjs into tablet.

– Compaction => constructed Tablet is compressed and hash key generated for it and
from that point on it becomes immutable.

Tableton

What about other datatypes?
• They are mapped into yet another dimension of
scalar array.
• It is strongly recommended not to use java strings.
They are impossible to work with without incurring
full cost of object lifecycle management.
• It is ok not to support them at all, and then
gradually add support for them.
• All Java string class goodies will anyway be
impossible to support in Metaxa because of
performance.
• Same thing about BLOB, images and any other
complex data type. All are mapped to yet another
dimension of scalar array.
Tableton

Hierarchical vs. Columnar
• Different abstractions / domains / contexts
• Different schemas
• Most confusion stems from not differentiating!
• Always keep in mind the context when you r developing…
• Don’t thinks about both in the same time unless you are
willing to develop schizophrenia.
• Columnar is not an implementation artifact of hierarchical.
Columnar is whole new model in its own
• We must adopt two different vocabulary for these domains.
Confusion is notoriously common here.

Tableton

Hierarchical vs. Columnar
Hierarchical Columnar

A SerObjs in our lingo A Tablet in our lingo

Protobuf, Avro, Thirft files Dremel generated tablets

Serialized Objects Multi-dimensional arrays

The only user-level abstraction User never knows what it is

BQL queries written against it Query plans executed against it

More frontend-related More backend-related

More logical / external format More physical / internal format

hierarchical is queried Columns are scanned
SerObjLib component Tableton component
Tableton

Hierarchical Example

Tableton

Executes QP against tablets
• Requirements
– Must convert QP into executable bytecode and execute it (not interpret).
– Must work with QP in object-model, but initially compiling and running
QP in java form will suffice.
– Must not mask data and task parallelism.
• Data parallelism on tablet level and also on column level within tablet.
• Task parallelism on separate QP transformation functions
– Must be ultra-high performance
• Latency overhead within few milliseconds (assuming data in RAM).
• Throughput multi GB/sec

Executor

Vocabulary

• QP – Query Plan
• DAG – Directed Acyclic Graph
• Slot – Like thread (todo)
• Expression – operator tree on scalar arguments and
scalar constants
• CF – Context Free (stateless scalar expression)
• FC – Fixed-size Context (scalar expr with
accumulator)
• VC – variable-size Context (scalar expr with
growing list of accumulators)
Executor

Code generation
• [todo] Janino!
• [todo] Explain dynamic java code generation
and compilation
• [todo] Use code templates! No classes/functions
& classes just code listing with labels and jumps.
Generated code is every time different no one is
going to study it. Put static-portions in library
and pre-compile it regularly. All dynamic portion
is just code snippet

Executor

Thanks
(sneak preview of future versions in next slides)

The overall vision for OpenDremel
• Interactive data cloud platform for managing
high volumes of static data in forms of
serialized objects.
• Compatible to Google tools such as BigQuery,
prediction API, Fusion Tables and Google
storage and etc...
• Aggressively use existing open-source
software, preferably apache licensed to
quickly “implement” desired functionality.

Features Backlog
• Processing compressed data directly without decompressing.
• Macro parallelism 1) multithreading 2) multi-process 3)multi-node 4)
massive clustering
• Micro parallelism 1) SSE&AVX 2) OpenCL 3) Better machine code to
leverage ILP 4) light-threads for parallel processing of single tablet 5)
LLVM 6) special hardware GPU & tilera
• Interactive joins and indexing support, zone maps and global system-
recognized dimensions such as time, geography, ip
• Advanced analytics, statistics and machine learning capabilities.
• Richer SerObjLib, more formats
• Advanced visualization and streaming.
• Batch data-crunching and map-reduce support.
• Multi-tenancy, resource control, metering and accounting.
• CEP capabilities, fast lookups and querying also data that is not yet packed
into tablets.
• User-defined functions.
• Scratch tables and rolling queries

OpenDremel's Metaxa Architecture

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (16)

Similaire à OpenDremel's Metaxa Architecture

Similaire à OpenDremel's Metaxa Architecture (20)

Dernier

Dernier (20)

OpenDremel's Metaxa Architecture