John Yates has been a programmer since 1970. In 2010, he was working at Netezza where they needed a new data analytics architecture. He proposed using a family of statically typed acyclic data flow graph intermediate representations (IRs). This involved representing the query plan as a graph of operators with data flowing along edges. The graph could then be optimized and executed in parallel across cluster nodes. Yates provided an initial implementation of this approach to prove the concept, which helped convince Netezza to adopt it. He then led the development of the new Marlin architecture at Netezza based on this IR approach.
2. Order of presentation
• Who am I and why am I here?
• 2010: Netezza needs a new architecture
• A family of statically typed acyclic DFG IRs
• (Time permitting: Some engineering details)
• Q&A
3. “Who am I and why am I here?”
(with apologies to Adm. Stockdale)
4. 1970: Maybe I’ll be a programmer
• NYC hippie, ponytail, curled handlebar mustache
• Liberal arts high school, lousy student
• Wanted to build things, real things
• Computers seemed interesting and intuitive
• Luckily in 1970 programmers were scarce
5. 40 years…
– 1970: learning the craft, various jobs (all in assembler)
– 1978: Digital Equipment Corp
• Pascal frontend, dynamic programming code selector
– 1983: Apollo Computer
• Designed RISC ISP w/ explicit parallel dispatch (pre-VLIW)
• Lead architect for RISC backend optimizer; built team
• 1st commercial: SSA IR, SW pipeliner, lattice const prop
– 1992: Binary translation: DEC (sw), Chromatic (hw-support)
• More SSA IR, lowering; built teams; lot of patents (many hw)
– 1999: Everfile - NFS-like Win32 internet file system
– 2002: Netezza, badge #26
• Storage: compression, indices, access methods, txns, CBTs
20+years
7. Data parallel analytics engine
• Data partitioned across a cluster of nodes
– Multiple “slices” per node to exploit multi-core
• Execution model:
– Leader accepts query, produces an execution plan
– Leader broadcasts plan’s parallel components
– Cluster performs data parallel work
– Leader performs work requiring a single locus
• Competition: Teradata, Green Plum, DB2, …
10. Garth’s incomplete Marlin vision
• What is the real input to the interpreter?
• How do we get from query plan to that form?
PG
Plan
Split
Bcast
Interpret
(faster?)
Interpret
(faster?)
N workers
Unspecified
miracle
Multi-
core?
11. A family of statically typed
acyclic data flow graph IRs
13. Graph
• Operators
– Label names a function
– Edge connections in and out
• Edges
– Directed (“dataflow”)
14. Dataflow
• Dataflow machines
– Apply history, wisdom, insights to the interpreter
• Value semantics
– All edges carry data
– No other kinds of edges (i.e. no anti-dependence)
– No updatable shared state (i.e. no store)
• Expose all opportunities for concurrency
15. Acyclic
• No backedges ≡ no cycles J
• Can exploit topological ordering
– Fact propagation: rDFS (forward) or DFS (reverse)
– No iteration, guaranteed termination
– Linear algorithms, O(graph)
16. Statically typed
• Edges initially have unknown type
• A well-formed graph can be statically typed
– Linear pass over topologically ordered Operators
– Assign edge types per Operator descriptors
– Inconsistencies can be diagnosed and reported
17. • Well-nested subsets of edge type vocabularies
• Constraining edge types constrains operators
A family of … IRs
PG
Plan
Split
Bcast
Interpret
N workers
Lower
andOpt
Lower
andOpt
Lower
andOpt
Interpret
Tree
patterns
Graph1
patterns
Graph2
patterns
High level tree - tuples
High level graph - tuples
Mid level graph - nullable values
Low level graph - values
Common
pattern
notation
Topo
expand,
insert
CLONEs
Topo
expand,
insert
CLONEs
≈
18. Nothing convinces like working code
• First delivery
– Table drive operator semantics
– Utilities: build, edit & expand
– Topologically sort
– Type check & report errors
Split
Bcast
Interpret
N workers
Interpret
Topo
expand,
insert
CLONEs
Topo
expand,
insert
CLONEs
Graph
assembler
Graph
assembly
program
19. Sold!
• Working code rendered my
successive lowerings idea credible
• Overall Marlin added ~10 engineers; I got 3
• My team got itsfirst end-to-end test case working
PG
Plan
Split
Bcast
Interpret
N workers
Lower
andOpt
Lower
andOpt
Lower
andOpt
Interpret
Tree
patterns
Graph1
patterns
Graph2
patterns
Topo
expand,
insert
CLONEs
Topo
expand,
insert
CLONEs
20. IBM killed the Marlin program…
• Marlin was a clean up project promising…
– Performance and shorter development cycles
– But no new features nor functionality
• It is always hard to fund significant clean up
– Especially if not legitimately tied to a coveted feature
• Harder if your company is under duress
• Harder still if DB2 is gunning for your headcount
23. Why clone?
• After expansion all edges are point-to-point
– No output is multiply-consumed
• Chunk handoff along an edge becomes trivial
– Think C++11’s new move semantics
• So only clones implement reference counting
24. Broadcast
• Serialize / deserialize
• On network size matters
• Graph object
– Small number of scalar members
– Handful of C++ vector (some ephemeral)
– Position independent (no pointers in vectors)
25. No pointers
• Pointers index the linear address space
– Implicit context (there is only one address space)
• Unsigned as vector index
– User must provide explicit context (vector base)
– 32 bit indices are ½ the size of 64 bit pointers
– Position independence simplifies serialization
26. The graph object
• Exposed read-only data
– Vector of Operator objects
– Vector of EdgeIn objects
– Vector of EdgeOut objects
– Literal table and pool
• Private data (may be missing or elided)
– Vector of EdgeIn next links
– Vector of Operator BreadCrumbs
33. Thin graph construction
Method Effect
graph.add(BreadCrumb, Op, Locus, Expansion,
unsigned nVarIn =0, unsigned nVarOut =0);
Add an Operator and its
Edge resources
graph.connect(OperatorIndex srcOp, unsigned srcPos,
OperatorIndex dstOp, unsigned dstPos);
Guarantee a srcOp[srcPos] to
dstOp[dstPos] edge exists
34. Whole graph operations
Operation Effect
Graph(); Construct an empty Graph
void done(); Topo sort and type check
Graph(Graph const thinGraph&, bool forSpu); Partitioning constructor
BinStream& operator << (BinStream&, Graph const&); Put to a BinStream (cheap)
BinStream& operator >> (BinStream&, Graph&); Get from a BinStream (cheap)
void expand(bool forSpu, Environment const& env); Expand, insert clones, etc.
35. Graph states and conversions
• Start with a “thin” graph
• Leader plus one representative node and dataslice
• Operators tagged with a locus and expansion rule
• Outputs can have multiple consumers
• Partition into leader-side & node-side subsets
• Expand based on loci and system topology
• Duplicate operators, adjust in and out arities, add sites
• Expand edges: fan-in, fan-out, parallel
• Introduce clones as needed
36. Graph overlay
• Template object publically derived from Graph
• Macro hides lots of template boilerplate
• User supplied types for parallel vectors
– MyOperator ovOp[OperatorIndex]
– MyEdgeIn ovIn[EdgeInIndex]
– MyEdgeOut ovOut[EdgeOutIndex]
• Constructor shares vectors and LiteralTable