MathWorks Interview Lecture

Selling an executable
data flow graph based IR
John Yates

Order of presentation
• Who am I and why am I here?
• 2010: Netezza needs a new architecture
• A family of statically typed acyclic DFG IRs
• (Time permitting: Some engineering details)
• Q&A

“Who am I and why am I here?”
(with apologies to Adm. Stockdale)

1970: Maybe I’ll be a programmer
• NYC hippie, ponytail, curled handlebar mustache
• Liberal arts high school, lousy student
• Wanted to build things, real things
• Computers seemed interesting and intuitive
• Luckily in 1970 programmers were scarce

40 years…
– 1970: learning the craft, various jobs (all in assembler)
– 1978: Digital Equipment Corp
• Pascal frontend, dynamic programming code selector
– 1983: Apollo Computer
• Designed RISC ISP w/ explicit parallel dispatch (pre-VLIW)
• Lead architect for RISC backend optimizer; built team
• 1st commercial: SSA IR, SW pipeliner, lattice const prop
– 1992: Binary translation: DEC (sw), Chromatic (hw-support)
• More SSA IR, lowering; built teams; lot of patents (many hw)
– 1999: Everfile - NFS-like Win32 internet file system
– 2002: Netezza, badge #26
• Storage: compression, indices, access methods, txns, CBTs
20+years

2010: Netezza needs
a new architecture

Data parallel analytics engine
• Data partitioned across a cluster of nodes
– Multiple “slices” per node to exploit multi-core
• Execution model:
– Leader accepts query, produces an execution plan
– Leader broadcasts plan’s parallel components
– Cluster performs data parallel work
– Leader performs work requiring a single locus
• Competition: Teradata, Green Plum, DB2, …

Netezza’s architecture
PG
Plan
Split1
Split2
Gen
FPGA
Gen
C++
Gen
C++
CompileCompile
Load
DLL
Bcast
Load
DLL
Load
FPGA
ExecuteExecute
N workers

Latency
Netezza’s problems
PG
Plan
Split1
Split2
Gen
FPGA
Gen
C++
Gen
C++
CompileCompile
Load
DLL
Bcast
Load
DLL
Load
FPGA
ExecuteExecute
Very simplistic code generator:
-Lowering across an enormous
semantic gulf
- No intermediate representation
- Very complex, very fragile
- Difficult to implement much more
than general case code patterns
Hardware
development
time scales
N workers

Garth’s incomplete Marlin vision
• What is the real input to the interpreter?
• How do we get from query plan to that form?
PG
Plan
Split
Bcast
Interpret
(faster?)
Interpret
(faster?)
N workers
Unspecified
miracle
Multi-
core?

A family of statically typed
acyclic data flow graph IRs

Working backwards
• Graph
• Dataflow
• Acyclic
• Statically typed
• A family of … IRs

Graph
• Operators
– Label names a function
– Edge connections in and out
• Edges
– Directed (“dataflow”)

Dataflow
• Dataflow machines
– Apply history, wisdom, insights to the interpreter
• Value semantics
– All edges carry data
– No other kinds of edges (i.e. no anti-dependence)
– No updatable shared state (i.e. no store)
• Expose all opportunities for concurrency

Acyclic
• No backedges ≡ no cycles J
• Can exploit topological ordering
– Fact propagation: rDFS (forward) or DFS (reverse)
– No iteration, guaranteed termination
– Linear algorithms, O(graph)

Statically typed
• Edges initially have unknown type
• A well-formed graph can be statically typed
– Linear pass over topologically ordered Operators
– Assign edge types per Operator descriptors
– Inconsistencies can be diagnosed and reported

• Well-nested subsets of edge type vocabularies
• Constraining edge types constrains operators
A family of … IRs
PG
Plan
Split
Bcast
Interpret
N workers
Lower
andOpt
Lower
andOpt
Lower
andOpt
Interpret
Tree
patterns
Graph1
patterns
Graph2
patterns
High level tree - tuples
High level graph - tuples
Mid level graph - nullable values
Low level graph - values
Common
pattern
notation
Topo
expand,
insert
CLONEs
Topo
expand,
insert
CLONEs
≈

Nothing convinces like working code
• First delivery
– Table drive operator semantics
– Utilities: build, edit & expand
– Topologically sort
– Type check & report errors
Split
Bcast
Interpret
N workers
Interpret
Topo
expand,
insert
CLONEs
Topo
expand,
insert
CLONEs
Graph
assembler
Graph
assembly
program

Sold!
• Working code rendered my
successive lowerings idea credible
• Overall Marlin added ~10 engineers; I got 3
• My team got itsfirst end-to-end test case working
PG
Plan
Split
Bcast
Interpret
N workers
Lower
andOpt
Lower
andOpt
Lower
andOpt
Interpret
Tree
patterns
Graph1
patterns
Graph2
patterns
Topo
expand,
insert
CLONEs
Topo
expand,
insert
CLONEs

IBM killed the Marlin program…
• Marlin was a clean up project promising…
– Performance and shorter development cycles
– But no new features nor functionality
• It is always hard to fund significant clean up
– Especially if not legitimately tied to a coveted feature
• Harder if your company is under duress
• Harder still if DB2 is gunning for your headcount

Why clone?
• After expansion all edges are point-to-point
– No output is multiply-consumed
• Chunk handoff along an edge becomes trivial
– Think C++11’s new move semantics
• So only clones implement reference counting

Broadcast
• Serialize / deserialize
• On network size matters
• Graph object
– Small number of scalar members
– Handful of C++ vector (some ephemeral)
– Position independent (no pointers in vectors)

No pointers
• Pointers index the linear address space
– Implicit context (there is only one address space)
• Unsigned as vector index
– User must provide explicit context (vector base)
– 32 bit indices are ½ the size of 64 bit pointers
– Position independence simplifies serialization

The graph object
• Exposed read-only data
– Vector of Operator objects
– Vector of EdgeIn objects
– Vector of EdgeOut objects
– Literal table and pool
• Private data (may be missing or elided)
– Vector of EdgeIn next links
– Vector of Operator BreadCrumbs

Discardable elements
• vecBc: BreadCrumbs vector
• vecNxt: EdgeIn sibling links
• LiteralPool hash table array

Graph vector details
Vector Index Type Element Type Element Size
g.vecOp OperatorIndex Operator 16 bytes
g.vecOut EdgeOutIndex EdgeOut 8 bytes
g.vecIn EdgeInIndex EdgeIn 8 bytes
g.lit LiteralKey Literal multiple of 8 bytes
g.vecNxt EdgeInIndex EdgeInIndex 4 bytes
g.vecBc OperatorIndex BreadCrumb 4 bytes

Connectivity: Operator objects
• Operator private members
– Operator’s edges are sub-vectors of g.vecIn, g.vecOut
– Start of EdgeIn objects: EdgeInIndex baseIn_;
– Start of EdgeOut objects: EdgeOutIndex baseOut_;
• Number of connections
– Inputs: vecOp[x+1].baseIn_ - vecOp[x].baseIn_
– Outputs: vecOp[x+1].baseOut_ - vecOp[x].baseOut_

Connectivity: EdgeIn objects
• EdgeIn private members
– Sink Operator: OperatorIndex dstOp_;
– Source EdgeOut: EdgeOutIndex src_;
• EdgeIn connection position
– Use pointer arithmetic:
this - (vecIn + vecOp[dstOp_].baseIn_);

Connectivity: EdgeOut objects
• EdgeOut private members
– Source Operator: OperatorIndex srcOp_;
– Sink EdgeIn: EdgeInXIndex dst_;
• EdgeOut connection position
– Use pointer arithmetic
this - (vecOut + vecOp[srcOp_].baseOut_);

Thin graph construction
Method Effect
graph.add(BreadCrumb, Op, Locus, Expansion,
unsigned nVarIn =0, unsigned nVarOut =0);
Add an Operator and its
Edge resources
graph.connect(OperatorIndex srcOp, unsigned srcPos,
OperatorIndex dstOp, unsigned dstPos);
Guarantee a srcOp[srcPos] to
dstOp[dstPos] edge exists

Whole graph operations
Operation Effect
Graph(); Construct an empty Graph
void done(); Topo sort and type check
Graph(Graph const thinGraph&, bool forSpu); Partitioning constructor
BinStream& operator << (BinStream&, Graph const&); Put to a BinStream (cheap)
BinStream& operator >> (BinStream&, Graph&); Get from a BinStream (cheap)
void expand(bool forSpu, Environment const& env); Expand, insert clones, etc.

Graph states and conversions
• Start with a “thin” graph
• Leader plus one representative node and dataslice
• Operators tagged with a locus and expansion rule
• Outputs can have multiple consumers
• Partition into leader-side & node-side subsets
• Expand based on loci and system topology
• Duplicate operators, adjust in and out arities, add sites
• Expand edges: fan-in, fan-out, parallel
• Introduce clones as needed

Graph overlay
• Template object publically derived from Graph
• Macro hides lots of template boilerplate
• User supplied types for parallel vectors
– MyOperator ovOp[OperatorIndex]
– MyEdgeIn ovIn[EdgeInIndex]
– MyEdgeOut ovOut[EdgeOutIndex]
• Constructor shares vectors and LiteralTable

1973: Began 2-axis controller
I wrote every line of code (in assembler)

1975: First installation
0.5 MegaWatt torch cutting up to ¾”
steel plate at Marion Power Shovel

1975: Torch on… I was hooked!

MathWorks Interview Lecture

Recommandé

Recommandé

Contenu connexe

Similaire à MathWorks Interview Lecture

Similaire à MathWorks Interview Lecture (20)

MathWorks Interview Lecture