Seattle Data Geeks: Hadoop and Beyond

Paco Nathan
liber118.com/pxn/
@pacoid
“Hadoop and Beyond”
Licensed under a Creative Commons Attribution-
NonCommercial-NoDerivs 3.0 Unported License.
1Saturday, 13 July 13

?
issues confronted:
“Data becomes too complex for
ONE computer, ONE model,
ONE expert…”

First Principles
we are taught to think of computing resources
in terms of Von Neumann architecture
in other words, we characterize the computing
resources by CPU, RAM, I/O

First Principles
CPU

First Principles
RAM

First Principles
I/O

First Principles
back in the day, all the tables required for a
given database could ﬁt onto one computer,
with one memory space, and one ﬁle space

First Principles
• okay, maybe the CPU was multi-core…
• okay, maybe RAM paged out to virtual memory…
• okay, maybe the disks were in a RAID conﬁg…

First Principles
or there were extra caches, or separate busses, etc.
but essentially those were incremental extensions
to aVon Neumann architecture…

First Principles
or there were extra caches, or separate busses, etc.
but essentially those were incremental extensions
to aVon Neumann architecture…
a machine created in his image, if you will
NB: credit should go to Eckert and Mauchly, inventors of the ENIAC

First Principles
a generation of computer scientists has been
taught to think “relational” – data on a DB server
RDBMS made sense, with their indexes, b-trees,
normal forms, etc.
Q: need to query bigger data?
A: simple, buy or lease a bigger DB server

issues confronted:
ONE expert…”
trends observed:
“Historical arc: 1996 - 2013,
rise of machine data, scale-out,
and algorithmic modeling…”
“The management problem is
about multi-disciplinary teams
and learning curves…”

Q3 1997: inﬂection point
Four independent teams were working toward horizontal
scale-out of workﬂows based on commodity hardware
This effort prepared the way for huge Internet successes
in the 1997 holiday season… AMZN, EBAY, Inktomi
(YHOO Search), then GOOG
MapReduce and the Apache Hadoop open source stack
emerged from this

RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inﬂection point

RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inﬂection point
“throw it over the wall”

RDBMS
SQL Query
result sets
recommenders
+
classiﬁers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes

RDBMS
SQL Query
result sets
recommenders
+
classiﬁers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
“data products”

Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtu.be/E91oEn1bnXM
Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtu.be/qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
MIT Media Lab
“Social Information Filtering for Music Recommendation” – Pattie Maes
pubs.media.mit.edu/pubs/papers/32paper.ps
ted.com/speakers/pattie_maes.html
Primary Sources

Three broad categories of data
Curt Monash, 2010
dbms2.com/2010/01/17/three-broad-categories-of-data
• Human/Tabular data – human-generated data which ﬁts
well into tables/arrays
• Human/Nontabular data – all other data generated by
humans
• Machine-Generated data
Now let’s add IoT:
• A/D conversion for sensors
Machine Data

Data Jujitsu
DJ Patil
O’Reilly, 2012
amazon.com/dp/B008HMN5BE
Building Data ScienceTeams
DJ Patil
O’Reilly, 2011
amazon.com/dp/B005O4U3ZE
Data Products

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
“optimize topologies”

?
issues confronted:
ONE expert…”

Modeling
back in the day, we worked with practices based on
data modeling
1. sample the data
2. fit the sample to a known distribution
3. ignore the rest of the data
4. infer, based on that fitted distribution
that served well with ONE computer, ONE analyst,
ONE model… just throw away annoying “extra” data
circa late 1990s: machine data, aggregation, clusters, etc.
algorithmic modeling displaced data modeling
because the data won’t fit on one computer anymore

Two Cultures
“A new research community using these tools sprang up.Their goal
was predictive accuracy.The community consisted of young computer
scientists, physicists and engineers plus a few aging statisticians.
They began using the new tools in working on complex prediction
problems where it was obvious that data models were not applicable:
speech recognition, image recognition, nonlinear time series prediction,
handwriting recognition, prediction in ﬁnancial markets.”
Statistical Modeling: TheTwo Cultures
Leo Breiman, 2001
bit.ly/eUTh9L
this paper chronicled a sea change from data modeling practices
(silos, manual process) to the rising use of algorithmic modeling
(machine data for automation/optimization)

issues confronted:
ONE expert…”
trends observed:

Algorithmic Modeling
“The trick to being a scientist is to be open to using
a wide variety of tools.” – Breiman
circa 2001: Random Forest, bootstrap aggregation, etc.,
yield dramatic increases in predictive power over earlier
modeling such as Logistic Regression
major learnings from the Netﬂix Prize: the power of
ensembles, model chaining, etc.
the problems at hand have become simply too big and too
complex for ONE distribution, ONE model, ONE team…
an overall history of data science:
forbes.com/sites/gilpress/2013/05/28/a-very-
short-history-of-data-science/

Why Do Ensembles Matter?
The World…
per Data Modeling
The World…

Ensemblers of Fortune
Breiman:“a multiplicity of data models”
BellKor team: 100+ individual models in 2007 Progress Prize
while the process of combining models adds complexity
(making it more difﬁcult to anticipate or explain predictions)
accuracy may increase substantially
Ensemble Learning: Better PredictionsThrough Diversity
Todd Holloway
ETech (2008)
abeautifulwww.com/EnsembleLearningETech.pdf
The Story of the Netﬂix Prize:An EnsemblersTale
Lester Mackey
National Academies Seminar,Washington, DC (2011)
stanford.edu/~lmackey/papers/

?
issues confronted:
ONE expert…”

issues confronted:
ONE expert…”
trends observed:

Q: Can I simply hire one rockstar
data scientist to cover all this kind
of work?

A: No, multi-disciplinary work
requires teams.
A: Hire leads who speak the lingo
of each domain.
A: Hire people who cover 2+ roles,
when possible.

approximately 80% of the costs for data-related projects
gets spent on data preparation – mostly on cleaning up
data quality issues: ETL, log ﬁles, etc., generally by socializing
the problem
unfortunately, data-related budgets tend to go into
frameworks which can only be used after clean up
most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to understand the audience and their priorities
‣ learn to generate compelling data visualizations
‣ learn to estimate the conﬁdence for reported results
‣ learn to automate work, making analysis repeatable
d3js.org
What is needed most?

employing a mode of thought which includes both logical and analytical reasoning:
evaluating the whole of a problem, as well as its component parts; attempting
to assess the effects of changing one or more variables
this approach attempts to understand not just problems and solutions,
but also the processes involved and their variances
particularly valuable in Big Data work when combined with hands-on experience in
physics – roughly 50% of my peers come from physics or physical engineering…
programmers typically don’t think this way…
however, both systems engineers and data scientists must
Process Variation Data Tools
Statistical Thinking

discovery
discovery
modeling
modeling
integration
integration
appsapps
systems
systems
business process,
stakeholder
data prep, discovery,
modeling, etc.
software engineering,
automation
systems engineering,
access
data
science
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
Team Composition: Needs × Roles

issues confronted:
ONE expert…”
trends observed:

Culture
Notes from the Mystery Machine Bus
SteveYegge, Google
goo.gl/SeRZa
consider these perspectives
in light of Conway’s Law…
“conservatism” “liberalism”
(mostly) Enterprise (mostly) Start-Up
risk management customer experiments
assurance flexibility
well-defined schema schema follows code
explicit configuration convention
type-checking compiler interpreted scripts
wants no surprises wants no impediments
Java, Scala, Clojure, etc. PHP, Ruby, Python, etc.
Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.

Two Avenues to the App Layer…
scale ➞
complexity➞
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
ANSI SQL, SAS, etc. – to migrate
workﬂows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding

Learning Curves
difﬁculties in the commercial use of distributed systems
often get represented as issues of managing complexity
much of the risk in managing a data science team is about
budgeting for learning curve: some orgs practice a kind of
engineering “conservatism”, with highly structured process
and strictly codiﬁed practices – people learn a few things
well, then avoid having to struggle with learning many new
things perpetually…
that approach leads to enormous teams and low ROI scale➞
complexity➞
ultimately, the challenge is about
managing learning curves within
a social context

Learning Curves vs. Technology Selections
ultimately, the challenge is about managing
learning curves within a social context
est. cost of individual learning, initial impl
est.costofteamre-learning,lifecycle
some technologies constrain the
need to learn, others accelerate
re-learning prior business logic…
choose the latter, FTW!

issues confronted:
“Orders of magnitude increase, more
complexity and variety, widespread
disruption…”
?

Big Data?
we’re just getting started:
• ~12 exabytes/day, jet turbines on commercial ﬂights
• Google self-driving cars, ~1 Gb/s per vehicle
• National Instruments initiative: Big Analog Data™
• 1m resolution satellites skyboximaging.com
• open resource monitoring reddmetrics.com
• Sensing XChallenge nokiasensingxchallenge.org
consider the implications of Nike, Jawbone, etc.,
plus the secondary/tertiary effects of Google Glass
7+ billion people, instrumented better than … how we
have Nagios instrumenting our web servers right now
technologyreview.com/...

Internet of Things

Business Disruption
Geoffrey Moore
Mohr DavidowVentures, author CrossingThe Chasm / Hadoop Summit, 2012:
what Amazon did to the retail sector… has put the entire Global 1000
on notice over the next decade… data as the major force… mostly
through apps – verticals, leveraging domain expertise
Michael Stonebraker
INGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc. / XLDB, 2012:
complex analytics workloads are now displacing SQL as the basis
for Enterprise apps
Larry Page
CEO, Google / Wired, 2013:
create products and services that are 10 times better than the
competition… thousand-percent improvement requires rethinking
problems entirely, exploring the edges of what’s technically possible,
and having a lot more fun in the process

A Thought Exercise
consider that when a company like Caterpillar moves
into data science, they won’t be building the world’s
next search engine or social network
they will most likely be optimizing supply chain,
optimizing fuel costs, automating data feedback
loops integrated into their equipment…
that’s a $50B company,
in a market segment worth $250B
upcoming: tractors as drones –
guided by complex, distributed data apps
Operations Research –
crunching amazing amounts of data

Alternatively…
climate.com

issues confronted:
disruption…”
trends observed:
“Functional programming for Big Data”
“Just enough math, but not calculus”
“Enterprise Data Workflow design pattern”
“Cluster computing, smarter scheduling”

Languages
JVM-based languages became popular for Big Data open source
technologies:
• partly becauseYHOO adopted Hadoop, etc.
• partly because Enterprise IT shops have J2EE expertise
• partly because of functional languages: Clojure, Scala
JVM has its drawbacks, especially for low-latency use cases
ample use of languages such as Python and Erlang in Big Data
practices, plus keep in mind that Google uses lots of C++
FunctionalThinking
Neal Ford
youtu.be/plSZIkLodDM

Architecture
Rich Hickey, Nathan Marz, Stuart Sierra, et al.:
functional programming to help reduce
costs over time
technical debt? this is how an organization
builds a culture to avoid it
Conway's Law corollary: model teams and
communication based on properties of the
desired architecture
“Out of theTar Pit”
Moseley & Marks, 2006
goo.gl/SKspn
“A relational model of data for
large shared data banks”
Edgar Codd, 1970
dl.acm.org/citation.cfm?id=362685
Rich Hickey, infoq.com/presentations/Simple-Made-Easy

Pattern Language
structured method for solving large, complex design
problems, where the syntax of the language ensures
the use of best practices – i.e., conveying expertise
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199

Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
1 map
1 reduce
18 lines code gist.github.com/3900702
WordCount – conceptual ﬂow diagram
cascading.org/category/impatient

WordCount – Cascading app in Java
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

mapreduce
Every('wc')[Count[decl:'count']]
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
GroupBy('wc')[by:['token']]
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[head]
[tail]
[{2}:'token', 'count']
[{1}:'token']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
wc[{1}:'token']
[{1}:'token']
[{1}:'token']
[{1}:'token']
WordCount – generated ﬂow diagram
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

(ns impatient.core
  (:use [cascalog.api]
        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))
(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
WordCount – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

github.com/nathanmarz/cascalog/wiki
• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
WordCount – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

import com.twitter.scalding._

class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}
WordCount – Scalding / Scala
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

github.com/twitter/scalding/wiki
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual ﬂow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• signiﬁcant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog
WordCount – Scalding / Scala
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

Functional Programming for Big Data
WordCount with token scrubbing…
Apache Hive: 52 lines HQL + 8 lines Python (UDF)
compared to
Scalding: 18 lines Scala/Cascading
functional programming languages help reduce
software engineering costs at scale, over time

Case Studies: LinkedIn and eBay
“Scalable and Flexible Machine LearningWith Scala @ LinkedIn”
Vitaly Gordon, LinkedIn
Chris Severs, eBay
slideshare.net/VitalyGordon/scalable-and-ﬂexible-machine-learning-with-scala-linkedin
…be sure to read slides 8-16 !!

Lambda Architecture
Big Data
Nathan Marz, James Warren
Manning, 2013
manning.com/marz
• batch layer (immutable data, idempotent ops)
• serving layer (to query batch)
• speed layer (transient, cached “real-time”)
• combining results

issues confronted:
disruption…”
trends observed:

Where To Start?
having a solid background in statistics becomes vital,
because it provides formalisms for what we’re trying
to accomplish at scale
along with that, some areas of math help – regardless
of the “calculus threshold” invoked at many universities…
linear algebra e.g., crunching algorithms efﬁciently for large-scale apps
graph theory e.g., representation of problems in a calculable language
abstract algebra e.g., probabilistic data structures in streaming analytics
topology e.g., determining the underlying structure of the data
operations research e.g., techniques for optimization … in other words, ROI

in a nutshell, most of what we do is to…
‣ estimate probability
‣ calculate analytic variance
‣ manipulate dimension and complexity
‣ make use of learning theory
+ collaborate with DevOps, Stakeholders
+ reduce our work into cron entries
UniqueRegistration
Launchedgameslobby
NUI:TutorialMode
BirthdayMessage
ChatPublicRoomvoice
Launchedheyzapgame
ityTest:testsuitestarted
CreateNewPet
tarted:client,community
NUI:MovieMode
BuyanItem:web
PutonClothing
spaceremaining:512M
chaseCartPageStep2
FeedPet
PlayPet
ChatNow
EditPanel
PanelFlipProductOver
AddFriend
Open3DWindow
ChangeSeat
TypeaBubble
VisitOwnHomepage
TakeaSnapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
ssspaceremaining:1G
LeaveaMessage
NUI:ChatMode
NUI:FriendsMode
dv
WebsiteLogin
AddBuddy
NUI:PublicRoomMode
NUI:MyRoomMode
PanelRemoveProduct
oryPanelApplyProduct
NUI:DressUpMode
UniqueRegistration
Launchedgameslobby
NUI:TutorialMode
BirthdayMessage
ChatPublicRoomvoice
Launchedheyzapgame
ConnectivityTest:testsuitestarted
CreateNewPet
MovieViewStarted:client,community
NUI:MovieMode
BuyanItem:web
PutonClothing
Addressspaceremaining:512M
CustomerMadePurchaseCartPageStep2
FeedPet
PlayPet
ChatNow
EditPanel
ClientInventoryPanelFlipProductOver
AddFriend
Open3DWindow
ChangeSeat
TypeaBubble
VisitOwnHomepage
TakeaSnapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
Addressspaceremaining:1G
LeaveaMessage
NUI:ChatMode
NUI:FriendsMode
dv
WebsiteLogin
AddBuddy
NUI:PublicRoomMode
NUI:MyRoomMode
ClientInventoryPanelRemoveProduct
ClientInventoryPanelApplyProduct
NUI:DressUpMode
Where is the Science in Data Science?

techniques for manipulating order complexity:
dimensional reduction…
with clustering as a common case
e.g., you may have 100 million HTML docs,
but only ~10K useful keywords within them
low-dimensional structure, PCA, etc.
linear algebra techniques: eigenvalues,
matrix factorization, etc.
this is an area ripe for much advancement
in algorithms research, near-term
Dimension and Complexity

in general, apps alternate between learning patterns/rules
and retrieving similar things…
statistical learning theory – rigorous, prevents you
from making billion dollar mistakes, probably our future
machine learning – scalable, enables you to make
billion dollar mistakes, much commercial emphasis
supervised vs. unsupervised
arguably, optimization is a parent category
once Big Data projects get beyond merely digesting
log ﬁles, optimization will likely become the next
overused buzzword :)
Learning Theory

Algorithms
many algorithm libraries used today are based on implementations
back when people used DO loops in FORTRAN, 30+ years ago
MapReduce is Good Enough?
Jimmy Lin, U Maryland
umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf
astrophysics and genomics are light years ahead in sophisticated
algorithms work – as Breiman suggested in 2001 – which may take
a few years to percolate into industry
other game-changers:
• streaming algorithms, sketches, probabilistic data structures
• signiﬁcant “Big O” complexity reduction (e.g., skytree.net)
• better architectures and topologies (e.g., GPUs and CUDA)
• partial aggregates – parallelizing workﬂows

Make It Sparse…
also, take a moment to check this out…
(IMHO most interesting algorithm work recently)
QR factorization of a “tall-and-skinny” matrix
• used to solve many data problems at scale,
e.g., PCA, SVD, etc.
• numerically stable with efﬁcient implementation
on large-scale Hadoop clusters
suppose that you have a sparse matrix of customer
interactions where there are 100MM customers,
with a limited set of outcomes…
cs.purdue.edu/homes/dgleich
stanford.edu/~arbenson
github.com/ccsevers/scalding-linalg
David Gleich, slideshare.net/dgleich
Tristan Jehan

Sparse Matrix Collection
for when you really need a wide variety of sparse matrix examples…
University of Florida Sparse Matrix Collection
cise.uﬂ.edu/research/sparse/matrices/
Tim Davis, U Florida
cise.uﬂ.edu/~davis/welcome.html
Yifan Hu, AT&T Research
www2.research.att.com/~yifanhu/

A Winning Approach…
consider that if you know priors about a system, then
you may be able to leverage low dimensional structure
within high dimensional data… that works much, much
better than sampling!
1. real-world data
2. graph theory for representation
3. sparse matrix factorization for production work
4. cost-effective parallel processing
for machine learning app at scale

Suggested Reading
A Few UsefulThings to Know about Machine Learning
Pedro Domingos, U Washington
homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
Probabilistic Data Structures forWeb Analytics and Data Mining
Ilya Katsov, Grid Dynamics
highlyscalable.wordpress.com/2012/05/01/probabilistic-
structures-web-analytics-data-mining/

issues confronted:
disruption…”
trends observed:

Anatomy of an Enterprise app
Deﬁnition a typical Enterprise workﬂow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses

ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL

ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic

ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models

ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…

ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Cascading allows multiple departments to combine their workﬂow components
into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
cascading.org

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
.addSource( "example.employee", emplTap )
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );

SQLPlanner sqlPlanner = new SQLPlanner()
.setSql( sqlStatement );

flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );

PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
.retainOnlyActiveIncomingFields();

flowDef.addAssemblyPlanner( pmmlPlanner );

cascading.org
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
visual collaboration for the business logic is a great
way to improve how teams work together
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
multiple departments, working in their respective
frameworks, integrate results into a combined app,
which runs at scale on a cluster… business process
combined in a common space (DAG) for ﬂow
planners, compiler, optimization, troubleshooting,
exception handling, notiﬁcations, security audit,
performance monitoring, etc.
cascading.org

issues confronted:
disruption…”
trends observed:

Clusters
a little secret: people like me make a good living by
leveraging high ROI apps based on clusters, and so
the execs agree to build out more data centers…
clusters for Hadoop/Hive/HBase, clusters for Memcached,
for Cassandra, for MySQL, for Storm, for Nginx, etc.
this becomes expensive!
a single class of workloads on a given cluster is simpler
to manage; but terrible for utilization
leveragingVMs and various notions of “cloud” helps
Cloudera, Hortonworks, probably EMC soon: sell a notion
of “Hadoop as OS” All your workloads are belong to us
regardless of how architectures change, death and taxes
will endure: servers fail, and data must move
Google Data Center, Fox News
~2002

Three Laws, or more?
meanwhile, architectures evolve toward much, much larger data…
pistoncloud.com/ ...
Rich Freitas, IBM Research
Q:
what kinds of evolution in topologies could
this imply?

Topologies
Hadoop and other topologies arose from a need for fault-
tolerant workloads, leveraging horizontal scale-out based
on commodity hardware
because the data won’t ﬁt on one computer anymore
a variety of Big Data technologies has since emerged,
which can be categorized in terms of topologies and
the CAP Theorem

Some Topologies Beyond Hadoop…
Spark (iterative/interactive)
Titan (graph database)
Redis (data structure server)
Zookeeper (distributed metadata)
HBase (columnar data objects)
Riak (durable key-value store)
Storm (real-time streams)
ElasticSearch (search index)
MongoDB (document store)
Greenplum (MPP)
SciDB (array database)

issues confronted:
disruption…”
trends observed:

Operating Systems, redux
meanwhile, GOOG is 3+ generations ahead,
with much improved ROI on data centers
John Wilkes, et al.
Borg/Omega:“10x” secret sauce
youtu.be/0ZFMlO98Jkc
0%
25%
50%
75%
100%
RAILS CPU
LOAD
MEMCACHED
CPU LOAD
0%
25%
50%
75%
100%
HADOOP CPU
LOAD
0%
25%
50%
75%
100%
t t
0%
25%
50%
75%
100%
Rails
Memcached
Hadoop
COMBINED CPU LOAD (RAILS,
MEMCACHED, HADOOP)
Florian Leibert, Chronos/Mesos @ Airbnb
Mesos, open source cloud OS – like Borg
goo.gl/jPtTP

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
1. End Use Cases, the drivers

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
2. A new kind of team process

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
3. Abstraction layer as optimizing
middleware, e.g., Cascading

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
4. Distributed OS, e.g., Mesos

Enterprise DataWorkﬂows
with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do
Further study…
workshops and newsletter
updates:
liber118.com/pxn/

Seattle Data Geeks: Hadoop and Beyond

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (16)

Similaire à Seattle Data Geeks: Hadoop and Beyond

Similaire à Seattle Data Geeks: Hadoop and Beyond (20)

Plus de Paco Nathan

Plus de Paco Nathan (20)

Dernier

Dernier (20)

Seattle Data Geeks: Hadoop and Beyond