This talk was held at the 10th meeting on February 3rd 2014 by Sean Owen.
Having collected Big Data, organizations are now keen on data science and “Big Learning”. Much of the focus has been on data science as exploratory analytics: offline, in the lab. However, building from that a production-ready large-scale operational analytics system remains a difficult and ad-hoc endeavor, especially when real-time answers are required. Design patterns for effective implementations are emerging, which take advantage of relaxed assumptions, adopt a new tiered "lambda" architecture, and pick the right scale-friendly algorithms to succeed. Drawing on experience from customer problems and the open source Oryx project at Cloudera, this session will provide examples of operational analytics projects in the field, and present a reference architecture and algorithm design choices for a successful implementation.
6. Data Science Is Exploratory Analytics?
www.tc.umn.edu/~zief0002/Comparing-Groups/blog.html
thenextweb.com/microsoft/2013/07/08/microsoft-brings-the-office-store-to-22-new-markets-adds-power-bi-an-intelligence-tool-to-office-365/
6
8. Example:
•
•
•
•
•
•
Search, ML over Patient Data
MapReduce for indexing, learning
HBase for storage and fast access
Also: Storm for
incremental update
And: relational DB for
most recent derived data
API façade for input;
API for querying learning
Engineering
8
Machine Learning
engineering.cerner.com/2013/02/near-real-time-processing-over-hadoop-and-hbase/
17. Gaps to fill, and Goals
•
Model Building
•
•
•
•
•
Model Serving
•
•
17
Large-scale
Continuous
Apache Hadoop™-based
Few, good algorithms
Real-time query
Real-time update
•
Algorithms
•
•
•
•
Parallelizable
Updateable
Works on diverse input
Interoperable
•
•
•
PMML model format
Simple REST API
Open source
21. Two Layers
•
Computation Layer
•
•
•
•
•
Java-based server process
Client of Hadoop 2.x
Periodically builds
“generation” from recent
data and past model
Baby-sits MapReduce*
jobs (or, locally in-core)
Publishes models
•
Serving Layer
•
•
•
•
•
•
* Apache Spark later
21
Apache Tomcat™-based
server process
Consumes models from
HDFS (or local FS)
Serves queries from
model in memory
Updates from new input
Also writes input to HDFS
Replicas for scale
22. Collaborative Filtering : ALS
•
•
•
•
•
•
22
Alternating Least Squares
Latent-factor model
Accepts implicit or
explicit feedback
Real-time update
via fold-in of input
No cold-start
Parallelizable
YT
X
24. Classification / Regression : RDF
•
•
•
•
•
•
24
Random Decision Forests
Ensemble method
Numeric, categorical
features and target
Very parallel
Nodes updateable
Works well on many
problems
age$ 30
>$
female?
income$ 20000
>$
Yes
Yes
Yes
No
25. PMML
Predictive Modeling
Markup Language
• XML-based format for
predictive models
• Standardized by Data
Mining Group
(www.dmg.org)
• Wide tool support
•
<PMML xmlns="http://www.dmg.org/PMML-4_1"
version="4.1">
<Header copyright="www.dmg.org"/>
<DataDictionary numberOfFields="5">
<DataField name="temperature"
optype="continuous"
dataType="double"/>
…
</DataDictionary>
<TreeModel modelName="golfing"
functionName="classification">
<MiningSchema>
<MiningField name="temperature"/>
…
</MiningSchema>
<Node score="will play">
<Node score="will play">
<SimplePredicate field="outlook"
operator="equal"
value="sunny"/>
…
</Node>
</Node>
</TreeModel>
</PMML>
www.dmg.org/v4-1/TreeModel.html
25
26. HTTP REST API
•
•
•
•
•
26
Convention for RPC-like
request / response
HTTP verbs, transport
GET : query
POST : add input
Easy from browser, CLI,
Java, Python, Scala, etc.
GET /recommend/jwills
HTTP/1.1 200 OK
Content-Type: text/plain
"Ray LaMontagne",0.951
"Fleet Foxes",0.7905
"The National",0.688
"Shearwater",0.3017
27. Wish List
•
Revamp workflow
•
•
•
De-emphasize model
building
•
•
•
Well-solved
Bring your own
Emphasize integration
•
27
Oozie?
Spark / Crunch-like API,
not raw M/R
PMML, etc.
More component-ized
• Less black-box service
• More “push” options
•
•
•
Flume?
“Pull” options
•
•
Kafka?
Hive / Impala ?