Presented at Splunk .conf 2012 in Las Vegas. Includes an overview of the Cascading app based on City of Palo Alto open data. PS: email me if you need a different format than Keynote: @pacoid or pnathan AT concurrentinc DOT com
Human Factors of XR: Using Human Factors to Design XR Systems
A Data Scientist And A Log File Walk Into A Bar...
1. “A Data Scientist
And A Log File
Walk Into A Bar…”
Paco Nathan Document
Collection
Tokenize
Scrub
Concurrent, Inc.
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
pnathan@concurrentinc.com
Count
Word
Count
@pacoid
Copyright @2012, Concurrent, Inc.
2. opportunity
Unstructured Data
meets
Enterprise Scale
1. backstory: how we got here
2. overview: typical use cases
3. example: a Cascading app
3. Intro to Data Science
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
1. backstory:
how we got here
4. inflection point
• huge Internet successes after 1997 holiday season… 1997
AMZN, EBAY, then GOOG, Inktomi (YHOO Search)
• consider this metric: 1998
annual revenue per customer / amount of data stored
dropped 100x within a few years after 1997
• storage and processing costs plummeted, now we must
work much smarter to extract ROI from Big Data…
our methods must adapt 2004
• “conventional wisdom” of RDBMS and BI tools became
less viable; business cadre still focused on pivot tables
and pie charts… tends toward inertia!
• MapReduce and the Hadoop open source stack grew
directly out of that contention… but only solve portions
massive disruption in retail, advertising, etc.,
“All of Fortune 500 is now on notice over the next 10-year period.”
– Geoffrey Moore, 2012 (Mohr Davidow Ventures)
10. data innovation: circa 2013
Customers
Data Apps
business
Domain process Workflow Prod
Expert
dashboard Web Apps,
metrics
History services Mobile,
data etc. s/w
science dev
Data
Planner
Scientist
social
discovery optimized interactions
+ capacity transactions, Eng
endpoints
modeling content
App Dev
Data Access Patterns
Hadoop, Log In-Memory
etc. Events Data Grid
Ops DW Ops
batch "real time"
Cluster Scheduler
RDBMS
RDBMS
12. statistical thinking
Process Variation Data Tools
employing a mode of thought which includes both logical and analytical reasoning:
evaluating the whole of a problem, as well as its component parts; attempting
to assess the effects of changing one or more variables
this approach attempts to understand not just problems and solutions,
but also the processes involved and their variances
particularly valuable in Big Data work when combined with hands-on experience in
physics – roughly 50% of my peers come from physics or physical engineering…
programmers typically don’t think this way…
however, both systems engineers and data scientists must!
13. references
by Leo Breiman
Statistical Modeling:
The Two Cultures
Statistical Science, 2001
http://bit.ly/eUTh9L
also check out RStudio:
http://rstudio.org/
http://rpubs.com/
14. most valuable skills
• approximately 80% of the costs for data-related projects
get spent on data preparation – mostly on cleaning up
data quality issues: ETL, log file analysis, etc.
• unfortunately, data-related budgets for many companies tend
to go into frameworks which can only be used after clean up
• most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making analysis repeatable
D3
the rest of the skills – modeling,
algorithms, etc. – those are secondary
15. team process
help people ask the
discovery right questions
allow automation to
modeling place informed bets
deliver products at
integration scale to customers
leverage smarts in
apps product features Gephi
keep infrastructure
systems running, cost-effective
16. matrix: usage
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps stem
stem
s
s
diisc
d sc mod
mod nteg
iinte
g sy
sy
conceptual tool for managing Data Science teams stakeholder
overlay your project requirements (needs)
with your team’s strengths (roles)
scientist
that will show very quickly where to focus
NB: bring in individuals who cover 2-3 needs, developer
particularly for team leads
ops
17. building teams
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps stem
stem
s
s
diisc
d sc mod
mod nteg
iinte
g sy
sy
stakeholder
scientist
developer
ops
18. references
by DJ Patil
Data Jujitsu
O’Reilly, 2012
http://www.amazon.com/dp/B008HMN5BE
Building Data Science Teams
O’Reilly, 2011
http://www.amazon.com/dp/B005O4U3ZE
19. Intro to Data Science
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
2. overview:
typical use cases
20. using science in data science
edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
in a nutshell, what we do…
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC
wodniW D3 nepO
dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
lenaP tidE
• estimate probability
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
y d d uB d dA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
t a eS e g n a h C
dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
l e n aP t i dE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
wodniW D3 nepO
• calculate analytic variance
• manipulate order complexity
• make use of learning theory
• collab with DevOps, Stakeholders
21. use case: marketing funnel
• must optimize a very large ad spend
• different vendors report different metrics
Wikipedia
• seasonal variation distorts performance
• some campaigns are much smaller than others
• hard to predict ROI for incremental spend
approach:
• log aggregation, followed with cohort analysis
• bayesian point estimates compare different-sized ad tests
• customer lifetime value quantifies ROI of new leads
• time series analysis normalizes for seasonal variation
• geolocation adjusts for regional cost/benefit
• linear programming models estimate elasticity of demand
22. use case: ecommerce fraud
• sparse data means lots of missing values
stat.berkeley.edu
• “needle in a haystack” lack of training cases
• answers are available in large-scale batch, results
are needed in real-time event processing
• not just one pattern to detect – many, ever-changing
approach:
• random forest (RF) classifiers predict likely fraud
• subsampled data to re-balance training sets
• impute missing values based on density functions
• train on massive log files, run on in-memory grid
• adjust metrics to minimize customer support costs
• detect novelty – report anomalies via notifications
23. use case: customer segmentation
• many millions of customers, hard to determine
which features resonate
Mathworks
• multi-modal distributions get obscured by the
practice of calculating an “average”
• not much is known about individual customers
approach:
• connected components for sessionization, determining
uniques from logs
• estimates for age, gender, income, geo, etc.
• clustering algorithms to group into market segments
• social graph infers “unknown” relationships
• covariance/heat maps visualizes segments vs. feature sets
24. use case: monetizing content
• need to suggest relevant content which would
Digital Humanities
otherwise get buried in the back catalog
• big disconnect between inventory and limited
performance ad market
• enormous amounts of text, hard to categorize
approach:
• text analytics glean key phrases from documents
• hierarchical clustering of char frequencies detects lang
• latent dirichlet allocation (LDA) reduces dimension to topic
models
• recommenders suggest similar topics to customers
• collaborative filters connect known users with less known
26. Intro to Data Science
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
3. example:
a Cascading app
27. getting started
cascading.org/category/impatient/
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
28. composition of a workflow
business domain expertise, business trade-offs,
process market position, operating parameters, etc.
API Scala, Clojure, Python, Ruby, Java, etc.
language
…envision whatever else runs in a JVM
optimize /
schedule major changes in technology now
Document
Collection
Scrub
Tokenize
token
physical
M
HashJoin Regex
Left token
GroupBy R
plan
Stop Word token
List
RHS
Count
Word
Count
compute Apache Hadoop, in-memory local mode
“assembler”
code
substrate
…envision GPUs, other frameworks, etc.
machine
data Splunk, Nagios, Collectd, etc.
29. 1: copy
public class
Main
{
public static void
main( String[] args )
{
String inPath = args[ 0 ];
String outPath = args[ 1 ];
Source
Properties props = new Properties();
AppProps.setApplicationJarClass( props, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create the source tap
Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );
// create the sink tap
M Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );
// specify a pipe to connect the taps
Sink Pipe copyPipe = new Pipe( "copy" );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
.addSource( copyPipe, inTap )
.addTailSink( copyPipe, outTap );
// run the flow
flowConnector.connect( flowDef ).complete();
1 mapper }
}
0 reducers
10 lines code
30. wait!
ten lines of code
for a file copy…
seems like a lot.
31. same JAR, any scale…
MegaCorp Enterprise IT:
Pb’s data
1000+ node private cluster
EVP calls you when app fails
runtime: days+
Production Cluster:
Tb’s data
EMR w/ 50 HPC Instances
Ops monitors results
runtime: hours – days
Staging Cluster:
Gb’s data
EMR + 4 Spot Instances
CI shows red or green lights
runtime: minutes – hours
Your Laptop:
Mb’s data
Hadoop standalone mode
passes unit tests, or not
runtime: seconds – minutes
33. 3: City of Palo Alto open data
Regex Regex
tree
Scrub
filter parser species
M
HashJoin
Left Geohash
CoPA
GIS exprot Tree
Metadata M
RHS RHS
tree
Regex Checkpoint
road
Regex Regex
tsv
parser tsv filter Tree Filter GroupBy Checkpoint
parser CoGroup
Distance tree_dist tree_name shade
M
R M R M RHS
M
HashJoin Estimate Road
Left Albedo Segments Geohash CoGroup
Road
Metadata GPS
Failure RHS M logs
Traps R
road
Geohash
M
Regex
park
filter reco
M
park
github.com/Cascading/CoPA/wiki
• GIS export for parks, roads, trees (unstructured / open data)
• log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks
• curated metadata, used to enrich the dataset
• could extend via mash-up with many available public data APIs
Enterprise-scale app: road albedo + tree species metadata + geospatial indexing
“Find a shady spot on a summer day to walk near downtown and take a call…”