Hidden Gems found with Hadoop

Hidden Gems found with Hadoop
Paco Nathan
Lead, Analytics team @ IMVU.com

Ask Questions Early…

‣ How do Hadoop and “Big Data” ﬁt into the practice
of Continuous Deployment ?

‣ Why don’t we simply load all our data into Oracle,
then generate reports and spreadsheets as needed ?

‣ Given all the conﬂicting “NoSQL” options, how
does an engineer design an effective data store ?

‣ Is there one framework we can just buy and resolve
all these annoying data issues ?

‣ What kinds of analytics work can be performed
using Hadoop in the cloud ?

‣ Is IMVU currently hiring ? ☺

Continuous Deployment

• IMVU: ~50 engineers work in parallel, builds push live every ~8 minutes
• depends on “immune system” regression checks, progressive roll-outs
• dedication to transparency and metrics: data-intensive company culture
• extensive use of customer experiments (A/B testing) on millions of users
• instrumentation, alerting, strict discipline on conﬁg and resource usage
• Ops excellence, plus big investment in a ﬁnely tuned production environment
http://www.quora.com/What-are-best-examples-of-companies-using-continuous-deployment

http://www.slideshare.net/bgdurrett/3-reasons-you-should-use-continuous-deployment

http://www.startuplessonslearned.com/2009/06/why-continuous-deployment.html

Data Analytics

• data usage downstream from production cluster is a lower priority
• industry truism: data usage downstream almost never trumps
the priority of direct revenue transactions

• even so, business strategy depends on data analytics – which in
practice, at scale, must live downstream from transactions

• however, data analytics jobs tend to break that extensive work in
testing/monitoring which allows for continuous deployment:

- mission critical code which can’t be veriﬁed readily by unit tests
- “slow queries” trip immune system, signaling regressions
- likewise for large data transfers within production cluster
- tightly conﬁgured environment vs. elastic resource needs

How Did We Get Here?

• big Internet successes after 1997 holiday season…
AMZN, EBAY, then GOOG, Inktomi (YHOO Search)

• consider how, among tech ﬁrms, this metric:
annual revenue per customer / operational data store size
dropped more than 100x within a few years after 1997

• “conventional wisdom” of RDBMS and BI tools became
much less viable; however, business cadre which came of
age when “spreadsheets were new” tends to carry along
too much inertia to confront these issues pro-actively

• one one hand, storage and processing costs plummeted…
on the other hand, we must now work much smarter
to extract ROI from “Big Data”, so methods must adapt

• MapReduce and the Hadoop open source stack grew
directly out of this context… but they only solve part
of these problems

CAP Theorem

• Eric Brewer, 2000: “You can have at most two of these properties for
any shared-data system … the choice of which feature to discard
determines the nature of your system.”

• direct revenue apps in consumer Internet require consistency and
partition tolerance

• data analytics jobs for business uses generally require availability and
eventual consistency, but tend to not tolerate highly partitioned data

• ETL becomes an Achilles heal for “Lean Startup™”:
‣ agile/experiment-driven/scale-out, which leads to… strong
consistency
high
availability

‣ provably-hard-to-detect metadata drift, which leads to…
C A
‣ high-risk technical debt
https://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

http://www.julianbrowne.com/article/viewer/brewers-cap-theorem RDBMS
P eventual
partition consistency
tolerance

Data Access Patterns

• design patterns: originated in consensus negotiation
for architecture, then software engineering

• consider the corollaries in large scale data wrangling…
• essential advice:
select data frameworks based on your data access patterns

• in other words, decouple usage based on need –
to avoid “one size ﬁts all” blockers

• let’s review some examples…

Access Patterns ↔ Frameworks

ﬁnancial transactions general ledger in RDBMS CAx
ad-hoc queries RDS (hosted MySQL) CAx
reporting, dashboards like Pentaho CAx
log rotation/persistence like Riak xxP
search indexes like Lucene, Solr xAP
static content, archives S3 (durable storage) xAP
customer facts like Redis, Membase xAP
distributed counters, locks, sets like Redis x A P*
data objects CRUD key/value – like, NoSQL on MySQL CxP
authoritative metadata like Zookeeper CxP
data prep, modeling at scale like Hadoop/Hive/Cascading + R CxP
graph analysis like Hadoop + Redis + Gephi CxP
data marts like Hadoop/Hive/HBase CxP

Data Prep → Modeling at Scale
Analytics jobs performed in the cloud with Hadoop, R, etc.:
• log clean-up, sessionization
• roll-ups, slices, sampling, data cubes, visualizations
• language identiﬁcation, key phrase extraction
• co-occurrence analysis, topic trending
• custom search indexes
• random forests and other classiﬁers
• connected components, effects across social graph
• virtual economy metrics

Business use cases:
• customer segmentation edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN

• retention models
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA

•
dekcilCeliforPyM:IUN

anti-fraud
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC
wodniW D3 nepO
dneirF ddA

•
revO tcudorP pilF lenaP yrotnevnI tneilC

content recommendation
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP

•
bew :metI na yuB

ad optimization
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
edoMpUss
tcudorP yl
tcudorP ev
edoMmoo
edoMmoo
ydduB ddA
nigoL etisb
vd
edoMsdnei
edoMtahC:
egasseM a
G1 :gninia
dekcilCelif
edoMstider
tohspanS
egapemoH
elbbuB a e
taeS egna
wodniW D
dneirF ddA
revO tcudo
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS ega
M215 :gnin
gnihtolC n
bew :metI
edoMeivo
ytinummoc
teP weN et
detrats etiu
emag pazy
eciov moo
egasseM y
edoMlairot
ybbol sem
noitartsige

Finding Hidden Gems…

data objects, cloud-based data access patterns business
transactions data marts use cases

Hive reporting ad-hoc queries,
RDS framework reporting

search,
Lucene / Solr cache recommenders,
data services
Hadoop

graph analysis,
Redis sessionization,
data services
MySQL
partitions
MySQL predictive modeling,
partitions
MySQL Gephi
ETL S3 social graph,
partitions factor analysis,
R time series,
data visualization

Related Resources

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/

http://www.slideshare.net/pacoid/getting-started-on-hadoop

https://github.com/ceteri/ceteri-mapred

http://redis.io/

http://www.r-project.org/

http://gephi.org/

Analytics Team, IMVU.com

• IMVU: 90 employees in Bay Area, $40MM annual rev
• largest virtual goods catalog: +6MM items UGC
- Best Places to Work in Bay Area, 2011 & 2010
- Red Herring Global 100 Tech Startup, 2010
- Inc. 500, 2010
http://www.imvu.com/jobs/
@pacoid

Hidden Gems found with Hadoop

Recommandé

Recommandé

Contenu connexe

Similaire à Hidden Gems found with Hadoop

Similaire à Hidden Gems found with Hadoop (20)

Plus de Paco Nathan

Plus de Paco Nathan (20)

Dernier

Dernier (20)

Hidden Gems found with Hadoop

Notes de l'éditeur