How does using Hadoop in the cloud for data analytics fit into the context of continuous deployment? Also taking a look at how one can use CAP to fit data access patterns with appropriate data frameworks.
1. Hidden Gems found with Hadoop
Paco Nathan
Lead, Analytics team @ IMVU.com
2. Ask Questions Early…
‣ How do Hadoop and “Big Data” fit into the practice
of Continuous Deployment ?
‣ Why don’t we simply load all our data into Oracle,
then generate reports and spreadsheets as needed ?
‣ Given all the conflicting “NoSQL” options, how
does an engineer design an effective data store ?
‣ Is there one framework we can just buy and resolve
all these annoying data issues ?
‣ What kinds of analytics work can be performed
using Hadoop in the cloud ?
‣ Is IMVU currently hiring ? ☺
3. Continuous Deployment
• IMVU: ~50 engineers work in parallel, builds push live every ~8 minutes
• depends on “immune system” regression checks, progressive roll-outs
• dedication to transparency and metrics: data-intensive company culture
• extensive use of customer experiments (A/B testing) on millions of users
• instrumentation, alerting, strict discipline on config and resource usage
• Ops excellence, plus big investment in a finely tuned production environment
http://www.quora.com/What-are-best-examples-of-companies-using-continuous-deployment
http://www.slideshare.net/bgdurrett/3-reasons-you-should-use-continuous-deployment
http://www.startuplessonslearned.com/2009/06/why-continuous-deployment.html
5. Data Analytics
• data usage downstream from production cluster is a lower priority
• industry truism: data usage downstream almost never trumps
the priority of direct revenue transactions
• even so, business strategy depends on data analytics – which in
practice, at scale, must live downstream from transactions
• however, data analytics jobs tend to break that extensive work in
testing/monitoring which allows for continuous deployment:
- mission critical code which can’t be verified readily by unit tests
- “slow queries” trip immune system, signaling regressions
- likewise for large data transfers within production cluster
- tightly configured environment vs. elastic resource needs
6. How Did We Get Here?
• big Internet successes after 1997 holiday season…
AMZN, EBAY, then GOOG, Inktomi (YHOO Search)
• consider how, among tech firms, this metric:
annual revenue per customer / operational data store size
dropped more than 100x within a few years after 1997
• “conventional wisdom” of RDBMS and BI tools became
much less viable; however, business cadre which came of
age when “spreadsheets were new” tends to carry along
too much inertia to confront these issues pro-actively
• one one hand, storage and processing costs plummeted…
on the other hand, we must now work much smarter
to extract ROI from “Big Data”, so methods must adapt
• MapReduce and the Hadoop open source stack grew
directly out of this context… but they only solve part
of these problems
7. CAP Theorem
• Eric Brewer, 2000: “You can have at most two of these properties for
any shared-data system … the choice of which feature to discard
determines the nature of your system.”
• direct revenue apps in consumer Internet require consistency and
partition tolerance
• data analytics jobs for business uses generally require availability and
eventual consistency, but tend to not tolerate highly partitioned data
• ETL becomes an Achilles heal for “Lean Startup™”:
‣ agile/experiment-driven/scale-out, which leads to… strong
consistency
high
availability
‣ provably-hard-to-detect metadata drift, which leads to…
C A
‣ high-risk technical debt
https://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
http://www.julianbrowne.com/article/viewer/brewers-cap-theorem RDBMS
P eventual
partition consistency
tolerance
8. Data Access Patterns
• design patterns: originated in consensus negotiation
for architecture, then software engineering
• consider the corollaries in large scale data wrangling…
• essential advice:
select data frameworks based on your data access patterns
• in other words, decouple usage based on need –
to avoid “one size fits all” blockers
• let’s review some examples…
9. Access Patterns ↔ Frameworks
financial transactions general ledger in RDBMS CAx
ad-hoc queries RDS (hosted MySQL) CAx
reporting, dashboards like Pentaho CAx
log rotation/persistence like Riak xxP
search indexes like Lucene, Solr xAP
static content, archives S3 (durable storage) xAP
customer facts like Redis, Membase xAP
distributed counters, locks, sets like Redis x A P*
data objects CRUD key/value – like, NoSQL on MySQL CxP
authoritative metadata like Zookeeper CxP
data prep, modeling at scale like Hadoop/Hive/Cascading + R CxP
graph analysis like Hadoop + Redis + Gephi CxP
data marts like Hadoop/Hive/HBase CxP
10. Access Patterns ↔ Frameworks
financial transactions general ledger in RDBMS CAx
ad-hoc queries RDS (hosted MySQL) CAx
reporting, dashboards like Pentaho CAx
log rotation/persistence like Riak xxP
search indexes like Lucene, Solr xAP
static content, archives S3 (durable storage) xAP
customer facts like Redis, Membase xAP
distributed counters, locks, sets like Redis x A P*
data objects CRUD key/value – like, NoSQL on MySQL CxP
authoritative metadata like Zookeeper CxP
data prep, modeling at scale like Hadoop/Hive/Cascading + R CxP
graph analysis like Hadoop + Redis + Gephi CxP
data marts like Hadoop/Hive/HBase CxP
11. Data Prep → Modeling at Scale
Analytics jobs performed in the cloud with Hadoop, R, etc.:
• log clean-up, sessionization
• roll-ups, slices, sampling, data cubes, visualizations
• language identification, key phrase extraction
• co-occurrence analysis, topic trending
• custom search indexes
• random forests and other classifiers
• connected components, effects across social graph
• virtual economy metrics
Business use cases:
• customer segmentation edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
• retention models
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
•
dekcilCeliforPyM:IUN
anti-fraud
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC
wodniW D3 nepO
dneirF ddA
•
revO tcudorP pilF lenaP yrotnevnI tneilC
content recommendation
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
•
bew :metI na yuB
ad optimization
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
edoMpUss
tcudorP yl
tcudorP ev
edoMmoo
edoMmoo
ydduB ddA
nigoL etisb
vd
edoMsdnei
edoMtahC:
egasseM a
G1 :gninia
dekcilCelif
edoMstider
tohspanS
egapemoH
elbbuB a e
taeS egna
wodniW D
dneirF ddA
revO tcudo
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS ega
M215 :gnin
gnihtolC n
bew :metI
edoMeivo
ytinummoc
teP weN et
detrats etiu
emag pazy
eciov moo
egasseM y
edoMlairot
ybbol sem
noitartsige
12. Finding Hidden Gems…
data objects, cloud-based data access patterns business
transactions data marts use cases
Hive reporting ad-hoc queries,
RDS framework reporting
search,
Lucene / Solr cache recommenders,
data services
Hadoop
graph analysis,
Redis sessionization,
data services
MySQL
partitions
MySQL predictive modeling,
partitions
MySQL Gephi
ETL S3 social graph,
partitions factor analysis,
R time series,
data visualization
14. Analytics Team, IMVU.com
• IMVU: 90 employees in Bay Area, $40MM annual rev
• largest virtual goods catalog: +6MM items UGC
- Best Places to Work in Bay Area, 2011 & 2010
- Red Herring Global 100 Tech Startup, 2010
- Inc. 500, 2010
http://www.imvu.com/jobs/
@pacoid
Notes de l'éditeur
• prior teams: Jive, ShareThis, Adknowledge, HeadCase\n• worked with Ray while at ShareThis on our DW and recommender systems\n• 5 years experience with AWS, some at firms 100% in the cloud\n
• I’m a big believer in asking many questions up-front…\n• this talk examines how Hadoop fits into what IMVU is famous for: continuous deployment\n• we do some critical work with large data sets which makes RDBMS not a good fit\n
• CD allows many developers to respond to immediate needs, to experiment frequently\n• transparency, measurement, and consistent data-driven decisions are absolutely requisite\n
• in short, we can handle in minutes or hours, what other firms might take days, weeks, or months to do\n• decisions and actions are highly distributed, and engineering process is well disciplined\n\n\n
• my team works in Analytics, and our data usage is at a different priority than our production cluster\n• this is generally true throughout the industry\n• business strategy depends on analytics – \n• however, analytics work tends to break what we’ve so carefully instrumented\n\n\n\n
• how did we reach this condition?\n• 1997Q4 through 1998Q1, AMZN/EBAY/GOOG/YHOO redefined data use\n• revenue/data size, as a metric, fell through the floor\n• previous practices in relational DBs and BI no longer worked so well\n
• CAP theorem explains an inherent conflict there…\n• Internet transactions tend to need different kinds of data management than analytics\n• partitioned databases are a solution for one aspect, but in turn cause ETL to become a huge problem\n
• fortunately, there are patterns we can use to engineer around those conflicts…\n• providing that you don’t buy into “one size fits all” sales rhetoric from DB vendors\n• design patterns help here: choose data frameworks which fit your data access patterns\n
• hopefully, this tables states the CAP forfeits correctly – email me corrections, please :)\n• some of these patterns migrate well to the cloud; you may miss a big opportunity if you don't\n
• Redis is notable; rich/flexible atomic operations lend to not-shared cases\n• let’s drill-down into the Hadoop use cases…\n
• here are a variety of kinds of data preparation, discovery, modeling, and visualization for which my teams have used Hadoop and AWS\n• generally the goal is to automate most all of the work, as “pipelines”, and deliver data products/data services\n• these visualizations are actually some recent products from my team (less a few details stripped out)…\n• geolocation, topic trending from text analytics, measuring effects across the social graph, and comparing features vs. retention\n
• BTW, Redis provides an excellent “left brain” to pair with Hadoop “right brain”\n• this is not strictly “real-time” analytics, but cost-effective and follows guidance from CAP\n• in other words, scalable data frameworks based on prevalent data access patterns\n
• here is some further reading, which I will post online…\n