Our client helps advertisers target publishers/networks and improve ad results by analyzing millions of web pages every day. They have been able to cut monthly costs by more than 50%, improve response time by 4x, and quickly add new features by switching from a traditional DB-centric approach to one based on Hadoop & Solr. This analysis is handled by a complex Hadoop-based workflow, where the end result is a set of unique, highly optimized Solr indexes. The data processing platform provided by Hadoop also enables scalable machine learning using Mahout. This presentation covers some of the unique challenges in switching the web site from relying on slow, expensive real-time analytics using database queries to fast, affordable batch analytics and search using Hadoop and Solr.
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
1. 1
Faster, cheaper, better
Replacing Oracle with
Hadoop and Solr
Ken Krugler
Scale Unlimited
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
2. 2
Obligatory Background
Ken Krugler - direct from Nevada City, California
Krugle startup (2005-2008) used Nutch, Hadoop, Solr
Now running Scale Unlimited
big data + search
consulting + training
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
3. 3
The 50,000ft View
We helped our client kick the RDBMS habit
It’s an analytics web site for display advertising
Got rid of DBs handling queries for their web site
Now uses Hadoop + Solr to...
cut costs
add features
improve performance
increase scalability
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
4. 4
What’s an Analytics Web Site?
Let the user ask questions about data
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
5. 5
Including Sexy Dashboards
All driven by slices of the data
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
6. 6
Behind the web site curtain
Each view or constraint change triggers queries
“sum ad impact for all advertisers on all networks, sort by sum, limit 10”
“sum ad impact by ad type for advertiser ‘oracle.com’”
For millions of records, you have to chose...
Fast, accurate, inexpensive - pick any two
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
7. 7
Combinatorial Explosion
Too many possibilities to pre-calculate everything
more than 10^5 publishers
more than 10^6 advertisers
30 ad networks, 3 day ranges, etc
So many trillions of possible combinations
Caching of DB query results isn’t very useful
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
8. 8
Trouble in UI Land
UI refresh took 10-30 seconds
Well outside of target range of “about a second or so”
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
9. 8
Trouble in UI Land
UI refresh took 10-30 seconds
Well outside of target range of “about a second or so”
0.1 second: instantaneous
1.0 second: I’m still in the flow
10 seconds: I’m bored
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
10. 9
Trouble in the back office
Beefy hardware for multiple DBs was expensive
AWS monthly cost approaching 5 figures
And the data sets needed to grow significantly
Constant schema changes meant painful data reloading
Extract, load, transform (inside of DB)
Re-indexing of DB fields
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
11. 10
A New Approach
Do analytics off-line using Hadoop
Pre-generate as much as possible
Use Solr as a NoSQL database
And leverage search, faceting
+ =
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
12. 11
Obligatory Architectural Slide
Two search servers
8 shards per index
Optimize response time
Additional indexes
autocompletion, etc.
200M total documents
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
13. 12
What Solr Gives Us
Fast, memory-efficient queries
Count the number of documents that match a query
Sort results by fields
And search - “Find all Flash ads with the word ‘diet’”
Fast faceting
Count # of results from query that have different values for a field
“How many different image ad sizes (w/counts) are used by google?”
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
14. 13
How to Connect the Dots
We have web crawl data - ads, advertisers, publishers, networks
http://www.michiguide.com/some-page.html text google
DIRECTV® For Businesses Save $13/mo ww.directv.com/business
We have target Solr schemas with the fields defined
<field name="network" type="string" indexed="true" stored="false" required="true" />
<field name="publisher" type="string" indexed="true" stored="false" required="true" />
How do we get from A to B?
Data
f(data)??? Index
Sources
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
15. 14
Hadoop ETL
Implement appropriate Extract, Transform, Load
Extract is just parsing text files that are stored in Amazon’s S3
Load is building the Solr index and deploying it to the search servers
What about that pesky “Transform” part?
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
16. 15
Simplicity Itself
25 Hadoop Jobs
Developed with Cascading
Daily run is $25
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
17. 16
Workflow Essentials
“Do analytics offline” means anything that involves aggregation
Solr is fine for first/last/count
Pre-calculate anything that does math on each record
Essentially index is pre-calculated answers to 200M questions
“what is trendline for ad impact of this advertiser on that publisher?”
“which ads use 300x250 images?”
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
18. 17
Combinatorial Explosion
Limit questions that can be asked
E.g. no arbitrary date ranges
Requires tricky “biggest bang for buck” decisions
Collapse entries that are “all” and only one other
Leverage Solr multi-value field support
network:all and network:doubleclick are one entry
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
19. 18
Reduce Duplicated Data
De-normalized schema means multiple records with similar data
“ad X on network Y”, “ad X on network Z”
We couldn’t use Solr’s “join” support (not in 3.6, issues with shards)
Non-indexed duplicated data goes into “special” records
e.g. the records that have “all” for a field value
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
20. 19
Defer Workflow Optimizations
Frequently tempted to get tricky
But helicopter stunts lead to pain and suffering
Often complex ETL means running multiple jobs in parallel
So job timing/prioritization is more important
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
21. 20
Analyzing Workflows
Sadly, hand analysis is
currently required
Key is no dead time
map/reduce slots
New solutions
Ambrose
Driven
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
22. 21
Useful Optimizations
“Cache” results - HDFS storage is cheap
Daily processing
Daily state + delta from today
Throw away data ASAP - avoid data baggage
Analytics data sets often have many, many fields
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
23. 22
Map-side Reduction
Reduce the amount of data being sent from map to reduce
Often is bottleneck for jobs, due to network overhead
Examples include aggregation, group-level filtering
Hadoop has “combiners”, which are post-map reducers
Do incremental reduce on map side before sending to reducers
Cascading has “AggregateBy”, which are in-map reducers
Keeps some number of results in memory using LRU queue
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
24. 23
Avoid Heuristics in Hadoop
What’s easy to describe (and implement) in a function...
is often painful and slow in map-reduce
Conditional/branching logic is common example
If this join result matches X, use it; otherwise join with Y and do Z
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
25. 24
The Net-Net
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
26. 24
The Net-Net
If you have a web site that provides analytics
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
27. 24
The Net-Net
If you have a web site that provides analytics
And it’s currently using a RDBMS like Oracle
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
28. 24
The Net-Net
If you have a web site that provides analytics
And it’s currently using a RDBMS like Oracle
You should be able to make it faster, cheaper, better (and scalable)
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
29. 24
The Net-Net
If you have a web site that provides analytics
And it’s currently using a RDBMS like Oracle
You should be able to make it faster, cheaper, better (and scalable)
Using Hadoop & Solr
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12
30. 25
Questions?
Feel free to contact me
http://www.scaleunlimited.com/contact/
Check out Lucid’s “Big Data & Solr” class
http://www.lucidimagination.com/services/training/
Check out Cascading
http://www.cascading.org/
Copyright (c) 2012 Scale Unlimited. All Rights Reserved.
Monday, June 11, 12