SlideShare a Scribd company logo
1 of 30
1




              Faster, cheaper, better
                  Replacing Oracle with
                  Hadoop and Solr

                  Ken Krugler
                  Scale Unlimited


         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
2




      Obligatory Background


             Ken Krugler - direct from Nevada City, California
             Krugle startup (2005-2008) used Nutch, Hadoop, Solr
             Now running Scale Unlimited
                     big data + search
                     consulting + training




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
3




      The 50,000ft View

             We helped our client kick the RDBMS habit
                     It’s an analytics web site for display advertising
                     Got rid of DBs handling queries for their web site
                     Now uses Hadoop + Solr to...
                             cut costs
                             add features
                             improve performance
                             increase scalability


         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
4




      What’s an Analytics Web Site?

               Let the user ask questions about data




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
5




      Including Sexy Dashboards

               All driven by slices of the data




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
6




      Behind the web site curtain

             Each view or constraint change triggers queries
                     “sum ad impact for all advertisers on all networks, sort by sum, limit 10”
                     “sum ad impact by ad type for advertiser ‘oracle.com’”


             For millions of records, you have to chose...
                     Fast, accurate, inexpensive - pick any two




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
7




      Combinatorial Explosion

             Too many possibilities to pre-calculate everything
                     more than 10^5 publishers
                     more than 10^6 advertisers
                     30 ad networks, 3 day ranges, etc


             So many trillions of possible combinations
                     Caching of DB query results isn’t very useful



         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
8




      Trouble in UI Land


             UI refresh took 10-30 seconds
             Well outside of target range of “about a second or so”




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
8




      Trouble in UI Land


             UI refresh took 10-30 seconds
             Well outside of target range of “about a second or so”

                     0.1 second: instantaneous
                     1.0 second: I’m still in the flow
                     10 seconds: I’m bored




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
9




      Trouble in the back office

             Beefy hardware for multiple DBs was expensive
                     AWS monthly cost approaching 5 figures
                     And the data sets needed to grow significantly


             Constant schema changes meant painful data reloading
                     Extract, load, transform (inside of DB)
                     Re-indexing of DB fields



         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
10




      A New Approach

             Do analytics off-line using Hadoop
                     Pre-generate as much as possible
             Use Solr as a NoSQL database
                     And leverage search, faceting




                                                                    +   =
         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
11




      Obligatory Architectural Slide

             Two search servers
             8 shards per index
                     Optimize response time
             Additional indexes
                     autocompletion, etc.
             200M total documents




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
12




      What Solr Gives Us

             Fast, memory-efficient queries
                     Count the number of documents that match a query
                     Sort results by fields
                     And search - “Find all Flash ads with the word ‘diet’”


             Fast faceting
                     Count # of results from query that have different values for a field
                     “How many different image ad sizes (w/counts) are used by google?”


         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
13




      How to Connect the Dots
             We have web crawl data - ads, advertisers, publishers, networks
                     http://www.michiguide.com/some-page.html text google
                     DIRECTV® For Businesses Save $13/mo ww.directv.com/business

             We have target Solr schemas with the fields defined

            <field name="network" type="string" indexed="true" stored="false" required="true" />
            <field name="publisher" type="string" indexed="true" stored="false" required="true" />


             How do we get from A to B?


                                       Data
                                                                    f(data)???   Index
                                      Sources

         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
14




      Hadoop ETL


             Implement appropriate Extract, Transform, Load
                     Extract is just parsing text files that are stored in Amazon’s S3
                     Load is building the Solr index and deploying it to the search servers
                     What about that pesky “Transform” part?




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
15




      Simplicity Itself

           25 Hadoop Jobs
           Developed with Cascading
           Daily run is $25




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
16




      Workflow Essentials

             “Do analytics offline” means anything that involves aggregation
             Solr is fine for first/last/count
             Pre-calculate anything that does math on each record
             Essentially index is pre-calculated answers to 200M questions
                     “what is trendline for ad impact of this advertiser on that publisher?”
                     “which ads use 300x250 images?”




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
17




      Combinatorial Explosion

             Limit questions that can be asked
                     E.g. no arbitrary date ranges
                     Requires tricky “biggest bang for buck” decisions


             Collapse entries that are “all” and only one other
                     Leverage Solr multi-value field support
                     network:all and network:doubleclick are one entry



         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
18




      Reduce Duplicated Data

             De-normalized schema means multiple records with similar data
                     “ad X on network Y”, “ad X on network Z”
                     We couldn’t use Solr’s “join” support (not in 3.6, issues with shards)


             Non-indexed duplicated data goes into “special” records
                     e.g. the records that have “all” for a field value




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
19




      Defer Workflow Optimizations


             Frequently tempted to get tricky
                     But helicopter stunts lead to pain and suffering


             Often complex ETL means running multiple jobs in parallel
                     So job timing/prioritization is more important




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
20




      Analyzing Workflows

             Sadly, hand analysis is
             currently required

             Key is no dead time
                     map/reduce slots


             New solutions
                     Ambrose
                     Driven



         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
21




      Useful Optimizations

             “Cache” results - HDFS storage is cheap
                     Daily processing
                     Daily state + delta from today


             Throw away data ASAP - avoid data baggage
                     Analytics data sets often have many, many fields




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
22




      Map-side Reduction
             Reduce the amount of data being sent from map to reduce
                     Often is bottleneck for jobs, due to network overhead
                     Examples include aggregation, group-level filtering


             Hadoop has “combiners”, which are post-map reducers
                     Do incremental reduce on map side before sending to reducers


             Cascading has “AggregateBy”, which are in-map reducers
                     Keeps some number of results in memory using LRU queue

         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
23




      Avoid Heuristics in Hadoop


             What’s easy to describe (and implement) in a function...
                     is often painful and slow in map-reduce


             Conditional/branching logic is common example
                     If this join result matches X, use it; otherwise join with Y and do Z




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
24




      The Net-Net




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
24




      The Net-Net


             If you have a web site that provides analytics




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
24




      The Net-Net


             If you have a web site that provides analytics
             And it’s currently using a RDBMS like Oracle




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
24




      The Net-Net


             If you have a web site that provides analytics
             And it’s currently using a RDBMS like Oracle
             You should be able to make it faster, cheaper, better (and scalable)




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
24




      The Net-Net


             If you have a web site that provides analytics
             And it’s currently using a RDBMS like Oracle
             You should be able to make it faster, cheaper, better (and scalable)
             Using Hadoop & Solr




         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12
25




      Questions?

             Feel free to contact me
                     http://www.scaleunlimited.com/contact/


             Check out Lucid’s “Big Data & Solr” class
                     http://www.lucidimagination.com/services/training/


             Check out Cascading
                     http://www.cascading.org/


         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12

More Related Content

Viewers also liked

Dlvr.it 使用說明
Dlvr.it 使用說明Dlvr.it 使用說明
Dlvr.it 使用說明waytorich
 
Enterprise Search: An Information Architect's Perspective
Enterprise Search: An Information Architect's PerspectiveEnterprise Search: An Information Architect's Perspective
Enterprise Search: An Information Architect's PerspectivePeter Morville
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopGrant Ingersoll
 
Real-time searching of big data with Solr and Hadoop
Real-time searching of big data with Solr and HadoopReal-time searching of big data with Solr and Hadoop
Real-time searching of big data with Solr and HadoopRogue Wave Software
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & SolrLucidworks
 
Enterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchEnterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchSearch Technologies
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaLucidworks
 
Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...
Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...
Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...Mongara AB
 
State Of Outsourcing 2009, Core Research
State Of Outsourcing 2009, Core ResearchState Of Outsourcing 2009, Core Research
State Of Outsourcing 2009, Core Researchrandiwoloz
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic rankingFELIX75
 
My Holidays
My HolidaysMy Holidays
My HolidaysMIKOT
 
Earliest California Titles Jorge Vera Estanol
Earliest California Titles Jorge Vera EstanolEarliest California Titles Jorge Vera Estanol
Earliest California Titles Jorge Vera Estanolrealestatehistory
 
Unit 18c Retirement dwellings
Unit 18c Retirement dwellingsUnit 18c Retirement dwellings
Unit 18c Retirement dwellingsAndrew Hingston
 

Viewers also liked (20)

Dlvr.it 使用說明
Dlvr.it 使用說明Dlvr.it 使用說明
Dlvr.it 使用說明
 
Enterprise Search: An Information Architect's Perspective
Enterprise Search: An Information Architect's PerspectiveEnterprise Search: An Information Architect's Perspective
Enterprise Search: An Information Architect's Perspective
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Real-time searching of big data with Solr and Hadoop
Real-time searching of big data with Solr and HadoopReal-time searching of big data with Solr and Hadoop
Real-time searching of big data with Solr and Hadoop
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Enterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchEnterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for Search
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
 
The Norwegian Gem
The Norwegian GemThe Norwegian Gem
The Norwegian Gem
 
Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...
Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...
Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...
 
Mining for gold 2.0
Mining for gold 2.0Mining for gold 2.0
Mining for gold 2.0
 
State Of Outsourcing 2009, Core Research
State Of Outsourcing 2009, Core ResearchState Of Outsourcing 2009, Core Research
State Of Outsourcing 2009, Core Research
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic ranking
 
yaM /yet another Meeting/ [Web Ready 2010]
yaM /yet another Meeting/ [Web Ready 2010]yaM /yet another Meeting/ [Web Ready 2010]
yaM /yet another Meeting/ [Web Ready 2010]
 
Reason and logic
Reason and logicReason and logic
Reason and logic
 
Camp Manitou
Camp ManitouCamp Manitou
Camp Manitou
 
My Holidays
My HolidaysMy Holidays
My Holidays
 
Earliest California Titles Jorge Vera Estanol
Earliest California Titles Jorge Vera EstanolEarliest California Titles Jorge Vera Estanol
Earliest California Titles Jorge Vera Estanol
 
Point in-time count training 2014
Point in-time count training 2014Point in-time count training 2014
Point in-time count training 2014
 
Unit 18c Retirement dwellings
Unit 18c Retirement dwellingsUnit 18c Retirement dwellings
Unit 18c Retirement dwellings
 

Similar to Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr

Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Romeo Kienzler
 
New Features of OBIEE 11.1.1.6.x
New Features of OBIEE 11.1.1.6.x New Features of OBIEE 11.1.1.6.x
New Features of OBIEE 11.1.1.6.x Capgemini
 
DDN Accelerating-Decisions-Through-Enterprise-Hadoop-final
DDN Accelerating-Decisions-Through-Enterprise-Hadoop-finalDDN Accelerating-Decisions-Through-Enterprise-Hadoop-final
DDN Accelerating-Decisions-Through-Enterprise-Hadoop-finalIntelHealthcare
 
CA_Plex_SupportForModernizingIBM_DB2_for_i
CA_Plex_SupportForModernizingIBM_DB2_for_iCA_Plex_SupportForModernizingIBM_DB2_for_i
CA_Plex_SupportForModernizingIBM_DB2_for_iGeorge Jeffcock
 
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseIntegrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseDataWorks Summit
 
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseHadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseCloudera, Inc.
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
Pivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DancePivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DanceEMC
 
Database Development: The Object-oriented and Test-driven Way
Database Development: The Object-oriented and Test-driven WayDatabase Development: The Object-oriented and Test-driven Way
Database Development: The Object-oriented and Test-driven WayTechWell
 
You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing It
You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing ItYou Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing It
You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing ItAleksandr Yampolskiy
 
Introducing Neo4j graph database
Introducing Neo4j graph databaseIntroducing Neo4j graph database
Introducing Neo4j graph databaseAmirhossein Saberi
 
Accelerating big data with ioMemory and Cisco UCS and NOSQL
Accelerating big data with ioMemory and Cisco UCS and NOSQLAccelerating big data with ioMemory and Cisco UCS and NOSQL
Accelerating big data with ioMemory and Cisco UCS and NOSQLSumeet Bansal
 
Sharepoint and SQL Server 2012
Sharepoint and SQL Server 2012Sharepoint and SQL Server 2012
Sharepoint and SQL Server 2012James Tramel
 
GoldenGate Case Study - Enterprise IT
GoldenGate Case Study - Enterprise ITGoldenGate Case Study - Enterprise IT
GoldenGate Case Study - Enterprise ITPaul Steffensen
 
Using SAP Crystal Reports as a Linked (Open) Data Front-End via ODBC
Using SAP Crystal Reports as a Linked (Open) Data Front-End via ODBCUsing SAP Crystal Reports as a Linked (Open) Data Front-End via ODBC
Using SAP Crystal Reports as a Linked (Open) Data Front-End via ODBCKingsley Uyi Idehen
 
Hadoop, SQL & NoSQL: No Longer an Either-or Question
Hadoop, SQL & NoSQL: No Longer an Either-or QuestionHadoop, SQL & NoSQL: No Longer an Either-or Question
Hadoop, SQL & NoSQL: No Longer an Either-or QuestionTony Baer
 
Hadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionHadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionDataWorks Summit
 

Similar to Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr (20)

Running a Lean Startup with AWS
Running a Lean Startup with AWSRunning a Lean Startup with AWS
Running a Lean Startup with AWS
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
 
New Features of OBIEE 11.1.1.6.x
New Features of OBIEE 11.1.1.6.x New Features of OBIEE 11.1.1.6.x
New Features of OBIEE 11.1.1.6.x
 
DDN Accelerating-Decisions-Through-Enterprise-Hadoop-final
DDN Accelerating-Decisions-Through-Enterprise-Hadoop-finalDDN Accelerating-Decisions-Through-Enterprise-Hadoop-final
DDN Accelerating-Decisions-Through-Enterprise-Hadoop-final
 
CA_Plex_SupportForModernizingIBM_DB2_for_i
CA_Plex_SupportForModernizingIBM_DB2_for_iCA_Plex_SupportForModernizingIBM_DB2_for_i
CA_Plex_SupportForModernizingIBM_DB2_for_i
 
Antonio piraino v1
Antonio piraino v1Antonio piraino v1
Antonio piraino v1
 
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseIntegrating Hadoop Into the Enterprise
Integrating Hadoop Into the Enterprise
 
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseHadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Pivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DancePivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant Dance
 
Pass bac jd_sm
Pass bac jd_smPass bac jd_sm
Pass bac jd_sm
 
Database Development: The Object-oriented and Test-driven Way
Database Development: The Object-oriented and Test-driven WayDatabase Development: The Object-oriented and Test-driven Way
Database Development: The Object-oriented and Test-driven Way
 
You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing It
You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing ItYou Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing It
You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing It
 
Introducing Neo4j graph database
Introducing Neo4j graph databaseIntroducing Neo4j graph database
Introducing Neo4j graph database
 
Accelerating big data with ioMemory and Cisco UCS and NOSQL
Accelerating big data with ioMemory and Cisco UCS and NOSQLAccelerating big data with ioMemory and Cisco UCS and NOSQL
Accelerating big data with ioMemory and Cisco UCS and NOSQL
 
Sharepoint and SQL Server 2012
Sharepoint and SQL Server 2012Sharepoint and SQL Server 2012
Sharepoint and SQL Server 2012
 
GoldenGate Case Study - Enterprise IT
GoldenGate Case Study - Enterprise ITGoldenGate Case Study - Enterprise IT
GoldenGate Case Study - Enterprise IT
 
Using SAP Crystal Reports as a Linked (Open) Data Front-End via ODBC
Using SAP Crystal Reports as a Linked (Open) Data Front-End via ODBCUsing SAP Crystal Reports as a Linked (Open) Data Front-End via ODBC
Using SAP Crystal Reports as a Linked (Open) Data Front-End via ODBC
 
Hadoop, SQL & NoSQL: No Longer an Either-or Question
Hadoop, SQL & NoSQL: No Longer an Either-or QuestionHadoop, SQL & NoSQL: No Longer an Either-or Question
Hadoop, SQL & NoSQL: No Longer an Either-or Question
 
Hadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionHadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or question
 

More from Ken Krugler

Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, FasterKen Krugler
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scaleKen Krugler
 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraKen Krugler
 
Strata web mining tutorial
Strata web mining tutorialStrata web mining tutorial
Strata web mining tutorialKen Krugler
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to HadoopKen Krugler
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big dataKen Krugler
 
Thinking at scale with hadoop
Thinking at scale with hadoopThinking at scale with hadoop
Thinking at scale with hadoopKen Krugler
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web MiningKen Krugler
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web MiningKen Krugler
 

More from Ken Krugler (9)

Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, Faster
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and Cassandra
 
Strata web mining tutorial
Strata web mining tutorialStrata web mining tutorial
Strata web mining tutorial
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to Hadoop
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big data
 
Thinking at scale with hadoop
Thinking at scale with hadoopThinking at scale with hadoop
Thinking at scale with hadoop
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 

Recently uploaded

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr

  • 1. 1 Faster, cheaper, better Replacing Oracle with Hadoop and Solr Ken Krugler Scale Unlimited Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 2. 2 Obligatory Background Ken Krugler - direct from Nevada City, California Krugle startup (2005-2008) used Nutch, Hadoop, Solr Now running Scale Unlimited big data + search consulting + training Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 3. 3 The 50,000ft View We helped our client kick the RDBMS habit It’s an analytics web site for display advertising Got rid of DBs handling queries for their web site Now uses Hadoop + Solr to... cut costs add features improve performance increase scalability Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 4. 4 What’s an Analytics Web Site? Let the user ask questions about data Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 5. 5 Including Sexy Dashboards All driven by slices of the data Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 6. 6 Behind the web site curtain Each view or constraint change triggers queries “sum ad impact for all advertisers on all networks, sort by sum, limit 10” “sum ad impact by ad type for advertiser ‘oracle.com’” For millions of records, you have to chose... Fast, accurate, inexpensive - pick any two Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 7. 7 Combinatorial Explosion Too many possibilities to pre-calculate everything more than 10^5 publishers more than 10^6 advertisers 30 ad networks, 3 day ranges, etc So many trillions of possible combinations Caching of DB query results isn’t very useful Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 8. 8 Trouble in UI Land UI refresh took 10-30 seconds Well outside of target range of “about a second or so” Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 9. 8 Trouble in UI Land UI refresh took 10-30 seconds Well outside of target range of “about a second or so” 0.1 second: instantaneous 1.0 second: I’m still in the flow 10 seconds: I’m bored Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 10. 9 Trouble in the back office Beefy hardware for multiple DBs was expensive AWS monthly cost approaching 5 figures And the data sets needed to grow significantly Constant schema changes meant painful data reloading Extract, load, transform (inside of DB) Re-indexing of DB fields Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 11. 10 A New Approach Do analytics off-line using Hadoop Pre-generate as much as possible Use Solr as a NoSQL database And leverage search, faceting + = Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 12. 11 Obligatory Architectural Slide Two search servers 8 shards per index Optimize response time Additional indexes autocompletion, etc. 200M total documents Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 13. 12 What Solr Gives Us Fast, memory-efficient queries Count the number of documents that match a query Sort results by fields And search - “Find all Flash ads with the word ‘diet’” Fast faceting Count # of results from query that have different values for a field “How many different image ad sizes (w/counts) are used by google?” Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 14. 13 How to Connect the Dots We have web crawl data - ads, advertisers, publishers, networks http://www.michiguide.com/some-page.html text google DIRECTV® For Businesses Save $13/mo ww.directv.com/business We have target Solr schemas with the fields defined <field name="network" type="string" indexed="true" stored="false" required="true" /> <field name="publisher" type="string" indexed="true" stored="false" required="true" /> How do we get from A to B? Data f(data)??? Index Sources Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 15. 14 Hadoop ETL Implement appropriate Extract, Transform, Load Extract is just parsing text files that are stored in Amazon’s S3 Load is building the Solr index and deploying it to the search servers What about that pesky “Transform” part? Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 16. 15 Simplicity Itself 25 Hadoop Jobs Developed with Cascading Daily run is $25 Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 17. 16 Workflow Essentials “Do analytics offline” means anything that involves aggregation Solr is fine for first/last/count Pre-calculate anything that does math on each record Essentially index is pre-calculated answers to 200M questions “what is trendline for ad impact of this advertiser on that publisher?” “which ads use 300x250 images?” Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 18. 17 Combinatorial Explosion Limit questions that can be asked E.g. no arbitrary date ranges Requires tricky “biggest bang for buck” decisions Collapse entries that are “all” and only one other Leverage Solr multi-value field support network:all and network:doubleclick are one entry Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 19. 18 Reduce Duplicated Data De-normalized schema means multiple records with similar data “ad X on network Y”, “ad X on network Z” We couldn’t use Solr’s “join” support (not in 3.6, issues with shards) Non-indexed duplicated data goes into “special” records e.g. the records that have “all” for a field value Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 20. 19 Defer Workflow Optimizations Frequently tempted to get tricky But helicopter stunts lead to pain and suffering Often complex ETL means running multiple jobs in parallel So job timing/prioritization is more important Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 21. 20 Analyzing Workflows Sadly, hand analysis is currently required Key is no dead time map/reduce slots New solutions Ambrose Driven Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 22. 21 Useful Optimizations “Cache” results - HDFS storage is cheap Daily processing Daily state + delta from today Throw away data ASAP - avoid data baggage Analytics data sets often have many, many fields Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 23. 22 Map-side Reduction Reduce the amount of data being sent from map to reduce Often is bottleneck for jobs, due to network overhead Examples include aggregation, group-level filtering Hadoop has “combiners”, which are post-map reducers Do incremental reduce on map side before sending to reducers Cascading has “AggregateBy”, which are in-map reducers Keeps some number of results in memory using LRU queue Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 24. 23 Avoid Heuristics in Hadoop What’s easy to describe (and implement) in a function... is often painful and slow in map-reduce Conditional/branching logic is common example If this join result matches X, use it; otherwise join with Y and do Z Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 25. 24 The Net-Net Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 26. 24 The Net-Net If you have a web site that provides analytics Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 27. 24 The Net-Net If you have a web site that provides analytics And it’s currently using a RDBMS like Oracle Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 28. 24 The Net-Net If you have a web site that provides analytics And it’s currently using a RDBMS like Oracle You should be able to make it faster, cheaper, better (and scalable) Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 29. 24 The Net-Net If you have a web site that provides analytics And it’s currently using a RDBMS like Oracle You should be able to make it faster, cheaper, better (and scalable) Using Hadoop & Solr Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12
  • 30. 25 Questions? Feel free to contact me http://www.scaleunlimited.com/contact/ Check out Lucid’s “Big Data & Solr” class http://www.lucidimagination.com/services/training/ Check out Cascading http://www.cascading.org/ Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Monday, June 11, 12