SlideShare une entreprise Scribd logo
1  sur  20
Indexing and Searching
               Massive Data Sets

                           PAUL ALLEN
                              CEO




                                        May 13, 2009
Enterprise Search Summit
About

 WorldVitalRecords.com
   Provides access to genealogy databases and family history
    tools, including birth, death, military, census, and parish
    records
 WorldHistory.com
   Provides historical and biographical content

 We’re Related Application on Facebook
   Designed to help users find their family members, build a
    family tree and share news and photos with family.
   There are 15,636,941 active users as of March 2009.




                                                           May 13, 2009
Enterprise Search Summit
 Founded in 2006 and includes several key members
  of the original Ancestry.com team
 Has goal to be #2 genealogy company on web
 Currently has
        12,000+ databases
    

        1.2 billion names
    

        25,000 subscribers
    




Enterprise Search Summit                       May 13, 2009
The Challenge

 Rapidly expanding data set to grow into the billions of
  records
 Mixture of structured and unstructured data
 Indexing and search costs to handle this massive content
  repository quickly escalate
 Increased customer traffic placing additional load on
  query servers requiring additional servers and costs

        How do I provide an affordable search
         solution to handle the explosive data
                       growth?

                                                     May 13, 2009
Enterprise Search Summit
My Experience With Massive Data Sets

                            Saw content repositories grow to
                            billions of records

                            Saw millions spent on capital
                            expenditures on servers and data
                            centers

                            Saw millions spent on annual server
                            maintenance and energy costs


                                                         May 13, 2009
Enterprise Search Summit
My Options

 Build proprietary search engine


 Buy a solution from an Enterprise Search Vendor


 Build a Lucene open source search platform




                                               May 13, 2009
Enterprise Search Summit
Traditional Solution to Handle Massive Data Sets




Lots of servers!
Lots of money!
                                               May 13, 2009
Enterprise Search Summit
Cluster Architecture

 Cluster made up of rows and columns of servers
 Determinants of cluster size
   Size of index
           Determines number of columns
       

       Amount of peak traffic (queries per second)
   
           Determines the number of rows
       




                                                     May 13, 2009
Enterprise Search Summit
Index Size – Number of Columns
             Index size = 80 gb = 5 columns
                      Query Servers needed = 5

                      Collation Server


                      Query      Query      Query      Query
          Query
                      Server     Server     Server     Server
          Server
                       8 GB       8 GB       8 GB       8 GB
           8 GB

Assumptions:
       Up to 50% of index (16 gb for each cluster) resides mostly in cache
       QPS per server = 20 qps
Peak Traffic Load – Number of Columns

Index size = 20 G – 5 columns needed
Max Traffic Rate = 100 queries per second (QPS) – 10 rows needed
         Query Servers needed = 5 columns x 10 rows = 50 servers
                                         Collation Server
          Query Server    Query Server    Query Server      Query Server   Query Server




        Assumptions:
                  Server memory = 8 GB
                  Up to 50% of index resides in cache
                  QPS per server = 10 qps
WVR Server Configuration


                                          Collation
                                           Server

  8 gb              8 gb           8 gb     8 gb      8 gb   8 gb      16 gb




    800,000,000 records
    Maximum Query volume system capacity = 7 queries per second

    64 bit Windows based servers
    Dual core CPUs




Enterprise Search Summit                                            May 13, 2009
Future WVR Cluster Needs

 Projections show
   1+ terabyte of data (3.5 billion records)

   200 queries per second at peak load



Would require a cluster architecture of:
       29 columns
   

       40 rows
   


             WVR would need 1,160 query servers!
                           Over $3.5 million in initial capital expenditure
                            Over $2.3 million in recurring yearly costs



                                                                              May 13, 2009
Enterprise Search Summit
Search Challenges

   Many solutions work well with
     Low traffic

     Initial small data sets

   As data and traffic grows; however, so do the costs
     and associated problems
         Slow indexing times
     

         Low queries per second capacity
     

         Ranked search would have required a significant expansion of
     
         servers to handle the increased search load
         Required skilled staff to modify and optimize to handle growth
     




Enterprise Search Summit                                        May 13, 2009
Enterprise Search Solution




                                       May 13, 2009
Enterprise Search Summit
Perfect Search Approach
   Replace Existing Lucene
     Utilize PS Indexing

     Utilize PS Search Engine

   Match Business Rules
   Incorporate Near Exact modules
     Soundex, Metaphone

   Match or improve results
   Provide query results back to WVR for display
   Disk-based index



Enterprise Search Summit                             May 13, 2009
Data Growth Past Year



                              August 2008   May 2009        % Growth

    Number of Records         800,000,000   1,200,000,000   50%
    Number of Databases       9,000         12,000          33%




                                                                  May 13, 2009
Enterprise Search Summit
Current WVR Perfect Search Server Configuration



                                           Collation
                                            Server

    8 gb              8 gb          8 gb     8 gb      8 gb   8 gb       16 gb




     1.2 billion records
     12,000 databases

     Maximum Query volume system capacity = 40 queries per second
                                                  5x faster!
     64 bit Windows based servers
     Dual core CPUs

Enterprise Search Summit                                             May 13, 2009
Benefits to Worldvitalrecords.com

  Reduce indexing time to 1/100 of Lucene times

  Reduce query servers from 7 to 1

  Provided sub-second query response times

  Allows for continued dramatic customer growth
    without significant server expansion

  Allow World Vital Records to compete with market
    leaders at a fraction of the server capitalization and
    maintenance costs

Enterprise Search Summit                            May 13, 2009
Future Growth Projections

 World Vital Records Growth Plans
   1+ terabyte (3 times growth in data)

   200 queries per second (20 times growth in customers)




                                             Perfect    Lucene
                                             Search
        Servers                                    20            1160
        Server Capital Expenditure            $60,000    $3,480,000
        Recurring Power /Maintenance Costs    $40,000    $2,320,000




                                                                    May 13, 2009
Enterprise Search Summit
Questions?




                     Paul Allen, CEO, FamilyLink.com
                            paul@familylink.com




                                                       May 13, 2009
Enterprise Search Summit

Contenu connexe

Tendances

Hadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at FacebookHadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at Facebook
DataWorks Summit
 
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Cosmin Lehene
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 

Tendances (13)

Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Cosbench apac
Cosbench apacCosbench apac
Cosbench apac
 
Hadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at FacebookHadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at Facebook
 
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
 
Postgres.foreign.data.wrappers.2015
Postgres.foreign.data.wrappers.2015Postgres.foreign.data.wrappers.2015
Postgres.foreign.data.wrappers.2015
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 

En vedette

New Microsoft Word Document
New Microsoft Word DocumentNew Microsoft Word Document
New Microsoft Word Document
Farhan Shariff
 
презентация Microsoft office power point2
презентация Microsoft office power point2презентация Microsoft office power point2
презентация Microsoft office power point2
137nvk
 
CRONUS BIKE FACTORY IN GZ1202
CRONUS BIKE FACTORY IN GZ1202CRONUS BIKE FACTORY IN GZ1202
CRONUS BIKE FACTORY IN GZ1202
Tina Chan
 
AI (Architettura dell'Informazione) for dummies: lezioni e project work
AI (Architettura dell'Informazione) for dummies: lezioni e project workAI (Architettura dell'Informazione) for dummies: lezioni e project work
AI (Architettura dell'Informazione) for dummies: lezioni e project work
Daniela Costantini
 

En vedette (17)

Tema 2 clima s
Tema 2 clima sTema 2 clima s
Tema 2 clima s
 
Calaveras
CalaverasCalaveras
Calaveras
 
Annual budget planner : families, students, workers, migrants
Annual budget planner : families, students, workers, migrantsAnnual budget planner : families, students, workers, migrants
Annual budget planner : families, students, workers, migrants
 
New Microsoft Word Document
New Microsoft Word DocumentNew Microsoft Word Document
New Microsoft Word Document
 
Resume Baru
Resume BaruResume Baru
Resume Baru
 
Infrastructure for the Third Platform
Infrastructure for the Third PlatformInfrastructure for the Third Platform
Infrastructure for the Third Platform
 
Fda uk distribution guide
Fda uk distribution guideFda uk distribution guide
Fda uk distribution guide
 
AET/562 Team D: Self guided social media training manual
AET/562 Team D:  Self guided social media training manualAET/562 Team D:  Self guided social media training manual
AET/562 Team D: Self guided social media training manual
 
презентация Microsoft office power point2
презентация Microsoft office power point2презентация Microsoft office power point2
презентация Microsoft office power point2
 
Sistemas operativos de red
Sistemas operativos de redSistemas operativos de red
Sistemas operativos de red
 
Jacqueline Hernandez's Resume
Jacqueline Hernandez's ResumeJacqueline Hernandez's Resume
Jacqueline Hernandez's Resume
 
CRONUS BIKE FACTORY IN GZ1202
CRONUS BIKE FACTORY IN GZ1202CRONUS BIKE FACTORY IN GZ1202
CRONUS BIKE FACTORY IN GZ1202
 
Startup, corporate e PMI: una collaborazione per innovare - Sara Monesi, ASTER
Startup, corporate e PMI: una collaborazione per innovare - Sara Monesi, ASTERStartup, corporate e PMI: una collaborazione per innovare - Sara Monesi, ASTER
Startup, corporate e PMI: una collaborazione per innovare - Sara Monesi, ASTER
 
Team c social media training manual presentation
Team c social media training manual presentationTeam c social media training manual presentation
Team c social media training manual presentation
 
AI (Architettura dell'Informazione) for dummies: lezioni e project work
AI (Architettura dell'Informazione) for dummies: lezioni e project workAI (Architettura dell'Informazione) for dummies: lezioni e project work
AI (Architettura dell'Informazione) for dummies: lezioni e project work
 
L'economia delle Esperienze - Rolando Gualerzi, Gruppo del Barba
L'economia delle Esperienze - Rolando Gualerzi, Gruppo del BarbaL'economia delle Esperienze - Rolando Gualerzi, Gruppo del Barba
L'economia delle Esperienze - Rolando Gualerzi, Gruppo del Barba
 
Deicy Ayala Penaloza 63459541
Deicy Ayala Penaloza 63459541Deicy Ayala Penaloza 63459541
Deicy Ayala Penaloza 63459541
 

Similaire à World Vital Records Case Study

Rest in Practice, Brazil 2010
Rest in Practice, Brazil 2010Rest in Practice, Brazil 2010
Rest in Practice, Brazil 2010
Thoughtworks
 
13h00 p duff-building-applications-with-aws-final
13h00   p duff-building-applications-with-aws-final13h00   p duff-building-applications-with-aws-final
13h00 p duff-building-applications-with-aws-final
Luiz Gustavo Santos
 

Similaire à World Vital Records Case Study (20)

What's new in Amazon Aurora - ADB203 - Chicago AWS Summit
What's new in Amazon Aurora - ADB203 - Chicago AWS SummitWhat's new in Amazon Aurora - ADB203 - Chicago AWS Summit
What's new in Amazon Aurora - ADB203 - Chicago AWS Summit
 
Rest in Practice, Brazil 2010
Rest in Practice, Brazil 2010Rest in Practice, Brazil 2010
Rest in Practice, Brazil 2010
 
Amazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration Service
 
Emakina Academy - 5 - Know your audience - Web Analytics
Emakina Academy -  5 - Know your audience - Web AnalyticsEmakina Academy -  5 - Know your audience - Web Analytics
Emakina Academy - 5 - Know your audience - Web Analytics
 
10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS application10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS application
 
Shopzilla - Performance By Design
Shopzilla - Performance By DesignShopzilla - Performance By Design
Shopzilla - Performance By Design
 
Scale Up and Modernize Your Database with Amazon Relational Database Service ...
Scale Up and Modernize Your Database with Amazon Relational Database Service ...Scale Up and Modernize Your Database with Amazon Relational Database Service ...
Scale Up and Modernize Your Database with Amazon Relational Database Service ...
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
20080528dublinpt1
20080528dublinpt120080528dublinpt1
20080528dublinpt1
 
Getting started with amazon aurora - Toronto
Getting started with amazon aurora - TorontoGetting started with amazon aurora - Toronto
Getting started with amazon aurora - Toronto
 
Amazon Aurora: Deep Dive - SRV308 - Chicago AWS Summit
Amazon Aurora: Deep Dive - SRV308 - Chicago AWS SummitAmazon Aurora: Deep Dive - SRV308 - Chicago AWS Summit
Amazon Aurora: Deep Dive - SRV308 - Chicago AWS Summit
 
Oracle Database 11g Lower Your Costs
Oracle Database 11g Lower Your CostsOracle Database 11g Lower Your Costs
Oracle Database 11g Lower Your Costs
 
Getting Started with Amazon Aurora
Getting Started with Amazon AuroraGetting Started with Amazon Aurora
Getting Started with Amazon Aurora
 
Deep Dive on Amazon Redshift - AWS Summit Cape Town 2017
Deep Dive on Amazon Redshift - AWS Summit Cape Town 2017 Deep Dive on Amazon Redshift - AWS Summit Cape Town 2017
Deep Dive on Amazon Redshift - AWS Summit Cape Town 2017
 
Sizing Amazon Elasticsearch Service for your workload - ADB303 - Santa Clara ...
Sizing Amazon Elasticsearch Service for your workload - ADB303 - Santa Clara ...Sizing Amazon Elasticsearch Service for your workload - ADB303 - Santa Clara ...
Sizing Amazon Elasticsearch Service for your workload - ADB303 - Santa Clara ...
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
Building Applications with AWS
Building Applications with AWSBuilding Applications with AWS
Building Applications with AWS
 
Storage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems PresentationStorage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems Presentation
 
Data Warehousing in the Cloud - AWS Summit Sydney
Data Warehousing in the Cloud - AWS Summit SydneyData Warehousing in the Cloud - AWS Summit Sydney
Data Warehousing in the Cloud - AWS Summit Sydney
 
13h00 p duff-building-applications-with-aws-final
13h00   p duff-building-applications-with-aws-final13h00   p duff-building-applications-with-aws-final
13h00 p duff-building-applications-with-aws-final
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

World Vital Records Case Study

  • 1. Indexing and Searching Massive Data Sets PAUL ALLEN CEO May 13, 2009 Enterprise Search Summit
  • 2. About  WorldVitalRecords.com  Provides access to genealogy databases and family history tools, including birth, death, military, census, and parish records  WorldHistory.com  Provides historical and biographical content  We’re Related Application on Facebook  Designed to help users find their family members, build a family tree and share news and photos with family.  There are 15,636,941 active users as of March 2009. May 13, 2009 Enterprise Search Summit
  • 3.  Founded in 2006 and includes several key members of the original Ancestry.com team  Has goal to be #2 genealogy company on web  Currently has 12,000+ databases  1.2 billion names  25,000 subscribers  Enterprise Search Summit May 13, 2009
  • 4. The Challenge  Rapidly expanding data set to grow into the billions of records  Mixture of structured and unstructured data  Indexing and search costs to handle this massive content repository quickly escalate  Increased customer traffic placing additional load on query servers requiring additional servers and costs How do I provide an affordable search solution to handle the explosive data growth? May 13, 2009 Enterprise Search Summit
  • 5. My Experience With Massive Data Sets  Saw content repositories grow to billions of records  Saw millions spent on capital expenditures on servers and data centers  Saw millions spent on annual server maintenance and energy costs May 13, 2009 Enterprise Search Summit
  • 6. My Options  Build proprietary search engine  Buy a solution from an Enterprise Search Vendor  Build a Lucene open source search platform May 13, 2009 Enterprise Search Summit
  • 7. Traditional Solution to Handle Massive Data Sets Lots of servers! Lots of money! May 13, 2009 Enterprise Search Summit
  • 8. Cluster Architecture  Cluster made up of rows and columns of servers  Determinants of cluster size  Size of index Determines number of columns  Amount of peak traffic (queries per second)  Determines the number of rows  May 13, 2009 Enterprise Search Summit
  • 9. Index Size – Number of Columns Index size = 80 gb = 5 columns Query Servers needed = 5 Collation Server Query Query Query Query Query Server Server Server Server Server 8 GB 8 GB 8 GB 8 GB 8 GB Assumptions: Up to 50% of index (16 gb for each cluster) resides mostly in cache QPS per server = 20 qps
  • 10. Peak Traffic Load – Number of Columns Index size = 20 G – 5 columns needed Max Traffic Rate = 100 queries per second (QPS) – 10 rows needed Query Servers needed = 5 columns x 10 rows = 50 servers Collation Server Query Server Query Server Query Server Query Server Query Server Assumptions: Server memory = 8 GB Up to 50% of index resides in cache QPS per server = 10 qps
  • 11. WVR Server Configuration Collation Server 8 gb 8 gb 8 gb 8 gb 8 gb 8 gb 16 gb 800,000,000 records Maximum Query volume system capacity = 7 queries per second 64 bit Windows based servers Dual core CPUs Enterprise Search Summit May 13, 2009
  • 12. Future WVR Cluster Needs  Projections show  1+ terabyte of data (3.5 billion records)  200 queries per second at peak load Would require a cluster architecture of: 29 columns  40 rows  WVR would need 1,160 query servers! Over $3.5 million in initial capital expenditure Over $2.3 million in recurring yearly costs May 13, 2009 Enterprise Search Summit
  • 13. Search Challenges  Many solutions work well with  Low traffic  Initial small data sets  As data and traffic grows; however, so do the costs and associated problems Slow indexing times  Low queries per second capacity  Ranked search would have required a significant expansion of  servers to handle the increased search load Required skilled staff to modify and optimize to handle growth  Enterprise Search Summit May 13, 2009
  • 14. Enterprise Search Solution May 13, 2009 Enterprise Search Summit
  • 15. Perfect Search Approach  Replace Existing Lucene  Utilize PS Indexing  Utilize PS Search Engine  Match Business Rules  Incorporate Near Exact modules  Soundex, Metaphone  Match or improve results  Provide query results back to WVR for display  Disk-based index Enterprise Search Summit May 13, 2009
  • 16. Data Growth Past Year August 2008 May 2009 % Growth Number of Records 800,000,000 1,200,000,000 50% Number of Databases 9,000 12,000 33% May 13, 2009 Enterprise Search Summit
  • 17. Current WVR Perfect Search Server Configuration Collation Server 8 gb 8 gb 8 gb 8 gb 8 gb 8 gb 16 gb 1.2 billion records 12,000 databases Maximum Query volume system capacity = 40 queries per second 5x faster! 64 bit Windows based servers Dual core CPUs Enterprise Search Summit May 13, 2009
  • 18. Benefits to Worldvitalrecords.com  Reduce indexing time to 1/100 of Lucene times  Reduce query servers from 7 to 1  Provided sub-second query response times  Allows for continued dramatic customer growth without significant server expansion  Allow World Vital Records to compete with market leaders at a fraction of the server capitalization and maintenance costs Enterprise Search Summit May 13, 2009
  • 19. Future Growth Projections  World Vital Records Growth Plans  1+ terabyte (3 times growth in data)  200 queries per second (20 times growth in customers) Perfect Lucene Search Servers 20 1160 Server Capital Expenditure $60,000 $3,480,000 Recurring Power /Maintenance Costs $40,000 $2,320,000 May 13, 2009 Enterprise Search Summit
  • 20. Questions? Paul Allen, CEO, FamilyLink.com paul@familylink.com May 13, 2009 Enterprise Search Summit