A case study presented at the 2009 Enterprise Search Summit by World Vital Records CEO Paul Allen.
Mr. Allen discussed the challenge of cost effective scaling, ability to index and query large data sets, and stay competitive in the market place.
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
World Vital Records Case Study
1. Indexing and Searching
Massive Data Sets
PAUL ALLEN
CEO
May 13, 2009
Enterprise Search Summit
2. About
WorldVitalRecords.com
Provides access to genealogy databases and family history
tools, including birth, death, military, census, and parish
records
WorldHistory.com
Provides historical and biographical content
We’re Related Application on Facebook
Designed to help users find their family members, build a
family tree and share news and photos with family.
There are 15,636,941 active users as of March 2009.
May 13, 2009
Enterprise Search Summit
3. Founded in 2006 and includes several key members
of the original Ancestry.com team
Has goal to be #2 genealogy company on web
Currently has
12,000+ databases
1.2 billion names
25,000 subscribers
Enterprise Search Summit May 13, 2009
4. The Challenge
Rapidly expanding data set to grow into the billions of
records
Mixture of structured and unstructured data
Indexing and search costs to handle this massive content
repository quickly escalate
Increased customer traffic placing additional load on
query servers requiring additional servers and costs
How do I provide an affordable search
solution to handle the explosive data
growth?
May 13, 2009
Enterprise Search Summit
5. My Experience With Massive Data Sets
Saw content repositories grow to
billions of records
Saw millions spent on capital
expenditures on servers and data
centers
Saw millions spent on annual server
maintenance and energy costs
May 13, 2009
Enterprise Search Summit
6. My Options
Build proprietary search engine
Buy a solution from an Enterprise Search Vendor
Build a Lucene open source search platform
May 13, 2009
Enterprise Search Summit
7. Traditional Solution to Handle Massive Data Sets
Lots of servers!
Lots of money!
May 13, 2009
Enterprise Search Summit
8. Cluster Architecture
Cluster made up of rows and columns of servers
Determinants of cluster size
Size of index
Determines number of columns
Amount of peak traffic (queries per second)
Determines the number of rows
May 13, 2009
Enterprise Search Summit
9. Index Size – Number of Columns
Index size = 80 gb = 5 columns
Query Servers needed = 5
Collation Server
Query Query Query Query
Query
Server Server Server Server
Server
8 GB 8 GB 8 GB 8 GB
8 GB
Assumptions:
Up to 50% of index (16 gb for each cluster) resides mostly in cache
QPS per server = 20 qps
10. Peak Traffic Load – Number of Columns
Index size = 20 G – 5 columns needed
Max Traffic Rate = 100 queries per second (QPS) – 10 rows needed
Query Servers needed = 5 columns x 10 rows = 50 servers
Collation Server
Query Server Query Server Query Server Query Server Query Server
Assumptions:
Server memory = 8 GB
Up to 50% of index resides in cache
QPS per server = 10 qps
11. WVR Server Configuration
Collation
Server
8 gb 8 gb 8 gb 8 gb 8 gb 8 gb 16 gb
800,000,000 records
Maximum Query volume system capacity = 7 queries per second
64 bit Windows based servers
Dual core CPUs
Enterprise Search Summit May 13, 2009
12. Future WVR Cluster Needs
Projections show
1+ terabyte of data (3.5 billion records)
200 queries per second at peak load
Would require a cluster architecture of:
29 columns
40 rows
WVR would need 1,160 query servers!
Over $3.5 million in initial capital expenditure
Over $2.3 million in recurring yearly costs
May 13, 2009
Enterprise Search Summit
13. Search Challenges
Many solutions work well with
Low traffic
Initial small data sets
As data and traffic grows; however, so do the costs
and associated problems
Slow indexing times
Low queries per second capacity
Ranked search would have required a significant expansion of
servers to handle the increased search load
Required skilled staff to modify and optimize to handle growth
Enterprise Search Summit May 13, 2009
15. Perfect Search Approach
Replace Existing Lucene
Utilize PS Indexing
Utilize PS Search Engine
Match Business Rules
Incorporate Near Exact modules
Soundex, Metaphone
Match or improve results
Provide query results back to WVR for display
Disk-based index
Enterprise Search Summit May 13, 2009
16. Data Growth Past Year
August 2008 May 2009 % Growth
Number of Records 800,000,000 1,200,000,000 50%
Number of Databases 9,000 12,000 33%
May 13, 2009
Enterprise Search Summit
17. Current WVR Perfect Search Server Configuration
Collation
Server
8 gb 8 gb 8 gb 8 gb 8 gb 8 gb 16 gb
1.2 billion records
12,000 databases
Maximum Query volume system capacity = 40 queries per second
5x faster!
64 bit Windows based servers
Dual core CPUs
Enterprise Search Summit May 13, 2009
18. Benefits to Worldvitalrecords.com
Reduce indexing time to 1/100 of Lucene times
Reduce query servers from 7 to 1
Provided sub-second query response times
Allows for continued dramatic customer growth
without significant server expansion
Allow World Vital Records to compete with market
leaders at a fraction of the server capitalization and
maintenance costs
Enterprise Search Summit May 13, 2009
19. Future Growth Projections
World Vital Records Growth Plans
1+ terabyte (3 times growth in data)
200 queries per second (20 times growth in customers)
Perfect Lucene
Search
Servers 20 1160
Server Capital Expenditure $60,000 $3,480,000
Recurring Power /Maintenance Costs $40,000 $2,320,000
May 13, 2009
Enterprise Search Summit
20. Questions?
Paul Allen, CEO, FamilyLink.com
paul@familylink.com
May 13, 2009
Enterprise Search Summit