SlideShare a Scribd company logo
1 of 11
Download to read offline
Thousands of Indexes in the Cloud




                                    1
Greplin searches:




                                                                                        2

- Greplin helps you search all your personal information, wherever it is.

- As Michael Arrington of TechCrunch said, we’ve “attacked the other half of search.”

- Greplin supports over a dozen services today, with more added constantly.
Requirements

          • Many inserts
          • Fewer searches
          • Low per-user cost

                                                                                          3

- We insert up to 5,000 documents/second

- Average document size of 2KB-4KB

- A fully loaded server is an Amazon c1.medium machine responsible for up to 80,000,000
3KB documents

- Each machine has just 1.7GB of RAM!

- Overall, we handle about 50M documents per GB of RAM with median search latencies
around 200ms.
Memory

          • Per doc: 2 longs + 1 int +1 String (avg 5
              letters) into the FieldCache, and average of
              10 norm’d fields/doc
             • 27 bytes/doc * 50M docs = 1.3GB


                                                                                       4

- Ranking requires pulling a few field values and norms into memory.

- For 50M documents would require well over 1.3GB of memory.

- Assuming an optimized index, searching the number of docs we have per machine with
1GB of RAM is impossible without swapping.

- We benchmarked using a single-index + swapping: search times were multi-second.
“Virtual memory was meant to make it
          easier to program when data was larger
           than the physical memory, but people
                  have still not caught on.”
                         Poul-Henning Kamp,Varnish architect and coder.
                              What’s Wrong With 1975 Programming
                      http://www.varnish-cache.org/trac/wiki/ArchitectNotes




                                                                                             5

- Over the last decade, the trend has been to stop manually managing what goes on disk and
what goes in RAM, instead trusting the operating system’s virtual memory and paging
systems to swap data in/out appropriately.

- For example, the caching HTTP proxy Varnish trusts the OS’s virtual memory, and is thus
significantly simpler and faster than Squid, which tries to manage the what-belongs-in-
memory vs what-belongs-on-disk itself.

- This philosophy has been jokingly summarized as “You’re not smarter than Linus, so don’t
try to be.”
We’re Smarter than
                     Linus!*


* When we cheat
                                                                                               6

- Many signals (such as user logins) let us predict which users are likely to do searches better
than the OS can.

- By keeping each user’s data in a separate index, we save memory and improve
performance.

- We only keep open IndexSearchers for users who are likely to do searches.
Other Benefits

           • tar -cvzf user.tar.gz user && mv user.tar.gz
           • du -h
           • Smaller ‘corruption domain’


                                                            7

By keeping each user’s index separate, we can:

- more easily move users between servers

- figure out their space usage

- ensure index corruption affects only one user
RAM Index

                                           • Deletion Filters
                                           • MultiSearcher
                                           • Flush planning



                                                                                               8

- Inspired by Zoie (http://sna-projects.com/zoie/)

- All incoming documents are first added to a RAM Index.

- A user search encompasses a ‘filtered’ view of the RAM Index, the currently flushing index,
plus their disk index.

- When the RAM index is ‘full’ we create a new RAM index.

- We open IndexWriters for each user in turn and flush documents from RAM to disk.

- Interesting cases including updates and deletions are handled with temporary filters on the
disk index.
Amazon Cloud
          •            Script everything

          •            XFS+LVM expandability and snapshots are helpful

          •            Some pain is unavoidable
                                                           EBS Performance
                   150000




                   112500
          KB/sec




                   75000




                   37500




                       0
                              Seq. Write            Seq. Read          Random Read         Random Write


                                           Single EBS     RAID10 EBS     Instance Store   RAID 0 EBS




                                                                                                          9

More info at: http://tech.blog.greplin.com/aws-best-practices-and-benchmarks
Other Cool Stuff
          •   ‘kill -9’ any time with no data-loss via a Protocol
              Buffer Write Ahead Log

          •   Detect duplicate documents with Bloom Filter

          •   Dynamically sized SoftReference Cache

          •   Custom MergeScheduler

          •   Custom FieldCache for multi-valued or sparse
              fields

          •   Efficient result clustering and faceting


                                                                    10

Some of this is open source: https://github.com/Greplin
Questions?
                          Suggestions?


            Robby Walker                    Shaneal Manek

                shaneal@greplin.com
                     @smanek
                                                            11

We’re hiring: http://www.greplin.com/jobs

More Related Content

Viewers also liked

Piel de asno. Renarración de cuento clásico.
Piel de asno. Renarración de cuento clásico.Piel de asno. Renarración de cuento clásico.
Piel de asno. Renarración de cuento clásico.mdelcfp
 
4th june meeting summary
4th june meeting summary4th june meeting summary
4th june meeting summaryAlan Bassett
 
TSG Members Handbook
TSG Members HandbookTSG Members Handbook
TSG Members HandbookAlan Bassett
 
Układ komunikacyjny dla Franowa
Układ komunikacyjny dla FranowaUkład komunikacyjny dla Franowa
Układ komunikacyjny dla FranowaEkokonsultacje
 
Non basta essere su Facebook per essere 2.0. La qualità della presenza della ...
Non basta essere su Facebook per essere 2.0. La qualità della presenza della ...Non basta essere su Facebook per essere 2.0. La qualità della presenza della ...
Non basta essere su Facebook per essere 2.0. La qualità della presenza della ...Alessandro Lovari
 
Lascialo in Rete (set.2007)
Lascialo in Rete (set.2007)Lascialo in Rete (set.2007)
Lascialo in Rete (set.2007)Annarita Salsi
 
Simple SEO - SEO Made Simple - Do It Yourself SEO
Simple SEO - SEO Made Simple - Do It Yourself SEOSimple SEO - SEO Made Simple - Do It Yourself SEO
Simple SEO - SEO Made Simple - Do It Yourself SEOThe JAR Group
 
Are you sitting comfortably?
Are you sitting comfortably?Are you sitting comfortably?
Are you sitting comfortably?Brightwave Group
 
Total learning: Case study: organising space - powering a community of practi...
Total learning: Case study: organising space - powering a community of practi...Total learning: Case study: organising space - powering a community of practi...
Total learning: Case study: organising space - powering a community of practi...Brightwave Group
 
Asbestos New Guide Lucion
Asbestos New Guide LucionAsbestos New Guide Lucion
Asbestos New Guide LucionAlan Bassett
 
Mobile Banking in 2020
Mobile Banking in 2020Mobile Banking in 2020
Mobile Banking in 2020mahendraji
 
Mercurial はオフラインの海を越える
Mercurial はオフラインの海を越えるMercurial はオフラインの海を越える
Mercurial はオフラインの海を越えるzetamatta
 
Os negros africanos no Brasil Colonial Monize e Hanna
Os negros africanos no Brasil Colonial Monize e HannaOs negros africanos no Brasil Colonial Monize e Hanna
Os negros africanos no Brasil Colonial Monize e Hannanice miranda
 
Groundworks Shad Booking Form
Groundworks Shad Booking FormGroundworks Shad Booking Form
Groundworks Shad Booking FormAlan Bassett
 
kasina: Costs of Compensation - Sales & National Accounts 2009
kasina: Costs of Compensation - Sales & National Accounts 2009kasina: Costs of Compensation - Sales & National Accounts 2009
kasina: Costs of Compensation - Sales & National Accounts 2009kasina
 
Hse alert 2013 35 two fatalities as a result of a failure of a bonnet-to...
Hse alert 2013 35 two fatalities as a result of a failure of a bonnet-to...Hse alert 2013 35 two fatalities as a result of a failure of a bonnet-to...
Hse alert 2013 35 two fatalities as a result of a failure of a bonnet-to...Alan Bassett
 
Mud In Stock...Discontinued Items
Mud In Stock...Discontinued ItemsMud In Stock...Discontinued Items
Mud In Stock...Discontinued Itemscnunnally
 

Viewers also liked (18)

Piel de asno. Renarración de cuento clásico.
Piel de asno. Renarración de cuento clásico.Piel de asno. Renarración de cuento clásico.
Piel de asno. Renarración de cuento clásico.
 
4th june meeting summary
4th june meeting summary4th june meeting summary
4th june meeting summary
 
TSG Members Handbook
TSG Members HandbookTSG Members Handbook
TSG Members Handbook
 
Układ komunikacyjny dla Franowa
Układ komunikacyjny dla FranowaUkład komunikacyjny dla Franowa
Układ komunikacyjny dla Franowa
 
Non basta essere su Facebook per essere 2.0. La qualità della presenza della ...
Non basta essere su Facebook per essere 2.0. La qualità della presenza della ...Non basta essere su Facebook per essere 2.0. La qualità della presenza della ...
Non basta essere su Facebook per essere 2.0. La qualità della presenza della ...
 
Lascialo in Rete (set.2007)
Lascialo in Rete (set.2007)Lascialo in Rete (set.2007)
Lascialo in Rete (set.2007)
 
Simple SEO - SEO Made Simple - Do It Yourself SEO
Simple SEO - SEO Made Simple - Do It Yourself SEOSimple SEO - SEO Made Simple - Do It Yourself SEO
Simple SEO - SEO Made Simple - Do It Yourself SEO
 
Are you sitting comfortably?
Are you sitting comfortably?Are you sitting comfortably?
Are you sitting comfortably?
 
Total learning: Case study: organising space - powering a community of practi...
Total learning: Case study: organising space - powering a community of practi...Total learning: Case study: organising space - powering a community of practi...
Total learning: Case study: organising space - powering a community of practi...
 
Asbestos New Guide Lucion
Asbestos New Guide LucionAsbestos New Guide Lucion
Asbestos New Guide Lucion
 
Mobile Banking in 2020
Mobile Banking in 2020Mobile Banking in 2020
Mobile Banking in 2020
 
Mercurial はオフラインの海を越える
Mercurial はオフラインの海を越えるMercurial はオフラインの海を越える
Mercurial はオフラインの海を越える
 
Os negros africanos no Brasil Colonial Monize e Hanna
Os negros africanos no Brasil Colonial Monize e HannaOs negros africanos no Brasil Colonial Monize e Hanna
Os negros africanos no Brasil Colonial Monize e Hanna
 
Groundworks Shad Booking Form
Groundworks Shad Booking FormGroundworks Shad Booking Form
Groundworks Shad Booking Form
 
kasina: Costs of Compensation - Sales & National Accounts 2009
kasina: Costs of Compensation - Sales & National Accounts 2009kasina: Costs of Compensation - Sales & National Accounts 2009
kasina: Costs of Compensation - Sales & National Accounts 2009
 
Hse alert 2013 35 two fatalities as a result of a failure of a bonnet-to...
Hse alert 2013 35 two fatalities as a result of a failure of a bonnet-to...Hse alert 2013 35 two fatalities as a result of a failure of a bonnet-to...
Hse alert 2013 35 two fatalities as a result of a failure of a bonnet-to...
 
POSO - podsumowanie
POSO - podsumowaniePOSO - podsumowanie
POSO - podsumowanie
 
Mud In Stock...Discontinued Items
Mud In Stock...Discontinued ItemsMud In Stock...Discontinued Items
Mud In Stock...Discontinued Items
 

Recently uploaded

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Greplin at Lucene Revolution 2011

  • 1. Thousands of Indexes in the Cloud 1
  • 2. Greplin searches: 2 - Greplin helps you search all your personal information, wherever it is. - As Michael Arrington of TechCrunch said, we’ve “attacked the other half of search.” - Greplin supports over a dozen services today, with more added constantly.
  • 3. Requirements • Many inserts • Fewer searches • Low per-user cost 3 - We insert up to 5,000 documents/second - Average document size of 2KB-4KB - A fully loaded server is an Amazon c1.medium machine responsible for up to 80,000,000 3KB documents - Each machine has just 1.7GB of RAM! - Overall, we handle about 50M documents per GB of RAM with median search latencies around 200ms.
  • 4. Memory • Per doc: 2 longs + 1 int +1 String (avg 5 letters) into the FieldCache, and average of 10 norm’d fields/doc • 27 bytes/doc * 50M docs = 1.3GB 4 - Ranking requires pulling a few field values and norms into memory. - For 50M documents would require well over 1.3GB of memory. - Assuming an optimized index, searching the number of docs we have per machine with 1GB of RAM is impossible without swapping. - We benchmarked using a single-index + swapping: search times were multi-second.
  • 5. “Virtual memory was meant to make it easier to program when data was larger than the physical memory, but people have still not caught on.” Poul-Henning Kamp,Varnish architect and coder. What’s Wrong With 1975 Programming http://www.varnish-cache.org/trac/wiki/ArchitectNotes 5 - Over the last decade, the trend has been to stop manually managing what goes on disk and what goes in RAM, instead trusting the operating system’s virtual memory and paging systems to swap data in/out appropriately. - For example, the caching HTTP proxy Varnish trusts the OS’s virtual memory, and is thus significantly simpler and faster than Squid, which tries to manage the what-belongs-in- memory vs what-belongs-on-disk itself. - This philosophy has been jokingly summarized as “You’re not smarter than Linus, so don’t try to be.”
  • 6. We’re Smarter than Linus!* * When we cheat 6 - Many signals (such as user logins) let us predict which users are likely to do searches better than the OS can. - By keeping each user’s data in a separate index, we save memory and improve performance. - We only keep open IndexSearchers for users who are likely to do searches.
  • 7. Other Benefits • tar -cvzf user.tar.gz user && mv user.tar.gz • du -h • Smaller ‘corruption domain’ 7 By keeping each user’s index separate, we can: - more easily move users between servers - figure out their space usage - ensure index corruption affects only one user
  • 8. RAM Index • Deletion Filters • MultiSearcher • Flush planning 8 - Inspired by Zoie (http://sna-projects.com/zoie/) - All incoming documents are first added to a RAM Index. - A user search encompasses a ‘filtered’ view of the RAM Index, the currently flushing index, plus their disk index. - When the RAM index is ‘full’ we create a new RAM index. - We open IndexWriters for each user in turn and flush documents from RAM to disk. - Interesting cases including updates and deletions are handled with temporary filters on the disk index.
  • 9. Amazon Cloud • Script everything • XFS+LVM expandability and snapshots are helpful • Some pain is unavoidable EBS Performance 150000 112500 KB/sec 75000 37500 0 Seq. Write Seq. Read Random Read Random Write Single EBS RAID10 EBS Instance Store RAID 0 EBS 9 More info at: http://tech.blog.greplin.com/aws-best-practices-and-benchmarks
  • 10. Other Cool Stuff • ‘kill -9’ any time with no data-loss via a Protocol Buffer Write Ahead Log • Detect duplicate documents with Bloom Filter • Dynamically sized SoftReference Cache • Custom MergeScheduler • Custom FieldCache for multi-valued or sparse fields • Efficient result clustering and faceting 10 Some of this is open source: https://github.com/Greplin
  • 11. Questions? Suggestions? Robby Walker Shaneal Manek shaneal@greplin.com @smanek 11 We’re hiring: http://www.greplin.com/jobs