The document discusses how building a general purpose search engine is an extremely expensive endeavor that would cost a minimum of $100 million. It breaks down the various costs that contribute to this total, including storage and crawling of web data, ensuring result relevance, performance requirements, and hiring necessary personnel over several years. Through examples of past search startups, the author argues that any entrepreneur claiming they can create a full search engine for under $100 million is underestimating the challenges and expenses involved.
Search Startups are Dead Entrepreneurs tend to think that there’s always a way to innovate out of a problem In this case, however, I’m going to show you that there are systematic reasons for why there cannot be a general purpose search engines that compete with Google and Bing.
I’ve worked for three search startups – SideStep, Kosmix, and Powerset – and I still don’t have a Gulfstream This is sort of an exercise in apologetics: it’s really not my fault that I don’t have mountains of cash from my stock options
There are good reasons about switching costs and marketing that a new search engine can’t pop up, but that’s not what I’ll focus on. It’s all about the mighty greenback: building a search engine is a really expensive proposal.
It goes without saying that the numbers herein are not the opinion of my employer and are speculative, but they are informed by experience . I’ve made a lot of estimations in Excel to come up with these numbers and I’m pretty confident that I’m in the right ballpark
The equation has two major components: hardware and people. In the following slides, I’ll explain the components going into hardware and people and, in the processs, show you how complicated and expensive a search engine is to build.
Last year, Google estimated that the Web is over 1T documents. That’s really expensive to store
It’s not just the Web page you have to store. There’s links, anchor text, and, since you’re a smarty-pants startup, you’ll probably be extracting all kinds of smart metadata on any page.
Keep in mind that the Web is constantly changing. New pages are being added, pages already crawled are changing, and making sure you have the latest copy of the Web on hand is really important.
At bare minimum, you need results that are as relevant as Bing or Google. To do that, you’ll need lots of servers to run relevance experiments. You’ll need lots of storage for huge amounts of clickstream data.
I know there aren’t any black hat SEO folks in this crowd, but there’s a constant battle with site-owners who don’t have the best interests of users at heart and are willing to do things to game search results.
No search engine is complete without lots of ancillary data: weather, stock quotes, images, maps, Twitter, Facebook. Licensing the content or building the vertical is very expensive and you’re not a true replacement without it.
One of the most expensive components of a search engine is runtime. When you do a search in Bing, results come back from thousands, or possibly billions, of Web pages in less than a second? How does that happen? Lots, and lots, and lots of servers.
All search engines use some kind of divide and conquer algorithm that federates your search to thousands of machines. That means that for any query, there are thousands of machines involved. When you have millions of users, serving search results gets very expensive.
At Powerset, we estimated that our index was 10-20 times the size of a typical keyword index. The Johnson coefficient represents the tax on storing, relevance and runtime that you’d have at an innovative search engine.