- I’m from San Franciscos- Glassdoor is the world’s largest user-generated content community site for jobs, companies and salaries, with over 80 million pages of jobs and user-generated content.- About.com and Ask.com are both top 50 web properties in the US, by traffic. They are two of the largest online publishers in the world. Ask.com has question-and-answer content writren by users and editors. About.com has articles written by experts. Each site has about 10 million pages indexed in Google.
It’s harder to control quality: 100 pages: You know what’s on each page.100,000 pages: No one is checking them all.100,000,000 pages: Would you know if 1 million of them were junk?
PageRank : Overall site PR helps new and existing pagesInterlinking: If you have more pages, you can get more relevant links between pages.Economies of scale. Managing larger sites is more efficient.Brand: User prefer familiar brands in search results.
“No results” pages: When your site has faceted navigation, some pages have no data. (E.g., no products in this category, no reviews for this restaurant, no salaries for this company).URL based duplicates: Multiple URLs return the same content.Content-based duplicates: If you have lots of content, sometimes the same topic comes up again.Multiple versions of site, multiple countries: Duplication between versions? Empty pages in some versions?
Every company on Glassdoor times every city they’re located in time salaries, review, or interviews, times job titles. Tens of millions of pages with no results.
Every company on Glassdoor times every city they’re located in time salaries, review, or interviews, times job titles. Tens of millions of pages with no results.
At one of the companies I worked at, we found the worst-performing 5% of pages, and we hired a team of editors to fix them.
Eliminate Duplicate TitlesFind pages with the same title (Webmaster tools)Same/overlapping content? Canonicalize the worse one to the better one.Different content? Merge them into one content page.
We created a search engine index of all our pages using Solr, an open source search engine platform.