Apidays New York 2024 - The value of a flexible API Management solution for O...
The Internet as a Single Database
1. The Internet as a Single Database Technologies Used & Lessons Learned Houston Code Camp, August 2011 Shion Deysarkar CEO, Datafiniti
2. What does that mean? All web data in one, unified format Places, people, news, URLs, products, etc., etc. Accessible as if you were querying a database
3. Why build such a thing? Our users needed a better way of getting web data Web crawling is kludgy and unintuitive Developers deserve something better than current APIs
6. The Challenges There’s a lot of data on the web 100 million registered domains Maybe only 100,000 have interesting stuff? (Which ones?) Some sites have millions or billions of data points
7. The Challenges It’s all structured differently! Do we have to write web crawls for each website? Writing 100,000 web crawlers seems.. not fun
10. Data Collection Building a scalable web crawler Cloud or local data center? Neither. Grid computing (think SETI@home) 1000s of home PCs that exchange time & bandwidth for $ Crawl very fast for relatively little $
15. Data Collection Building a scalable web crawler Coding 1000s of extraction apps Abstract away everything but pattern matching and link generation
16. Data Collection Building a scalable web crawler Current peak performance: 4.32 billion URLs per month Deploying 20 new website crawls every month Easy to scale crawling performance (just add grid nodes) Easy to scale deployment (just add contractors)
17. Now for step 2! (step 1 took us 3 years >_<) Data Storage
18. Data Storage Building a scalable data store What we’re dealing with: TBs (eventually PBs) of data Billions of rows, Thousands of columns (maybe more) Don’t want to deal with sharding Don’t actually care about ACID Do care about high-throughput and fault-tolerance
19.
20.
21.
22. Data Storage Building a unified database of everything Normalizing separate data points that represent the same thing Co-occurrence: most popular choice wins
23. Data Storage Building a unified database of everything Normalizing separate data points that represent the same thing Trusted sources: put more weight on sources that tend to be right
24. Data Storage Building a unified database of everything Identifying interesting data on a random web page
25. Yay, step 3! (step 2 took us 3 months :D) Data Retrieval
26.
27.
28. JSON default output, but will also supports CSV and XML
29. SSL authentication with tokenBriefly considered using a 3rd-party service like Mashery
30. Put it all together… (step 3 took 3 weeks!!!) Sneak Peak
31. Launching Soon Sign up for the beta at http://www.datafiniti.net Follow us @Datafiniti