SlideShare une entreprise Scribd logo
1  sur  25
The Internet as a Single Database Technologies Used & Lessons Learned Houston Code Camp, August 2011 Shion Deysarkar CEO, Datafiniti
What does that mean? All web data in one, unified format Places, people, news, URLs, products, etc., etc. Accessible as if you were querying a database
Why build such a thing? Our users needed a better way of getting web data Web crawling is kludgy and unintuitive Developers deserve something better than current APIs
Why build such a thing? Because it would be awesome!
Not an easy task… The Challenges
The Challenges There’s a lot of data on the web 100 million registered domains Maybe only 100,000 have interesting stuff? (Which ones?) Some sites have millions or billions of data points
The Challenges It’s all structured differently! Do we have to write web crawls for each website? Writing 100,000 web crawlers seems.. not fun
The Challenges Data can conflict How do we know which data is correct?
So let’s start at the beginning: Data Collection
Data Collection Building a scalable web crawler Cloud or local data center?  Neither. Grid computing (think SETI@home) 1000s of home PCs that exchange time & bandwidth for $ Crawl very fast for relatively little $
Data Collection Building a scalable web crawler Coding 1000s of extraction apps Build a framework that handles all the kludgy work: ,[object Object]
  Result formatting & storage
  Throttle rates & crawling behavior
  Any other crawling activity not specific to a website’s structureAbstract away everything but pattern matching and link generation ,[object Object],[object Object]
Data Collection Building a scalable web crawler Coding 1000s of extraction apps Abstract away everything but pattern matching and link generation
Data Collection Building a scalable web crawler Current peak performance: 4.32 billion URLs per month Deploying 20 new website crawls every month Easy to scale crawling performance (just add grid nodes) Easy to scale deployment (just add contractors)
Now for step 2! (step 1 took us 3 years >_<) Data Storage
Data Storage Building a scalable data store What we’re dealing with: TBs (eventually PBs) of data Billions of rows, Thousands of columns (maybe more) Don’t want to deal with sharding Don’t actually care about ACID Do care about high-throughput and fault-tolerance
Data Storage Building a scalable data store NoSQL (Cassandra) >> MySQL (for us) Can increase throughput and storage linearly by adding nodes Virtually unlimited and variable # of columns Much faster read/write Some challenges ,[object Object]
  Not a mature technology yet, expect frequent updates,[object Object]
Hadoop for batch-style processingImpressive production-scale examples ,[object Object],Backed by corporations (DataStax) and some really smart people
Data Storage Building a unified database of everything Normalizing separate data points that represent the same thing Co-occurrence:  most popular choice wins
Data Storage Building a unified database of everything Normalizing separate data points that represent the same thing Trusted sources:  put more weight on sources that tend to be right
Data Storage Building a unified database of everything Identifying interesting data on a random web page
Yay, step 3! (step 2 took us 3 months :D) Data Retrieval

Contenu connexe

Tendances

Tendances (19)

Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Webinar: Live Data Visualisation with Tableau and MongoDB
Webinar: Live Data Visualisation with Tableau and MongoDBWebinar: Live Data Visualisation with Tableau and MongoDB
Webinar: Live Data Visualisation with Tableau and MongoDB
 
Benefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSsBenefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSs
 
Hadoop World 2011 Keynote: Ebay - Hugh Williams
Hadoop World 2011 Keynote: Ebay - Hugh WilliamsHadoop World 2011 Keynote: Ebay - Hugh Williams
Hadoop World 2011 Keynote: Ebay - Hugh Williams
 
Mongo DB: Operational Big Data Database
Mongo DB: Operational Big Data DatabaseMongo DB: Operational Big Data Database
Mongo DB: Operational Big Data Database
 
Introduction to MongoDB Basics from SQL to NoSQL
Introduction to MongoDB Basics from SQL to NoSQLIntroduction to MongoDB Basics from SQL to NoSQL
Introduction to MongoDB Basics from SQL to NoSQL
 
MongoDB in a Mainframe World
MongoDB in a Mainframe WorldMongoDB in a Mainframe World
MongoDB in a Mainframe World
 
Accelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO WayAccelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO Way
 
Transforming your application with Elasticsearch
Transforming your application with ElasticsearchTransforming your application with Elasticsearch
Transforming your application with Elasticsearch
 
Introduction to Elastic with a hint of Symfony and Docker
Introduction to Elastic with a hint of Symfony and DockerIntroduction to Elastic with a hint of Symfony and Docker
Introduction to Elastic with a hint of Symfony and Docker
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
MongoDB.local Seattle 2019: Building Your First MongoDB App Using Atlas & Stitch
MongoDB.local Seattle 2019: Building Your First MongoDB App Using Atlas & StitchMongoDB.local Seattle 2019: Building Your First MongoDB App Using Atlas & Stitch
MongoDB.local Seattle 2019: Building Your First MongoDB App Using Atlas & Stitch
 
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtTableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of Thought
 
Build robust streaming data pipelines with MongoDB and Kafka P2
Build robust streaming data pipelines with MongoDB and Kafka P2Build robust streaming data pipelines with MongoDB and Kafka P2
Build robust streaming data pipelines with MongoDB and Kafka P2
 
CData Data Today: A Developer's Dilemma
CData Data Today: A Developer's DilemmaCData Data Today: A Developer's Dilemma
CData Data Today: A Developer's Dilemma
 
MongoDB vs OrientDB
MongoDB vs OrientDBMongoDB vs OrientDB
MongoDB vs OrientDB
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauWebinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
 
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
Big Data Analytics with Google BigQuery.  By Javier Ramirez. All your base Co...Big Data Analytics with Google BigQuery.  By Javier Ramirez. All your base Co...
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
 

Similaire à The Internet as a Single Database

Sears web30e connectionartificialintelligence
Sears web30e connectionartificialintelligenceSears web30e connectionartificialintelligence
Sears web30e connectionartificialintelligence
hrpiza
 
Sears web30e connectionartificialintelligence
Sears web30e connectionartificialintelligenceSears web30e connectionartificialintelligence
Sears web30e connectionartificialintelligence
hrpiza
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engine
ankur881120
 

Similaire à The Internet as a Single Database (20)

NoSQL Basics - a quick tour
NoSQL Basics - a quick tourNoSQL Basics - a quick tour
NoSQL Basics - a quick tour
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
A Data Culture with Embedded Analytics in Action
A Data Culture with Embedded Analytics in ActionA Data Culture with Embedded Analytics in Action
A Data Culture with Embedded Analytics in Action
 
Sears web30e connectionartificialintelligence
Sears web30e connectionartificialintelligenceSears web30e connectionartificialintelligence
Sears web30e connectionartificialintelligence
 
Sears web30e connectionartificialintelligence
Sears web30e connectionartificialintelligenceSears web30e connectionartificialintelligence
Sears web30e connectionartificialintelligence
 
[WITH THE VISION 2017] IoT/AI時代を生き抜くためのデータ プラットフォーム (Leveraging Azure Data Se...
[WITH THE VISION 2017] IoT/AI時代を生き抜くためのデータ プラットフォーム (Leveraging Azure Data Se...[WITH THE VISION 2017] IoT/AI時代を生き抜くためのデータ プラットフォーム (Leveraging Azure Data Se...
[WITH THE VISION 2017] IoT/AI時代を生き抜くためのデータ プラットフォーム (Leveraging Azure Data Se...
 
Build Your Own Search Engine
Build Your Own Search EngineBuild Your Own Search Engine
Build Your Own Search Engine
 
High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutions
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engine
 
Which database should I use for my app?
Which database should I use for my app?Which database should I use for my app?
Which database should I use for my app?
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Big data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and HealthcareBig data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and Healthcare
 
NoSQL Type, Bigdata, and Analytics
NoSQL Type, Bigdata, and AnalyticsNoSQL Type, Bigdata, and Analytics
NoSQL Type, Bigdata, and Analytics
 
What is the Semantic Web
What is the Semantic WebWhat is the Semantic Web
What is the Semantic Web
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Big Data Building Blocks with AWS Cloud
Big Data Building Blocks with AWS CloudBig Data Building Blocks with AWS Cloud
Big Data Building Blocks with AWS Cloud
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
Improve your Tech Quotient
Improve your Tech QuotientImprove your Tech Quotient
Improve your Tech Quotient
 
Why Scale Matters and How the Cloud is Really Different (at scale)
Why Scale Matters and How the Cloud is Really Different (at scale)Why Scale Matters and How the Cloud is Really Different (at scale)
Why Scale Matters and How the Cloud is Really Different (at scale)
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

The Internet as a Single Database

  • 1. The Internet as a Single Database Technologies Used & Lessons Learned Houston Code Camp, August 2011 Shion Deysarkar CEO, Datafiniti
  • 2. What does that mean? All web data in one, unified format Places, people, news, URLs, products, etc., etc. Accessible as if you were querying a database
  • 3. Why build such a thing? Our users needed a better way of getting web data Web crawling is kludgy and unintuitive Developers deserve something better than current APIs
  • 4. Why build such a thing? Because it would be awesome!
  • 5. Not an easy task… The Challenges
  • 6. The Challenges There’s a lot of data on the web 100 million registered domains Maybe only 100,000 have interesting stuff? (Which ones?) Some sites have millions or billions of data points
  • 7. The Challenges It’s all structured differently! Do we have to write web crawls for each website? Writing 100,000 web crawlers seems.. not fun
  • 8. The Challenges Data can conflict How do we know which data is correct?
  • 9. So let’s start at the beginning: Data Collection
  • 10. Data Collection Building a scalable web crawler Cloud or local data center? Neither. Grid computing (think SETI@home) 1000s of home PCs that exchange time & bandwidth for $ Crawl very fast for relatively little $
  • 11.
  • 12. Result formatting & storage
  • 13. Throttle rates & crawling behavior
  • 14.
  • 15. Data Collection Building a scalable web crawler Coding 1000s of extraction apps Abstract away everything but pattern matching and link generation
  • 16. Data Collection Building a scalable web crawler Current peak performance: 4.32 billion URLs per month Deploying 20 new website crawls every month Easy to scale crawling performance (just add grid nodes) Easy to scale deployment (just add contractors)
  • 17. Now for step 2! (step 1 took us 3 years >_<) Data Storage
  • 18. Data Storage Building a scalable data store What we’re dealing with: TBs (eventually PBs) of data Billions of rows, Thousands of columns (maybe more) Don’t want to deal with sharding Don’t actually care about ACID Do care about high-throughput and fault-tolerance
  • 19.
  • 20.
  • 21.
  • 22. Data Storage Building a unified database of everything Normalizing separate data points that represent the same thing Co-occurrence: most popular choice wins
  • 23. Data Storage Building a unified database of everything Normalizing separate data points that represent the same thing Trusted sources: put more weight on sources that tend to be right
  • 24. Data Storage Building a unified database of everything Identifying interesting data on a random web page
  • 25. Yay, step 3! (step 2 took us 3 months :D) Data Retrieval
  • 26.
  • 27.
  • 28. JSON default output, but will also supports CSV and XML
  • 29. SSL authentication with tokenBriefly considered using a 3rd-party service like Mashery
  • 30. Put it all together… (step 3 took 3 weeks!!!) Sneak Peak
  • 31. Launching Soon Sign up for the beta at http://www.datafiniti.net Follow us @Datafiniti