SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
Common Crawl forCommon Crawl for
Start-upsStart-ups
July 20, 2015July 20, 2015
Data Science Summit & Dato ConferenceData Science Summit & Dato Conference
It's a non-profit that makesIt's a non-profit that makes
web dataweb data
freely accessible tofreely accessible to anyoneanyone
Each crawl archive is billions of pages:Each crawl archive is billions of pages:
May crawl archive isMay crawl archive is
2.052.05 web pagesweb pagesbillionbillion
uncompresseduncompressed~159 terabytes~159 terabytes
ReleasedReleased
(lives on Amazon Public Data Sets)(lives on Amazon Public Data Sets)
totally freetotally free
without additionalwithout additional
intellectual property restrictionsintellectual property restrictions
Origins of Common CrawlOrigins of Common Crawl
Common Crawl founded in 2007Common Crawl founded in 2007
by Gil Elbaz (Applied Semantics / Factual)by Gil Elbaz (Applied Semantics / Factual)
Google and Microsoft were the powerhousesGoogle and Microsoft were the powerhouses
DataData powerspowers the algorithms in our fieldthe algorithms in our field
Goal:Goal: Democratize and simplify access toDemocratize and simplify access to
"the"the web as a dataset"web as a dataset"
Benefits for start-upsBenefits for start-ups
Performing your own large-scale crawling isPerforming your own large-scale crawling is
expensiveexpensive andand challengingchallenging
Innovation occurs byInnovation occurs by usingusing the data,the data,
rarely through novel collection methodsrarely through novel collection methods
+ Lower the barrier of entry for new start-ups+ Lower the barrier of entry for new start-ups
+ Create a communal pool of knowledge & data+ Create a communal pool of knowledge & data
Common Crawl File FormatsCommon Crawl File Formats
WARC (as downloaded)WARC (as downloaded)
+ Raw HTTP response headers+ Raw HTTP response headers
+ Raw HTTP responses+ Raw HTTP responses
WAT (metadata)WAT (metadata)
+ HTML head data+ HTML head data
+ HTTP header fields+ HTTP header fields
+ Extracted links / script tags+ Extracted links / script tags
WET (only text)WET (only text)
+ Extracted text+ Extracted text
Web Data at ScaleWeb Data at Scale
AnalyticsAnalytics
+ Usage of servers, libraries, and metadata+ Usage of servers, libraries, and metadata
Machine LearningMachine Learning
+ Language models based upon billions of tokens+ Language models based upon billions of tokens
Filtering and AggregationFiltering and Aggregation
+ Analyzing tables, Wikipedia, phone numbers+ Analyzing tables, Wikipedia, phone numbers
Analytics at ScaleAnalytics at Scale
Imagine you areImagine you are interestedinterested in ...in ...
++ Javascript library usageJavascript library usage
+ HTML / HTML5+ HTML / HTML5 usageusage
+ Web server types and age+ Web server types and age
++ RDFa, Microdata, and Microformat Data SetsRDFa, Microdata, and Microformat Data Sets
With Common Crawl you can analyseWith Common Crawl you can analyse
billions of pages in an afternoon!billions of pages in an afternoon!
Analyzing Web Domain VulnsAnalyzing Web Domain Vulns
Sietse T. Au and Wing Lung NgaiSietse T. Au and Wing Lung Ngai
WDC Hyperlink GraphWDC Hyperlink Graph
Largest freely available real world graph dataset:Largest freely available real world graph dataset:
3.6 billion pages, 128 billion links3.6 billion pages, 128 billion links
http://webdatacommons.org/hyperlinkgraph/
Fast and easy analysis usingFast and easy analysis using on aon a
singlesingle EC2 r3.8xlarge instanceEC2 r3.8xlarge instance
(under 10 minutes per PageRank iteration)(under 10 minutes per PageRank iteration)
Dato GraphLabDato GraphLab
Web Data at ScaleWeb Data at Scale
AnalyticsAnalytics
+ Usage of servers, libraries, and metadata+ Usage of servers, libraries, and metadata
Machine LearningMachine Learning
+ Language models based upon billions of tokens+ Language models based upon billions of tokens
Filtering and AggregationFiltering and Aggregation
+ Analyzing tables, Wikipedia, phone numbers+ Analyzing tables, Wikipedia, phone numbers
N-gram Counts & Language ModelsN-gram Counts & Language Models
from the Common Crawlfrom the Common Crawl
Christian BuckChristian Buck, Kenneth Heafield, Kenneth Heafield, Bas van Ooyen, Bas van Ooyen
Processed all the text of Common Crawl to produceProcessed all the text of Common Crawl to produce
975 billion deduplicated tokens975 billion deduplicated tokens
(similar size to the Google N-gram Dataset)(similar size to the Google N-gram Dataset)
Project data was released atProject data was released at
http://statmt.org/ngramshttp://statmt.org/ngrams
Deduped text split by languageDeduped text split by language
Resulting language modelsResulting language models
GloVe: Global Vectors for WordGloVe: Global Vectors for Word
RepresentationRepresentation
Jeffrey Pennington, Richard Socher, Christopher D. ManningJeffrey Pennington, Richard Socher, Christopher D. Manning
Word vector representations:Word vector representations:
king - queen = man - womanking - queen = man - woman
king - man + woman = queenking - man + woman = queen
(produces dimensions of meaning)(produces dimensions of meaning)
GloVe On Various CorporaGloVe On Various Corpora
Semantic: "Athens is to Greece as Berlin is to _?"Semantic: "Athens is to Greece as Berlin is to _?"
Syntactic: "Dance is to dancing as fly is to _?"Syntactic: "Dance is to dancing as fly is to _?"
GloVe over Big DataGloVe over Big Data
GloVe and word2vec (competing algorithm) can scaleGloVe and word2vec (competing algorithm) can scale
to hundreds of billions of tokensto hundreds of billions of tokens
Trained on the Common Crawl n-gram data:Trained on the Common Crawl n-gram data:
http://statmt.org/ngramshttp://statmt.org/ngrams
Source code and pre-trained models atSource code and pre-trained models at
http://www-nlp.stanford.edu/projects/glove/http://www-nlp.stanford.edu/projects/glove/
Web-Scale Parallel TextWeb-Scale Parallel Text
Dirt Cheap Web-Scale Parallel Text from the CommonDirt Cheap Web-Scale Parallel Text from the Common
Crawl (Smith et al.)Crawl (Smith et al.)
Processed all text from URLs of the style:Processed all text from URLs of the style:
website.com/[langcode]/website.com/[langcode]/
[w.com/en/tesla |[w.com/en/tesla | w.com/fr/tesla]w.com/fr/tesla]
"...nothing more than a set of common two-letter"...nothing more than a set of common two-letter
language codes ... [we] mined 32 terabytes ... in justlanguage codes ... [we] mined 32 terabytes ... in just
under a day"under a day"
Web-Scale Parallel TextWeb-Scale Parallel Text
Manual inspection across three languages:Manual inspection across three languages:
80% of the data contained good translations80% of the data contained good translations
(source = foreign language, target = English)(source = foreign language, target = English)
Web Data at ScaleWeb Data at Scale
AnalyticsAnalytics
+ Usage of servers, libraries, and metadata+ Usage of servers, libraries, and metadata
Machine LearningMachine Learning
+ Language models based upon billions of tokens+ Language models based upon billions of tokens
Filtering and AggregationFiltering and Aggregation
+ Analyzing tables, Wikipedia, phone numbers+ Analyzing tables, Wikipedia, phone numbers
Web Data Commons Web TablesWeb Data Commons Web Tables
Extracted 11.2 billion tables from WARC files,Extracted 11.2 billion tables from WARC files,
filtered to keep relational tables via trained classifierfiltered to keep relational tables via trained classifier
Only 1.3% of the original data was kept,Only 1.3% of the original data was kept,
yet it still remains hugely valuableyet it still remains hugely valuable
Resulting dataset:Resulting dataset:
11.2 billion tables => 147 million relational web tables11.2 billion tables => 147 million relational web tables
Web Data Commons Web TablesWeb Data Commons Web Tables
Popular column headers:Popular column headers: name, title, artist, location,name, title, artist, location,
model, manufacturer, country ...model, manufacturer, country ...
Released atReleased at webdatacommons.org/webtables/webdatacommons.org/webtables/
Extracting US Phone NumbersExtracting US Phone Numbers
""Let's use Common Crawl to help match businessesLet's use Common Crawl to help match businesses
from Yelp's database to the possible web pages forfrom Yelp's database to the possible web pages for
those businesses on the Internet."those businesses on the Internet."
Yelp extracted ~748 million US phone numbers fromYelp extracted ~748 million US phone numbers from
the Common Crawl December 2014 datasetthe Common Crawl December 2014 dataset
Regular expression over extracted text (WET files)Regular expression over extracted text (WET files)
Extracting US Phone NumbersExtracting US Phone Numbers
Total complexity:Total complexity: 134 lines of Python134 lines of Python
Total time:Total time: 1 hour (20 ×1 hour (20 × c3.8xlarge)c3.8xlarge)
Total cost:Total cost: $10.60 USD (Python using EMR)$10.60 USD (Python using EMR)
Matched against Yelp's database:Matched against Yelp's database:
48% had exact URL matches48% had exact URL matches
61% had matching domains61% had matching domains
More details (and full code) on Yelp's blog post:More details (and full code) on Yelp's blog post:
Analyzing the Web For the Price of a SandwichAnalyzing the Web For the Price of a Sandwich
Common Crawl's Derived DatasetsCommon Crawl's Derived Datasets
Natural language processing:Natural language processing:
(975 bln tokens)(975 bln tokens)
WDCWDC (3.5 bln)(3.5 bln)
Large scale web analysis:Large scale web analysis:
(128 bln edges(128 bln edges))
-- Wikipedia in-links analysisWikipedia in-links analysis
and a million more use cases!and a million more use cases!
Parallel text for machine translationParallel text for machine translation
N-gram & language modelsN-gram & language models
Web tablesWeb tables
WDC Hyperlink GraphsWDC Hyperlink Graphs
WikiReverse.orgWikiReverse.org
Why am I so excited..?Why am I so excited..?
Open data is catching on!Open data is catching on!
Even playing field for academia and start-upsEven playing field for academia and start-ups
Google Web 1T =>Google Web 1T =>
Google's Wikilinks =>Google's Wikilinks =>
Google's Sets =>Google's Sets =>
Buck et al.'s N-gramsBuck et al.'s N-grams
WikiReverseWikiReverse
WDC Web TablesWDC Web Tables
Common Crawl releases their datasetCommon Crawl releases their dataset
and brilliant people build on top of itand brilliant people build on top of it
Read more atRead more at
commoncrawl.orgcommoncrawl.org
Stephen MerityStephen Merity
stephen@commoncrawl.orgstephen@commoncrawl.org
commoncrawl.orgcommoncrawl.org

Contenu connexe

Tendances

A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceGlobus
 
Try It The Google Way .
Try It The Google Way .Try It The Google Way .
Try It The Google Way .abhinavbom
 
nanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for Nanopublicationsnanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for NanopublicationsTobias Kuhn
 
The RDF Report Card: Beyond the Triple Count
The RDF Report Card: Beyond the Triple CountThe RDF Report Card: Beyond the Triple Count
The RDF Report Card: Beyond the Triple CountLeigh Dodds
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...NoSQLmatters
 
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...Michele Pasin
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioOpen Knowledge Belgium
 
The Nature.com ontologies portal - Linked Science 2015
The Nature.com ontologies portal - Linked Science 2015The Nature.com ontologies portal - Linked Science 2015
The Nature.com ontologies portal - Linked Science 2015Michele Pasin
 
The nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesThe nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesTony Hammond
 
The Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsThe Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsJoshua Shinavier
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
 
SPARQL Query Forms
SPARQL Query FormsSPARQL Query Forms
SPARQL Query FormsLeigh Dodds
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataOntotext
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSymeon Papadopoulos
 
Research Automation for Data-Driven Discovery
Research Automationfor Data-Driven DiscoveryResearch Automationfor Data-Driven Discovery
Research Automation for Data-Driven DiscoveryGlobus
 
Drupal and the Semantic Web - ESIP Webinar
Drupal and the Semantic Web - ESIP WebinarDrupal and the Semantic Web - ESIP Webinar
Drupal and the Semantic Web - ESIP Webinarscorlosquet
 
RDTF Metadata Guidelines: an update
RDTF Metadata Guidelines: an updateRDTF Metadata Guidelines: an update
RDTF Metadata Guidelines: an updateAndy Powell
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportPascal-Nicolas Becker
 

Tendances (20)

A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 
Try It The Google Way .
Try It The Google Way .Try It The Google Way .
Try It The Google Way .
 
nanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for Nanopublicationsnanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for Nanopublications
 
The RDF Report Card: Beyond the Triple Count
The RDF Report Card: Beyond the Triple CountThe RDF Report Card: Beyond the Triple Count
The RDF Report Card: Beyond the Triple Count
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
 
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
 
The Nature.com ontologies portal - Linked Science 2015
The Nature.com ontologies portal - Linked Science 2015The Nature.com ontologies portal - Linked Science 2015
The Nature.com ontologies portal - Linked Science 2015
 
The nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesThe nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologies
 
The Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsThe Real-time Web in the Age of Agents
The Real-time Web in the Age of Agents
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
SPARQL Query Forms
SPARQL Query FormsSPARQL Query Forms
SPARQL Query Forms
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Research Automation for Data-Driven Discovery
Research Automationfor Data-Driven DiscoveryResearch Automationfor Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
Drupal and the Semantic Web - ESIP Webinar
Drupal and the Semantic Web - ESIP WebinarDrupal and the Semantic Web - ESIP Webinar
Drupal and the Semantic Web - ESIP Webinar
 
RDTF Metadata Guidelines: an update
RDTF Metadata Guidelines: an updateRDTF Metadata Guidelines: an update
RDTF Metadata Guidelines: an update
 
Graph database
Graph database Graph database
Graph database
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data Support
 

En vedette

Common Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Datahuguk
 
Is Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal PoliciesIs Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal PoliciesPromptCloud
 
Gephi Consortium Presentation
Gephi Consortium PresentationGephi Consortium Presentation
Gephi Consortium PresentationGephi Consortium
 
Enterprise Data World 2016 and CDO Vision Mural Summary
Enterprise Data World 2016 and CDO Vision Mural SummaryEnterprise Data World 2016 and CDO Vision Mural Summary
Enterprise Data World 2016 and CDO Vision Mural SummaryDATAVERSITY
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 

En vedette (6)

Common Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Data
 
Is Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal PoliciesIs Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal Policies
 
Gephi Consortium Presentation
Gephi Consortium PresentationGephi Consortium Presentation
Gephi Consortium Presentation
 
Enterprise Data World 2016 and CDO Vision Mural Summary
Enterprise Data World 2016 and CDO Vision Mural SummaryEnterprise Data World 2016 and CDO Vision Mural Summary
Enterprise Data World 2016 and CDO Vision Mural Summary
 
Gephi Quick Start
Gephi Quick StartGephi Quick Start
Gephi Quick Start
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 

Similaire à Using the whole web as your dataset

Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseNaveen Kumar
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Eli White
 
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku LepistoCOSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku LepistoAmazon Web Services
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Folio3 Software
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Neo4j
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataAndre Freitas
 
Web Search And Mining (Ntuim)
Web Search And Mining (Ntuim)Web Search And Mining (Ntuim)
Web Search And Mining (Ntuim)Hector Lin
 
Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big DataPierre De Wilde
 
AWS Enterprise Day | Closing Keynote, Singapore - Dr Werner Vogels
AWS Enterprise Day | Closing Keynote, Singapore - Dr Werner VogelsAWS Enterprise Day | Closing Keynote, Singapore - Dr Werner Vogels
AWS Enterprise Day | Closing Keynote, Singapore - Dr Werner VogelsAmazon Web Services
 
Beyond the Fridge, The World of Connected Data - Dr Werner Vogels
Beyond the Fridge, The World of Connected Data - Dr Werner VogelsBeyond the Fridge, The World of Connected Data - Dr Werner Vogels
Beyond the Fridge, The World of Connected Data - Dr Werner VogelsAmazon Web Services
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeArangoDB Database
 
Schema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowSchema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowRichard Wallis
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveQubole
 
Contextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationContextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationRichard Wallis
 
A fresh new look into Information Gathering - OWASP Spain
A fresh new look into Information Gathering - OWASP SpainA fresh new look into Information Gathering - OWASP Spain
A fresh new look into Information Gathering - OWASP SpainChristian Martorella
 
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기Heejong Ahn
 
Fyp ideas
Fyp ideasFyp ideas
Fyp ideasMr SMAK
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationSören Auer
 
Searching the Internet
Searching the InternetSearching the Internet
Searching the Internetvanalery
 

Similaire à Using the whole web as your dataset (20)

Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / Database
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku LepistoCOSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
Semantic Web, e-commerce
Semantic Web, e-commerceSemantic Web, e-commerce
Semantic Web, e-commerce
 
Web Search And Mining (Ntuim)
Web Search And Mining (Ntuim)Web Search And Mining (Ntuim)
Web Search And Mining (Ntuim)
 
Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big Data
 
AWS Enterprise Day | Closing Keynote, Singapore - Dr Werner Vogels
AWS Enterprise Day | Closing Keynote, Singapore - Dr Werner VogelsAWS Enterprise Day | Closing Keynote, Singapore - Dr Werner Vogels
AWS Enterprise Day | Closing Keynote, Singapore - Dr Werner Vogels
 
Beyond the Fridge, The World of Connected Data - Dr Werner Vogels
Beyond the Fridge, The World of Connected Data - Dr Werner VogelsBeyond the Fridge, The World of Connected Data - Dr Werner Vogels
Beyond the Fridge, The World of Connected Data - Dr Werner Vogels
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
Schema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowSchema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & How
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 
Contextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationContextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data Foundation
 
A fresh new look into Information Gathering - OWASP Spain
A fresh new look into Information Gathering - OWASP SpainA fresh new look into Information Gathering - OWASP Spain
A fresh new look into Information Gathering - OWASP Spain
 
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
 
Fyp ideas
Fyp ideasFyp ideas
Fyp ideas
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data Integration
 
Searching the Internet
Searching the InternetSearching the Internet
Searching the Internet
 

Plus de Turi, Inc.

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing VideoTuri, Inc.
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission RiskTuri, Inc.
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Turi, Inc.
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Turi, Inc.
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Turi, Inc.
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Turi, Inc.
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataTuri, Inc.
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsTuri, Inc.
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine LearningTuri, Inc.
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesTuri, Inc.
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinTuri, Inc.
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data scienceTuri, Inc.
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Turi, Inc.
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender SystemsTuri, Inc.
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in productionTuri, Inc.
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringTuri, Inc.
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with DatoTuri, Inc.
 

Plus de Turi, Inc. (20)

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission Risk
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning Toolkits
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive Services
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos Guestrin
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data science
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender Systems
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
SFrame
SFrameSFrame
SFrame
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
 

Dernier

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Dernier (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Using the whole web as your dataset

  • 1. Common Crawl forCommon Crawl for Start-upsStart-ups July 20, 2015July 20, 2015 Data Science Summit & Dato ConferenceData Science Summit & Dato Conference
  • 2. It's a non-profit that makesIt's a non-profit that makes web dataweb data freely accessible tofreely accessible to anyoneanyone
  • 3. Each crawl archive is billions of pages:Each crawl archive is billions of pages: May crawl archive isMay crawl archive is 2.052.05 web pagesweb pagesbillionbillion uncompresseduncompressed~159 terabytes~159 terabytes
  • 4. ReleasedReleased (lives on Amazon Public Data Sets)(lives on Amazon Public Data Sets) totally freetotally free without additionalwithout additional intellectual property restrictionsintellectual property restrictions
  • 5. Origins of Common CrawlOrigins of Common Crawl Common Crawl founded in 2007Common Crawl founded in 2007 by Gil Elbaz (Applied Semantics / Factual)by Gil Elbaz (Applied Semantics / Factual) Google and Microsoft were the powerhousesGoogle and Microsoft were the powerhouses DataData powerspowers the algorithms in our fieldthe algorithms in our field Goal:Goal: Democratize and simplify access toDemocratize and simplify access to "the"the web as a dataset"web as a dataset"
  • 6. Benefits for start-upsBenefits for start-ups Performing your own large-scale crawling isPerforming your own large-scale crawling is expensiveexpensive andand challengingchallenging Innovation occurs byInnovation occurs by usingusing the data,the data, rarely through novel collection methodsrarely through novel collection methods + Lower the barrier of entry for new start-ups+ Lower the barrier of entry for new start-ups + Create a communal pool of knowledge & data+ Create a communal pool of knowledge & data
  • 7. Common Crawl File FormatsCommon Crawl File Formats WARC (as downloaded)WARC (as downloaded) + Raw HTTP response headers+ Raw HTTP response headers + Raw HTTP responses+ Raw HTTP responses WAT (metadata)WAT (metadata) + HTML head data+ HTML head data + HTTP header fields+ HTTP header fields + Extracted links / script tags+ Extracted links / script tags WET (only text)WET (only text) + Extracted text+ Extracted text
  • 8. Web Data at ScaleWeb Data at Scale AnalyticsAnalytics + Usage of servers, libraries, and metadata+ Usage of servers, libraries, and metadata Machine LearningMachine Learning + Language models based upon billions of tokens+ Language models based upon billions of tokens Filtering and AggregationFiltering and Aggregation + Analyzing tables, Wikipedia, phone numbers+ Analyzing tables, Wikipedia, phone numbers
  • 9. Analytics at ScaleAnalytics at Scale Imagine you areImagine you are interestedinterested in ...in ... ++ Javascript library usageJavascript library usage + HTML / HTML5+ HTML / HTML5 usageusage + Web server types and age+ Web server types and age ++ RDFa, Microdata, and Microformat Data SetsRDFa, Microdata, and Microformat Data Sets With Common Crawl you can analyseWith Common Crawl you can analyse billions of pages in an afternoon!billions of pages in an afternoon!
  • 10. Analyzing Web Domain VulnsAnalyzing Web Domain Vulns Sietse T. Au and Wing Lung NgaiSietse T. Au and Wing Lung Ngai
  • 11. WDC Hyperlink GraphWDC Hyperlink Graph Largest freely available real world graph dataset:Largest freely available real world graph dataset: 3.6 billion pages, 128 billion links3.6 billion pages, 128 billion links http://webdatacommons.org/hyperlinkgraph/ Fast and easy analysis usingFast and easy analysis using on aon a singlesingle EC2 r3.8xlarge instanceEC2 r3.8xlarge instance (under 10 minutes per PageRank iteration)(under 10 minutes per PageRank iteration) Dato GraphLabDato GraphLab
  • 12.
  • 13. Web Data at ScaleWeb Data at Scale AnalyticsAnalytics + Usage of servers, libraries, and metadata+ Usage of servers, libraries, and metadata Machine LearningMachine Learning + Language models based upon billions of tokens+ Language models based upon billions of tokens Filtering and AggregationFiltering and Aggregation + Analyzing tables, Wikipedia, phone numbers+ Analyzing tables, Wikipedia, phone numbers
  • 14. N-gram Counts & Language ModelsN-gram Counts & Language Models from the Common Crawlfrom the Common Crawl Christian BuckChristian Buck, Kenneth Heafield, Kenneth Heafield, Bas van Ooyen, Bas van Ooyen Processed all the text of Common Crawl to produceProcessed all the text of Common Crawl to produce 975 billion deduplicated tokens975 billion deduplicated tokens (similar size to the Google N-gram Dataset)(similar size to the Google N-gram Dataset) Project data was released atProject data was released at http://statmt.org/ngramshttp://statmt.org/ngrams Deduped text split by languageDeduped text split by language Resulting language modelsResulting language models
  • 15. GloVe: Global Vectors for WordGloVe: Global Vectors for Word RepresentationRepresentation Jeffrey Pennington, Richard Socher, Christopher D. ManningJeffrey Pennington, Richard Socher, Christopher D. Manning Word vector representations:Word vector representations: king - queen = man - womanking - queen = man - woman king - man + woman = queenking - man + woman = queen (produces dimensions of meaning)(produces dimensions of meaning)
  • 16.
  • 17.
  • 18. GloVe On Various CorporaGloVe On Various Corpora Semantic: "Athens is to Greece as Berlin is to _?"Semantic: "Athens is to Greece as Berlin is to _?" Syntactic: "Dance is to dancing as fly is to _?"Syntactic: "Dance is to dancing as fly is to _?"
  • 19. GloVe over Big DataGloVe over Big Data GloVe and word2vec (competing algorithm) can scaleGloVe and word2vec (competing algorithm) can scale to hundreds of billions of tokensto hundreds of billions of tokens Trained on the Common Crawl n-gram data:Trained on the Common Crawl n-gram data: http://statmt.org/ngramshttp://statmt.org/ngrams Source code and pre-trained models atSource code and pre-trained models at http://www-nlp.stanford.edu/projects/glove/http://www-nlp.stanford.edu/projects/glove/
  • 20. Web-Scale Parallel TextWeb-Scale Parallel Text Dirt Cheap Web-Scale Parallel Text from the CommonDirt Cheap Web-Scale Parallel Text from the Common Crawl (Smith et al.)Crawl (Smith et al.) Processed all text from URLs of the style:Processed all text from URLs of the style: website.com/[langcode]/website.com/[langcode]/ [w.com/en/tesla |[w.com/en/tesla | w.com/fr/tesla]w.com/fr/tesla] "...nothing more than a set of common two-letter"...nothing more than a set of common two-letter language codes ... [we] mined 32 terabytes ... in justlanguage codes ... [we] mined 32 terabytes ... in just under a day"under a day"
  • 21. Web-Scale Parallel TextWeb-Scale Parallel Text Manual inspection across three languages:Manual inspection across three languages: 80% of the data contained good translations80% of the data contained good translations (source = foreign language, target = English)(source = foreign language, target = English)
  • 22. Web Data at ScaleWeb Data at Scale AnalyticsAnalytics + Usage of servers, libraries, and metadata+ Usage of servers, libraries, and metadata Machine LearningMachine Learning + Language models based upon billions of tokens+ Language models based upon billions of tokens Filtering and AggregationFiltering and Aggregation + Analyzing tables, Wikipedia, phone numbers+ Analyzing tables, Wikipedia, phone numbers
  • 23. Web Data Commons Web TablesWeb Data Commons Web Tables Extracted 11.2 billion tables from WARC files,Extracted 11.2 billion tables from WARC files, filtered to keep relational tables via trained classifierfiltered to keep relational tables via trained classifier Only 1.3% of the original data was kept,Only 1.3% of the original data was kept, yet it still remains hugely valuableyet it still remains hugely valuable Resulting dataset:Resulting dataset: 11.2 billion tables => 147 million relational web tables11.2 billion tables => 147 million relational web tables
  • 24. Web Data Commons Web TablesWeb Data Commons Web Tables Popular column headers:Popular column headers: name, title, artist, location,name, title, artist, location, model, manufacturer, country ...model, manufacturer, country ... Released atReleased at webdatacommons.org/webtables/webdatacommons.org/webtables/
  • 25. Extracting US Phone NumbersExtracting US Phone Numbers ""Let's use Common Crawl to help match businessesLet's use Common Crawl to help match businesses from Yelp's database to the possible web pages forfrom Yelp's database to the possible web pages for those businesses on the Internet."those businesses on the Internet." Yelp extracted ~748 million US phone numbers fromYelp extracted ~748 million US phone numbers from the Common Crawl December 2014 datasetthe Common Crawl December 2014 dataset Regular expression over extracted text (WET files)Regular expression over extracted text (WET files)
  • 26. Extracting US Phone NumbersExtracting US Phone Numbers Total complexity:Total complexity: 134 lines of Python134 lines of Python Total time:Total time: 1 hour (20 ×1 hour (20 × c3.8xlarge)c3.8xlarge) Total cost:Total cost: $10.60 USD (Python using EMR)$10.60 USD (Python using EMR) Matched against Yelp's database:Matched against Yelp's database: 48% had exact URL matches48% had exact URL matches 61% had matching domains61% had matching domains More details (and full code) on Yelp's blog post:More details (and full code) on Yelp's blog post: Analyzing the Web For the Price of a SandwichAnalyzing the Web For the Price of a Sandwich
  • 27. Common Crawl's Derived DatasetsCommon Crawl's Derived Datasets Natural language processing:Natural language processing: (975 bln tokens)(975 bln tokens) WDCWDC (3.5 bln)(3.5 bln) Large scale web analysis:Large scale web analysis: (128 bln edges(128 bln edges)) -- Wikipedia in-links analysisWikipedia in-links analysis and a million more use cases!and a million more use cases! Parallel text for machine translationParallel text for machine translation N-gram & language modelsN-gram & language models Web tablesWeb tables WDC Hyperlink GraphsWDC Hyperlink Graphs WikiReverse.orgWikiReverse.org
  • 28. Why am I so excited..?Why am I so excited..? Open data is catching on!Open data is catching on! Even playing field for academia and start-upsEven playing field for academia and start-ups Google Web 1T =>Google Web 1T => Google's Wikilinks =>Google's Wikilinks => Google's Sets =>Google's Sets => Buck et al.'s N-gramsBuck et al.'s N-grams WikiReverseWikiReverse WDC Web TablesWDC Web Tables Common Crawl releases their datasetCommon Crawl releases their dataset and brilliant people build on top of itand brilliant people build on top of it
  • 29. Read more atRead more at commoncrawl.orgcommoncrawl.org Stephen MerityStephen Merity stephen@commoncrawl.orgstephen@commoncrawl.org commoncrawl.orgcommoncrawl.org