SlideShare une entreprise Scribd logo
1  sur  17
Crawl the entire web
in 10 minutes...
Copyright ©: 2015 OnPage.org GmbH
Using AWS-EMR, AWS-S3, PIG, CommonCrawl
...and just 100 €
Since 2011 in Munich
Work at OnPage.org
Interested in Webcrawling and BigData Frameworks
Build low cost scalable BigData solutions
About Me
Twitter: @danny_munich
Facebook: https://www.facebook.com/danny.linden2
E-mail: danny@onpage.org
Do you want to build your own Search-
Engine?
- High Hardware / Cloud Costs
- Nutch needs ~ 1 Hour for 1 million URLs
- You want to crawl > 1 Billion URLs
Solution ?
Don‘t Crawl!
- Use Common-Crawl : https://commoncrawl.org
- Non-Profit-Organisation
- ~Monthly over 2 Billions Crawled URLs
- Over 1.000 TB total since 2009
- URL seeding list from Blekko: https://blekko.com
Don‘t Crawl! – Use Common Crawl!
- Scalably stored on Amazon AWS S3
- Hadoop compatible format powered by Archive.org (Wayback Machine)
- Partitionable with S3 Object Prefix possibility
- 100MB-1GB file Sizes (gzip) -> Hadoop size
Nice Data Format
Store the raw crawl data.
Format 1:
WARC
Store only the
Meta-Information
as JSON
Format 2:
WAT
Store only the
Plain Text Content
Format 3:
WET
Choose the right format
- WARC (Raw HTML): 1.000 MB
- WAT (Meta data as JSON) : 450 MB
- WET (Plain Text): 150 MB
Processing
- Pure Hadoop with MapReduce
- Input Classes: http://commoncrawl.org/the-data/get-started/
Processing
- High Level ETL-Layer like PIG: http://pig.apache.org
- Example Stuff :
- https://github.com/norvigaward/warcexamples
- https://github.com/mortardata/mortar-examples
- https://github.com/matpalm/common-crawl
PIG Example
REGISTER file:/home/hadoop/lib/pig/piggybank.jar
DEFINE FileLoaderClass org.commoncrawl.pig.ArcLoader();
%default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/0/1285398*.arc.gz";
-- %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/";
%default OUTPUT_PATH "s3://example-bucket/out";
pages = LOAD '$INPUT_PATH'
USING FileLoaderClass
AS (url, html);
meta_titles = FOREACH pages GENERATE url, REGEX_EXTRACT(html, '<title>(.*)</title>', 1) AS meta_title;
filtered = FILTER meta_titles BY meta_title IS NOT NULL;
STORE filtered INTO '$OUTPUT_PATH' USING PigStorage('t');
Hadoop & PIG on AWS
- Support new Hadoop releases
- PIG Integration
- Replace HDFS with S3
- Easy UI to start quickly
- Pay per Hour to scale as much as posible
It‘s Demo Time!
Let's cross fingers now
That‘s it!
Customer:
Twitter: @danny_munich
Facebook: https://www.facebook.com/danny.linden2
E-mail: danny@onpage.org
And: We are hiring!
https://de.onpage.org/about/jobs/

Contenu connexe

Tendances

William slawski-google-patents- how-do-they-influence-search
William slawski-google-patents- how-do-they-influence-searchWilliam slawski-google-patents- how-do-they-influence-search
William slawski-google-patents- how-do-they-influence-searchBill Slawski
 
SEO for Ecommerce: A Comprehensive Guide
SEO for Ecommerce: A Comprehensive GuideSEO for Ecommerce: A Comprehensive Guide
SEO for Ecommerce: A Comprehensive GuideAdam Audette
 
NikolauSEO: Karlsruher Erfolgsgeschichte Home&Smart
NikolauSEO: Karlsruher Erfolgsgeschichte Home&SmartNikolauSEO: Karlsruher Erfolgsgeschichte Home&Smart
NikolauSEO: Karlsruher Erfolgsgeschichte Home&SmartAlexander Keller
 
Migration Best Practices - Search Y 2019, Paris
Migration Best Practices - Search Y 2019, ParisMigration Best Practices - Search Y 2019, Paris
Migration Best Practices - Search Y 2019, ParisBastian Grimm
 
Entity Seo Mastery
Entity Seo MasteryEntity Seo Mastery
Entity Seo MasteryDixon Jones
 
Actionable Tips to Increase Your Website Authority - Lily Ray
Actionable Tips to Increase Your Website Authority - Lily RayActionable Tips to Increase Your Website Authority - Lily Ray
Actionable Tips to Increase Your Website Authority - Lily RayLily Ray
 
Linked Data in Libraries
Linked Data in LibrariesLinked Data in Libraries
Linked Data in LibrariesCarl Hess
 
Entity-based SEO - Wordlift webinar - Studio Makoto Agenzia di Marketing.pptx
Entity-based SEO - Wordlift webinar - Studio Makoto Agenzia di Marketing.pptxEntity-based SEO - Wordlift webinar - Studio Makoto Agenzia di Marketing.pptx
Entity-based SEO - Wordlift webinar - Studio Makoto Agenzia di Marketing.pptxMassimiliano Geraci
 
Internet of Things for Libraries
Internet of Things for LibrariesInternet of Things for Libraries
Internet of Things for LibrariesNicole Baratta
 
Knowledge Graph Reasoning Techniques through Studies on Mystery Stories - Rep...
Knowledge Graph Reasoning Techniques through Studies on Mystery Stories - Rep...Knowledge Graph Reasoning Techniques through Studies on Mystery Stories - Rep...
Knowledge Graph Reasoning Techniques through Studies on Mystery Stories - Rep...KnowledgeGraph
 

Tendances (20)

William slawski-google-patents- how-do-they-influence-search
William slawski-google-patents- how-do-they-influence-searchWilliam slawski-google-patents- how-do-they-influence-search
William slawski-google-patents- how-do-they-influence-search
 
Semantic web
Semantic webSemantic web
Semantic web
 
SEO for Ecommerce: A Comprehensive Guide
SEO for Ecommerce: A Comprehensive GuideSEO for Ecommerce: A Comprehensive Guide
SEO for Ecommerce: A Comprehensive Guide
 
Web 3.0
Web 3.0Web 3.0
Web 3.0
 
Search engine
Search engineSearch engine
Search engine
 
NikolauSEO: Karlsruher Erfolgsgeschichte Home&Smart
NikolauSEO: Karlsruher Erfolgsgeschichte Home&SmartNikolauSEO: Karlsruher Erfolgsgeschichte Home&Smart
NikolauSEO: Karlsruher Erfolgsgeschichte Home&Smart
 
The future internet web 3.0
The future internet  web 3.0The future internet  web 3.0
The future internet web 3.0
 
Migration Best Practices - Search Y 2019, Paris
Migration Best Practices - Search Y 2019, ParisMigration Best Practices - Search Y 2019, Paris
Migration Best Practices - Search Y 2019, Paris
 
Semantic Web
Semantic WebSemantic Web
Semantic Web
 
Entity Seo Mastery
Entity Seo MasteryEntity Seo Mastery
Entity Seo Mastery
 
Deep web
Deep webDeep web
Deep web
 
Actionable Tips to Increase Your Website Authority - Lily Ray
Actionable Tips to Increase Your Website Authority - Lily RayActionable Tips to Increase Your Website Authority - Lily Ray
Actionable Tips to Increase Your Website Authority - Lily Ray
 
Linked Data in Libraries
Linked Data in LibrariesLinked Data in Libraries
Linked Data in Libraries
 
Python for SEO
Python for SEOPython for SEO
Python for SEO
 
Entity-based SEO - Wordlift webinar - Studio Makoto Agenzia di Marketing.pptx
Entity-based SEO - Wordlift webinar - Studio Makoto Agenzia di Marketing.pptxEntity-based SEO - Wordlift webinar - Studio Makoto Agenzia di Marketing.pptx
Entity-based SEO - Wordlift webinar - Studio Makoto Agenzia di Marketing.pptx
 
Google Dorks
Google DorksGoogle Dorks
Google Dorks
 
Internet of Things for Libraries
Internet of Things for LibrariesInternet of Things for Libraries
Internet of Things for Libraries
 
Knowledge Graph Reasoning Techniques through Studies on Mystery Stories - Rep...
Knowledge Graph Reasoning Techniques through Studies on Mystery Stories - Rep...Knowledge Graph Reasoning Techniques through Studies on Mystery Stories - Rep...
Knowledge Graph Reasoning Techniques through Studies on Mystery Stories - Rep...
 
AI and Smarter Libraries
AI and Smarter LibrariesAI and Smarter Libraries
AI and Smarter Libraries
 
Web 3.0 Intro
Web 3.0 IntroWeb 3.0 Intro
Web 3.0 Intro
 

Similaire à Crawl the entire web in 10 minutes...and just 100€

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
Mongo db and hadoop driving business insights - final
Mongo db and hadoop   driving business insights - finalMongo db and hadoop   driving business insights - final
Mongo db and hadoop driving business insights - finalMongoDB
 
thinking in key value stores
thinking in key value storesthinking in key value stores
thinking in key value storesBhasker Kode
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyWeb scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyLITTINRAJAN
 
Web Development in Perl
Web Development in PerlWeb Development in Perl
Web Development in PerlNaveen Gupta
 
Seravia in the Cloud
Seravia in the CloudSeravia in the Cloud
Seravia in the Cloudkidrane
 
Insight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestionInsight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestionTreasure Data, Inc.
 
Nosql-columbia-feb2011
Nosql-columbia-feb2011Nosql-columbia-feb2011
Nosql-columbia-feb2011siculars
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services
 
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...AWS Germany
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Aad Versteden
 
Reducing latency on the web with the Azure CDN - DevSum - SWAG
Reducing latency on the web with the Azure CDN - DevSum - SWAGReducing latency on the web with the Azure CDN - DevSum - SWAG
Reducing latency on the web with the Azure CDN - DevSum - SWAGMaarten Balliauw
 
StartPad Countdown 8 - Amazon Web Services and You
StartPad Countdown 8 - Amazon Web Services and YouStartPad Countdown 8 - Amazon Web Services and You
StartPad Countdown 8 - Amazon Web Services and YouStart Pad
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big DataAmazon Web Services
 
Terraform Q&A - HashiCorp User Group Oslo
Terraform Q&A - HashiCorp User Group OsloTerraform Q&A - HashiCorp User Group Oslo
Terraform Q&A - HashiCorp User Group OsloAnton Babenko
 

Similaire à Crawl the entire web in 10 minutes...and just 100€ (20)

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Mongo db and hadoop driving business insights - final
Mongo db and hadoop   driving business insights - finalMongo db and hadoop   driving business insights - final
Mongo db and hadoop driving business insights - final
 
thinking in key value stores
thinking in key value storesthinking in key value stores
thinking in key value stores
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyWeb scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
 
Web Development in Perl
Web Development in PerlWeb Development in Perl
Web Development in Perl
 
Seravia in the Cloud
Seravia in the CloudSeravia in the Cloud
Seravia in the Cloud
 
Open source data ingestion
Open source data ingestionOpen source data ingestion
Open source data ingestion
 
Insight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestionInsight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestion
 
Scaling PHP apps
Scaling PHP appsScaling PHP apps
Scaling PHP apps
 
Nosql-columbia-feb2011
Nosql-columbia-feb2011Nosql-columbia-feb2011
Nosql-columbia-feb2011
 
JahiaOne - Semantic Web with Jahia
JahiaOne - Semantic Web with JahiaJahiaOne - Semantic Web with Jahia
JahiaOne - Semantic Web with Jahia
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016
 
Reducing latency on the web with the Azure CDN - DevSum - SWAG
Reducing latency on the web with the Azure CDN - DevSum - SWAGReducing latency on the web with the Azure CDN - DevSum - SWAG
Reducing latency on the web with the Azure CDN - DevSum - SWAG
 
StartPad Countdown 8 - Amazon Web Services and You
StartPad Countdown 8 - Amazon Web Services and YouStartPad Countdown 8 - Amazon Web Services and You
StartPad Countdown 8 - Amazon Web Services and You
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Terraform Q&A - HashiCorp User Group Oslo
Terraform Q&A - HashiCorp User Group OsloTerraform Q&A - HashiCorp User Group Oslo
Terraform Q&A - HashiCorp User Group Oslo
 

Dernier

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Dernier (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Crawl the entire web in 10 minutes...and just 100€

  • 1. Crawl the entire web in 10 minutes... Copyright ©: 2015 OnPage.org GmbH Using AWS-EMR, AWS-S3, PIG, CommonCrawl ...and just 100 €
  • 2. Since 2011 in Munich Work at OnPage.org Interested in Webcrawling and BigData Frameworks Build low cost scalable BigData solutions About Me Twitter: @danny_munich Facebook: https://www.facebook.com/danny.linden2 E-mail: danny@onpage.org
  • 3. Do you want to build your own Search- Engine? - High Hardware / Cloud Costs - Nutch needs ~ 1 Hour for 1 million URLs - You want to crawl > 1 Billion URLs
  • 5. Don‘t Crawl! - Use Common-Crawl : https://commoncrawl.org - Non-Profit-Organisation - ~Monthly over 2 Billions Crawled URLs - Over 1.000 TB total since 2009 - URL seeding list from Blekko: https://blekko.com
  • 6. Don‘t Crawl! – Use Common Crawl! - Scalably stored on Amazon AWS S3 - Hadoop compatible format powered by Archive.org (Wayback Machine) - Partitionable with S3 Object Prefix possibility - 100MB-1GB file Sizes (gzip) -> Hadoop size
  • 8. Store the raw crawl data. Format 1: WARC
  • 10. Store only the Plain Text Content Format 3: WET
  • 11. Choose the right format - WARC (Raw HTML): 1.000 MB - WAT (Meta data as JSON) : 450 MB - WET (Plain Text): 150 MB
  • 12. Processing - Pure Hadoop with MapReduce - Input Classes: http://commoncrawl.org/the-data/get-started/
  • 13. Processing - High Level ETL-Layer like PIG: http://pig.apache.org - Example Stuff : - https://github.com/norvigaward/warcexamples - https://github.com/mortardata/mortar-examples - https://github.com/matpalm/common-crawl
  • 14. PIG Example REGISTER file:/home/hadoop/lib/pig/piggybank.jar DEFINE FileLoaderClass org.commoncrawl.pig.ArcLoader(); %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/0/1285398*.arc.gz"; -- %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/"; %default OUTPUT_PATH "s3://example-bucket/out"; pages = LOAD '$INPUT_PATH' USING FileLoaderClass AS (url, html); meta_titles = FOREACH pages GENERATE url, REGEX_EXTRACT(html, '<title>(.*)</title>', 1) AS meta_title; filtered = FILTER meta_titles BY meta_title IS NOT NULL; STORE filtered INTO '$OUTPUT_PATH' USING PigStorage('t');
  • 15. Hadoop & PIG on AWS - Support new Hadoop releases - PIG Integration - Replace HDFS with S3 - Easy UI to start quickly - Pay per Hour to scale as much as posible
  • 16. It‘s Demo Time! Let's cross fingers now
  • 17. That‘s it! Customer: Twitter: @danny_munich Facebook: https://www.facebook.com/danny.linden2 E-mail: danny@onpage.org And: We are hiring! https://de.onpage.org/about/jobs/

Notes de l'éditeur

  1. Screenshot austauschen + shclecht lesbar