SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
Centipede: Analyzing Web
Crawl data for context of a
location
Vikas Bansal
Primal Pappachan
Abhishek Sethi
Introduction
Introduction
Description
A web service that presents the context
associated with a location
Context of a location
1. Weather
2. Healthcare
3. Crime
4. Employment
5. ……
Customers
1. Moving/Travelling into a new place
2. Policy Makers
3. Journalists
4. Researchers
Scenario
Related Services
● Yelp
● Google news
● http://bestplaces.net/
● http://www.nycgo.com/events/
● http://www.stubhub.com/
Technical Description of Service
● Analyze the web crawl data
● Create a list of locations
● Filter top 100 words from the files that
mention a location from the list
● Build an index of location against list of
words corresponding to that location
System Architecture
Data Sources
•Common Crawl Data from Amazon S3
–Contains information on billions of web pages
–Search through the contents
–Use ARC and Text files
Technologies and Resources
● Hadoop Cluster on Bluegrit System
● Apache Pig
○ Python for UDF’s
● Java/PHP for front end development
○ Use a Jboss container for Java, Xampp for PHP
● Elastic Search
● Map Reduce
● SQL/NoSQL database
● REST
● WSDL 2.0
● AWS - RDS, R53, EC2
MapReduce Job
Splitter
● Sentence
● Paragraph
● Article
Elastic Search
● Distributed restful search and analytics.
● Has near real-time search.
● Resilient clusters - detect and remove failed
nodes.
Challenges and Limitations
•Amount of HDD space available.
•Learning new technologies such as Apache
Pig, WSDL etc.
•Creating special UDF’s in Python.
Timeline
References
● Data set
● Common Crawl Web data
● Elastic Search
● Apache Pig
● Elastic Search for Term Filter lookup
● Hadoop Tutorial
● Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data
processing on large clusters." Communications of the ACM 51.1 (2008):
107-113.
● Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet
allocation." the Journal of machine Learning research 3 (2003): 993-1022.

Contenu connexe

Tendances

Get Your Insecure PostgreSQL Passwords to SCRAM
Get Your Insecure PostgreSQL Passwords to SCRAMGet Your Insecure PostgreSQL Passwords to SCRAM
Get Your Insecure PostgreSQL Passwords to SCRAM
Jonathan Katz
 
MongoDB Performance Debugging
MongoDB Performance DebuggingMongoDB Performance Debugging
MongoDB Performance Debugging
MongoDB
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Chris Bizer
 

Tendances (20)

Get Your Insecure PostgreSQL Passwords to SCRAM
Get Your Insecure PostgreSQL Passwords to SCRAMGet Your Insecure PostgreSQL Passwords to SCRAM
Get Your Insecure PostgreSQL Passwords to SCRAM
 
Database change management with Liquibase
Database change management with LiquibaseDatabase change management with Liquibase
Database change management with Liquibase
 
Ter Web Service Intro
Ter Web Service IntroTer Web Service Intro
Ter Web Service Intro
 
MongoDB Performance Debugging
MongoDB Performance DebuggingMongoDB Performance Debugging
MongoDB Performance Debugging
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
[215]네이버콘텐츠통계서비스소개 김기영
[215]네이버콘텐츠통계서비스소개 김기영[215]네이버콘텐츠통계서비스소개 김기영
[215]네이버콘텐츠통계서비스소개 김기영
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
Understanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStoreUnderstanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStore
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
Node.js API 서버 성능 개선기
Node.js API 서버 성능 개선기Node.js API 서버 성능 개선기
Node.js API 서버 성능 개선기
 
A Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTiger
 
[NDC 발표] 모바일 게임데이터분석 및 실전 활용
[NDC 발표] 모바일 게임데이터분석 및 실전 활용[NDC 발표] 모바일 게임데이터분석 및 실전 활용
[NDC 발표] 모바일 게임데이터분석 및 실전 활용
 
SSL certificates in the Oracle Database without surprises
SSL certificates in the Oracle Database without surprisesSSL certificates in the Oracle Database without surprises
SSL certificates in the Oracle Database without surprises
 
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
 

Similaire à Cenitpede: Analyzing Webcrawl

Similaire à Cenitpede: Analyzing Webcrawl (20)

E017624043
E017624043E017624043
E017624043
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
 
Search Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismSearch Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanism
 
SharePoint Search Topology and Optimization
SharePoint Search Topology and OptimizationSharePoint Search Topology and Optimization
SharePoint Search Topology and Optimization
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Elsevier - Smart Data and Algorithms for the Publishing Industry
Elsevier - Smart Data and Algorithms for the Publishing IndustryElsevier - Smart Data and Algorithms for the Publishing Industry
Elsevier - Smart Data and Algorithms for the Publishing Industry
 
ElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der CloudsElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der Clouds
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
Modern web search: Web Information Systems
Modern web search: Web Information SystemsModern web search: Web Information Systems
Modern web search: Web Information Systems
 
Modern web search: Lecture 11
Modern web search: Lecture 11Modern web search: Lecture 11
Modern web search: Lecture 11
 
Week10
Week10Week10
Week10
 
How Data Science can boost your SEO ?
How Data Science can boost your SEO ?How Data Science can boost your SEO ?
How Data Science can boost your SEO ?
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas Workshop
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data Generation
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Data catalog
Data catalogData catalog
Data catalog
 

Plus de Primal Pappachan

An ontology based sensor selection engine
An ontology based sensor selection engineAn ontology based sensor selection engine
An ontology based sensor selection engine
Primal Pappachan
 
Pythonizing the Indian Engineering Education
Pythonizing the Indian Engineering EducationPythonizing the Indian Engineering Education
Pythonizing the Indian Engineering Education
Primal Pappachan
 

Plus de Primal Pappachan (6)

Mobipedia presentation
Mobipedia presentationMobipedia presentation
Mobipedia presentation
 
A Semantic Context-aware Privacy Model for FaceBlock
A Semantic Context-aware Privacy Model for FaceBlockA Semantic Context-aware Privacy Model for FaceBlock
A Semantic Context-aware Privacy Model for FaceBlock
 
An ontology based sensor selection engine
An ontology based sensor selection engineAn ontology based sensor selection engine
An ontology based sensor selection engine
 
Droidcon India 2011 Talk
Droidcon India 2011 TalkDroidcon India 2011 Talk
Droidcon India 2011 Talk
 
Pythonizing the Indian Engineering Education
Pythonizing the Indian Engineering EducationPythonizing the Indian Engineering Education
Pythonizing the Indian Engineering Education
 
FOSSEE
FOSSEEFOSSEE
FOSSEE
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Cenitpede: Analyzing Webcrawl

  • 1. Centipede: Analyzing Web Crawl data for context of a location Vikas Bansal Primal Pappachan Abhishek Sethi
  • 4. Description A web service that presents the context associated with a location
  • 5. Context of a location 1. Weather 2. Healthcare 3. Crime 4. Employment 5. ……
  • 6. Customers 1. Moving/Travelling into a new place 2. Policy Makers 3. Journalists 4. Researchers
  • 8. Related Services ● Yelp ● Google news ● http://bestplaces.net/ ● http://www.nycgo.com/events/ ● http://www.stubhub.com/
  • 9. Technical Description of Service ● Analyze the web crawl data ● Create a list of locations ● Filter top 100 words from the files that mention a location from the list ● Build an index of location against list of words corresponding to that location
  • 11. Data Sources •Common Crawl Data from Amazon S3 –Contains information on billions of web pages –Search through the contents –Use ARC and Text files
  • 12. Technologies and Resources ● Hadoop Cluster on Bluegrit System ● Apache Pig ○ Python for UDF’s ● Java/PHP for front end development ○ Use a Jboss container for Java, Xampp for PHP ● Elastic Search ● Map Reduce ● SQL/NoSQL database ● REST ● WSDL 2.0 ● AWS - RDS, R53, EC2
  • 14. Elastic Search ● Distributed restful search and analytics. ● Has near real-time search. ● Resilient clusters - detect and remove failed nodes.
  • 15. Challenges and Limitations •Amount of HDD space available. •Learning new technologies such as Apache Pig, WSDL etc. •Creating special UDF’s in Python.
  • 17. References ● Data set ● Common Crawl Web data ● Elastic Search ● Apache Pig ● Elastic Search for Term Filter lookup ● Hadoop Tutorial ● Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. ● Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.