SlideShare une entreprise Scribd logo
1  sur  29
Efficient Query Processing in
Geographic Web Search Engines
Yen-Yu Chen (yenyu@yahoo-inc.com)
Torsten Suel (suel@poly.edu)
Alex Markowetz (alexmar@cs.ust.hk)
Web Search Engines
• used by millions of people every day
• a few large and many small engines
• cover billions of pages
• are completely changing the way
people find information
Search Engine Architecture
• Crawling pages
• Parsing for link information
• Building Index
• Computing ranking scores
• Processing Queries
WEBWEBWEBWEB
Crawler
Link
Parser
disks
Indexing
Query
Text Index
Standard Search Does Not Work Well
• try various terms with
geographic meaning:
– Yoga Brooklyn
– Yoga “New York”
– Yoga “Park Slope”
– Yoga Queens
• inconvenient &
inefficient
• will still miss many
results
Example: looking for a Yoga school around Brooklyn
Geographic Search Engines
• Geographic data mining
– Geo extraction, matching, propagation
• Geographic footprints
• Geographic ranking function
disks
Query
Geo Data
Mining
Footprints Geo Query
Processor
Text Index
Ranking Function
Geographic Footprints
• how to represent geo. locations of documents?
• “document footprint”
represented by some kind of
spatial structure (bitmaps, polygons)
• how to use footprints to answer queries?
• look for documents that
- contain the textual search terms
- and whose geographic footprint overlaps the query footprint
- score according to textual occurrences and degree of overlap
• we claim: this gives general and flexible approach
• how to extend a standard SE to answer such queries?
``
Geographic Footprints
• Raster data model
• Representing geographic
footprint of a page as a
bitmap on an underlying
1024x1024 grid of
Germany
• Each point on the grid has
an integer amplitude
• Bitmaps are kept as quad
tree structures
Geographic Footprints of Web
Pages
• Two advantages:
1.Aggregation and other
operations are
efficient
2.Highly compressed
– less than 100 bytes on
average after
simplification 0-badewanne.baby--shop.de
Geographic Query Processing
Ranking according to
subject-relevance and
intersection between
footprints
Ranking according to
subject-relevance
Boolean operations on
inverted index and
Footprints
Boolean operations on
inverted index.
User enters key words and
interested geographic
positions
User enters key words
Geographic SearchText Search
Geographic Query Processing
• each doc consists of textual terms and a geographic footprint
• each query consists of textual terms and a query footprint
• standard engine: combination of textual (cosine) + links etc
• geo engine: combine with geographic score function gs(d, q)
• s(d, q) = ts(d, q) + pr(d) + gs(d, q)
• properties of geographic score
• - only depends on footprints
• - is zero if intersection empty
• - is “partitionable”
• examples:
• - intersection volume
• - better: dot product
• - with suitable weights
• claim: fairly flexible
Challenges
• How can spatial index be integrated into a
search engine query processor ?
– Optimizing spatial index towards inverted
index structure.
• How should the data flow during query
execution?
– Most selective filter, textual or spatial?
– Interlacing within DAAT text query processor
Standard vs. Geographic SEs
Build additional spatial
data structures
Build inverted index
Additional data mining
for geo extraction
Link analysis, spam
and duplicate
detection, etc
Maybe restricted to
particular area
Standard crawling
Geographic SearchStandard Search
Traverse inverted lists Traverse inverted lists
plus
spatial structures
data
aquisition
data
mining
index
construction
query
execution
Architecture View
• if you can cook it into a footprint, we have an
efficient way to execute it
• separation of concerns
data mining &
inverted index
construction
footprint
computation
standard SE
query
processor
augmented
with spatial
data structures
preprocessing online query processing
Experiment Setup
• seven machines
– 512 MB memory each
• partition 31,000,000 documents from .de
domain
– 70GB of web pages (uncompressed) per machine
– 6.5 GB compressed inverted index per machine
• 50% to 90% of pages have footprints of average
size 130-500 bytes
• Inverted index and footprints need to reside on
disk
– main memory is too small
– mostly used for caching
• footprints smaller than index but larger than
Goals and Roadmap
• Minimizing footprint
data to be retrieved
• Replacing random I/O
by sequential scans
• Minimizing memory
usage
• Early results
• Text First baseline
algorithm
• Geo First baseline
algorithm
• Toeprints
• Spatial disk layout
• K-sweep algorithm
• Tile-index algorithm
• Spatially ordered
Inverted index
Naïve Algorithms
• Text-First Baseline
1. Go through inverted list with all keywords.
2. Retrieve all footprints.
3. Compute scores.
• Lots of random I/Os (footprints)
• DAAT (both steps can be
interleaved per document)
Naïve Algorithms (2)
• Geo-First Baseline
1.Get doc IDs from R*-tree.
2.Filter doc IDs with inverted index.
3.Retrieve all footprints.
4.Compute scores.
• A smarter (non-dynamic) spatial
index could have been used
• Blocking (sorting by docID after
spatial index)
Toeprints
• MBRs of footprints can be
quite large
– Page on Hamburg and
Munich spans all of
Germany
• Degrade the spatial index
• Idea: toeprints
• Split up MBRs recursively
• Partitioning stops when
1) No MBR has a side length >
S0
2) No MBR larger than S1 is
more than X% empty
Data Layout on Disk
• We have to keep the
toeprints on disk
• We have to retrieve some
• We want to avoid random
I/Os and prefer sequential
scans
• Space-filling curves are
curves whose ranges
contain the entire two
dimensional unit square
• Arrange data on disk
according to a space filling
curve
K-Sweep Algorithm
• IDEA: Retrieve
toeprints through a
fixed number of
contiguous scans
1. Retrieve intervals from
in-memory spatial
structure.
2. Perform up to k sweeps
to fetch all toeprints.
3. Filter doc IDs with
inverted index.
4. Compute the geo score.
Tile
ID
Interval
one
Interval
two
0 3476 3500 23400 31000
1 4688 5505 13202 20300
2 13274 15000 28408 33098
:
:
Disadvantage:
• fetches complete toeprint data
without first filtering by query terms
• Blocking (sorting)
Tile Index Algorithm
• Uses a grid
• Stores docIDs for every
tile
– Instead of intervals in k-
sweep
– Size reduction by
coarser grid (256 * 256)
1. Retrieve doc IDs in tiles
intersecting the query
footprint.
2. Filter Doc IDs with inverted
index.
3. Fetch the toeprints.
4. Compute and combine the
geo score.
Tile ID Toeprint IDs
0 354, 4562, 7821, ….
1 134, 592, 8211, ….
2 51, 62, 78, 101, ….
:
:
65536 351, 362, 478, ….
Decreases amount of toeprints to be retrieved
But not dramatically
(they are clustered and k-sweep performs scans)
Space-Filling Inverted Index
• Idea: order documents in Inverted index along
space filling curve
• Assign doc IDs along a space-filling curve
• Pros
– Apply to either k-Sweep or Tile Index algo.
– Fed into inverted index seamlessly
– Speed up reading inverted index
– Good trade-off if textual part is bottleneck
• Cons
– Increase the size of inverted index
Average Execution time for various
methods
K-Sweep
Tile Index vs.
Space-Filling Inverted Index
Comparison
Caching
Related Work
• Y. Zhou, et al. “Hybrid index structures for
location-based web search”, In Proc of 14th
CIKM 2005
• M. van Kreveld, et al. “Distributed ranking
methods for geographic information
retrieval”, In European Workshop on
Computational Geometry
• B. Martins, et al. “Indexing and ranking in geo-
ir systems”, In Proc of the Workshop on Geo-IR
2005
• S. Vaid, et al. “Spatio-textual indexing for
geographical search on the web”, In Proc of 9th
SSTD, 2005
Conclusion and Future Work
• Geographic query processing can be
performed at about the same level of
efficiency as text-only queries
• Pruning techniques for geographic search
engines
• Parallel geographic query processing

Contenu connexe

Tendances

Spark summit europe 2015 magellan
Spark summit europe 2015 magellanSpark summit europe 2015 magellan
Spark summit europe 2015 magellanRam Sriharsha
 
Gis and Ruby 101 at Ruby Conf Kenya 2017 by Kamal Ogudah
Gis and Ruby 101 at Ruby Conf Kenya 2017 by Kamal OgudahGis and Ruby 101 at Ruby Conf Kenya 2017 by Kamal Ogudah
Gis and Ruby 101 at Ruby Conf Kenya 2017 by Kamal OgudahMichael Kimathi
 
R programming language in spatial analysis
R programming language in spatial analysisR programming language in spatial analysis
R programming language in spatial analysisAbhiram Kanigolla
 
Geographic Information Retrieval (GIR)
Geographic Information Retrieval (GIR)Geographic Information Retrieval (GIR)
Geographic Information Retrieval (GIR)Behrooz Rasuli
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAdnan Akhter
 
Geographic Information Retrieval Systems.
Geographic Information Retrieval Systems.Geographic Information Retrieval Systems.
Geographic Information Retrieval Systems.Dinesh Manajipet
 
R Spatial Analysis using SP
R Spatial Analysis using SPR Spatial Analysis using SP
R Spatial Analysis using SPtjagger
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
 
Beyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleBeyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleTuri, Inc.
 
AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017Kisung Kim
 

Tendances (13)

Spark summit europe 2015 magellan
Spark summit europe 2015 magellanSpark summit europe 2015 magellan
Spark summit europe 2015 magellan
 
search engine
search enginesearch engine
search engine
 
Gis and Ruby 101 at Ruby Conf Kenya 2017 by Kamal Ogudah
Gis and Ruby 101 at Ruby Conf Kenya 2017 by Kamal OgudahGis and Ruby 101 at Ruby Conf Kenya 2017 by Kamal Ogudah
Gis and Ruby 101 at Ruby Conf Kenya 2017 by Kamal Ogudah
 
R programming language in spatial analysis
R programming language in spatial analysisR programming language in spatial analysis
R programming language in spatial analysis
 
Geo data analytics
Geo data analyticsGeo data analytics
Geo data analytics
 
Geographic Information Retrieval (GIR)
Geographic Information Retrieval (GIR)Geographic Information Retrieval (GIR)
Geographic Information Retrieval (GIR)
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning Techniques
 
H1076875
H1076875H1076875
H1076875
 
Geographic Information Retrieval Systems.
Geographic Information Retrieval Systems.Geographic Information Retrieval Systems.
Geographic Information Retrieval Systems.
 
R Spatial Analysis using SP
R Spatial Analysis using SPR Spatial Analysis using SP
R Spatial Analysis using SP
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Beyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleBeyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at Scale
 
AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017
 

En vedette

Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesSimon Lia-Jonassen
 
Modern Workplace 2016 - Susanna Eerola, Microsoft
Modern Workplace 2016 - Susanna Eerola, MicrosoftModern Workplace 2016 - Susanna Eerola, Microsoft
Modern Workplace 2016 - Susanna Eerola, MicrosoftKnowit Oy
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Carlos Castillo (ChaTo)
 
Office 365 - Your Modern Workplace
Office 365 - Your Modern WorkplaceOffice 365 - Your Modern Workplace
Office 365 - Your Modern WorkplaceTarek El Jammal
 

En vedette (6)

p723-zukowski
p723-zukowskip723-zukowski
p723-zukowski
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search Engines
 
Modern Workplace: Office 2016
 Modern Workplace: Office 2016 Modern Workplace: Office 2016
Modern Workplace: Office 2016
 
Modern Workplace 2016 - Susanna Eerola, Microsoft
Modern Workplace 2016 - Susanna Eerola, MicrosoftModern Workplace 2016 - Susanna Eerola, Microsoft
Modern Workplace 2016 - Susanna Eerola, Microsoft
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
 
Office 365 - Your Modern Workplace
Office 365 - Your Modern WorkplaceOffice 365 - Your Modern Workplace
Office 365 - Your Modern Workplace
 

Similaire à Efficient Query Processing in Geographic Web Search Engines

Indexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data TypesIndexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data TypesJonathan Katz
 
Building a Spatial Database in PostgreSQL
Building a Spatial Database in PostgreSQLBuilding a Spatial Database in PostgreSQL
Building a Spatial Database in PostgreSQLSohail Akbar Goheer
 
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web AppsGIS in the Rockies
 
Presentation MIG-T GeoPackage.pdf
Presentation MIG-T GeoPackage.pdfPresentation MIG-T GeoPackage.pdf
Presentation MIG-T GeoPackage.pdfssusera932351
 
Review presentation for Orientation 2014
Review presentation for Orientation 2014Review presentation for Orientation 2014
Review presentation for Orientation 2014DUSPviz
 
Enhancing Parallel Coordinates with Curves
Enhancing Parallel Coordinates with CurvesEnhancing Parallel Coordinates with Curves
Enhancing Parallel Coordinates with Curvesmartinjgraham
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsGeorge Stathis
 
Sql query performance analysis
Sql query performance analysisSql query performance analysis
Sql query performance analysisRiteshkiit
 
Elasticsearch - Scalability and Multitenancy
Elasticsearch - Scalability and MultitenancyElasticsearch - Scalability and Multitenancy
Elasticsearch - Scalability and MultitenancyBozhidar Bozhanov
 
MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329Douglas Duncan
 
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Thanh Tran
 
Sql query performance analysis
Sql query performance analysisSql query performance analysis
Sql query performance analysisRiteshkiit
 
12 ipt 0203 Storage and Retrieval
12 ipt 0203   Storage and Retrieval12 ipt 0203   Storage and Retrieval
12 ipt 0203 Storage and Retrievalctedds
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Lucidworks
 
Conceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLConceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLMongoDB
 
Skillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise Group
 

Similaire à Efficient Query Processing in Geographic Web Search Engines (20)

Indexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data TypesIndexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data Types
 
Building a Spatial Database in PostgreSQL
Building a Spatial Database in PostgreSQLBuilding a Spatial Database in PostgreSQL
Building a Spatial Database in PostgreSQL
 
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps
 
Geoservices Activities at EDINA
Geoservices Activities at EDINAGeoservices Activities at EDINA
Geoservices Activities at EDINA
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
Presentation MIG-T GeoPackage.pdf
Presentation MIG-T GeoPackage.pdfPresentation MIG-T GeoPackage.pdf
Presentation MIG-T GeoPackage.pdf
 
Day 6 - PostGIS
Day 6 - PostGISDay 6 - PostGIS
Day 6 - PostGIS
 
Review presentation for Orientation 2014
Review presentation for Orientation 2014Review presentation for Orientation 2014
Review presentation for Orientation 2014
 
Enhancing Parallel Coordinates with Curves
Enhancing Parallel Coordinates with CurvesEnhancing Parallel Coordinates with Curves
Enhancing Parallel Coordinates with Curves
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
Sql query performance analysis
Sql query performance analysisSql query performance analysis
Sql query performance analysis
 
Elasticsearch - Scalability and Multitenancy
Elasticsearch - Scalability and MultitenancyElasticsearch - Scalability and Multitenancy
Elasticsearch - Scalability and Multitenancy
 
MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329
 
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
 
Sql query performance analysis
Sql query performance analysisSql query performance analysis
Sql query performance analysis
 
12 ipt 0203 Storage and Retrieval
12 ipt 0203   Storage and Retrieval12 ipt 0203   Storage and Retrieval
12 ipt 0203 Storage and Retrieval
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
 
Exploratory Spatial Analytics (ESA)
Exploratory Spatial Analytics (ESA)Exploratory Spatial Analytics (ESA)
Exploratory Spatial Analytics (ESA)
 
Conceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLConceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQL
 
Skillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet app
 

Dernier

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 

Dernier (20)

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 

Efficient Query Processing in Geographic Web Search Engines

  • 1. Efficient Query Processing in Geographic Web Search Engines Yen-Yu Chen (yenyu@yahoo-inc.com) Torsten Suel (suel@poly.edu) Alex Markowetz (alexmar@cs.ust.hk)
  • 2. Web Search Engines • used by millions of people every day • a few large and many small engines • cover billions of pages • are completely changing the way people find information
  • 3. Search Engine Architecture • Crawling pages • Parsing for link information • Building Index • Computing ranking scores • Processing Queries WEBWEBWEBWEB Crawler Link Parser disks Indexing Query Text Index
  • 4. Standard Search Does Not Work Well • try various terms with geographic meaning: – Yoga Brooklyn – Yoga “New York” – Yoga “Park Slope” – Yoga Queens • inconvenient & inefficient • will still miss many results Example: looking for a Yoga school around Brooklyn
  • 5. Geographic Search Engines • Geographic data mining – Geo extraction, matching, propagation • Geographic footprints • Geographic ranking function disks Query Geo Data Mining Footprints Geo Query Processor Text Index Ranking Function
  • 6. Geographic Footprints • how to represent geo. locations of documents? • “document footprint” represented by some kind of spatial structure (bitmaps, polygons) • how to use footprints to answer queries? • look for documents that - contain the textual search terms - and whose geographic footprint overlaps the query footprint - score according to textual occurrences and degree of overlap • we claim: this gives general and flexible approach • how to extend a standard SE to answer such queries? ``
  • 7. Geographic Footprints • Raster data model • Representing geographic footprint of a page as a bitmap on an underlying 1024x1024 grid of Germany • Each point on the grid has an integer amplitude • Bitmaps are kept as quad tree structures
  • 8. Geographic Footprints of Web Pages • Two advantages: 1.Aggregation and other operations are efficient 2.Highly compressed – less than 100 bytes on average after simplification 0-badewanne.baby--shop.de
  • 9. Geographic Query Processing Ranking according to subject-relevance and intersection between footprints Ranking according to subject-relevance Boolean operations on inverted index and Footprints Boolean operations on inverted index. User enters key words and interested geographic positions User enters key words Geographic SearchText Search
  • 10. Geographic Query Processing • each doc consists of textual terms and a geographic footprint • each query consists of textual terms and a query footprint • standard engine: combination of textual (cosine) + links etc • geo engine: combine with geographic score function gs(d, q) • s(d, q) = ts(d, q) + pr(d) + gs(d, q) • properties of geographic score • - only depends on footprints • - is zero if intersection empty • - is “partitionable” • examples: • - intersection volume • - better: dot product • - with suitable weights • claim: fairly flexible
  • 11. Challenges • How can spatial index be integrated into a search engine query processor ? – Optimizing spatial index towards inverted index structure. • How should the data flow during query execution? – Most selective filter, textual or spatial? – Interlacing within DAAT text query processor
  • 12. Standard vs. Geographic SEs Build additional spatial data structures Build inverted index Additional data mining for geo extraction Link analysis, spam and duplicate detection, etc Maybe restricted to particular area Standard crawling Geographic SearchStandard Search Traverse inverted lists Traverse inverted lists plus spatial structures data aquisition data mining index construction query execution
  • 13. Architecture View • if you can cook it into a footprint, we have an efficient way to execute it • separation of concerns data mining & inverted index construction footprint computation standard SE query processor augmented with spatial data structures preprocessing online query processing
  • 14. Experiment Setup • seven machines – 512 MB memory each • partition 31,000,000 documents from .de domain – 70GB of web pages (uncompressed) per machine – 6.5 GB compressed inverted index per machine • 50% to 90% of pages have footprints of average size 130-500 bytes • Inverted index and footprints need to reside on disk – main memory is too small – mostly used for caching • footprints smaller than index but larger than
  • 15. Goals and Roadmap • Minimizing footprint data to be retrieved • Replacing random I/O by sequential scans • Minimizing memory usage • Early results • Text First baseline algorithm • Geo First baseline algorithm • Toeprints • Spatial disk layout • K-sweep algorithm • Tile-index algorithm • Spatially ordered Inverted index
  • 16. Naïve Algorithms • Text-First Baseline 1. Go through inverted list with all keywords. 2. Retrieve all footprints. 3. Compute scores. • Lots of random I/Os (footprints) • DAAT (both steps can be interleaved per document)
  • 17. Naïve Algorithms (2) • Geo-First Baseline 1.Get doc IDs from R*-tree. 2.Filter doc IDs with inverted index. 3.Retrieve all footprints. 4.Compute scores. • A smarter (non-dynamic) spatial index could have been used • Blocking (sorting by docID after spatial index)
  • 18. Toeprints • MBRs of footprints can be quite large – Page on Hamburg and Munich spans all of Germany • Degrade the spatial index • Idea: toeprints • Split up MBRs recursively • Partitioning stops when 1) No MBR has a side length > S0 2) No MBR larger than S1 is more than X% empty
  • 19. Data Layout on Disk • We have to keep the toeprints on disk • We have to retrieve some • We want to avoid random I/Os and prefer sequential scans • Space-filling curves are curves whose ranges contain the entire two dimensional unit square • Arrange data on disk according to a space filling curve
  • 20. K-Sweep Algorithm • IDEA: Retrieve toeprints through a fixed number of contiguous scans 1. Retrieve intervals from in-memory spatial structure. 2. Perform up to k sweeps to fetch all toeprints. 3. Filter doc IDs with inverted index. 4. Compute the geo score. Tile ID Interval one Interval two 0 3476 3500 23400 31000 1 4688 5505 13202 20300 2 13274 15000 28408 33098 : : Disadvantage: • fetches complete toeprint data without first filtering by query terms • Blocking (sorting)
  • 21. Tile Index Algorithm • Uses a grid • Stores docIDs for every tile – Instead of intervals in k- sweep – Size reduction by coarser grid (256 * 256) 1. Retrieve doc IDs in tiles intersecting the query footprint. 2. Filter Doc IDs with inverted index. 3. Fetch the toeprints. 4. Compute and combine the geo score. Tile ID Toeprint IDs 0 354, 4562, 7821, …. 1 134, 592, 8211, …. 2 51, 62, 78, 101, …. : : 65536 351, 362, 478, …. Decreases amount of toeprints to be retrieved But not dramatically (they are clustered and k-sweep performs scans)
  • 22. Space-Filling Inverted Index • Idea: order documents in Inverted index along space filling curve • Assign doc IDs along a space-filling curve • Pros – Apply to either k-Sweep or Tile Index algo. – Fed into inverted index seamlessly – Speed up reading inverted index – Good trade-off if textual part is bottleneck • Cons – Increase the size of inverted index
  • 23. Average Execution time for various methods
  • 28. Related Work • Y. Zhou, et al. “Hybrid index structures for location-based web search”, In Proc of 14th CIKM 2005 • M. van Kreveld, et al. “Distributed ranking methods for geographic information retrieval”, In European Workshop on Computational Geometry • B. Martins, et al. “Indexing and ranking in geo- ir systems”, In Proc of the Workshop on Geo-IR 2005 • S. Vaid, et al. “Spatio-textual indexing for geographical search on the web”, In Proc of 9th SSTD, 2005
  • 29. Conclusion and Future Work • Geographic query processing can be performed at about the same level of efficiency as text-only queries • Pruning techniques for geographic search engines • Parallel geographic query processing