SlideShare une entreprise Scribd logo
1  sur  24
PrivatePond: Outsourced Management of Web Corpuses Daniel Fabbri, Arnab Nandi,  Kristen LeFevre, H.V. Jagadish University of Michigan 1
Outsourcing Data to the Cloud Increase in cloud computing Outsource documents management to service providers Search and retrieve documents from the cloud Leverage existing search infrastructure High quality search results 2
Outsourcing Challenge: Confidentiality Documents may contain private information The service provider/public should not have access to the contents How can we balance confidentiality and search quality? WEB Intranet Search Engines 3
PrivatePond Create and store a corpus of confidential hyperlinked documents  Search confidential document using an unmodified search engine Balance privacy and searchability with a secure indexable representation WEB Intranet Intranet Search Engines 4
PrivatePond Design Goals User Experience: Document Confidentiality Search Quality Transparency Search System: Minimal Overhead Leverage Existing Search Infrastructure Previous work requires modification to the search engine    [Song 2000, Bawa 2003, Zerr 2008] 5
Outsourcing Architecture 6 Outsource the original corpus Does not maintain confidentiality D Service (Unmodified) Search Engine Ranked Result Document(s) D Q User Search
Outsourcing Architecture Outsource encrypted documents Local proxy encrypts and decrypts Local proxy performs the searches High search overhead 7 E(D) Service (Unmodified) Search Engine Local Proxy Ranked Result Document(s) D Q User Search
PrivatePond Architecture 8 Secure Indexable Representation Attached to encrypted document Indexable Searchable Secure Indexable  Representation E(D) Service (Unmodified) Search Engine E(D) Q’ Local Proxy Ranked Result Document(s) D Q User Search
Outsourcing Search 9 Practical Tradeoffs… Search Quality Confidentiality Indexable Representation Outsource Original Corpus   - Searchable   - Not confidential Outsource Encrypted Corpus - Confidential   - Not easily searched
Sample Indexable Representation First, consider encrypting each word in a document Maintain links between indexable representations  Vulnerable to attacks: Language structure (e.g., <noun> <verb> <noun>) Frequency of words (e.g., twinkle is most frequent)  [Kumar 2007] Twinkle, twinkle little star AAA AAA BBB CCC Document Indexable Representation 10
Second, represent documents as an encrypted set-of-words Prevents attacks on a single indexable representation Vulnerable to attacks that aggregate word frequencies across all indexable representations in the corpus Doc 2 Doc 1 Doc 3 AAA BBB CCC AAA BBB CCC AAA BBB CCC Sample Indexable Representation AAA BBB CCC Corpus of Indexable Representations Aggregate  Document Frequency 11
Third, Set-of-words representation + Padding (BW = 3) ,[object Object],Sample Indexable Representation AAA BBB CCC BBB CCC CCC Aggregate  Document Frequency Corpus of Indexable Representations 12
Set-of-words representation + Padding (BW = 3) PrivatePond Indexable Representation AAA BBB CCC AAA BBB CCC AAABBBCCC Aggregate  Document Frequency Corpus of Indexable Representations 13
PrivatePond Indexable Representation  Impact on Search Quality ,[object Object]
  Lose term frequency
  Padding of tokens introduces false positives14 What is the effect of the indexable representation on search quality?
Evaluation Data: Sample of Simple Wikipedia (Small Corpus) Full  Simple Wikipedia (Large Corpus) Query workload of 10 K queries Evaluation preformed with MySQL 15
Ranking Models Ranking Models: TFIDF (as implemented in MySQL FULLTEXT)  PageRank Combination of Ranking Models Measure change in search quality due to the indexable representation 16
Search Quality Metrics Indexable Representation Original  Corpus Search Engine Search Engine Ranked Results: Ranked Results: Gold List Pond List 17
Example: Search Quality Metrics ,[object Object]
N – Consider documents ranked from 1 to N
  P(N) = [gold list INTERSECT pond list] / N
  P(3) = 2/3
  Two additional metrics (included in the paper):

Contenu connexe

Tendances

Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)Jeet Das
 
SURE_2014 Poster 2.0
SURE_2014 Poster 2.0SURE_2014 Poster 2.0
SURE_2014 Poster 2.0Alex Sumner
 
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderRuben Taelman
 
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...Gong Cheng
 
EKAW - Linked Data Publishing
EKAW - Linked Data PublishingEKAW - Linked Data Publishing
EKAW - Linked Data PublishingRuben Taelman
 

Tendances (6)

Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)
 
SURE_2014 Poster 2.0
SURE_2014 Poster 2.0SURE_2014 Poster 2.0
SURE_2014 Poster 2.0
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with Order
 
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
 
EKAW - Linked Data Publishing
EKAW - Linked Data PublishingEKAW - Linked Data Publishing
EKAW - Linked Data Publishing
 

En vedette

Dé Managementconferentie 2011
Dé Managementconferentie 2011   Dé Managementconferentie 2011
Dé Managementconferentie 2011 saMBO-ICT
 
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...Clarke Ching
 
Hans Appel260308
Hans Appel260308Hans Appel260308
Hans Appel260308saMBO-ICT
 
Augmented Reality Arno Coenders
Augmented Reality Arno CoendersAugmented Reality Arno Coenders
Augmented Reality Arno CoenderssaMBO-ICT
 
Nbl Vermeend26mrt08
Nbl Vermeend26mrt08Nbl Vermeend26mrt08
Nbl Vermeend26mrt08saMBO-ICT
 
2020 InZicht ROC Mondriaan
2020 InZicht ROC Mondriaan2020 InZicht ROC Mondriaan
2020 InZicht ROC MondriaansaMBO-ICT
 
This call is being recorded
This call is being recordedThis call is being recorded
This call is being recordedsaMBO-ICT
 
DeKalb High School '88 Reunion Slideshow
DeKalb High School '88 Reunion SlideshowDeKalb High School '88 Reunion Slideshow
DeKalb High School '88 Reunion Slideshowmistersugar
 
Multitasking is evil
Multitasking is evilMultitasking is evil
Multitasking is evilClarke Ching
 

En vedette (9)

Dé Managementconferentie 2011
Dé Managementconferentie 2011   Dé Managementconferentie 2011
Dé Managementconferentie 2011
 
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
 
Hans Appel260308
Hans Appel260308Hans Appel260308
Hans Appel260308
 
Augmented Reality Arno Coenders
Augmented Reality Arno CoendersAugmented Reality Arno Coenders
Augmented Reality Arno Coenders
 
Nbl Vermeend26mrt08
Nbl Vermeend26mrt08Nbl Vermeend26mrt08
Nbl Vermeend26mrt08
 
2020 InZicht ROC Mondriaan
2020 InZicht ROC Mondriaan2020 InZicht ROC Mondriaan
2020 InZicht ROC Mondriaan
 
This call is being recorded
This call is being recordedThis call is being recorded
This call is being recorded
 
DeKalb High School '88 Reunion Slideshow
DeKalb High School '88 Reunion SlideshowDeKalb High School '88 Reunion Slideshow
DeKalb High School '88 Reunion Slideshow
 
Multitasking is evil
Multitasking is evilMultitasking is evil
Multitasking is evil
 

Similaire à PrivatePond: Outsourced Management of Web Corpuses

Exploiting web search engines to search structured
Exploiting web search engines to search structuredExploiting web search engines to search structured
Exploiting web search engines to search structuredNita Pawar
 
data base management system (DBMS)
data base management system (DBMS)data base management system (DBMS)
data base management system (DBMS)Varish Bajaj
 
X.500 More Than a Global Directory
X.500 More Than a Global DirectoryX.500 More Than a Global Directory
X.500 More Than a Global Directorylurdhu agnes
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...IRJET Journal
 
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcacoLa big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcacoData Con LA
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsIRJET Journal
 
Concept Based Search
Concept Based SearchConcept Based Search
Concept Based Searchfreewi11
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerSean Golliher
 
TID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To DatabaseTID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To DatabaseWanBK Leo
 
Normalisation in Database management System (DBMS)
Normalisation in Database management System (DBMS)Normalisation in Database management System (DBMS)
Normalisation in Database management System (DBMS)Prof Ansari
 
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchStructure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchC4Media
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
Metadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation beginsMetadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation beginsPéter Király
 
search.ppt
search.pptsearch.ppt
search.pptPikaj2
 
An Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search TechniqueAn Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search Techniquepaperpublications3
 

Similaire à PrivatePond: Outsourced Management of Web Corpuses (20)

Exploiting web search engines to search structured
Exploiting web search engines to search structuredExploiting web search engines to search structured
Exploiting web search engines to search structured
 
How web searching engines work
How web searching engines workHow web searching engines work
How web searching engines work
 
data base management system (DBMS)
data base management system (DBMS)data base management system (DBMS)
data base management system (DBMS)
 
X.500 More Than a Global Directory
X.500 More Than a Global DirectoryX.500 More Than a Global Directory
X.500 More Than a Global Directory
 
I explore
I exploreI explore
I explore
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
 
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcacoLa big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File Systems
 
Concept Based Search
Concept Based SearchConcept Based Search
Concept Based Search
 
search engine
search enginesearch engine
search engine
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
 
TID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To DatabaseTID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To Database
 
Anatomy of google
Anatomy of googleAnatomy of google
Anatomy of google
 
Normalisation in Database management System (DBMS)
Normalisation in Database management System (DBMS)Normalisation in Database management System (DBMS)
Normalisation in Database management System (DBMS)
 
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchStructure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Metadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation beginsMetadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation begins
 
search.ppt
search.pptsearch.ppt
search.ppt
 
An Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search TechniqueAn Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search Technique
 
Lecture 3 note.pptx
Lecture 3 note.pptxLecture 3 note.pptx
Lecture 3 note.pptx
 

Plus de arnabdotorg

Guided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result ParadigmGuided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result Paradigmarnabdotorg
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Runningarnabdotorg
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Runningarnabdotorg
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Runningarnabdotorg
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Runningarnabdotorg
 

Plus de arnabdotorg (6)

Guided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result ParadigmGuided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result Paradigm
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
 
yvmail
yvmailyvmail
yvmail
 

Dernier

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Dernier (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

PrivatePond: Outsourced Management of Web Corpuses

  • 1. PrivatePond: Outsourced Management of Web Corpuses Daniel Fabbri, Arnab Nandi, Kristen LeFevre, H.V. Jagadish University of Michigan 1
  • 2. Outsourcing Data to the Cloud Increase in cloud computing Outsource documents management to service providers Search and retrieve documents from the cloud Leverage existing search infrastructure High quality search results 2
  • 3. Outsourcing Challenge: Confidentiality Documents may contain private information The service provider/public should not have access to the contents How can we balance confidentiality and search quality? WEB Intranet Search Engines 3
  • 4. PrivatePond Create and store a corpus of confidential hyperlinked documents Search confidential document using an unmodified search engine Balance privacy and searchability with a secure indexable representation WEB Intranet Intranet Search Engines 4
  • 5. PrivatePond Design Goals User Experience: Document Confidentiality Search Quality Transparency Search System: Minimal Overhead Leverage Existing Search Infrastructure Previous work requires modification to the search engine [Song 2000, Bawa 2003, Zerr 2008] 5
  • 6. Outsourcing Architecture 6 Outsource the original corpus Does not maintain confidentiality D Service (Unmodified) Search Engine Ranked Result Document(s) D Q User Search
  • 7. Outsourcing Architecture Outsource encrypted documents Local proxy encrypts and decrypts Local proxy performs the searches High search overhead 7 E(D) Service (Unmodified) Search Engine Local Proxy Ranked Result Document(s) D Q User Search
  • 8. PrivatePond Architecture 8 Secure Indexable Representation Attached to encrypted document Indexable Searchable Secure Indexable Representation E(D) Service (Unmodified) Search Engine E(D) Q’ Local Proxy Ranked Result Document(s) D Q User Search
  • 9. Outsourcing Search 9 Practical Tradeoffs… Search Quality Confidentiality Indexable Representation Outsource Original Corpus - Searchable - Not confidential Outsource Encrypted Corpus - Confidential - Not easily searched
  • 10. Sample Indexable Representation First, consider encrypting each word in a document Maintain links between indexable representations Vulnerable to attacks: Language structure (e.g., <noun> <verb> <noun>) Frequency of words (e.g., twinkle is most frequent) [Kumar 2007] Twinkle, twinkle little star AAA AAA BBB CCC Document Indexable Representation 10
  • 11. Second, represent documents as an encrypted set-of-words Prevents attacks on a single indexable representation Vulnerable to attacks that aggregate word frequencies across all indexable representations in the corpus Doc 2 Doc 1 Doc 3 AAA BBB CCC AAA BBB CCC AAA BBB CCC Sample Indexable Representation AAA BBB CCC Corpus of Indexable Representations Aggregate Document Frequency 11
  • 12.
  • 13. Set-of-words representation + Padding (BW = 3) PrivatePond Indexable Representation AAA BBB CCC AAA BBB CCC AAABBBCCC Aggregate Document Frequency Corpus of Indexable Representations 13
  • 14.
  • 15. Lose term frequency
  • 16. Padding of tokens introduces false positives14 What is the effect of the indexable representation on search quality?
  • 17. Evaluation Data: Sample of Simple Wikipedia (Small Corpus) Full Simple Wikipedia (Large Corpus) Query workload of 10 K queries Evaluation preformed with MySQL 15
  • 18. Ranking Models Ranking Models: TFIDF (as implemented in MySQL FULLTEXT) PageRank Combination of Ranking Models Measure change in search quality due to the indexable representation 16
  • 19. Search Quality Metrics Indexable Representation Original Corpus Search Engine Search Engine Ranked Results: Ranked Results: Gold List Pond List 17
  • 20.
  • 21. N – Consider documents ranked from 1 to N
  • 22. P(N) = [gold list INTERSECT pond list] / N
  • 23. P(3) = 2/3
  • 24. Two additional metrics (included in the paper):
  • 26. Rank Perturbation 18
  • 27.
  • 28. PageRank is unaffected by the set-of-words representation19
  • 29.
  • 30. Padding in documents with high PageRankor low document frequency20
  • 31.
  • 32. Conclusion Present the PrivatePond architecture Outsourcing search Goal of balancing searchability and confidentiality Leverages existing search engine infrastructure Future Work: Alternative Indexable Representations 22
  • 33. more info at www.eecs.umich.edu/db 23

Notes de l'éditeur

  1. Consider a small company’s intranetOffload management responsibilities
  2. Secure boolean search on encrypted documents /Secure inverted indexes for document retrieval Transparency – seamless interaction for the userQuery run time
  3. Traditional search architecture query returns ranked list of documents
  4. Download each encrypted document to search
  5. So not confidential?
  6. One example to strike a balance between searchability and confidentiality
  7. Impact on Search Quality Lose proximity-based search Lose term frequency Padding of tokens introduces false positives
  8. Given a ranking model, examine the change in search quality; we do not determine the best ranking modelN – N highest ranked documents
  9. Meaning of N
  10. Bw = 1
  11. Varying confidentiality and search quality characteristics