Addressing scalability challenges in peer-to-peer search

Addressing scalability challenges
in peer-to-peer search
PhD seminar
4 Feb, 2014
Harisankar H,
PhD scholar,
DOS lab, Dept. of CSE
Advisor: Prof. D. Janakiram
http://harisankarh.wordpress.com

Outline
• Issues with centralized search
– Can peer-to-peer search help?

• Scalability challenges in peer-to-peer search
• Proposed architectural extensions
– Two-layered architecture for peer-to-peer concept
search
– Cloud-assisted approach to handle query spikes

Centralized search scenario
• Scenario
– Search engines crawl available content, index and
maintain it in data centers
– User queries directed to data centers, processed
internally and results sent back
– Centrally managed by single company

Content

End users

Datacenters

Some issues with centralized search
– Privacy concerns
• All user queries accessible from a single location

– Centralized control
• Individual companies decide what to(not to) index, rank
etc.

– Transparency
• Complete details of ranking, pre-processing etc. not
made available publicly
• Concerns of censorship and doctoring of results

Some issues with centralized search contd..
• Uses mostly syntactic search techniques
– Based on word or multi-word phrases
– Low quality of results due to ambiguity of natural language

• Issues with centralized semantic search
– Difficult to capture long tail of niche interests of users
• Requires rich human generated knowledge bases in numerous
niche areas

Peer-to-peer search approach
• Edge nodes in the internet participate in
providing and using the search service
• Search as a collaborative service
• Crawling, indexing and search distributed
across the peers

How could peer-to-peer search help?
• Each user query can be sent to a different peer among
millions
– Obtaining query logs in a single location difficult
– Reduced privacy concerns

• Distributed control across numerous peers
– Avoids centralized control

• Search application available with all peers
– Better transparency in ranking etc.

• Background knowledge of peers can be utilized for
effective semantic search
– Can help improve quality of results

• Led to lot of academic research in the area as well as
real world p2p search engines*
* e.g., faroo.com, yacy.net; YacyPi Kickstarter project

Realizing peer-to-peer search
• Distribution of search
index
– Term partitioning
• Responsibility of individual
terms assigned to different
peers
– E.g., peer1 is currently
responsible for term
“computer”

• Term-to-peer mapping
achieved through a
structured overlay(e.g., DHT)
Image src: http://wwarodomfr.blogspot.in/2008/09/chord-routing.html

Scalability challenges in peer-to-peer
search
•
•
•
•

Peers share only idle resources
Peers join/leave autonomously
Limited individual resources
leads to

No SLA

– Peer bandwidth bottleneck during query processing
• Particularly queries involving multiple terms(index transfer
between multiple peers)

– Instability during query spikes

• Knowledge management issues at large scale
– Difficult to have consensus at large scale
– Need wide understanding and have to meet requirements of
large diverse group

Two-layered architecture for peer-topeer concept search*
• Peers organized as communities based on common
interest
• Each community maintains its own background
knowledge to use in semantic search
– Maintained in a distributed manner

• A global layer with aggregate information to facilitate
search across communities
• Background knowledge bases extend from minimal
universally accepted knowledge in upper layer
• Search, indexing and knowledge management
proceeds independently in each community
*joint work with Prof. Fausto Guinchiglia and Uladzimir, Univ. of Trento

Two-layered architecture for peer-to-peer
concept search
GLOBAL
Comm: index

UK

BK-1

doc index -1

BK-3

Community-1

doc index -3

Community-3

BK-2

doc index -2

Community-2

Two-layered architecture
• Global layer
– retrieves relevant communities for query based on
universal knowledge

• Community layer
– retrieves relevant documents for query based on
background knowledge of community

Overcoming the shortcomings of singlelayered approaches
• Search can be scoped only to the relevant
communities for a query
– Results in less bandwidth-related issues

• Two layers make knowledge management scalable
and interoperable
– Niche interests supported at community-level background
knowledge bases
– Minimal universal knowledge for interoperability

• Search within community based on community’s
background knowledge
– Focused interest of community helps in better term-toconcept mapping

Two-layered approach
• Index partitioning
– Uses partition-by-term
• Posting list for each term stored in different peers

– Uses Distributed Hash Table(DHT) to realize dynamic termto-peer mapping
• O(logN) hops for each lookup

• Overlay network
– Communities and global layer maintained using twolayered overlay
• Based on our earlier work on computational grids*

– O(logN) hops for lookup even with two-layers
*M.V. Reddy, A.V. Srinivas, T. Gopinath, and D. Janakiram, “Vishwa: A reconfigurable P2P
middleware for Grid Computations,” in ICPP'06

• Community management
– Similar to public communities in flickr, facebook
etc.

• Search within community
– Uses Concept Search* as underlying semantic
search scheme
• Extends syntactic search with available knowledge to
realize semantic search
• Falls back to syntactic search when no knowledge is
available
*Fausto Giunchiglia, Uladzimir Kharkevich, Ilya Zaihrayeu, “Concept search”, ESWC 2009

• Knowledge representation
– Term -> concept mapping
– Concept hierarchy
• Concept relations expressed as subsumption relations

• Concepts in documents/queries extracted
– by analyzing words and natural language phrases
– Nounphrases translated into conjunctions of atomic
concepts (complex concepts)
• Small-1Λdog-2

– Documents/queries represented as enumerated
sequences of complex concepts
• Eg: 1:small-1Λdog-2 2:big-1Λanimal-3

• Relevance model
– Documents having more specific concepts than query
concepts considered relevant
• Eg: poodle-1 relevant when searching for dog-2

– Ranking done by extending tf-idf relevance model
• Incorporates term-concept and concept-concept similarities also

• Distributed knowledge maintenance
– Each atomic concept indexed on DHT with id
– Node responsible for each atomic concept id also stores
ids of
• All immediate more specific atomic concepts
• All concepts in the path to root of the atomic concept

• Document indexing and search
– Concepts mapped to peer using DHT
– Query routed to peers responsible for the query concepts
and related concepts
– Results from multiple peers combined to give final results

• Global search
– The popularity(document frequency) of each concept
indexed in upper layer
– Tf-idf extended with universal knowledge to search for
communities
– Combined score of doc = (score of community)*(score of
doc within community)

Experiments
• Single layer syntactic vs semantic: TREC ad-hoc,TREC8 (
simulated with 10,000 peers)
– Wordnet as knowledge base

• Single vs 2 layer
– 18 communities (doc: categories in dMoz*)
• 18*1000 = 18,000 peers simulated

–
–
–
–
–

UK = domain-independent concepts and relations from wordnet
BK = UK + wordnet domains + YAGO
BK mapped to communities
Queries selected as directory path to a specific subdirectory
Standard result: documents in that subdirectory

*http://www.dmoz.org/

Experiments
• Tools
– GATE(NLP), Lucene(search library), PeerSim(peer-topeer system simulator)

• Performance metrics
– Quality
• Precision @10, precision @20
• Mean average precision, MAP

– Network bandwidth
• Average number of postings transferred

– Response time
• s-postings, s-hops

Results (1 layer syntactic vs semantic)

• Quality improved
• But, cost also increased

Results (1 layer vs 2 layer)

• Quality improved
• Cost decreased
– 94% decrease in posting transfer for opt. case

Two-layered approach results
• Proposed approach gives better quality and
performance over single-layered approaches
– Performance can further improved using
optimizations like early termination

• But, issue of query spikes remain

Query spikes in peer-to-peer search
• Query spikes can lead to instability
– Replication/caching insufficient due to high
document creation rate*
rate of queries related to “Bin laden” increased by
10,000 times within one hour in Google on May 1, 2011
after Operation Geronimo.

Some background
• Term-partitioned search
– Term/popular query responsibility assigned to individual peers
• Updates and queries are sent to peer responsible which process them

– Term -> peer mapping done using a Distributed Hash
Table(DHTs)

top-k result list of q

Cloud-assisted p2p search(CAPS)

• Offload responsibilities of spiking queries to
public cloud

Issues in realizing CAPS
• Maintaining full index copy in cloud is very
expensive
– Storage alone will cost more than 5 million dollars per
month*

• Approach: transfer only relevant index portion to
cloud
– Need to be performed fast considering effect on user
experience(result quality, response time)

• Effect on the desirable properties of peer-to-peer
search
– Privacy, transparency, decentralized control etc.

CAPS components
• Switching decision maker
– Decide when to switch
– Simple e.g., “switch when query rate increases by
X% within last Y seconds”

• Switching implementor
– Switching algorithm to seamlessly transfer index
partition
– Dynamic creation of cloud instances

CAPS Switching algorithm

• Ensures that result quality is not affected
• Controlled bandwidth usage at peer

Addressing additional concerns
• Transparency
– Index resides both among peers and cloud

• Centralized control
– Query can switched back to peers or other clouds

• Privacy
– Only spiking queries(less revealing) are forwarded to
cloud

• Cost
– Cloud used only transiently for spiking queries

• Cloud payment model
– Anonymous keyword-based advertising model*

CAPS Evaluation
• Experimental setup
– Target system consists of millions of peers
– Implemented the relevant components in a
realistic network
• Responsible peer, preceding peers, cloud instance

• Datasets
– Real datasets on query/corresponding
updates(rates) not publicly available
– Used synthetic queries and updates with expected
query/update rates/ratio

Experimental setup

• 6 heterogeneous workstations with 4-6 cores,
8-16GB RAM used

Experiments
• Two sets of experiments
1. Demonstrate effect of query spike with and
without cloud-assistance
2. Effect of switching on user experience
• Response time and result quality
• Switching time

Results-1
With cloud assistance
Without cloud assistance

Results-2(effect of switching on user
experience)

• Result freshness

• Response time

Conclusions
• Peer-to-peer search has many advantages by
design compared to centralized search
• But, peer-to-peer search approaches have
scalability issues
• Two-layered approach to peer-to-peer search can
improve efficiency and result quality of peer-topeer search
• Offloading queries to cloud can be an effective
method to handle query spikes
– Desirable properties of p2p systems not lost

Publications
• Janakiram Dharanipragada and Harisankar Haridas, “Stabilizing
peer-to-peer systems using public cloud: A case study of peer-topeer search”, In the The 11th International Symposium on Parallel
and Distributed Computing(ISPDC 2012), held at Munich, Germany.
• Janakiram Dharanipragada, Fausto Giunchiglia, Harisankar Haridas
and Uladzimir Kharkevich, “Two-layered architecture for peer-topeer concept search”, In the 4th International Semantic Search
Workshop located at the 20th Int. World Wide Web
Conference(WWW 2011), 2011), held at Hyderabad, India.
• Harisankar Haridas, Sriram Kailasam, Prateek Dhawalia, Prateek
Shrivastava, Santosh Kumar and Janakiram Dharanipragada, “Vcloud: A Peer-to-peer Video Storage-Compute Cloud”, In the 21st
International ACM Symposium on High-Performance Parallel and
Distributed Computing(HPDC 2012), held at Delft, The
Netherlands[Poster].

THANK YOU

Questions/Suggestions

harisankarh[ at ]gmail.com

Addressing scalability challenges in peer-to-peer search

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Addressing scalability challenges in peer-to-peer search

Similaire à Addressing scalability challenges in peer-to-peer search (20)

Dernier

Dernier (20)

Addressing scalability challenges in peer-to-peer search