Contenu connexe Similaire à seekda's Web Service search engine (20) seekda's Web Service search engine1. seekda‘s Web Service Search Engine
Nathalie Steinmetz
seekda GmbH
1
© Copyright 2012 SEEKDA GmbH – www.seekda.com
3. Motivation
“Web of services”
Growing amount of public services & data on the Web
Problem: How do I find the service I need?
General search engine: services hard to identify, not much information
on results page
Specific portals: access to restricted sets of registered and editorially
maintained services
Use semantic technologies for better search experience
No to heavy-weight, expressive semantic web service languages
such as OWL-S or WSML
Yes to simple light-weight semantic annotations in RDF
Scalability!
3
© Copyright 2012 SEEKDA GmbH – www.seekda.com
4. Outline
Web Service search engine - basics
Focused Crawling
WSDL-based services
Web APIs
Seekda‘s search engine & experimental prototype
Crowdsourcing Web Service annotations
Web Service Annotation wizard
Amazon Mechanical Turk crowdsourcing
Service ontologies
© Copyright 2012 SEEKDA GmbH – www.seekda.com
5. Service Location
Locating Web Services on the Web (Approach adopted by
European projects Service-Finder & SOA4All)
Crawling the Web for services
Aggregate information
Annotate services
Supported services:
WSDL descriptions
Web APIs (a.k.a. RESTful services)
5
© Copyright 2012 SEEKDA GmbH – www.seekda.com
6. Service Crawler Architecture
Crawl Operator
Collecting Seeds
Configuration & Monitoring
Crawling
RDF
meta-data
Data
Post-Processing
ARCs Index
6
© Copyright 2012 SEEKDA GmbH – www.seekda.com
7. Crawling the Web for Services
Basic crawling process:
Start with a set of seed URLs
Check whether a page should be fetched or not
Fetch the document the URL points to
Extract links from the fetched document
Decide whether or not to store fetched documents
Feed crawler queues with newly extracted links
Assign costs/priorities to single URLs and queues
7
© Copyright 2012 SEEKDA GmbH – www.seekda.com
8. Focused Crawling Techniques
Seed Collection
Collecting seeds from specialized portals
Reuse known Web Service descriptions and related documents
URL Scheduling
Use clever means to prioritize URLs to focus the crawls to the relevant part of
the Web
Assign costs that influence the priority of a URL in a queue
Based on:
Building term vectors of pages to assess similarity to WS domain
URL characteristics
Queue Scheduling
One queue per host
Prioritize queues with low-cost URLs
8
© Copyright 2012 SEEKDA GmbH – www.seekda.com
9. Identify WSDLs and Related Information
WSDL identification
Check whether a fetched page is XML and valid WSDL
Related documents identification
Definition of related document
Inlink to the WSDL
Outlink from the WSDL
Associated by term vector similarity
Task split between crawl run-time and post-processing of the
crawl data
Task implies the deeper crawling of service provider domains
9
© Copyright 2012 SEEKDA GmbH – www.seekda.com
10. Unique Service Objects
Building unique service objects
Collect all similar WSDLs deduplication
One service = all WSDLs with same provider and service
Example:
Unique Service: http://seekda.com/providers/cdyne.com/IP2Geo
Endpoint: http://ws.cdyne.com/ip2geo/ip2geo.asmx
Provider: cdyne.com
Service: IP2Geo
WSDLs:
http://ws.cdyne.com/ip2geo/ip2geo.asmx?wsdl
http://miki2005.uda.ad/p1net/Web%20References/com.cdyne.ws/ip2geo.wsdl
...
Create uniqe service identifiers:
http://seekda.com/providers/<providerName>/<serviceName>
Assemble related information
10
© Copyright 2012 SEEKDA GmbH – www.seekda.com
14. Why crawl for Web APIs?
Significant growth of Web APIs
> 5,400 Web APIs on ProgrammableWeb (including SOAP and
REST APIs) [end of 2009: ca. 1,500 Web APIs]
> 6,500 Mashups on ProgrammableWeb (combining Web APIs
from one or more sources)
SOAP services are only a small part of the overall available
public services
14
© Copyright 2012 SEEKDA GmbH – www.seekda.com
15. Web API – Example (1/3)
15
© Copyright 2012 SEEKDA GmbH – www.seekda.com
16. Web API – Example (2/3)
16
© Copyright 2012 SEEKDA GmbH – www.seekda.com
17. Web API – Example (3/3)
Problem:
Web APIs are
described by regular
HTML pages
No standardized
structure that helps
with the
identification
17
© Copyright 2012 SEEKDA GmbH – www.seekda.com
18. Web API Identification
Solution: Crawl for Web APIs
Approach 1: Manual Feature Identification Approach
Taking into account HTML structure (e.g., title, mark-up), syntactical
properties of used language (e.g., camel-cased words), and link
properties of pages (ratio external links / internal links)
Approach 2: Automatic Classification Approach
Text Classification, supervised learning (Support Vector Machine
model)
Training set: APIs from ProgrammableWeb
18
© Copyright 2012 SEEKDA GmbH – www.seekda.com
19. Unique Service Objects – Web APIs
Create unique identifiers:
Again using the provider name (from the Web API homepage)
We do not know the service name hash value of URL instead
http://seekda.com/providers/<providerName>/<hashValueOfURL
>
But: still needed human confirmation to be sure
19
© Copyright 2012 SEEKDA GmbH – www.seekda.com
21. Prototype – User Contributions
Web API – yes/no: confirmation from
human needed!
Other annotations that help improve
the search for Web Services
Categories
Tags
Natural Language descriptions
Cost: Free or paid service
21
© Copyright 2012 SEEKDA GmbH – www.seekda.com
22. Problem - User Contribution
Problem:
Users/developers don’t contribute enough
Hard to motivate them to provide annotations
Community recognition or peer respect not enough
Solution: crowdsourcing the annotations, pay people to
provide annotations
Use Amazon Mechanical Turk
Bootstrap annotations quickly and cheap
22
© Copyright 2012 SEEKDA GmbH – www.seekda.com
27. Amazon Mechanical Turk – Iteration 1
Number of Submissions 70
Reward per task $0.10
Restrictions none
Annotation Wizard
Web API Yes/No
Assign a category
Assign tags
Provide a natural language description
Determine whether page is documentation, pricing or listing
Rate the service
27
© Copyright 2012 SEEKDA GmbH – www.seekda.com
28. Amazon Mechanical Turk – Iteration 1
Results
21 APIs correctly identified as APIs
28 Web documents (non APIs) identified correctly as non APIs
49/70 correctly identified (70% accuracy)
Average task completion time: 2:20 min
But, only:
4 well done & complete annotations
8 acceptable annotations (non complete)
28
© Copyright 2012 SEEKDA GmbH – www.seekda.com
29. Amazon Mechanical Turk – Iterations 2 & 3
Iteration 2 Iteration 3
Number of Submissions 100 150
Reward per task $0.20 $0.20
Restrictions yes yes
Annotation Wizard
Removed page type identification & service rating
For a task to be accepted:
At least one category must be assigned
At least 2 tags must be provided
A meaningful description must be provided
29
© Copyright 2012 SEEKDA GmbH – www.seekda.com
30. Amazon Mechanical Turk – Iteration 2 & 3
Results Iteration 2 & 3:
Ca. 80% of documents correctly identified
Very satisfying annotations
Average completion time: 2:36 min
30
© Copyright 2012 SEEKDA GmbH – www.seekda.com
31. Amazon Mechanical Turk – Survey
48 survey submissions
Female 18, Male 30
Most popular origins: India (27) and USA (9)
Popular age groups:
15-22 (12)
23-30 (18)
31-50 (16)
Most of them worked in some IT profession
Provided best quality annotations
31
© Copyright 2012 SEEKDA GmbH – www.seekda.com
32. Amazon Mechanical Turk
Recommendations for further improvement:
Improve task description, especially ‘what is a Web API’
Better examples (e.g., hinting what makes a false page false)
Allow assignment of multiple categories
Restrict to workers in IT professions?
Conclusion:
Very positive results good way to get quality annotations
Results will help provide better search experience to users
Results can be used as positive set for automatic classification
32
© Copyright 2012 SEEKDA GmbH – www.seekda.com
34. Service Ontologies (2/2)
http://www.service-finder.eu/ontologies/ServiceCategories
34
© Copyright 2012 SEEKDA GmbH – www.seekda.com
35. Questions?
35
© Copyright 2012 SEEKDA GmbH – www.seekda.com