seekda's Web Service search engine

seekda‘s Web Service Search Engine

Nathalie Steinmetz
seekda GmbH

1
© Copyright 2012 SEEKDA GmbH – www.seekda.com

seekda Web Service Search Engine

2

Motivation

 “Web of services”
 Growing amount of public services & data on the Web
 Problem: How do I find the service I need?
 General search engine: services hard to identify, not much information
on results page
 Specific portals: access to restricted sets of registered and editorially
maintained services
 Use semantic technologies for better search experience
 No to heavy-weight, expressive semantic web service languages
such as OWL-S or WSML
 Yes to simple light-weight semantic annotations in RDF
  Scalability!
3

Outline

 Web Service search engine - basics
 Focused Crawling
 WSDL-based services
 Web APIs

 Seekda‘s search engine & experimental prototype

 Crowdsourcing Web Service annotations
 Web Service Annotation wizard
 Amazon Mechanical Turk crowdsourcing

 Service ontologies


Service Location

 Locating Web Services on the Web (Approach adopted by
European projects Service-Finder & SOA4All)
 Crawling the Web for services
 Aggregate information
 Annotate services

 Supported services:
 WSDL descriptions
 Web APIs (a.k.a. RESTful services)

5

Service Crawler Architecture

Crawl Operator
Collecting Seeds
Configuration & Monitoring

Crawling

RDF
meta-data

Data
Post-Processing

ARCs Index

6

Crawling the Web for Services

 Basic crawling process:
 Start with a set of seed URLs
 Check whether a page should be fetched or not
 Fetch the document the URL points to
 Extract links from the fetched document
 Decide whether or not to store fetched documents
 Feed crawler queues with newly extracted links
 Assign costs/priorities to single URLs and queues

7

Focused Crawling Techniques

 Seed Collection
 Collecting seeds from specialized portals
 Reuse known Web Service descriptions and related documents
 URL Scheduling
 Use clever means to prioritize URLs to focus the crawls to the relevant part of
the Web
 Assign costs that influence the priority of a URL in a queue
 Based on:
 Building term vectors of pages to assess similarity to WS domain
 URL characteristics
 Queue Scheduling
 One queue per host
 Prioritize queues with low-cost URLs

8

Identify WSDLs and Related Information

 WSDL identification
 Check whether a fetched page is XML and valid WSDL

 Related documents identification
 Definition of related document
 Inlink to the WSDL
 Outlink from the WSDL
 Associated by term vector similarity
 Task split between crawl run-time and post-processing of the
crawl data
 Task implies the deeper crawling of service provider domains

9

Unique Service Objects

 Building unique service objects
 Collect all similar WSDLs  deduplication
 One service = all WSDLs with same provider and service
 Example:
 Unique Service: http://seekda.com/providers/cdyne.com/IP2Geo
 Endpoint: http://ws.cdyne.com/ip2geo/ip2geo.asmx
 Provider: cdyne.com
 Service: IP2Geo
 WSDLs:
http://ws.cdyne.com/ip2geo/ip2geo.asmx?wsdl
http://miki2005.uda.ad/p1net/Web%20References/com.cdyne.ws/ip2geo.wsdl
...

 Create uniqe service identifiers:
 http://seekda.com/providers/<providerName>/<serviceName>
 Assemble related information
10

Search Results

11

Service Overview

12

seekda Web Service Search Engine

13

Why crawl for Web APIs?

 Significant growth of Web APIs
 > 5,400 Web APIs on ProgrammableWeb (including SOAP and
REST APIs) [end of 2009: ca. 1,500 Web APIs]
 > 6,500 Mashups on ProgrammableWeb (combining Web APIs
from one or more sources)
 SOAP services are only a small part of the overall available
public services

14

Web API – Example (1/3)

15


16


 Problem:
 Web APIs are
described by regular
HTML pages
 No standardized
structure that helps
with the
identification

17

Web API Identification

 Solution: Crawl for Web APIs

 Approach 1: Manual Feature Identification Approach
 Taking into account HTML structure (e.g., title, mark-up), syntactical
properties of used language (e.g., camel-cased words), and link
properties of pages (ratio external links / internal links)

 Approach 2: Automatic Classification Approach
 Text Classification, supervised learning (Support Vector Machine
model)
 Training set: APIs from ProgrammableWeb

18

Unique Service Objects – Web APIs

 Create unique identifiers:
 Again using the provider name (from the Web API homepage)
 We do not know the service name  hash value of URL instead
 http://seekda.com/providers/<providerName>/<hashValueOfURL
>

 But: still needed human confirmation to be sure

19

New Search Engine Prototype

20

Prototype – User Contributions

 Web API – yes/no: confirmation from
human needed!
 Other annotations that help improve
the search for Web Services
 Categories
 Tags
 Natural Language descriptions
 Cost: Free or paid service

21

Problem - User Contribution

 Problem:
 Users/developers don’t contribute enough
 Hard to motivate them to provide annotations
 Community recognition or peer respect not enough
 Solution: crowdsourcing the annotations, pay people to
provide annotations
 Use Amazon Mechanical Turk
 Bootstrap annotations quickly and cheap

22

Service Annotation Wizard (1/4)

23


24


25


26

Amazon Mechanical Turk – Iteration 1

Number of Submissions 70
Reward per task $0.10
Restrictions none

 Annotation Wizard
 Web API Yes/No
 Assign a category
 Assign tags
 Provide a natural language description
 Determine whether page is documentation, pricing or listing
 Rate the service

27

Amazon Mechanical Turk – Iteration 1

 Results
 21 APIs correctly identified as APIs
 28 Web documents (non APIs) identified correctly as non APIs
 49/70 correctly identified (70% accuracy)
 Average task completion time: 2:20 min
 But, only:
 4 well done & complete annotations
 8 acceptable annotations (non complete)

28

Amazon Mechanical Turk – Iterations 2 & 3

Iteration 2 Iteration 3
Number of Submissions 100 150
Reward per task $0.20 $0.20
Restrictions yes yes

 Annotation Wizard
 Removed page type identification & service rating
 For a task to be accepted:
 At least one category must be assigned
 At least 2 tags must be provided
 A meaningful description must be provided

29

Amazon Mechanical Turk – Iteration 2 & 3

 Results Iteration 2 & 3:
 Ca. 80% of documents correctly identified
 Very satisfying annotations
 Average completion time: 2:36 min

30

Amazon Mechanical Turk – Survey

 48 survey submissions
 Female 18, Male 30
 Most popular origins: India (27) and USA (9)
 Popular age groups:
 15-22 (12)
 23-30 (18)
 31-50 (16)
 Most of them worked in some IT profession
 Provided best quality annotations

31

Amazon Mechanical Turk

 Recommendations for further improvement:
 Improve task description, especially ‘what is a Web API’
 Better examples (e.g., hinting what makes a false page false)
 Allow assignment of multiple categories
 Restrict to workers in IT professions?

 Conclusion:
 Very positive results  good way to get quality annotations
 Results will help provide better search experience to users
 Results can be used as positive set for automatic classification

32

Service Ontologies (1/2)

33

Service Ontologies (2/2)

http://www.service-finder.eu/ontologies/ServiceCategories

34

Questions?

35

seekda's Web Service search engine

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à seekda's Web Service search engine

Similaire à seekda's Web Service search engine (20)

Dernier

Dernier (20)

seekda's Web Service search engine