SlideShare une entreprise Scribd logo
1  sur  35
seekda‘s Web Service Search Engine




                                                         Nathalie Steinmetz
                                                                 seekda GmbH




                                                                           1
© Copyright 2012 SEEKDA GmbH – www.seekda.com
seekda Web Service Search Engine




                                                                               2
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Motivation

       “Web of services”
           Growing amount of public services & data on the Web
           Problem: How do I find the service I need?
              General search engine: services hard to identify, not much information
               on results page
              Specific portals: access to restricted sets of registered and editorially
               maintained services
       Use semantic technologies for better search experience
           No to heavy-weight, expressive semantic web service languages
            such as OWL-S or WSML
           Yes to simple light-weight semantic annotations in RDF
            Scalability!
                                                                                       3
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Outline

       Web Service search engine - basics
           Focused Crawling
           WSDL-based services
           Web APIs

       Seekda‘s search engine & experimental prototype

       Crowdsourcing Web Service annotations
           Web Service Annotation wizard
           Amazon Mechanical Turk crowdsourcing

       Service ontologies

© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Location

       Locating Web Services on the Web (Approach adopted by
        European projects Service-Finder & SOA4All)
           Crawling the Web for services
           Aggregate information
           Annotate services

       Supported services:
           WSDL descriptions
           Web APIs (a.k.a. RESTful services)




                                                                5
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Crawler Architecture


                        Crawl Operator
                                                           Collecting Seeds
                            Configuration & Monitoring




                                                                Crawling

                                                                                        RDF
                                                                                      meta-data

                                                                Data
                                                           Post-Processing




                                                         ARCs                 Index


                                                                                                  6
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Crawling the Web for Services

       Basic crawling process:
             Start with a set of seed URLs
             Check whether a page should be fetched or not
             Fetch the document the URL points to
             Extract links from the fetched document
             Decide whether or not to store fetched documents
             Feed crawler queues with newly extracted links
             Assign costs/priorities to single URLs and queues




                                                                            7
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Focused Crawling Techniques

       Seed Collection
           Collecting seeds from specialized portals
           Reuse known Web Service descriptions and related documents
       URL Scheduling
           Use clever means to prioritize URLs to focus the crawls to the relevant part of
            the Web
           Assign costs that influence the priority of a URL in a queue
           Based on:
              Building term vectors of pages to assess similarity to WS domain
              URL characteristics
       Queue Scheduling
           One queue per host
           Prioritize queues with low-cost URLs

                                                                                              8
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Identify WSDLs and Related Information

       WSDL identification
           Check whether a fetched page is XML and valid WSDL

       Related documents identification
           Definition of related document
              Inlink to the WSDL
              Outlink from the WSDL
              Associated by term vector similarity
           Task split between crawl run-time and post-processing of the
            crawl data
           Task implies the deeper crawling of service provider domains

                                                                            9
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Unique Service Objects

       Building unique service objects
           Collect all similar WSDLs  deduplication
              One service = all WSDLs with same provider and service
              Example:
                   Unique Service: http://seekda.com/providers/cdyne.com/IP2Geo
                   Endpoint: http://ws.cdyne.com/ip2geo/ip2geo.asmx
                   Provider: cdyne.com
                   Service: IP2Geo
                   WSDLs:
                    http://ws.cdyne.com/ip2geo/ip2geo.asmx?wsdl
                    http://miki2005.uda.ad/p1net/Web%20References/com.cdyne.ws/ip2geo.wsdl
                    ...

       Create uniqe service identifiers:
           http://seekda.com/providers/<providerName>/<serviceName>
       Assemble related information
                                                                                             10
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Search Results




                                                            11
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Overview




                                                               12
© Copyright 2012 SEEKDA GmbH – www.seekda.com
seekda Web Service Search Engine




                                                                               13
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Why crawl for Web APIs?

       Significant growth of Web APIs
           > 5,400 Web APIs on ProgrammableWeb (including SOAP and
            REST APIs) [end of 2009: ca. 1,500 Web APIs]
           > 6,500 Mashups on ProgrammableWeb (combining Web APIs
            from one or more sources)
       SOAP services are only a small part of the overall available
        public services




                                                                       14
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Web API – Example (1/3)




                                                                      15
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Web API – Example (2/3)




                                                                      16
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Web API – Example (3/3)

       Problem:
           Web APIs are
            described by regular
            HTML pages
           No standardized
            structure that helps
            with the
            identification




                                                                      17
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Web API Identification

       Solution: Crawl for Web APIs

           Approach 1: Manual Feature Identification Approach
              Taking into account HTML structure (e.g., title, mark-up), syntactical
               properties of used language (e.g., camel-cased words), and link
               properties of pages (ratio external links / internal links)


           Approach 2: Automatic Classification Approach
              Text Classification, supervised learning (Support Vector Machine
               model)
              Training set: APIs from ProgrammableWeb


                                                                                        18
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Unique Service Objects – Web APIs

       Create unique identifiers:
           Again using the provider name (from the Web API homepage)
           We do not know the service name  hash value of URL instead
           http://seekda.com/providers/<providerName>/<hashValueOfURL
            >

       But: still needed human confirmation to be sure




                                                                               19
© Copyright 2012 SEEKDA GmbH – www.seekda.com
New Search Engine Prototype




                                                                          20
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Prototype – User Contributions

       Web API – yes/no: confirmation from
        human needed!
       Other annotations that help improve
        the search for Web Services
             Categories
             Tags
             Natural Language descriptions
             Cost: Free or paid service




                                                                            21
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Problem - User Contribution

       Problem:
           Users/developers don’t contribute enough
           Hard to motivate them to provide annotations
           Community recognition or peer respect not enough
       Solution: crowdsourcing the annotations, pay people to
        provide annotations
           Use Amazon Mechanical Turk
           Bootstrap annotations quickly and cheap




                                                                         22
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Annotation Wizard (1/4)




                                                                             23
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Annotation Wizard (2/4)




                                                                             24
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Annotation Wizard (3/4)




                                                                             25
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Annotation Wizard (4/4)




                                                                             26
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Amazon Mechanical Turk – Iteration 1

                        Number of Submissions               70
                        Reward per task                    $0.10
                        Restrictions                        none

       Annotation Wizard
             Web API Yes/No
             Assign a category
             Assign tags
             Provide a natural language description
             Determine whether page is documentation, pricing or listing
             Rate the service


                                                                              27
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Amazon Mechanical Turk – Iteration 1

       Results
             21 APIs correctly identified as APIs
             28 Web documents (non APIs) identified correctly as non APIs
             49/70 correctly identified (70% accuracy)
             Average task completion time: 2:20 min
       But, only:
           4 well done & complete annotations
           8 acceptable annotations (non complete)




                                                                              28
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Amazon Mechanical Turk – Iterations 2 & 3

                                                Iteration 2   Iteration 3
           Number of Submissions                   100           150
           Reward per task                        $0.20         $0.20
           Restrictions                            yes           yes


       Annotation Wizard
           Removed page type identification & service rating
           For a task to be accepted:
              At least one category must be assigned
              At least 2 tags must be provided
              A meaningful description must be provided


                                                                            29
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Amazon Mechanical Turk – Iteration 2 & 3

       Results Iteration 2 & 3:
           Ca. 80% of documents correctly identified
           Very satisfying annotations
           Average completion time: 2:36 min




                                                                         30
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Amazon Mechanical Turk – Survey

       48 survey submissions
           Female 18, Male 30
           Most popular origins: India (27) and USA (9)
           Popular age groups:
              15-22 (12)
              23-30 (18)
              31-50 (16)
           Most of them worked in some IT profession
              Provided best quality annotations




                                                                              31
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Amazon Mechanical Turk

       Recommendations for further improvement:
             Improve task description, especially ‘what is a Web API’
             Better examples (e.g., hinting what makes a false page false)
             Allow assignment of multiple categories
             Restrict to workers in IT professions?

       Conclusion:
           Very positive results  good way to get quality annotations
           Results will help provide better search experience to users
           Results can be used as positive set for automatic classification


                                                                               32
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Ontologies (1/2)




                                                                      33
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Ontologies (2/2)




                                                http://www.service-finder.eu/ontologies/ServiceCategories

                                                                                                  34
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Questions?




                                                             35
© Copyright 2012 SEEKDA GmbH – www.seekda.com

Contenu connexe

Tendances

Tripit Ajaxworld V5
Tripit Ajaxworld V5Tripit Ajaxworld V5
Tripit Ajaxworld V5
rajivmordani
 
Golam Md. Enamul Haque
Golam Md. Enamul HaqueGolam Md. Enamul Haque
Golam Md. Enamul Haque
memasum13
 
Web 2 0 Data Visualization With Jsf
Web 2 0 Data Visualization With JsfWeb 2 0 Data Visualization With Jsf
Web 2 0 Data Visualization With Jsf
rajivmordani
 
Supporting architecture for office 365 spo
Supporting architecture for office 365 spoSupporting architecture for office 365 spo
Supporting architecture for office 365 spo
Jethro Seghers
 
Talk IT_ Oracle_임기성_110907
Talk IT_ Oracle_임기성_110907Talk IT_ Oracle_임기성_110907
Talk IT_ Oracle_임기성_110907
Cana Ko
 
Frank Mantek Google G Data
Frank Mantek Google G DataFrank Mantek Google G Data
Frank Mantek Google G Data
deimos
 

Tendances (19)

Tripit Ajaxworld V5
Tripit Ajaxworld V5Tripit Ajaxworld V5
Tripit Ajaxworld V5
 
Introducing SQL Server Data Services
Introducing SQL Server Data ServicesIntroducing SQL Server Data Services
Introducing SQL Server Data Services
 
Golam Md. Enamul Haque
Golam Md. Enamul HaqueGolam Md. Enamul Haque
Golam Md. Enamul Haque
 
Design a share point 2013 architecture – the basics
Design a share point 2013 architecture – the basicsDesign a share point 2013 architecture – the basics
Design a share point 2013 architecture – the basics
 
Web 2 0 Data Visualization With Jsf
Web 2 0 Data Visualization With JsfWeb 2 0 Data Visualization With Jsf
Web 2 0 Data Visualization With Jsf
 
Multiorg Collaboration Using Salesforce S2S
Multiorg Collaboration Using Salesforce S2SMultiorg Collaboration Using Salesforce S2S
Multiorg Collaboration Using Salesforce S2S
 
SharePoint 2010: ECM-ready?
SharePoint 2010: ECM-ready?SharePoint 2010: ECM-ready?
SharePoint 2010: ECM-ready?
 
Blaze Ds Slides
Blaze Ds SlidesBlaze Ds Slides
Blaze Ds Slides
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)
 
Supporting architecture for office 365 spo
Supporting architecture for office 365 spoSupporting architecture for office 365 spo
Supporting architecture for office 365 spo
 
Talk IT_ Oracle_임기성_110907
Talk IT_ Oracle_임기성_110907Talk IT_ Oracle_임기성_110907
Talk IT_ Oracle_임기성_110907
 
SQL Azure Federation and Scalability
SQL Azure Federation and ScalabilitySQL Azure Federation and Scalability
SQL Azure Federation and Scalability
 
Architectural changes in SharePoint 2013
Architectural changes in SharePoint 2013Architectural changes in SharePoint 2013
Architectural changes in SharePoint 2013
 
List of Top Local Databases used for react native app developement in 2022
List of Top Local Databases used for react native app developement in 2022					List of Top Local Databases used for react native app developement in 2022
List of Top Local Databases used for react native app developement in 2022
 
A Succesful WebCenter Upgrade: What You Need to Know
A Succesful WebCenter Upgrade: What You Need to KnowA Succesful WebCenter Upgrade: What You Need to Know
A Succesful WebCenter Upgrade: What You Need to Know
 
Sql azure database under the hood
Sql azure database under the hoodSql azure database under the hood
Sql azure database under the hood
 
Session 2 Integrating SharePoint 2010 and Windows Azure
Session 2   Integrating SharePoint 2010 and Windows AzureSession 2   Integrating SharePoint 2010 and Windows Azure
Session 2 Integrating SharePoint 2010 and Windows Azure
 
Frank Mantek Google G Data
Frank Mantek Google G DataFrank Mantek Google G Data
Frank Mantek Google G Data
 
All-inclusive insights on Building JavaScript microservices with Node!.pdf
All-inclusive insights on Building JavaScript microservices with Node!.pdfAll-inclusive insights on Building JavaScript microservices with Node!.pdf
All-inclusive insights on Building JavaScript microservices with Node!.pdf
 

Similaire à seekda's Web Service search engine

AAAI2012 - Crowd Sourcing Web Service Annotations
AAAI2012 - Crowd Sourcing Web Service AnnotationsAAAI2012 - Crowd Sourcing Web Service Annotations
AAAI2012 - Crowd Sourcing Web Service Annotations
INSEMTIVES project
 
Con8439 fusion apps customs to ebs
Con8439 fusion apps customs to ebsCon8439 fusion apps customs to ebs
Con8439 fusion apps customs to ebs
Berry Clemens
 
Fusion app integration_con8685_pdf_8685_0001
Fusion app integration_con8685_pdf_8685_0001Fusion app integration_con8685_pdf_8685_0001
Fusion app integration_con8685_pdf_8685_0001
jucaab
 
Understanding the WSO2 Platform and Technology
Understanding the WSO2 Platform and TechnologyUnderstanding the WSO2 Platform and Technology
Understanding the WSO2 Platform and Technology
WSO2
 

Similaire à seekda's Web Service search engine (20)

Crowd Sourcing Web Service Annotations
Crowd Sourcing Web Service AnnotationsCrowd Sourcing Web Service Annotations
Crowd Sourcing Web Service Annotations
 
AAAI2012 - Crowd Sourcing Web Service Annotations
AAAI2012 - Crowd Sourcing Web Service AnnotationsAAAI2012 - Crowd Sourcing Web Service Annotations
AAAI2012 - Crowd Sourcing Web Service Annotations
 
Oracle ADF Architecture TV - Design - Service Integration Architectures
Oracle ADF Architecture TV - Design - Service Integration ArchitecturesOracle ADF Architecture TV - Design - Service Integration Architectures
Oracle ADF Architecture TV - Design - Service Integration Architectures
 
W8/WP8 App Dev for SAP, Part 1B: Service Generation with NetWeaver Gateway Fr...
W8/WP8 App Dev for SAP, Part 1B: Service Generation with NetWeaver Gateway Fr...W8/WP8 App Dev for SAP, Part 1B: Service Generation with NetWeaver Gateway Fr...
W8/WP8 App Dev for SAP, Part 1B: Service Generation with NetWeaver Gateway Fr...
 
Autodesk Technical Webinar: SAP NetWeaver Gateway Part 1
Autodesk Technical Webinar: SAP NetWeaver Gateway Part 1Autodesk Technical Webinar: SAP NetWeaver Gateway Part 1
Autodesk Technical Webinar: SAP NetWeaver Gateway Part 1
 
OOW 2012: Integrate Cloud Applications with Oracle SOA Suite
OOW 2012: Integrate Cloud Applications with Oracle SOA SuiteOOW 2012: Integrate Cloud Applications with Oracle SOA Suite
OOW 2012: Integrate Cloud Applications with Oracle SOA Suite
 
W8/WP8 App Dev for SAP, Part 1A: Service Development with NetWeaver Gateway S...
W8/WP8 App Dev for SAP, Part 1A: Service Development with NetWeaver Gateway S...W8/WP8 App Dev for SAP, Part 1A: Service Development with NetWeaver Gateway S...
W8/WP8 App Dev for SAP, Part 1A: Service Development with NetWeaver Gateway S...
 
Standard Issue: Preparing for the Future of Data Management
Standard Issue: Preparing for the Future of Data ManagementStandard Issue: Preparing for the Future of Data Management
Standard Issue: Preparing for the Future of Data Management
 
Oracle ADF Architecture TV - Design - ADF Service Architectures
Oracle ADF Architecture TV - Design - ADF Service ArchitecturesOracle ADF Architecture TV - Design - ADF Service Architectures
Oracle ADF Architecture TV - Design - ADF Service Architectures
 
Seamless Integrations between WebCenter Content, Site Studio, and WebCenter S...
Seamless Integrations between WebCenter Content, Site Studio, and WebCenter S...Seamless Integrations between WebCenter Content, Site Studio, and WebCenter S...
Seamless Integrations between WebCenter Content, Site Studio, and WebCenter S...
 
Elevate MongoDB with ODBC/JDBC
Elevate MongoDB with ODBC/JDBCElevate MongoDB with ODBC/JDBC
Elevate MongoDB with ODBC/JDBC
 
Con8439 fusion apps customs to ebs
Con8439 fusion apps customs to ebsCon8439 fusion apps customs to ebs
Con8439 fusion apps customs to ebs
 
黑豹 ch4 ddd pattern practice (2)
黑豹 ch4 ddd pattern practice (2)黑豹 ch4 ddd pattern practice (2)
黑豹 ch4 ddd pattern practice (2)
 
Application development using Zend Framework
Application development using Zend FrameworkApplication development using Zend Framework
Application development using Zend Framework
 
Fusion app integration_con8685_pdf_8685_0001
Fusion app integration_con8685_pdf_8685_0001Fusion app integration_con8685_pdf_8685_0001
Fusion app integration_con8685_pdf_8685_0001
 
From Requirements Management to Release with Git for Android System
From Requirements Management to Release with Git for Android System From Requirements Management to Release with Git for Android System
From Requirements Management to Release with Git for Android System
 
No SQL at The Guardian
No SQL at The GuardianNo SQL at The Guardian
No SQL at The Guardian
 
Administration for Oracle ADF Applications
Administration for Oracle ADF ApplicationsAdministration for Oracle ADF Applications
Administration for Oracle ADF Applications
 
Administration von ADF Anwendungen
Administration von ADF AnwendungenAdministration von ADF Anwendungen
Administration von ADF Anwendungen
 
Understanding the WSO2 Platform and Technology
Understanding the WSO2 Platform and TechnologyUnderstanding the WSO2 Platform and Technology
Understanding the WSO2 Platform and Technology
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

seekda's Web Service search engine

  • 1. seekda‘s Web Service Search Engine Nathalie Steinmetz seekda GmbH 1 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 2. seekda Web Service Search Engine 2 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 3. Motivation  “Web of services”  Growing amount of public services & data on the Web  Problem: How do I find the service I need?  General search engine: services hard to identify, not much information on results page  Specific portals: access to restricted sets of registered and editorially maintained services  Use semantic technologies for better search experience  No to heavy-weight, expressive semantic web service languages such as OWL-S or WSML  Yes to simple light-weight semantic annotations in RDF   Scalability! 3 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 4. Outline  Web Service search engine - basics  Focused Crawling  WSDL-based services  Web APIs  Seekda‘s search engine & experimental prototype  Crowdsourcing Web Service annotations  Web Service Annotation wizard  Amazon Mechanical Turk crowdsourcing  Service ontologies © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 5. Service Location  Locating Web Services on the Web (Approach adopted by European projects Service-Finder & SOA4All)  Crawling the Web for services  Aggregate information  Annotate services  Supported services:  WSDL descriptions  Web APIs (a.k.a. RESTful services) 5 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 6. Service Crawler Architecture Crawl Operator Collecting Seeds Configuration & Monitoring Crawling RDF meta-data Data Post-Processing ARCs Index 6 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 7. Crawling the Web for Services  Basic crawling process:  Start with a set of seed URLs  Check whether a page should be fetched or not  Fetch the document the URL points to  Extract links from the fetched document  Decide whether or not to store fetched documents  Feed crawler queues with newly extracted links  Assign costs/priorities to single URLs and queues 7 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 8. Focused Crawling Techniques  Seed Collection  Collecting seeds from specialized portals  Reuse known Web Service descriptions and related documents  URL Scheduling  Use clever means to prioritize URLs to focus the crawls to the relevant part of the Web  Assign costs that influence the priority of a URL in a queue  Based on:  Building term vectors of pages to assess similarity to WS domain  URL characteristics  Queue Scheduling  One queue per host  Prioritize queues with low-cost URLs 8 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 9. Identify WSDLs and Related Information  WSDL identification  Check whether a fetched page is XML and valid WSDL  Related documents identification  Definition of related document  Inlink to the WSDL  Outlink from the WSDL  Associated by term vector similarity  Task split between crawl run-time and post-processing of the crawl data  Task implies the deeper crawling of service provider domains 9 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 10. Unique Service Objects  Building unique service objects  Collect all similar WSDLs  deduplication  One service = all WSDLs with same provider and service  Example:  Unique Service: http://seekda.com/providers/cdyne.com/IP2Geo  Endpoint: http://ws.cdyne.com/ip2geo/ip2geo.asmx  Provider: cdyne.com  Service: IP2Geo  WSDLs: http://ws.cdyne.com/ip2geo/ip2geo.asmx?wsdl http://miki2005.uda.ad/p1net/Web%20References/com.cdyne.ws/ip2geo.wsdl ...  Create uniqe service identifiers:  http://seekda.com/providers/<providerName>/<serviceName>  Assemble related information 10 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 11. Search Results 11 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 12. Service Overview 12 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 13. seekda Web Service Search Engine 13 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 14. Why crawl for Web APIs?  Significant growth of Web APIs  > 5,400 Web APIs on ProgrammableWeb (including SOAP and REST APIs) [end of 2009: ca. 1,500 Web APIs]  > 6,500 Mashups on ProgrammableWeb (combining Web APIs from one or more sources)  SOAP services are only a small part of the overall available public services 14 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 15. Web API – Example (1/3) 15 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 16. Web API – Example (2/3) 16 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 17. Web API – Example (3/3)  Problem:  Web APIs are described by regular HTML pages  No standardized structure that helps with the identification 17 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 18. Web API Identification  Solution: Crawl for Web APIs  Approach 1: Manual Feature Identification Approach  Taking into account HTML structure (e.g., title, mark-up), syntactical properties of used language (e.g., camel-cased words), and link properties of pages (ratio external links / internal links)  Approach 2: Automatic Classification Approach  Text Classification, supervised learning (Support Vector Machine model)  Training set: APIs from ProgrammableWeb 18 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 19. Unique Service Objects – Web APIs  Create unique identifiers:  Again using the provider name (from the Web API homepage)  We do not know the service name  hash value of URL instead  http://seekda.com/providers/<providerName>/<hashValueOfURL >  But: still needed human confirmation to be sure 19 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 20. New Search Engine Prototype 20 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 21. Prototype – User Contributions  Web API – yes/no: confirmation from human needed!  Other annotations that help improve the search for Web Services  Categories  Tags  Natural Language descriptions  Cost: Free or paid service 21 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 22. Problem - User Contribution  Problem:  Users/developers don’t contribute enough  Hard to motivate them to provide annotations  Community recognition or peer respect not enough  Solution: crowdsourcing the annotations, pay people to provide annotations  Use Amazon Mechanical Turk  Bootstrap annotations quickly and cheap 22 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 23. Service Annotation Wizard (1/4) 23 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 24. Service Annotation Wizard (2/4) 24 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 25. Service Annotation Wizard (3/4) 25 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 26. Service Annotation Wizard (4/4) 26 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 27. Amazon Mechanical Turk – Iteration 1 Number of Submissions 70 Reward per task $0.10 Restrictions none  Annotation Wizard  Web API Yes/No  Assign a category  Assign tags  Provide a natural language description  Determine whether page is documentation, pricing or listing  Rate the service 27 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 28. Amazon Mechanical Turk – Iteration 1  Results  21 APIs correctly identified as APIs  28 Web documents (non APIs) identified correctly as non APIs  49/70 correctly identified (70% accuracy)  Average task completion time: 2:20 min  But, only:  4 well done & complete annotations  8 acceptable annotations (non complete) 28 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 29. Amazon Mechanical Turk – Iterations 2 & 3 Iteration 2 Iteration 3 Number of Submissions 100 150 Reward per task $0.20 $0.20 Restrictions yes yes  Annotation Wizard  Removed page type identification & service rating  For a task to be accepted:  At least one category must be assigned  At least 2 tags must be provided  A meaningful description must be provided 29 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 30. Amazon Mechanical Turk – Iteration 2 & 3  Results Iteration 2 & 3:  Ca. 80% of documents correctly identified  Very satisfying annotations  Average completion time: 2:36 min 30 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 31. Amazon Mechanical Turk – Survey  48 survey submissions  Female 18, Male 30  Most popular origins: India (27) and USA (9)  Popular age groups:  15-22 (12)  23-30 (18)  31-50 (16)  Most of them worked in some IT profession  Provided best quality annotations 31 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 32. Amazon Mechanical Turk  Recommendations for further improvement:  Improve task description, especially ‘what is a Web API’  Better examples (e.g., hinting what makes a false page false)  Allow assignment of multiple categories  Restrict to workers in IT professions?  Conclusion:  Very positive results  good way to get quality annotations  Results will help provide better search experience to users  Results can be used as positive set for automatic classification 32 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 33. Service Ontologies (1/2) 33 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 34. Service Ontologies (2/2) http://www.service-finder.eu/ontologies/ServiceCategories 34 © Copyright 2012 SEEKDA GmbH – www.seekda.com
  • 35. Questions? 35 © Copyright 2012 SEEKDA GmbH – www.seekda.com