SlideShare une entreprise Scribd logo
1  sur  120
Télécharger pour lire hors ligne
Clef–Ip 2009: retrieval experiments in the
      Intellectual Property domain

                 Giovanna Roda
                   Matrixware
                 Vienna, Austria



   Clef 2009 / 30 September - 2 October, 2009
Previous work on patent retrieval

  CLEF-IP 2009 is the first track on patent retrieval at Clef1
  (Cross Language Evaluation Forum).




    1
        http://www.clef-campaign.org
Previous work on patent retrieval

  CLEF-IP 2009 is the first track on patent retrieval at Clef1
  (Cross Language Evaluation Forum).


  Previous work on patent retrieval:




    1
        http://www.clef-campaign.org
Previous work on patent retrieval

  CLEF-IP 2009 is the first track on patent retrieval at Clef1
  (Cross Language Evaluation Forum).


  Previous work on patent retrieval:
         Acm Sigir 2000 Workshop




    1
        http://www.clef-campaign.org
Previous work on patent retrieval

  CLEF-IP 2009 is the first track on patent retrieval at Clef1
  (Cross Language Evaluation Forum).


  Previous work on patent retrieval:
         Acm Sigir 2000 Workshop
         Ntcir workshop series since 2001




    1
        http://www.clef-campaign.org
Previous work on patent retrieval

  CLEF-IP 2009 is the first track on patent retrieval at Clef1
  (Cross Language Evaluation Forum).


  Previous work on patent retrieval:
         Acm Sigir 2000 Workshop
         Ntcir workshop series since 2001
         Primarily targeting Japanese patents.




    1
        http://www.clef-campaign.org
Previous work on patent retrieval

  CLEF-IP 2009 is the first track on patent retrieval at Clef1
  (Cross Language Evaluation Forum).


  Previous work on patent retrieval:
         Acm Sigir 2000 Workshop
         Ntcir workshop series since 2001
         Primarily targeting Japanese patents.
             ad-hoc task (goal: find patents on a given topic)
             invalidity search (goal: find patents invalidating a given claim)
             patent classification according to the F-term system




    1
        http://www.clef-campaign.org
Legal and economic implications of patent search.


      patents are legal documents
      patent portfolios are assets for enterprises
      a single patent search can be worth several days of work




  High recall searches
  Missing even a single relevant document can have severe financial
  and economic impact. For example, when a granted patent
  becomes invalidated because of a document omitted at application
  time.
Clef–Ip 2009: the task



  The main task in the Clef–Ip track was to find prior art for a
  given patent.
Clef–Ip 2009: the task



  The main task in the Clef–Ip track was to find prior art for a
  given patent.



  Prior art search
  Prior art search consists in identifying all information (including
  non-patent literature) that might be relevant to a patent’s claim of
  novelty.
Prior art search.


  The most common type of patent search. It is performed at
  various stages of the patent life-cycle and with different intentions.
Prior art search.


  The most common type of patent search. It is performed at
  various stages of the patent life-cycle and with different intentions.

      before filing a patent application (novelty search or
      patentability search to determine whether the invention fulfills
      the requirements of
Prior art search.


  The most common type of patent search. It is performed at
  various stages of the patent life-cycle and with different intentions.

      before filing a patent application (novelty search or
      patentability search to determine whether the invention fulfills
      the requirements of
           novelty
Prior art search.


  The most common type of patent search. It is performed at
  various stages of the patent life-cycle and with different intentions.

      before filing a patent application (novelty search or
      patentability search to determine whether the invention fulfills
      the requirements of
           novelty
           inventive step
Prior art search.


  The most common type of patent search. It is performed at
  various stages of the patent life-cycle and with different intentions.

      before filing a patent application (novelty search or
      patentability search to determine whether the invention fulfills
      the requirements of
           novelty
           inventive step
      before grant - results of search constitute the search report
      attached to patent document
Prior art search.


  The most common type of patent search. It is performed at
  various stages of the patent life-cycle and with different intentions.

      before filing a patent application (novelty search or
      patentability search to determine whether the invention fulfills
      the requirements of
           novelty
           inventive step
      before grant - results of search constitute the search report
      attached to patent document
      invalidity search: post-grant search used to unveil prior art
      that invalidates a patent’s claims of originality
The patent search problem




  Some noteworthy facts about patent search:
The patent search problem




  Some noteworthy facts about patent search:
      patentese: language used in patents is not natural
The patent search problem




  Some noteworthy facts about patent search:
      patentese: language used in patents is not natural
      patents are linked (by citations, applicants, inventors,
      priorities, ...)
The patent search problem




  Some noteworthy facts about patent search:
      patentese: language used in patents is not natural
      patents are linked (by citations, applicants, inventors,
      priorities, ...)
      available classification information (Ipc, Ecla)
The patent search problem




  Some noteworthy facts about patent search:
      patentese: language used in patents is not natural
      patents are linked (by citations, applicants, inventors,
      priorities, ...)
      available classification information (Ipc, Ecla)
Outline

  1   Introduction
         Previous work on patent retrieval
         The patent search problem
         Clef–Ip the task

  2   The Clef–Ip Patent Test Collection
        Target data
        Topics
        Relevance assessments

  3   Participants

  4   Results

  5   Lessons Learned and Plans for 2010

  6   Epilogue
Outline

  1   Introduction
         Previous work on patent retrieval
         The patent search problem
         Clef–Ip the task

  2   The Clef–Ip Patent Test Collection
        Target data
        Topics
        Relevance assessments

  3   Participants

  4   Results

  5   Lessons Learned and Plans for 2010

  6   Epilogue
The Clef–Ip Patent Test Collection


  The Clef–Ip collection comprises
      target data: 1.9 million patent documents pertaining to 1
      million patents (75Gb)
The Clef–Ip Patent Test Collection


  The Clef–Ip collection comprises
      target data: 1.9 million patent documents pertaining to 1
      million patents (75Gb)
      10, 000 topics
The Clef–Ip Patent Test Collection


  The Clef–Ip collection comprises
      target data: 1.9 million patent documents pertaining to 1
      million patents (75Gb)
      10, 000 topics
      relevance assessments (with an average of 6.23 relevant
      documents per topic)
The Clef–Ip Patent Test Collection


  The Clef–Ip collection comprises
      target data: 1.9 million patent documents pertaining to 1
      million patents (75Gb)
      10, 000 topics
      relevance assessments (with an average of 6.23 relevant
      documents per topic)


  Target data and topics are multi-lingual: they contain fields in
  English, German, and French.
Patent documents


  The data was provided by Matrixware in a standardized Xml
  format for patent data (the Alexandria Xml scheme).
Looking at a patent document




  Field: description
  Language: German English French
Looking at a patent document




  Field: claims
  Language: German English French
Looking at a patent document




  Field: claims
  Language: German English French
Looking at a patent document




  Field: claims
  Language: German English French
Topics




  The task for the Clef–Ip track was to find prior art for a given
  patent.
Topics




  The task for the Clef–Ip track was to find prior art for a given
  patent.
  But:
Topics




  The task for the Clef–Ip track was to find prior art for a given
  patent.
  But:
      patents come in several versions corresponding to the different
      stages of the patent’s life-cycle
Topics




  The task for the Clef–Ip track was to find prior art for a given
  patent.
  But:
      patents come in several versions corresponding to the different
      stages of the patent’s life-cycle
      not all versions of a patent contain all fields
Topics




         How to represent a patent topic?
Topics




  We assembled a “virtual patent topic” file by
Topics




  We assembled a “virtual patent topic” file by
      taking the B1 document (granted patent)
Topics




  We assembled a “virtual patent topic” file by
      taking the B1 document (granted patent)
      adding missing fields from the most current document where
      they appeared
Criteria for topics selection




  Patents to be used as topics were selected according to the
  following criteria:
    1   availability of granted patent
    2   full text description available
    3   at least three citations
    4   at least one highly relevant citation
Relevance assessments

  1   Introduction
         Previous work on patent retrieval
         The patent search problem
         Clef–Ip the task

  2   The Clef–Ip Patent Test Collection
        Target data
        Topics
        Relevance assessments

  3   Participants

  4   Results

  5   Lessons Learned and Plans for 2010

  6   Epilogue
Relevance assessments


  We used patents cited as prior art as relevance assessments.
Relevance assessments


  We used patents cited as prior art as relevance assessments.


  Sources of citations:
Relevance assessments


  We used patents cited as prior art as relevance assessments.


  Sources of citations:
    1   applicant’s disclosure: the Uspto requires applicants to
        disclose all known relevant publications
Relevance assessments


  We used patents cited as prior art as relevance assessments.


  Sources of citations:
    1   applicant’s disclosure: the Uspto requires applicants to
        disclose all known relevant publications
    2   patent office search report: each patent office will do a search
        for prior art to judge the novelty of a patent
Relevance assessments


  We used patents cited as prior art as relevance assessments.


  Sources of citations:
    1   applicant’s disclosure: the Uspto requires applicants to
        disclose all known relevant publications
    2   patent office search report: each patent office will do a search
        for prior art to judge the novelty of a patent
    3   opposition procedures: patents cited to prove that a granted
        patent is not novel
Extended citations as relevance assessments

                                       P11           P34


                          P12                                    P33
                                  family             family
                            family                          family

                   P13 family     P1                       P3   family P32
                                       cites     cites
                         family          Seed patent            family

                   P14                       cites                       P31

                                              P2
                                  family             family
                          P21           family family            P24


                                       P22           P23




  direct citations and their families
Extended citations as relevance assessments

                                       Q11             Q43


                        Q12                                               Q42
                                      cites            cites
                              cites                               cites

                                      Q1                     Q4
                        cites                                        cites
                  Q13                                                           Q41
                                           family   family

                                            Seed patent

                                           family   family
                  Q21                                                           Q33
                        cites                                        cites
                                      Q2                     Q3

                              cites                               cites
                                      cites            cites
                        Q22                                               Q32


                                       Q23             Q31




  direct citations of family members ...
Extended citations as relevance assessments

                                      Q111             Q134


                          Q112                                     Q133
                                   family               family
                             family                            family

                   Q113 family Q11                            Q13 family Q132
                                      cites           cites
                          family               Q1                 family

                   Q114                       cites                        Q131

                                              Q12
                                   family               family
                          Q121          family family              Q124


                                      Q122             Q123




  ... and their families
Patent families




  A patent family consists of patents granted by different patent
  authorities but related to the same invention.
Patent families




  A patent family consists of patents granted by different patent
  authorities but related to the same invention.
  simple family all family members share the same priority number
Patent families




  A patent family consists of patents granted by different patent
  authorities but related to the same invention.
  simple family all family members share the same priority number
  extended family there are several definitions, in the INPADOC
              database all documents which are directly or
              indirectly linked via a priority number belong to the
              same family
Patent families




Patent documents are linked by
priorities
Patent families




Patent documents are linked by
                                 INPADOC family.
priorities
Patent families




Patent documents are linked by
                                 Clef–Ip uses simple families.
priorities
Outline

  1   Introduction
         Previous work on patent retrieval
         The patent search problem
         Clef–Ip the task

  2   The Clef–Ip Patent Test Collection
        Target data
        Topics
        Relevance assessments

  3   Participants

  4   Results

  5   Lessons Learned and Plans for 2010

  6   Epilogue
Participants




           CH 3
                       DE 3
                               15 participants
    NL 2
                               48 runs for the main task
                          UK   10 runs for the language
                               tasks
      ES 2               SE
               FI IE   RO
Participants

    1 Tech. Univ. Darmstadt, Dept. of CS,
      Ubiquitous Knowledge Processing Lab (DE)
    2 Univ. Neuchatel - Computer Science (CH)
    3 Santiago de Compostela Univ. - Dept.
      Electronica y Computacion (ES)
    4 University of Tampere - Info Studies (FI)
    5 Interactive Media and Swedish Institute of
      Computer Science (SE)
    6 Geneva Univ. - Centre Universitaire
      d’Informatique (CH)
    7 Glasgow Univ. - IR Group Keith (UK)
    8 Centrum Wiskunde & Informatica - Interactive
      Information Access (NL)
Participants

    9 Geneva Univ. Hospitals - Service of Medical
      Informatics (CH)
   10 Humboldt Univ. - Dept. of German Language
      and Linguistics (DE)
   11 Dublin City Univ. - School of Computing (IE)
   12 Radboud Univ. Nijmegen - Centre for Language
      Studies & Speech Technologies (NL)
   13 Hildesheim Univ. - Information Systems &
      Machine Learning Lab (DE)
   14 Technical Univ. Valencia - Natural Language
      Engineering (ES)
   15 Al. I. Cuza University of Iasi - Natural Language
      Processing (RO)
Upload of experiments




  A system based on Alfresco2 together with a Docasu3 web
  interface was developed.
  Main features of this system are:




    2
        http://www.alfresco.com/
    3
        http://docasu.sourceforge.net/
Upload of experiments




  A system based on Alfresco2 together with a Docasu3 web
  interface was developed.
  Main features of this system are:
         user authentication




    2
        http://www.alfresco.com/
    3
        http://docasu.sourceforge.net/
Upload of experiments




  A system based on Alfresco2 together with a Docasu3 web
  interface was developed.
  Main features of this system are:
         user authentication
         run files format checks




    2
        http://www.alfresco.com/
    3
        http://docasu.sourceforge.net/
Upload of experiments




  A system based on Alfresco2 together with a Docasu3 web
  interface was developed.
  Main features of this system are:
         user authentication
         run files format checks
         revision control




    2
        http://www.alfresco.com/
    3
        http://docasu.sourceforge.net/
Who contributed

  These are the people who contributed to the Clef–Ip track:
Who contributed

  These are the people who contributed to the Clef–Ip track:

      the Clef–Ip steering committee:
Who contributed

  These are the people who contributed to the Clef–Ip track:

      the Clef–Ip steering committee: Gianni Amati, Kalervo
      J¨rvelin, Noriko Kando, Mark Sanderson, Henk Thomas,
       a
      Christa Womser-Hacker
Who contributed

  These are the people who contributed to the Clef–Ip track:

      the Clef–Ip steering committee: Gianni Amati, Kalervo
      J¨rvelin, Noriko Kando, Mark Sanderson, Henk Thomas,
       a
      Christa Womser-Hacker
      Helmut Berger who invented the name Clef–Ip
Who contributed

  These are the people who contributed to the Clef–Ip track:

      the Clef–Ip steering committee: Gianni Amati, Kalervo
      J¨rvelin, Noriko Kando, Mark Sanderson, Henk Thomas,
       a
      Christa Womser-Hacker
      Helmut Berger who invented the name Clef–Ip
      Florina Piroi and Veronika Zenz who walked the walk
Who contributed

  These are the people who contributed to the Clef–Ip track:

      the Clef–Ip steering committee: Gianni Amati, Kalervo
      J¨rvelin, Noriko Kando, Mark Sanderson, Henk Thomas,
       a
      Christa Womser-Hacker
      Helmut Berger who invented the name Clef–Ip
      Florina Piroi and Veronika Zenz who walked the walk
      the patent experts who helped with advice and with
      assessment of results
Who contributed

  These are the people who contributed to the Clef–Ip track:

      the Clef–Ip steering committee: Gianni Amati, Kalervo
      J¨rvelin, Noriko Kando, Mark Sanderson, Henk Thomas,
       a
      Christa Womser-Hacker
      Helmut Berger who invented the name Clef–Ip
      Florina Piroi and Veronika Zenz who walked the walk
      the patent experts who helped with advice and with
      assessment of results
      the Soire team
Who contributed

  These are the people who contributed to the Clef–Ip track:

      the Clef–Ip steering committee: Gianni Amati, Kalervo
      J¨rvelin, Noriko Kando, Mark Sanderson, Henk Thomas,
       a
      Christa Womser-Hacker
      Helmut Berger who invented the name Clef–Ip
      Florina Piroi and Veronika Zenz who walked the walk
      the patent experts who helped with advice and with
      assessment of results
      the Soire team
      Evangelos Kanoulas and Emine Yilmaz for their advice on
      statistics
Who contributed

  These are the people who contributed to the Clef–Ip track:

      the Clef–Ip steering committee: Gianni Amati, Kalervo
      J¨rvelin, Noriko Kando, Mark Sanderson, Henk Thomas,
       a
      Christa Womser-Hacker
      Helmut Berger who invented the name Clef–Ip
      Florina Piroi and Veronika Zenz who walked the walk
      the patent experts who helped with advice and with
      assessment of results
      the Soire team
      Evangelos Kanoulas and Emine Yilmaz for their advice on
      statistics
      John Tait
Outline

  1   Introduction
         Previous work on patent retrieval
         The patent search problem
         Clef–Ip the task

  2   The Clef–Ip Patent Test Collection
        Target data
        Topics
        Relevance assessments

  3   Participants

  4   Results

  5   Lessons Learned and Plans for 2010

  6   Epilogue
Measures used for evaluation




  We evaluated all runs according to standard IR measures
Measures used for evaluation




  We evaluated all runs according to standard IR measures
      Precision, Precision@5, Precision@10, Precision@100
Measures used for evaluation




  We evaluated all runs according to standard IR measures
      Precision, Precision@5, Precision@10, Precision@100
      Recall, Recall@5, Recall@10, Recall@100
Measures used for evaluation




  We evaluated all runs according to standard IR measures
      Precision, Precision@5, Precision@10, Precision@100
      Recall, Recall@5, Recall@10, Recall@100
      MAP
Measures used for evaluation




  We evaluated all runs according to standard IR measures
      Precision, Precision@5, Precision@10, Precision@100
      Recall, Recall@5, Recall@10, Recall@100
      MAP
      nDCG (with reduction factor given by a logarithm in base 10)
How to interpret the results




  Some participants were disappointed by their poor evaluation
              results as compared to other tracks
How to interpret the results




                      MAP = 0.02 ?
How to interpret the results




  There are two main reasons why evaluation at Clef–Ip yields lower
  values than other tracks:
How to interpret the results




  There are two main reasons why evaluation at Clef–Ip yields lower
  values than other tracks:
    1   citations are incomplete sets of relevance assessments
How to interpret the results




  There are two main reasons why evaluation at Clef–Ip yields lower
  values than other tracks:
    1   citations are incomplete sets of relevance assessments
    2   target data set is fragmentary, some patents are represented
        by one single document containing just title and bibliographic
        references (thus making it practically unfindable)
How to interpret the results




  Still, one can sensibly use evaluation results for comparing runs as-
  suming that
How to interpret the results




  Still, one can sensibly use evaluation results for comparing runs as-
  suming that
    1   incompleteness of citations is distributed uniformly
How to interpret the results




  Still, one can sensibly use evaluation results for comparing runs as-
  suming that
    1   incompleteness of citations is distributed uniformly
    2   same assumption for unfindable documents in the collection
How to interpret the results




  Still, one can sensibly use evaluation results for comparing runs as-
  suming that
    1   incompleteness of citations is distributed uniformly
    2   same assumption for unfindable documents in the collection

  Incompleteness of citations is difficult to check not having a large
  enough gold standard to refer to.
How to interpret the results




  Still, one can sensibly use evaluation results for comparing runs as-
  suming that
    1   incompleteness of citations is distributed uniformly
    2   same assumption for unfindable documents in the collection

  Incompleteness of citations is difficult to check not having a large
  enough gold standard to refer to.

  Second issue: we are thinking about re-evaluating all runs after
  removing unfindable patents from the collection.
MAP: best run per participant
MAP: best run per participant

   Group-ID      Run-ID                      MAP    R@100    P@100
   humb          1                           0.27    0.58     0.03
   hcuge         BiTeM                       0.11    0.40     0.02
   uscom         BM25bt                      0.11    0.36     0.02
   UTASICS       all-ratf-ipcr               0.11    0.37     0.02
   UniNE         strat3                      0.10    0.34     0.02
   TUD           800noTitle                  0.11    0.42     0.02
   clefip-dcu     Filtered2                   0.09    0.35     0.02
   clefip-unige   RUN3                        0.09    0.30     0.02
   clefip-ug      infdocfreqCosEnglishTerms   0.07    0.24     0.01
   cwi           categorybm25                0.07    0.29     0.02
   clefip-run     ClaimsBOW                   0.05    0.22     0.01
   NLEL          MethodA                     0.03    0.12     0.01
   UAIC          MethodAnew                  0.01    0.03     0.00
   Hildesheim    MethodAnew                  0.00    0.02     0.00

           Table: MAP, P@100, R@100 of best run/participant (S)
Manual assessments



  We managed to have 12 topics assessed up to rank 20 for all runs.
Manual assessments



  We managed to have 12 topics assessed up to rank 20 for all runs.
      7 patent search professionals
Manual assessments



  We managed to have 12 topics assessed up to rank 20 for all runs.
      7 patent search professionals
      judged in average 264 documents per topics
Manual assessments



  We managed to have 12 topics assessed up to rank 20 for all runs.
      7 patent search professionals
      judged in average 264 documents per topics
      not surprisingly, rankings of systems obtained with this small
      collection do not agree with rankings obtained with large
      collection
Manual assessments



  We managed to have 12 topics assessed up to rank 20 for all runs.
      7 patent search professionals
      judged in average 264 documents per topics
      not surprisingly, rankings of systems obtained with this small
      collection do not agree with rankings obtained with large
      collection

  Investigations on this smaller collection are ongoing.
Correlation analysis



  The rankings of runs obtained with the three sets of topics (S=500
  ,M=1000, XL=10, 000)are highly correlated (Kendall’s τ > 0.9)
  suggesting that the three collections are equivalent.
Correlation analysis



  As expected, correlation drops when comparing the ranking
  obtained with the 12 manually assessed topics and the one
  obtained with the ≥ 500 topics sets.
Working notes




       I didn’t have time to read the working notes ...
... so I collected all the notes and generated a Wordle
... so I collected all the notes and generated a Wordle




                 They’re about patent retrieval.
Refining the Wordle




  I ran an Information Extraction algorithm in order to get a more
  meaningful picture
Refining the Wordle
Refining the Wordle
Humboldt’s University working notes
Humboldt’s University working notes
Humboldt’s University working notes
Humboldt’s University working notes
Humboldt’s University working notes
Future plans


  Some plans and ideas for future tracks:
Future plans


  Some plans and ideas for future tracks:
      a layered evaluation model is needed in order to measure the
      impact of each single factor to retrieval effectiveness
Future plans


  Some plans and ideas for future tracks:
      a layered evaluation model is needed in order to measure the
      impact of each single factor to retrieval effectiveness
      provide images (they are essential elements in chemical or
      mechanical patents, for instance)
Future plans


  Some plans and ideas for future tracks:
      a layered evaluation model is needed in order to measure the
      impact of each single factor to retrieval effectiveness
      provide images (they are essential elements in chemical or
      mechanical patents, for instance)
      investigate query reformulations rather than one query-result
      set
Future plans


  Some plans and ideas for future tracks:
      a layered evaluation model is needed in order to measure the
      impact of each single factor to retrieval effectiveness
      provide images (they are essential elements in chemical or
      mechanical patents, for instance)
      investigate query reformulations rather than one query-result
      set
      extend collection to include other languages
Future plans


  Some plans and ideas for future tracks:
      a layered evaluation model is needed in order to measure the
      impact of each single factor to retrieval effectiveness
      provide images (they are essential elements in chemical or
      mechanical patents, for instance)
      investigate query reformulations rather than one query-result
      set
      extend collection to include other languages
      include an annotation task
Future plans


  Some plans and ideas for future tracks:
      a layered evaluation model is needed in order to measure the
      impact of each single factor to retrieval effectiveness
      provide images (they are essential elements in chemical or
      mechanical patents, for instance)
      investigate query reformulations rather than one query-result
      set
      extend collection to include other languages
      include an annotation task
      include a categorization task
Epilogue




     we have created a large integrated test collection for
     experimentations in patent retrieval
Epilogue




     we have created a large integrated test collection for
     experimentations in patent retrieval
     the Clef–Ip track had a more than satisfactory participation
     rate for its first year
Epilogue




     we have created a large integrated test collection for
     experimentations in patent retrieval
     the Clef–Ip track had a more than satisfactory participation
     rate for its first year
     the right combination of techniques and the exploitation of
     patent-specific know-how yields best results
Thank you for your attention.

Contenu connexe

Similaire à CLEF-IP 2009: Patent Retrieval Experiments

Patent Search: An important new test bed for IR
Patent Search: An important new test bed for IRPatent Search: An important new test bed for IR
Patent Search: An important new test bed for IRGiovanna Roda
 
2010 Patent Information
2010 Patent Information2010 Patent Information
2010 Patent Informationsmallhead
 
Patent search analysis and report
Patent search analysis and reportPatent search analysis and report
Patent search analysis and reportYash Patel
 
Patent search analysis and report
Patent search analysis and reportPatent search analysis and report
Patent search analysis and reportYash Patel
 
COER 13-2-2008.pdf
COER 13-2-2008.pdfCOER 13-2-2008.pdf
COER 13-2-2008.pdfssuser996f71
 
Prior Art Search - An Overview
Prior Art Search - An OverviewPrior Art Search - An Overview
Prior Art Search - An OverviewManoj Prajapati
 
IPR in Academic Research:Dr G Geetha
IPR in Academic Research:Dr G GeethaIPR in Academic Research:Dr G Geetha
IPR in Academic Research:Dr G GeethaDr Geetha Mohan
 
Patents and Intellectual Property
Patents and Intellectual PropertyPatents and Intellectual Property
Patents and Intellectual PropertyEdwardBennett122
 
Presentation on patent search
Presentation on patent searchPresentation on patent search
Presentation on patent searchtinasingh30
 
Patents 101: How to Do a Patent Search
Patents 101: How to Do a Patent SearchPatents 101: How to Do a Patent Search
Patents 101: How to Do a Patent SearchKristina Gomez
 
The EPO document collection: A technical treasure chest
The EPO document collection:A technical treasure chestThe EPO document collection:A technical treasure chest
The EPO document collection: A technical treasure chestGO opleidingen
 
The Peculiar Searching Habits of the North American Patent Searcher
The Peculiar Searching Habits of the North American Patent SearcherThe Peculiar Searching Habits of the North American Patent Searcher
The Peculiar Searching Habits of the North American Patent Searcheratripper
 
Advanced patents
Advanced patentsAdvanced patents
Advanced patentsJohn Meier
 

Similaire à CLEF-IP 2009: Patent Retrieval Experiments (20)

Patent Search: An important new test bed for IR
Patent Search: An important new test bed for IRPatent Search: An important new test bed for IR
Patent Search: An important new test bed for IR
 
2010 Patent Information
2010 Patent Information2010 Patent Information
2010 Patent Information
 
Patent searching training doc
Patent searching training docPatent searching training doc
Patent searching training doc
 
Patent search analysis and report
Patent search analysis and reportPatent search analysis and report
Patent search analysis and report
 
Patent search analysis and report
Patent search analysis and reportPatent search analysis and report
Patent search analysis and report
 
UNH Law/WIPO Summer School: 2017 Patent Information and its Usefulness
UNH Law/WIPO Summer School: 2017 Patent Information and its Usefulness UNH Law/WIPO Summer School: 2017 Patent Information and its Usefulness
UNH Law/WIPO Summer School: 2017 Patent Information and its Usefulness
 
Introduction to Global Patent Searching & Analysis
Introduction to Global Patent Searching & AnalysisIntroduction to Global Patent Searching & Analysis
Introduction to Global Patent Searching & Analysis
 
COER 13-2-2008.pdf
COER 13-2-2008.pdfCOER 13-2-2008.pdf
COER 13-2-2008.pdf
 
Prior Art Search - An Overview
Prior Art Search - An OverviewPrior Art Search - An Overview
Prior Art Search - An Overview
 
Patent search
Patent searchPatent search
Patent search
 
Patent Searching 101
Patent Searching 101Patent Searching 101
Patent Searching 101
 
IPR in Academic Research:Dr G Geetha
IPR in Academic Research:Dr G GeethaIPR in Academic Research:Dr G Geetha
IPR in Academic Research:Dr G Geetha
 
Patents and Intellectual Property
Patents and Intellectual PropertyPatents and Intellectual Property
Patents and Intellectual Property
 
Patent basics
Patent basicsPatent basics
Patent basics
 
Presentation on patent search
Presentation on patent searchPresentation on patent search
Presentation on patent search
 
Patents 101: How to Do a Patent Search
Patents 101: How to Do a Patent SearchPatents 101: How to Do a Patent Search
Patents 101: How to Do a Patent Search
 
The EPO document collection: A technical treasure chest
The EPO document collection:A technical treasure chestThe EPO document collection:A technical treasure chest
The EPO document collection: A technical treasure chest
 
The Peculiar Searching Habits of the North American Patent Searcher
The Peculiar Searching Habits of the North American Patent SearcherThe Peculiar Searching Habits of the North American Patent Searcher
The Peculiar Searching Habits of the North American Patent Searcher
 
Advanced patents
Advanced patentsAdvanced patents
Advanced patents
 
Patent analysis
Patent analysisPatent analysis
Patent analysis
 

Plus de Giovanna Roda

Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for EveryoneGiovanna Roda
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1Giovanna Roda
 
The need for new paradigms in IT services provisioning
The need for new paradigms in IT services provisioningThe need for new paradigms in IT services provisioning
The need for new paradigms in IT services provisioningGiovanna Roda
 
Apache Spark™ is here to stay
Apache Spark™ is here to stayApache Spark™ is here to stay
Apache Spark™ is here to stayGiovanna Roda
 
Chances and Challenges in Comparing Cross-Language Retrieval Tools
Chances and Challenges in Comparing Cross-Language Retrieval ToolsChances and Challenges in Comparing Cross-Language Retrieval Tools
Chances and Challenges in Comparing Cross-Language Retrieval ToolsGiovanna Roda
 

Plus de Giovanna Roda (7)

Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for Everyone
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
The need for new paradigms in IT services provisioning
The need for new paradigms in IT services provisioningThe need for new paradigms in IT services provisioning
The need for new paradigms in IT services provisioning
 
Apache Spark™ is here to stay
Apache Spark™ is here to stayApache Spark™ is here to stay
Apache Spark™ is here to stay
 
Chances and Challenges in Comparing Cross-Language Retrieval Tools
Chances and Challenges in Comparing Cross-Language Retrieval ToolsChances and Challenges in Comparing Cross-Language Retrieval Tools
Chances and Challenges in Comparing Cross-Language Retrieval Tools
 

Dernier

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 

Dernier (20)

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 

CLEF-IP 2009: Patent Retrieval Experiments

  • 1. Clef–Ip 2009: retrieval experiments in the Intellectual Property domain Giovanna Roda Matrixware Vienna, Austria Clef 2009 / 30 September - 2 October, 2009
  • 2. Previous work on patent retrieval CLEF-IP 2009 is the first track on patent retrieval at Clef1 (Cross Language Evaluation Forum). 1 http://www.clef-campaign.org
  • 3. Previous work on patent retrieval CLEF-IP 2009 is the first track on patent retrieval at Clef1 (Cross Language Evaluation Forum). Previous work on patent retrieval: 1 http://www.clef-campaign.org
  • 4. Previous work on patent retrieval CLEF-IP 2009 is the first track on patent retrieval at Clef1 (Cross Language Evaluation Forum). Previous work on patent retrieval: Acm Sigir 2000 Workshop 1 http://www.clef-campaign.org
  • 5. Previous work on patent retrieval CLEF-IP 2009 is the first track on patent retrieval at Clef1 (Cross Language Evaluation Forum). Previous work on patent retrieval: Acm Sigir 2000 Workshop Ntcir workshop series since 2001 1 http://www.clef-campaign.org
  • 6. Previous work on patent retrieval CLEF-IP 2009 is the first track on patent retrieval at Clef1 (Cross Language Evaluation Forum). Previous work on patent retrieval: Acm Sigir 2000 Workshop Ntcir workshop series since 2001 Primarily targeting Japanese patents. 1 http://www.clef-campaign.org
  • 7. Previous work on patent retrieval CLEF-IP 2009 is the first track on patent retrieval at Clef1 (Cross Language Evaluation Forum). Previous work on patent retrieval: Acm Sigir 2000 Workshop Ntcir workshop series since 2001 Primarily targeting Japanese patents. ad-hoc task (goal: find patents on a given topic) invalidity search (goal: find patents invalidating a given claim) patent classification according to the F-term system 1 http://www.clef-campaign.org
  • 8. Legal and economic implications of patent search. patents are legal documents patent portfolios are assets for enterprises a single patent search can be worth several days of work High recall searches Missing even a single relevant document can have severe financial and economic impact. For example, when a granted patent becomes invalidated because of a document omitted at application time.
  • 9. Clef–Ip 2009: the task The main task in the Clef–Ip track was to find prior art for a given patent.
  • 10. Clef–Ip 2009: the task The main task in the Clef–Ip track was to find prior art for a given patent. Prior art search Prior art search consists in identifying all information (including non-patent literature) that might be relevant to a patent’s claim of novelty.
  • 11. Prior art search. The most common type of patent search. It is performed at various stages of the patent life-cycle and with different intentions.
  • 12. Prior art search. The most common type of patent search. It is performed at various stages of the patent life-cycle and with different intentions. before filing a patent application (novelty search or patentability search to determine whether the invention fulfills the requirements of
  • 13. Prior art search. The most common type of patent search. It is performed at various stages of the patent life-cycle and with different intentions. before filing a patent application (novelty search or patentability search to determine whether the invention fulfills the requirements of novelty
  • 14. Prior art search. The most common type of patent search. It is performed at various stages of the patent life-cycle and with different intentions. before filing a patent application (novelty search or patentability search to determine whether the invention fulfills the requirements of novelty inventive step
  • 15. Prior art search. The most common type of patent search. It is performed at various stages of the patent life-cycle and with different intentions. before filing a patent application (novelty search or patentability search to determine whether the invention fulfills the requirements of novelty inventive step before grant - results of search constitute the search report attached to patent document
  • 16. Prior art search. The most common type of patent search. It is performed at various stages of the patent life-cycle and with different intentions. before filing a patent application (novelty search or patentability search to determine whether the invention fulfills the requirements of novelty inventive step before grant - results of search constitute the search report attached to patent document invalidity search: post-grant search used to unveil prior art that invalidates a patent’s claims of originality
  • 17. The patent search problem Some noteworthy facts about patent search:
  • 18. The patent search problem Some noteworthy facts about patent search: patentese: language used in patents is not natural
  • 19. The patent search problem Some noteworthy facts about patent search: patentese: language used in patents is not natural patents are linked (by citations, applicants, inventors, priorities, ...)
  • 20. The patent search problem Some noteworthy facts about patent search: patentese: language used in patents is not natural patents are linked (by citations, applicants, inventors, priorities, ...) available classification information (Ipc, Ecla)
  • 21. The patent search problem Some noteworthy facts about patent search: patentese: language used in patents is not natural patents are linked (by citations, applicants, inventors, priorities, ...) available classification information (Ipc, Ecla)
  • 22. Outline 1 Introduction Previous work on patent retrieval The patent search problem Clef–Ip the task 2 The Clef–Ip Patent Test Collection Target data Topics Relevance assessments 3 Participants 4 Results 5 Lessons Learned and Plans for 2010 6 Epilogue
  • 23. Outline 1 Introduction Previous work on patent retrieval The patent search problem Clef–Ip the task 2 The Clef–Ip Patent Test Collection Target data Topics Relevance assessments 3 Participants 4 Results 5 Lessons Learned and Plans for 2010 6 Epilogue
  • 24. The Clef–Ip Patent Test Collection The Clef–Ip collection comprises target data: 1.9 million patent documents pertaining to 1 million patents (75Gb)
  • 25. The Clef–Ip Patent Test Collection The Clef–Ip collection comprises target data: 1.9 million patent documents pertaining to 1 million patents (75Gb) 10, 000 topics
  • 26. The Clef–Ip Patent Test Collection The Clef–Ip collection comprises target data: 1.9 million patent documents pertaining to 1 million patents (75Gb) 10, 000 topics relevance assessments (with an average of 6.23 relevant documents per topic)
  • 27. The Clef–Ip Patent Test Collection The Clef–Ip collection comprises target data: 1.9 million patent documents pertaining to 1 million patents (75Gb) 10, 000 topics relevance assessments (with an average of 6.23 relevant documents per topic) Target data and topics are multi-lingual: they contain fields in English, German, and French.
  • 28. Patent documents The data was provided by Matrixware in a standardized Xml format for patent data (the Alexandria Xml scheme).
  • 29. Looking at a patent document Field: description Language: German English French
  • 30. Looking at a patent document Field: claims Language: German English French
  • 31. Looking at a patent document Field: claims Language: German English French
  • 32. Looking at a patent document Field: claims Language: German English French
  • 33. Topics The task for the Clef–Ip track was to find prior art for a given patent.
  • 34. Topics The task for the Clef–Ip track was to find prior art for a given patent. But:
  • 35. Topics The task for the Clef–Ip track was to find prior art for a given patent. But: patents come in several versions corresponding to the different stages of the patent’s life-cycle
  • 36. Topics The task for the Clef–Ip track was to find prior art for a given patent. But: patents come in several versions corresponding to the different stages of the patent’s life-cycle not all versions of a patent contain all fields
  • 37. Topics How to represent a patent topic?
  • 38. Topics We assembled a “virtual patent topic” file by
  • 39. Topics We assembled a “virtual patent topic” file by taking the B1 document (granted patent)
  • 40. Topics We assembled a “virtual patent topic” file by taking the B1 document (granted patent) adding missing fields from the most current document where they appeared
  • 41. Criteria for topics selection Patents to be used as topics were selected according to the following criteria: 1 availability of granted patent 2 full text description available 3 at least three citations 4 at least one highly relevant citation
  • 42. Relevance assessments 1 Introduction Previous work on patent retrieval The patent search problem Clef–Ip the task 2 The Clef–Ip Patent Test Collection Target data Topics Relevance assessments 3 Participants 4 Results 5 Lessons Learned and Plans for 2010 6 Epilogue
  • 43. Relevance assessments We used patents cited as prior art as relevance assessments.
  • 44. Relevance assessments We used patents cited as prior art as relevance assessments. Sources of citations:
  • 45. Relevance assessments We used patents cited as prior art as relevance assessments. Sources of citations: 1 applicant’s disclosure: the Uspto requires applicants to disclose all known relevant publications
  • 46. Relevance assessments We used patents cited as prior art as relevance assessments. Sources of citations: 1 applicant’s disclosure: the Uspto requires applicants to disclose all known relevant publications 2 patent office search report: each patent office will do a search for prior art to judge the novelty of a patent
  • 47. Relevance assessments We used patents cited as prior art as relevance assessments. Sources of citations: 1 applicant’s disclosure: the Uspto requires applicants to disclose all known relevant publications 2 patent office search report: each patent office will do a search for prior art to judge the novelty of a patent 3 opposition procedures: patents cited to prove that a granted patent is not novel
  • 48. Extended citations as relevance assessments P11 P34 P12 P33 family family family family P13 family P1 P3 family P32 cites cites family Seed patent family P14 cites P31 P2 family family P21 family family P24 P22 P23 direct citations and their families
  • 49. Extended citations as relevance assessments Q11 Q43 Q12 Q42 cites cites cites cites Q1 Q4 cites cites Q13 Q41 family family Seed patent family family Q21 Q33 cites cites Q2 Q3 cites cites cites cites Q22 Q32 Q23 Q31 direct citations of family members ...
  • 50. Extended citations as relevance assessments Q111 Q134 Q112 Q133 family family family family Q113 family Q11 Q13 family Q132 cites cites family Q1 family Q114 cites Q131 Q12 family family Q121 family family Q124 Q122 Q123 ... and their families
  • 51. Patent families A patent family consists of patents granted by different patent authorities but related to the same invention.
  • 52. Patent families A patent family consists of patents granted by different patent authorities but related to the same invention. simple family all family members share the same priority number
  • 53. Patent families A patent family consists of patents granted by different patent authorities but related to the same invention. simple family all family members share the same priority number extended family there are several definitions, in the INPADOC database all documents which are directly or indirectly linked via a priority number belong to the same family
  • 54. Patent families Patent documents are linked by priorities
  • 55. Patent families Patent documents are linked by INPADOC family. priorities
  • 56. Patent families Patent documents are linked by Clef–Ip uses simple families. priorities
  • 57. Outline 1 Introduction Previous work on patent retrieval The patent search problem Clef–Ip the task 2 The Clef–Ip Patent Test Collection Target data Topics Relevance assessments 3 Participants 4 Results 5 Lessons Learned and Plans for 2010 6 Epilogue
  • 58. Participants CH 3 DE 3 15 participants NL 2 48 runs for the main task UK 10 runs for the language tasks ES 2 SE FI IE RO
  • 59. Participants 1 Tech. Univ. Darmstadt, Dept. of CS, Ubiquitous Knowledge Processing Lab (DE) 2 Univ. Neuchatel - Computer Science (CH) 3 Santiago de Compostela Univ. - Dept. Electronica y Computacion (ES) 4 University of Tampere - Info Studies (FI) 5 Interactive Media and Swedish Institute of Computer Science (SE) 6 Geneva Univ. - Centre Universitaire d’Informatique (CH) 7 Glasgow Univ. - IR Group Keith (UK) 8 Centrum Wiskunde & Informatica - Interactive Information Access (NL)
  • 60. Participants 9 Geneva Univ. Hospitals - Service of Medical Informatics (CH) 10 Humboldt Univ. - Dept. of German Language and Linguistics (DE) 11 Dublin City Univ. - School of Computing (IE) 12 Radboud Univ. Nijmegen - Centre for Language Studies & Speech Technologies (NL) 13 Hildesheim Univ. - Information Systems & Machine Learning Lab (DE) 14 Technical Univ. Valencia - Natural Language Engineering (ES) 15 Al. I. Cuza University of Iasi - Natural Language Processing (RO)
  • 61. Upload of experiments A system based on Alfresco2 together with a Docasu3 web interface was developed. Main features of this system are: 2 http://www.alfresco.com/ 3 http://docasu.sourceforge.net/
  • 62. Upload of experiments A system based on Alfresco2 together with a Docasu3 web interface was developed. Main features of this system are: user authentication 2 http://www.alfresco.com/ 3 http://docasu.sourceforge.net/
  • 63. Upload of experiments A system based on Alfresco2 together with a Docasu3 web interface was developed. Main features of this system are: user authentication run files format checks 2 http://www.alfresco.com/ 3 http://docasu.sourceforge.net/
  • 64. Upload of experiments A system based on Alfresco2 together with a Docasu3 web interface was developed. Main features of this system are: user authentication run files format checks revision control 2 http://www.alfresco.com/ 3 http://docasu.sourceforge.net/
  • 65. Who contributed These are the people who contributed to the Clef–Ip track:
  • 66. Who contributed These are the people who contributed to the Clef–Ip track: the Clef–Ip steering committee:
  • 67. Who contributed These are the people who contributed to the Clef–Ip track: the Clef–Ip steering committee: Gianni Amati, Kalervo J¨rvelin, Noriko Kando, Mark Sanderson, Henk Thomas, a Christa Womser-Hacker
  • 68. Who contributed These are the people who contributed to the Clef–Ip track: the Clef–Ip steering committee: Gianni Amati, Kalervo J¨rvelin, Noriko Kando, Mark Sanderson, Henk Thomas, a Christa Womser-Hacker Helmut Berger who invented the name Clef–Ip
  • 69. Who contributed These are the people who contributed to the Clef–Ip track: the Clef–Ip steering committee: Gianni Amati, Kalervo J¨rvelin, Noriko Kando, Mark Sanderson, Henk Thomas, a Christa Womser-Hacker Helmut Berger who invented the name Clef–Ip Florina Piroi and Veronika Zenz who walked the walk
  • 70. Who contributed These are the people who contributed to the Clef–Ip track: the Clef–Ip steering committee: Gianni Amati, Kalervo J¨rvelin, Noriko Kando, Mark Sanderson, Henk Thomas, a Christa Womser-Hacker Helmut Berger who invented the name Clef–Ip Florina Piroi and Veronika Zenz who walked the walk the patent experts who helped with advice and with assessment of results
  • 71. Who contributed These are the people who contributed to the Clef–Ip track: the Clef–Ip steering committee: Gianni Amati, Kalervo J¨rvelin, Noriko Kando, Mark Sanderson, Henk Thomas, a Christa Womser-Hacker Helmut Berger who invented the name Clef–Ip Florina Piroi and Veronika Zenz who walked the walk the patent experts who helped with advice and with assessment of results the Soire team
  • 72. Who contributed These are the people who contributed to the Clef–Ip track: the Clef–Ip steering committee: Gianni Amati, Kalervo J¨rvelin, Noriko Kando, Mark Sanderson, Henk Thomas, a Christa Womser-Hacker Helmut Berger who invented the name Clef–Ip Florina Piroi and Veronika Zenz who walked the walk the patent experts who helped with advice and with assessment of results the Soire team Evangelos Kanoulas and Emine Yilmaz for their advice on statistics
  • 73. Who contributed These are the people who contributed to the Clef–Ip track: the Clef–Ip steering committee: Gianni Amati, Kalervo J¨rvelin, Noriko Kando, Mark Sanderson, Henk Thomas, a Christa Womser-Hacker Helmut Berger who invented the name Clef–Ip Florina Piroi and Veronika Zenz who walked the walk the patent experts who helped with advice and with assessment of results the Soire team Evangelos Kanoulas and Emine Yilmaz for their advice on statistics John Tait
  • 74. Outline 1 Introduction Previous work on patent retrieval The patent search problem Clef–Ip the task 2 The Clef–Ip Patent Test Collection Target data Topics Relevance assessments 3 Participants 4 Results 5 Lessons Learned and Plans for 2010 6 Epilogue
  • 75. Measures used for evaluation We evaluated all runs according to standard IR measures
  • 76. Measures used for evaluation We evaluated all runs according to standard IR measures Precision, Precision@5, Precision@10, Precision@100
  • 77. Measures used for evaluation We evaluated all runs according to standard IR measures Precision, Precision@5, Precision@10, Precision@100 Recall, Recall@5, Recall@10, Recall@100
  • 78. Measures used for evaluation We evaluated all runs according to standard IR measures Precision, Precision@5, Precision@10, Precision@100 Recall, Recall@5, Recall@10, Recall@100 MAP
  • 79. Measures used for evaluation We evaluated all runs according to standard IR measures Precision, Precision@5, Precision@10, Precision@100 Recall, Recall@5, Recall@10, Recall@100 MAP nDCG (with reduction factor given by a logarithm in base 10)
  • 80. How to interpret the results Some participants were disappointed by their poor evaluation results as compared to other tracks
  • 81. How to interpret the results MAP = 0.02 ?
  • 82. How to interpret the results There are two main reasons why evaluation at Clef–Ip yields lower values than other tracks:
  • 83. How to interpret the results There are two main reasons why evaluation at Clef–Ip yields lower values than other tracks: 1 citations are incomplete sets of relevance assessments
  • 84. How to interpret the results There are two main reasons why evaluation at Clef–Ip yields lower values than other tracks: 1 citations are incomplete sets of relevance assessments 2 target data set is fragmentary, some patents are represented by one single document containing just title and bibliographic references (thus making it practically unfindable)
  • 85. How to interpret the results Still, one can sensibly use evaluation results for comparing runs as- suming that
  • 86. How to interpret the results Still, one can sensibly use evaluation results for comparing runs as- suming that 1 incompleteness of citations is distributed uniformly
  • 87. How to interpret the results Still, one can sensibly use evaluation results for comparing runs as- suming that 1 incompleteness of citations is distributed uniformly 2 same assumption for unfindable documents in the collection
  • 88. How to interpret the results Still, one can sensibly use evaluation results for comparing runs as- suming that 1 incompleteness of citations is distributed uniformly 2 same assumption for unfindable documents in the collection Incompleteness of citations is difficult to check not having a large enough gold standard to refer to.
  • 89. How to interpret the results Still, one can sensibly use evaluation results for comparing runs as- suming that 1 incompleteness of citations is distributed uniformly 2 same assumption for unfindable documents in the collection Incompleteness of citations is difficult to check not having a large enough gold standard to refer to. Second issue: we are thinking about re-evaluating all runs after removing unfindable patents from the collection.
  • 90. MAP: best run per participant
  • 91. MAP: best run per participant Group-ID Run-ID MAP R@100 P@100 humb 1 0.27 0.58 0.03 hcuge BiTeM 0.11 0.40 0.02 uscom BM25bt 0.11 0.36 0.02 UTASICS all-ratf-ipcr 0.11 0.37 0.02 UniNE strat3 0.10 0.34 0.02 TUD 800noTitle 0.11 0.42 0.02 clefip-dcu Filtered2 0.09 0.35 0.02 clefip-unige RUN3 0.09 0.30 0.02 clefip-ug infdocfreqCosEnglishTerms 0.07 0.24 0.01 cwi categorybm25 0.07 0.29 0.02 clefip-run ClaimsBOW 0.05 0.22 0.01 NLEL MethodA 0.03 0.12 0.01 UAIC MethodAnew 0.01 0.03 0.00 Hildesheim MethodAnew 0.00 0.02 0.00 Table: MAP, P@100, R@100 of best run/participant (S)
  • 92. Manual assessments We managed to have 12 topics assessed up to rank 20 for all runs.
  • 93. Manual assessments We managed to have 12 topics assessed up to rank 20 for all runs. 7 patent search professionals
  • 94. Manual assessments We managed to have 12 topics assessed up to rank 20 for all runs. 7 patent search professionals judged in average 264 documents per topics
  • 95. Manual assessments We managed to have 12 topics assessed up to rank 20 for all runs. 7 patent search professionals judged in average 264 documents per topics not surprisingly, rankings of systems obtained with this small collection do not agree with rankings obtained with large collection
  • 96. Manual assessments We managed to have 12 topics assessed up to rank 20 for all runs. 7 patent search professionals judged in average 264 documents per topics not surprisingly, rankings of systems obtained with this small collection do not agree with rankings obtained with large collection Investigations on this smaller collection are ongoing.
  • 97. Correlation analysis The rankings of runs obtained with the three sets of topics (S=500 ,M=1000, XL=10, 000)are highly correlated (Kendall’s τ > 0.9) suggesting that the three collections are equivalent.
  • 98. Correlation analysis As expected, correlation drops when comparing the ranking obtained with the 12 manually assessed topics and the one obtained with the ≥ 500 topics sets.
  • 99. Working notes I didn’t have time to read the working notes ...
  • 100. ... so I collected all the notes and generated a Wordle
  • 101. ... so I collected all the notes and generated a Wordle They’re about patent retrieval.
  • 102. Refining the Wordle I ran an Information Extraction algorithm in order to get a more meaningful picture
  • 110. Future plans Some plans and ideas for future tracks:
  • 111. Future plans Some plans and ideas for future tracks: a layered evaluation model is needed in order to measure the impact of each single factor to retrieval effectiveness
  • 112. Future plans Some plans and ideas for future tracks: a layered evaluation model is needed in order to measure the impact of each single factor to retrieval effectiveness provide images (they are essential elements in chemical or mechanical patents, for instance)
  • 113. Future plans Some plans and ideas for future tracks: a layered evaluation model is needed in order to measure the impact of each single factor to retrieval effectiveness provide images (they are essential elements in chemical or mechanical patents, for instance) investigate query reformulations rather than one query-result set
  • 114. Future plans Some plans and ideas for future tracks: a layered evaluation model is needed in order to measure the impact of each single factor to retrieval effectiveness provide images (they are essential elements in chemical or mechanical patents, for instance) investigate query reformulations rather than one query-result set extend collection to include other languages
  • 115. Future plans Some plans and ideas for future tracks: a layered evaluation model is needed in order to measure the impact of each single factor to retrieval effectiveness provide images (they are essential elements in chemical or mechanical patents, for instance) investigate query reformulations rather than one query-result set extend collection to include other languages include an annotation task
  • 116. Future plans Some plans and ideas for future tracks: a layered evaluation model is needed in order to measure the impact of each single factor to retrieval effectiveness provide images (they are essential elements in chemical or mechanical patents, for instance) investigate query reformulations rather than one query-result set extend collection to include other languages include an annotation task include a categorization task
  • 117. Epilogue we have created a large integrated test collection for experimentations in patent retrieval
  • 118. Epilogue we have created a large integrated test collection for experimentations in patent retrieval the Clef–Ip track had a more than satisfactory participation rate for its first year
  • 119. Epilogue we have created a large integrated test collection for experimentations in patent retrieval the Clef–Ip track had a more than satisfactory participation rate for its first year the right combination of techniques and the exploitation of patent-specific know-how yields best results
  • 120. Thank you for your attention.