Since 2003, the World Intellectual Property Organization has been developing through its Simpleshift Partner a system (IPCCAT) for assisting users in categorizing patent documents in the International Patent Classification (IPC). IPCCAT supports the classification of documents in several languages and aims to assist users in locating relevant IPC symbols by providing them with a convenient web-based service. The approach taken between 2004 and 2009 for developing such a system relied on powerful machine learning algorithms that are trained on manually classified documents to recognize IPC topics and was essentially limited by the computing power and limited coverage of the deepest level of the IPC. In 2017, WIPO decided to resume IPCCAT research and development, targeting categorization across the full IPC which includes 72,981 subgroups in its 2017.01 version.
We now detail in-house results of this research and development building on the increase in computing power, as well as of larger training collections and improved extraction algorithms, applying a custom-built computer-assisted categorizer to English and French-language patent documents. The use of n-gram further improved IPCCAT accuracy thus justifying in 2017 to keep on building on legacy algorithms rather than exploring other like Deep Learning ones. We find that reliable automated categorization in the full IPC is now achievable for the statistical methods employed here. With a combination of around 10,000 neural networks and an increased coverage of the IPC, IPCCAT suggesting three IPC symbols for each document can predict the most relevant IPC class correctly for around 90% of documents, the most relevant IPC subclass for about 85% of documents and the most relevant IPC subgroup for 80% of documents. IPCCAT opens new horizons to the IPC community e.g. to speed-up completion of the reclassification of patent documents after a revision of the IPC.
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
IC-SDV 2018: Patrick Fievet (WIPO) Automatic Categorization of Patent Documents in the International Patent Classification (IPCCAT)
1. Nice
April 23, 2018
Patrick FIÉVET & Jacques GUYOT
IC-SDV 2018
Automatic Text categorization in the
International Patent Classification
IPCCAT-Neural
2. IPCCAT-neural : automatic text
categorization in the IPC
What is it about?
Patent Classifications : IPC (and CPC)
Automatic text CATegorization in the specific context
of patent documents
Artificial Intelligence (AI) to mimic patent classification
practices by human beings
2
4. IPCCAT-neural : automatic text
categorization in the IPC
Initial problems to be solved:
IPCs allotment in small Patent Offices
Automatic routing of patent/technical documents
according to their technical domains
4
5. IPCCAT-neural : Principles
Large collection (millions) of patent documents already
classified, preferably with good-reputation practices
Minimum number of documents for each IPC symbol
Use of the CPC (converted into IPC through
concordance)
Complement with IPC symbols
Training /Testing phase : 80% / 20%
Precision measure: Three-guesses evaluation on
millions of test cases
5
6. IPCCAT-neural : Principles
Production use: 100% of the collection
Web service: returns IPCs with a numerical
confidence for each
User interface through IPC publication platform
(IPCPUB) 5 confidence levels
6
8. IPCCAT-neural : automatic text
categorization in the IPC
Challenges?
IPC coverage Vs. availability of training collections
Precision Vs. Recall (e.g. for Prior art Search)
Absolute Vs. relative quality of IPCCAT-Neural
8
9. IPCCAT-neural : recall challenges
Recall: “First” IPC does not mean “Main” IPC
One IPC is usually not enough for patent document
classification but is enough for their routing
Automatic Prediction (or guess) of the most
appropriate IPC symbols on the basis on a text
input (e.g. patent abstract) with a confidence level
associated to each of these predictions
9
10. IPCCAT-neural : precision challenges
Precision: “Mycat” based on classic neural networks
Trained system based on (many) neural networks
No evidence that more recent technology e.g. Deep
Learning would perform better than “classic” neural
networks (because patent classification is not hidden
information in the training collection)
Use of N-grams (terms with several words)
10
11. IPCCAT-neural : quality challenges
IPCCAT quality is relative to IPC quality in its
training collection:
IPCCAT Imitates human practices (good and bad
ones)
Limited by patent documents fragments used during
its training
IPCCAT offers consistent and repeatable predictions
That human beings are usually not be able to achieve
11
13. IPCCAT-neural 2018: text categorization
in the IPC at subgroup level !
• Automatic prediction in 99% of the
IPC i.e. among 72,137 categories
• Top-three guess precision > 80%
13
14. IPCCAT-neural 2018: text categorization
in the IPC at subgroup level
Training collection:
27.7 million in EN
IPC coverage(using IPC and CPC through concordance):
99% at subgroup level (EN)
Top three guess evaluation of its Precision:
82.5% based on 1.5 million of test cases (EN)
14
15. IPCCAT-neural text categorization in
the IPC at subgroup level
Why was It actually doable?
Recent evolution of the IPCCAT classifier available
on-demand as open source by the Olanto
foundation see http://olanto.org/foundation
Added value in data processing:
Training based on patent documents computed
from DOCDB XML excerpts (title + abstract)
Computation of both IPC and CPC classifications
Progress in computing power opens new R&D
horizons e.g. GPU, text processing,…
15
16. Evolution of IPCCAT R&D over years
2017
2003-2008:
IPC Main Group level
(~7,000 categories)
16
2018: IPC
Group level
~73,000
categories
18. IPCCAT-neural practical use
18
What it could be for?
Improvement of the consistency in patent
classification
Reduction of the backlog of IPC reclassification
through automation of the residual IPC
reclassification of patent documents after some years:
Potential alternative to IPC reclassification Default
transfer
19. IPCCAT-neural for IPC reclassification
19
Additional Challenges:
non-EN language:
Large training collection, with good IPC coverage
Consistency in legacy classification practices
Preserve possibility of language-dedicated solution
Costs containment for WIPO
Streamline production of training collection(s)
Streamline IPCCAT yearly retraining (new
vocabulary)
21. Cross-lingual text categorization to
assist IPC reclassification
Chronology:
1. Evidence that text categorization works at IPC subgroup level
with an acceptable level of precision: Done
2. Integration of IPCCAT neural at sub-group level into IPCPUB
v 7.6 Done
3. Confirmation that Cross-lingual text categorization can assist
in other languages than EN, even in absence of large training
collections: Done
21
22. IPCCAT-neural cross lingual prototype
60,0%
65,0%
70,0%
75,0%
80,0%
85,0%
90,0%
95,0%
100,0%
100 800 6400 51200
Precision(%)
Test with 1000 randomly selected patents in AR, DE, ES, FR,
JA, RU, ZH
Difficult to compare, not the same distribution of patents
Number of Categories
EN-NATIVE
AVERAGE-
TRANSLATION
AR-TRANSLATION
DE-TRANSLATION
ES-TRANSLATION
FR-TRANSLATION
JA-TRANSLATION
RU-TRANSLATION
ZH-TRANSLATION
23. IPCCAT-neural cross lingual prototype
Test with 500 randomly selected patents from subclass G06F
Losses due to translation are more visible for each language
Needs to be evaluated for all languages
0,0%
20,0%
40,0%
60,0%
80,0%
100,0%
100 1600 25600
Precision(%)
Number of categories
EN-NATIVE Average Dist
DE-TRANS-G06F2
JA-TRANS-G06F2
ZH-TRANS-G06F2
24. Cross-lingual text categorization to
assist IPC reclassification
Chronology: (Still a long way to go)
4. Incentives for R&D in automated text categorization: WIPO
DELTA training collection: Done
5. Propose alternatives to Default Transfer e.g. more than one
symbol based on IPCCAT guesses and confidence levels for
decisions, resource planning, etc…: 2019
6. Development of the production-scale solution integrating
cross-lingual text categorization and WIPO translate: 2019-
2020?
7. Integration in IPC reclassification system (IPCWLMS)
2020?
24
25. Incentive to R&D in text categorization:
WIPO-Delta training collection
Incentives for research and development institutes interested
in automatic text categorization :
WIPO DELTA 2018 EN dataset available upon request
Fully specified XML format
~50 million excepts of patent documents classified in the
IPC
Complement the public WIPO-ALPHA & GAMMA datasets
See
http://www.wipo.int/classifications/ipc/en/ITsupport/Categori
zation/dataset/index.html
25