II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

The Challenges of Managing “Big Data” in the
Patent Field
14-15 April 2014, Nice
Olivier Huc

Specialists
in Patent
Information
Building
Intelligent
Patent
Information
Solutions
since 1996
What we do
Trusted
by IP experts
Worldwide
Corporations,
National Patent
Offices, Patent
Attorneys and
Patent Search
Firms worldwide
International
Customer
Support
Global client base
With Offices and
Support across
Europe North
America, and Asia

Patent Families
Analytics
Quality Control
Fast Search
Legal Status
Review
Alerts
• 23 Full Text Collections
• 48 Million Families
• 103 Issuing Authorities
• IPC, CPC US and JP classes
• Quality Controlled content
• Normalised data

3 Patent Data Myths
• Myth #1: Patent data is just another type of
“Big Data”
• Myth #2: Patent Data is handled automatically
• Myth #3: Patent Data is consistent worldwide

• Patent Data volume might be smaller, data is
more complex (languages, text, fields)
• Patent data is not retrieved on the fly, it is
hosted, indexed and optimized
• There are multiple sources with overlap
• Data quality is a major issue
• Users have a low tolerance for errors
The reality

• Total data volume exceeds 35 Tb
• 49 million families and 103 publishing bodies
• 95 million publications
• 47 million full-texts including over 23 million non-Latin into
English machine translations
• 54 million clipped images and 45 million complete sets of
drawings
Database Facts

• Minesoft and RWS host their own data center, located just
outside of London
• Control
• Confidentiality
• Reactivity
• Speed
• Distributed search engine
• Continuous data update and indexing => no need to interrupt
or restart the online services,
+ new data immediately searchable
Hardware & Search Engine

• Multiple data sources:
• DOCDB weekly feeds (EPO)
• National Patent Offices
• Commercial collections
• External information (such as National Registers)
• Despite the complexity, having multiple sources for
the same country is a great advantage:
• Complementarity
• Improved quality
• Security
• Speed
Sources

• We perform stringent quality checks
• Human
• Programmatic
• Manual checks on some source data collections as they arrive: e.g.
Indian (IN), Thai (TH) and The Philippines (PH)
• Errors in data are identified programmatically by strict pre-set
parameters which are then manually corrected by our data team
• e.g. IC8=AO1G1/00
• Although we follow EPO’s INPADOC rules for families (extended),
we recreate all our families to ensure consistency
Data Quality

Adding extra value to PatBase data:
• Families are automatically reviewed and, then if necessary, rebuilt
when we receive new and/or corrected information (e.g. priority)
• Tagging of examples, paragraphs and claims is done in order to
facilitate searching specific sections of text
• Machine translation: when a family gets new text, the family is
reassessed to see if a machine translation needs to be
added/replaced/deleted.
Data Quality

TW AN/PR inputs TW AN/PR outputs
083303675 Emperor year conversion
& Type of application
TW19940303675F
092128911 TW20030128911
092128911 TW20040201682U
US AN/PR inputs US AN/PR outputs
US29/356,858 20100303 Type of application & Year US20100356858F
1301618611 A US20110016186
AT AN/PR inputs AT AN/PR outputs
A 709/95 Type of application & Year AT19950000709
GM647/96 AT19960000647U
Standardisation of patent data
Formatting application and priority information

• Formatting patent numbers and kind codes
• Formatting dates
Thailand use Buddhist years (Gregorian calendar year plus 543)
US date format - 2011/09/02 (9 February 2011)
European date format – 2011/09/02 (2 September 2011)
2007
Standardisation of patent data

The EPO standardize names to assist searching.
PatBase contains both standard and non-standard names.
Standard name assigned by the EPO
Non-standard name consists of whatever
is filed or published on the patent
Standard Non-standard
PIRELLI IND PIRELI SPA
PIRELLI IND PIRELLA SPA
PIRELLI IND PIRELLE S P A
PIRELLI IND PIRELLI DPA
PIRELLI IND PIRELLI S p A
PIRELLI IND PIRELLI S A
PIRELLI IND PIRELLI S P A
PIRELLI IND PIRELLI S P A FIRMA
PIRELLI IND PIRELLI S P A IT
PIRELLI IND PIRELLI S P CA
PIRELLI IND PIRELLI SPA IT
PIRELLI IND PIRELLI SPP
PIRELLI IND PIRELLU SPA
PIRELLI IND PIRELLY SPA
This is a small example set of the
non-standard names that The EPO
assign the standard name ‘Pirelli’
There are currently 188 non-standard names for the
standard name ‘Pirelli’
Standardization of patent data

• Date Formats
• All fields, e.g. patent classifications, assignees, text etc. have set
parameters. Where these are not matched data errors are
identified for manual editing.
• If a text is illegible (we have programmatic systems in place
measuring this) it will not be allowed into the database and be
identified as requiring manual attention (often manual typing).
• Character conversions
We have thousands of symbol / letter conversions in our
programs:
• & is replaced by and
• œ is replaced by oe
• β is replaced by ss
Data Improvements

Insertion of paragraph breaks and paragraph numbers
Data Improvements
Output in PatBase
Source text

• Errors appear in source data so manual checks are essential
• Example – Granted patent information from the Indian Patent Office
Journal. Three different inventions have incorrectly been given the same
publication number
Manual checks
IN000008

Data quality issues
On the Thai patent office website - the same publication number is used for two
different applications
Patent copy for TH48405 A
In PatBase
Application number: TH19981004295
Publication number: TH48406 A
Application number: TH19981002185
Publication number: TH48405 A
Wrong number
Correct number
Manual checks

• Acquiring data from multiple sources enables us to supplement records,
but also alerts us to errors thus ensuring accuracy
KR20010012826 A – Glial Cell Line-
Derived Neurotrophic Factor
Receptors
KR20010112826 A – Single phase six
pole DC brushless axial fan motor of
transistor type
Source EPO – Error in information This EPO record is a
combination of two
inventions. The
publication number
does not match with
the invention.
Identifying data errors

Incorrect data received from source
In cases such as these we correct the error in PatBase and inform the EPO
NULL values were
supplied in the
EPO’s DOCDB file
as Applicants

Example of an incorrect assignment from the USPTO
PatBase family 41683901
Excerpt from USPTO assignments database

Translations
• Principle: the English text of an equivalent is
always better than the Machine Translation
• All non-latin Texts are machine translated into
English and indexed when added to PatBase
• On a rolling basis we re-translate texts to
benefit from the continuous improvements of
translation engines

Machine translation
• Machine translations are made as data is added, removed / rebuilt. This
is all done before indexing.
• We run a rolling re-translate and re-index program to optimize the quality
of our machine translated full-text
Original translation, Thai into English Re-translation, Thai into English
Original translation, Thai into English Re-translation, Thai into English
Translations

Re-translation Korean into EnglishOriginal translation, Korean into English
Translations

Assignee translations
• Non-latin assignees are indexed
• Non-latin assignees are also translated
• First 10,000 CN and JP assignees have been
manually translated by RWS
• All others are Machine Translated until an “official”
Latin names appear in the family

Cross-lingual Tool
• Initially developed by WIPO, CLIR (Cross Lingual
Information Retrieval) allows our users to generate
multilingual searches
• Using an advanced statistical text analysis system
based on the PCT corpus, the cross-lingual search tool
identifies variants in multiple languages for search
terms entered by the user.
=> Better translation – translated words originate from
PCT applications

• Source: INPADOC
• All legal status events are categorised with a PRS
code
• Challenge: 2628 different PRS codes, some no longer
in use
• Solution: Grouping similar legal events together:
Legal Status
Reassignment
Deemed Withdrawn/Abandoned
Examined
Renewal Fees Paid
Granted
Lapsed/Expired/Ceased/Dead
Licence
Non-Entry into National Phase
National Phase Entry
Opposition Filed/Request for revocation
Published
Restored/Reinstated/Amended
Revoked/Rejected/Annuled/Invalid
Withdrawn/Abandoned/Terminated/Void

• Most patent databases are structured and optimized for
Patent Searching, not for Analytics
• At Minesoft, we developed a special database with
proprietary meta tags dedicated to the analytics
• Coverage is important – Beware of data gaps
• Importance of a web service (API)
• Importance of incorporating your own custom data or
legal status information in your analysis
Analytics

Thank you
PatBase celebrates its 10th anniversary
Olivier Huc – olivier@minesoft.com

II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

Similaire à II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK) (20)

Plus de Dr. Haxel Consult

Plus de Dr. Haxel Consult (20)

Dernier

Dernier (20)

II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)