This was a presentation given at the European Patent Office's annual Patent Information Conference in Madrid, Spain on November 10th, 2016.
In it, we give an overview of how machine translation works, latest advances in neural MT, and how this can be applied to patents and intellectual property content, not only for translations but also information extraction and other NLP applications.
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Past, Present, and Future: Machine Translation & Natural Language Processing for Patent Information
1. ‘Past, Present, and Future’
Machine Translation & Natural Language
Processing for Patent Information
Dr. John Tinsley
CEO, Iconic Translation Machines Ltd.
EPOPIC. Madrid. 10th November 2016
2. BSc in Computational Linguistics
PhD in Machine Translation
Language Technology consultant
Founder of Iconic Translation Machines
Why listen to me?
Machine Translation is what I do!
The world’s first and only patent specific machine translation platform
3. The use of computers to translate from one language into another
The use of computers to automate some, or all, of the translation
process
An approach to Machine Translation, where translations for an input are
estimated based on previous seen translation examples and associated
(inferred) probabilities.
e.g. IPTranslator, Google Translate
Rule-based (or transfer-based): based on linguistic rules
• e.g. Systran; Altavista’s Babelfish
Example-based: based on translation examples and inferred linguistic
patterns
Machine Translation: The Basics
Machine Translation = automatic translation
Statistical Machine Translation (SMT)
Other approaches
SMT is now by far the predominant approach*
4. A corpus (pl. corpora) is a collection
of texts, in electronic format, in a
single language
document(s)
book(s)
Bilingual Corpora
a bilingual corpus
Note source language = original language or language we’re translating from
target language = language we’re translating into
A bilingual corpus is a collection of
corresponding texts, in multiple
languages
a document & its translation
a book in multiple languages
European Parliament proceedings
5. Aligned Bilingual Corpora
A document-aligned bilingual corpus corresponds on a document level
For translation, we required sentence-aligned bilingual corpora
The sentence on line 1 in the source language text corresponds
to (i.e. is a translation of) the sentence on line 1 in the target
language text etc.
Often referred to as parallel aligned corpora
Sentence aligned bilingual parallel corpora
are essential for statistical machine translation
6. Learning from Previous Translations
Suppose we already know
(from a sentence-aligned bilingual
corpus) that:
“dog” is translated as “perro”
“I have a cat” is translated as
“Tengo un gato”
We can theoretically translate:
“I have a dog” “Tengo un perro”
Even though we have never seen “I
have a dog” before
Statistical machine translation induces information about unseen input, based on
previously known translations:
Primarily co-occurrence statistics
Takes contextual information into account
9. Statistical Machine Translation
From the corpus we can infer possible target (French)
translations for various source (English) words
We can then select the most probable translations
based on simple frequencies (co-occurrence statistics)
11. Advanced MT
All modern approaches are based on building translations for complete
sentences by putting together smaller pieces of translation
Previous example is very simplistic
In reality SMT systems calculate much more complex statistical models
over millions of sentence pairs for a pair of languages
Upwards of 2M sentence pairs on average for large-scale systems
Word-to-word translation probabilities
Phrase-to-phrase translation probabilities
Word order probabilities
Linguistic information (are the words nouns, verbs?)
Fluency of the final output
Previous example is very simplistic
Other statistics calculated include
12. Data is Key
For SMT data is key
Information (word/phrase correspondences and associated statistics) is only based
on what we have seen before in the data
Important that data used to train SMT systems is:
Of sufficient size
avoid sparseness/skewed statistics
Representative and relevant
contains the right type of language
High-quality
absence of misspellings,
incorrect alignments etc.
Proofed by human
translators
training data
13. Why is MT Difficult?
A word or a phrase can have more than one meaning (ambiguity – lexical or
structural)
e.g. “bank”, “dive”, “I saw the man with the telescope”
People use language creatively
New words are cropping up all the time
Linguistic differences between languages
e.g. structure of Irish sentences vs. structure of English sentences:
“Tá (Is) ocras (hunger) orm (on me)” <-> “I am hungry”
There can be more than one way to express the same meaning.
“New York”, “The Big Apple”, “NYC”
14. Why is MT Difficult?
Israeli officials are responsible for airport security.
Israel is in charge of the security at this airport.
The security work for this airport is the responsibility of the Israel government.
Israeli side was in charge of the security of this airport.
Israel is responsible for the airport’s security.
Israel is responsible for safety work at this airport.
Israel presides over the security of the airport.
Israel took charge of the airport security.
The safety of this airport is taken charge of by Israel.
This airport’s security is the responsibility of the Israeli security officials.
15. No single solution for all languages
Number agreement: the house / the houses vs. la maison / les maisons
Gender agreement: the house / the cheese vs. la maison / le frommage
English - Spanish
English - French
16. No single solution for all languages
English - German
English - Chinese
种水果的农民
The farmer who grows fruit
[Lit: “grow fruit (particle) farmer”]
17. Not all languages are created equal
French German Turkish Finnish
Spanish Chinese Korean Hungarian
Portuguese Japanese Thai Basque
18. The Challenge of Patents
L is an organic group selected from -CH2-
(OCH2CH2)n-, -CO-NR'-, with R'=H or
C1-C4 alkyl group; n=0-8; Y=F, CF3 …
maximum stress of 1.2 to 3.5 N/mm<2>
and a maximum elongation of 700 to
1,300% at 0[deg.] C.
Long Sentences
Technical constructions
Largest single document: 249,322 words
Longest Sentence: 1,417 words
19. The Challenge of Patents
Very long sentences as standard
Grammatically incomplete using
nominal and telegraphic style (!)
Passive forms are frequent
Frequent use of subordinate clauses,
participles, implicit constructs
Inconsistent and incorrect spelling
High use of neologisms
Instances of synonymy and polysemy
Spurious use of punctuation
Authoring guide
for “to be
translated” text
Patents break
almost all of the
rules!
20. Judge the quality of an MT system by comparing its output against a
human-produced “reference” translation
Pros: Quick, cheap, consistent
Cons: Inflexible, cannot be used on ‘new’ input
Pros: Reliable, flexible, multi-faceted (fluency, error analyses,
benchmarking)
Cons: Slow, expensive, subjective
Fluency vs. Adequacy
Evaluating Machine Translation Quality
Automatic Evaluation
Human Evaluation
Task-Based Evaluation
21. Evaluating Machine Translation Quality
Task Based Evaluation
Standalone evaluation of MT systems is necessary to get a sense of the
overall quality of a system
To determine the ultimate usability of an MT system, intrinsic task-based
evaluation is required
Why? Fluency vs. Adequacy
Fluency how fluent and grammatically correct the translation
output is
Adequacy how accurately the translation conveys the meaning of the
source
Output 1 The big blue house
Output 2 The big house red
Source La gran casa roja
Task-Based Evaluation
22. Practical uses of Machine Translation
Understand its limitations and you’ll understand
its capabilities!
No
Translate a patent for filing
Translate literature for
publication
Translate marketing
materials
Anything mission critical
without review
Yes
Productivity tool for
professional translation
Understand foreign patents
Localisation processes and
“controlled’ content
High volume, e.g. eDiscovery
23. Use cases in practice
Product descriptions
to open new markets
MT for post-editing
productivity across
industries
Developer, and user
for web content
Tens of thousands of
people using online
tools daily
24. Neural Networks
Using artificial intelligence and deep learning to develop a
completely new way of doing machine translation!
Quality Estimation
Functionality through which machine translation can “self-
assess” the quality of the translations it produces.
Online Adaptive Translation
Machine translations that can automatically learn and improve
based on feedback, particularly from revisions.
Use-case specific MT
Just like patent MT, but for countless other areas.
Current Hot Topics
25. About Iconic
We are a Machine Translation and Natural
Language Processing software and
services provider, delivering expert
solutions with Subject Matter Expertise
30. Speed, Cost, and Quality
What is the difference between machine translation vs. manual translation when
translating a 10 page patent document from Chinese into English?
Machine Translation is not
designed to replace
professional translation but
there are many cases
where costly and time-
consuming manual
translation is simply not
necessary.
31. - Data confidentiality
- File formats
- Potential for customisation,
enhancements, and
improvement for specific
domains
32. More than just translation
DATA PROCESSING
E.G. OPTICAL CHARACTER
RECOGNITION, DIGITISATION
DATABASE BUILDING
E.G. COMBINING THE ABOVE, WITH
TRANSLATION, FOR EXPORT
DATA UNDERSTANDING
E.G. SUMMARISATION, CONCEPT &
KEY TERM IDENTIFICATION
INFORMATION EXTRACTION
E.G. CITATION ANALYSIS, CROSS-
LINGUAL SEARCH
34. Citation Analysis
Assessment of record and reference patterns Application for record extraction
Tracking variations across years
Application for bibliographic data fielding
Second point is important. It has different uses and usability. The concept of FAHQMT is no more. Focus is now on HAMT and PEMT.
Problems with rule-based is that they didn’t scale
You need bilingual experts for each language pair
SMT is the predominant approach
Starting point for all systems is data.
The most important aspect is the quality of the data…
They are essential and the quality is crucial.
The translations must be accurate and the alignment must be correct, otherwise we infer the wrong things. Introduce “noise” into our systems.
How do we use these corpora? It’s all about learning and remembering things we’ve seen before, the same way you might go about translating something
Ok, so the translation isn’t exactly right here. It should be “Je parle a la fille” but we haven’t seen enough examples (don’t have enough data) for reliable estimates, we’re just going on the counts of the words
How likely a word is to translate to another word – as you have seen
How likely the different phrases are to translate as one another
What’s the likelihood a certain word will have a different position in the target sentence
Sometimes we take into account linguistic information about the words, is it a verb, then it should go here, articles should proceed nouns, etc.
Look at models of the target language and see if what we have produce makes sense (can these words go together in this order?)
Google Translate aims to be a general system, but what happens when your translating a sports website? Quality issues can be caused by the fact that there’s a lot of other data in their models than sports news.
Similarly, if I have a translation system for car manuals, it won’t be any good at translating sports websites.
This is reflected in our systems at IPTranslator too where all of our models are built using patents which have been filed in multiple languages to ensure we get the style correct
(patents are a bigger fish than this though)
The simple answer is that language is complex! Which is what makes it difficult to learn but also so interesting at the same time!
Who has the telescope, him or I?
New words, especially in patents. And new usage of words. The verb “to tweet” didn’t exist so long ago…
The last piece in the puzzle is understanding the languages you’re developing MT systems for. And that’s not understanding them in isolation – that’s understanding, for each language pair, what the differences are between them, e.g. many of the things we need to look out for when developing English-Spanish translation engines we don’t need to do for French-Spanish translation
With certain language pairs, things get more complex. The processes that we need to develop are harder to develop, less studied, require smarter people!
Chinese, need to identify these DE constructions so we know to move the head noun
No tense, going into English, how do we know what tense?
There’s no article! We have to generate it!
DE particle has many translations, which one!
FIRST THINGS FIRST, which ones are the words!? We need to segment the Chinese!
ONLY WITH THESE SKILLS CAN YOU EXPLOIT THE TECHNOLOGY TO ITS FULLEST – AND WHAT DO WE GET IN DOING THIS? MT WITH SUBJECT MATTER EXPERTISE
**EFFECT ON FEASIBILITY**
Basically, some languages are easier for MT that others.
General rule, closer two languages are to one another in terms of word order, grammatical structure, the easier.
Here’s some rules of thumb (with English)
But of course it’s not just that easy.
Patents for example have a range of highly complex linguistic characteristics that make this challenging, both for PROFESSIONAL translators as well as for Translation Software.
Lets look for example at this patent – what’s highlighted in blue is a SINGLE sentence, (which is an individual legal claim).
Additionally, we have to deal with complex technical constructions such as chemical formulae, alphanumeric sequences, even genomic and amino acid sequences.
And then we have patents which introduce a whole new level of complexity on top of the language issues…
Patents are hard to read, never mind translate, never mind try to teach a computer how to translate them!
Sometimes it’s hard to tell whether the translation is bad or that’s simply how the original patent was written
Commercial machine translation is plagued with misleading marketing with unrealistic claims and promises - Need to manage expectations
When I say NO, I mean no in a fully-automatic manner with no human intervention
Filing – not when meaning is CRUCIAL
Publication – no, there will be errors
Marketing – no, not with subtleties, idioms, etc.
MT solutions and services provider, specializing in providing customised solutions with subject matter expertise for specific techincal sectors, such as Patents/IP, life sciences, and financial.
We are the MT partner of choice for some of the world’s largest translation companies, information providers, and government and enterprise organisations.
For Translation Companies: We help translation companies to translate more content, more accurately for faster project turnaround, resulting in significant cost savings and increased revenue.
For Enterprise Clients: We help enterprises to translate more content in less time, resulting in faster products to market and enhanced global reach.
For Information Providers: We help information providers to translate knowledge, literature and documentary information faster and more accurately, resulting in broader knowledge offerings and faster time to market.
THERE’S VALUE TO BE ADDED, HOW CAN WE HARNESS?
We literally already have the perfect environment to allow NMT to be another string in the bow and let us use the most appropriate MT for the job
WHETHER IT BE NEURAL FOR KOREAN, FOR CHAT TEXT, OR WHATEVER THE CASE MAY BE
It’s not a one size fits all solution and who knows when it will be, but we have developed a framework that allows us to leverage it’s strength on a case by case basis to deliver the best possible translation for a given task.
Overtime we fully expect the “brain to grow” and become the best MT on offer for various language pairs and content types, and when it is, WE”RE PERFECTLY POSITIONS FROM A TECHINCOLOGY AND EXPERTISE PERSPECTIVE to capitalise on this wave.
We’ve launched a new product this year which is essentially repurposing the technology that we have and focusing on very particular use cases…
Firstly, let’s just look at the stark motivation for using MT for patent information in the first place…
The “standard” solution to the problem of foreign language documents is translations.
But translation is costly, not that quick, and often it is complete overkill for what is required!!
This is where MT comes as a much more cost-effective, rapid solution that allow you to make a QUICK determination as to whether something is relevant or not before you invest in a professional translation.
And, while we all know that MT isn’t perfect, the reality of the situation is that the quality is often “good enough” or fit for the purpose of make this determiniation.
SO IT’S A NO-BRAINER
So going back to IPTranslator, the elephant in the room for us for a long time has been Google Translate. The first question we get asked always is “is it better than Google Translate?”
The answer is yes, the majority of the time for most of the languages that we cover. However, is that increase in quality enough to justify the cost of our server over Google which is a free service? It’s hard to beat free! The reality is now, the “fit for purpose / good enough” quality level is something that Google can achieve often, especially since it started working with the EPO.
So where does IPTranslator fit?
Confidential Data
File formats incl. pdf
Potential for customisation, enhancements, and improvement for specific domains
Not just for patents, but for journals and other non-patent literature
Why was it challenging?
Exceptions to patterns
OCR errors
Lack of formatting information
The record extraction example is from Pattern B
The bib data example is from Pattern 5