SlideShare une entreprise Scribd logo
1  sur  37
‘Past, Present, and Future’
Machine Translation & Natural Language
Processing for Patent Information
Dr. John Tinsley
CEO, Iconic Translation Machines Ltd.
EPOPIC. Madrid. 10th November 2016
BSc in Computational Linguistics
PhD in Machine Translation
Language Technology consultant
Founder of Iconic Translation Machines
Why listen to me?
Machine Translation is what I do!
The world’s first and only patent specific machine translation platform
 The use of computers to translate from one language into another
 The use of computers to automate some, or all, of the translation
process
 An approach to Machine Translation, where translations for an input are
estimated based on previous seen translation examples and associated
(inferred) probabilities.
 e.g. IPTranslator, Google Translate
 Rule-based (or transfer-based): based on linguistic rules
• e.g. Systran; Altavista’s Babelfish
 Example-based: based on translation examples and inferred linguistic
patterns
Machine Translation: The Basics
Machine Translation = automatic translation
Statistical Machine Translation (SMT)
Other approaches
SMT is now by far the predominant approach*
A corpus (pl. corpora) is a collection
of texts, in electronic format, in a
single language
 document(s)
 book(s)
Bilingual Corpora
a bilingual corpus
Note source language = original language or language we’re translating from
target language = language we’re translating into
A bilingual corpus is a collection of
corresponding texts, in multiple
languages
 a document & its translation
 a book in multiple languages
 European Parliament proceedings
Aligned Bilingual Corpora
A document-aligned bilingual corpus corresponds on a document level
For translation, we required sentence-aligned bilingual corpora
 The sentence on line 1 in the source language text corresponds
to (i.e. is a translation of) the sentence on line 1 in the target
language text etc.
 Often referred to as parallel aligned corpora
Sentence aligned bilingual parallel corpora
are essential for statistical machine translation
Learning from Previous Translations
Suppose we already know
(from a sentence-aligned bilingual
corpus) that:
 “dog” is translated as “perro”
 “I have a cat” is translated as
“Tengo un gato”
We can theoretically translate:
 “I have a dog”  “Tengo un perro”
 Even though we have never seen “I
have a dog” before
Statistical machine translation induces information about unseen input, based on
previously known translations:
 Primarily co-occurrence statistics
 Takes contextual information into account
Statistical Machine Translation
 Example of a small sentence-aligned
bilingual corpus for English-French
Statistical Machine Translation
 We take some new sentence to translate
Statistical Machine Translation
 From the corpus we can infer possible target (French)
translations for various source (English) words
 We can then select the most probable translations
based on simple frequencies (co-occurrence statistics)
Statistical Machine Translation
Given a previously unseen input sentence, and our collated statistics,
we can estimate translation
Advanced MT
All modern approaches are based on building translations for complete
sentences by putting together smaller pieces of translation
Previous example is very simplistic
 In reality SMT systems calculate much more complex statistical models
over millions of sentence pairs for a pair of languages
 Upwards of 2M sentence pairs on average for large-scale systems
 Word-to-word translation probabilities
 Phrase-to-phrase translation probabilities
 Word order probabilities
 Linguistic information (are the words nouns, verbs?)
 Fluency of the final output
Previous example is very simplistic
Other statistics calculated include
Data is Key
For SMT data is key
 Information (word/phrase correspondences and associated statistics) is only based
on what we have seen before in the data
Important that data used to train SMT systems is:
 Of sufficient size
 avoid sparseness/skewed statistics
 Representative and relevant
 contains the right type of language
 High-quality
 absence of misspellings,
incorrect alignments etc.
 Proofed by human
translators
training data
Why is MT Difficult?
A word or a phrase can have more than one meaning (ambiguity – lexical or
structural)
 e.g. “bank”, “dive”, “I saw the man with the telescope”
People use language creatively
 New words are cropping up all the time
Linguistic differences between languages
 e.g. structure of Irish sentences vs. structure of English sentences:
 “Tá (Is) ocras (hunger) orm (on me)” <-> “I am hungry”
There can be more than one way to express the same meaning.
 “New York”, “The Big Apple”, “NYC”
Why is MT Difficult?
 Israeli officials are responsible for airport security.
 Israel is in charge of the security at this airport.
 The security work for this airport is the responsibility of the Israel government.
 Israeli side was in charge of the security of this airport.
 Israel is responsible for the airport’s security.
 Israel is responsible for safety work at this airport.
 Israel presides over the security of the airport.
 Israel took charge of the airport security.
 The safety of this airport is taken charge of by Israel.
 This airport’s security is the responsibility of the Israeli security officials.
No single solution for all languages
Number agreement: the house / the houses vs. la maison / les maisons
Gender agreement: the house / the cheese vs. la maison / le frommage
English - Spanish
English - French
No single solution for all languages
English - German
English - Chinese
种水果的农民
The farmer who grows fruit
[Lit: “grow fruit (particle) farmer”]
Not all languages are created equal
French German Turkish Finnish
Spanish Chinese Korean Hungarian
Portuguese Japanese Thai Basque
The Challenge of Patents
L is an organic group selected from -CH2-
(OCH2CH2)n-, -CO-NR'-, with R'=H or
C1-C4 alkyl group; n=0-8; Y=F, CF3 …
maximum stress of 1.2 to 3.5 N/mm<2>
and a maximum elongation of 700 to
1,300% at 0[deg.] C.
Long Sentences
Technical constructions
Largest single document: 249,322 words
Longest Sentence: 1,417 words
The Challenge of Patents
Very long sentences as standard
Grammatically incomplete using
nominal and telegraphic style (!)
Passive forms are frequent
Frequent use of subordinate clauses,
participles, implicit constructs
Inconsistent and incorrect spelling
High use of neologisms
Instances of synonymy and polysemy
Spurious use of punctuation
Authoring guide
for “to be
translated” text
Patents break
almost all of the
rules!
Judge the quality of an MT system by comparing its output against a
human-produced “reference” translation
 Pros: Quick, cheap, consistent
 Cons: Inflexible, cannot be used on ‘new’ input
 Pros: Reliable, flexible, multi-faceted (fluency, error analyses,
benchmarking)
 Cons: Slow, expensive, subjective
 Fluency vs. Adequacy
Evaluating Machine Translation Quality
Automatic Evaluation
Human Evaluation
Task-Based Evaluation
Evaluating Machine Translation Quality
Task Based Evaluation
 Standalone evaluation of MT systems is necessary to get a sense of the
overall quality of a system
 To determine the ultimate usability of an MT system, intrinsic task-based
evaluation is required
 Why? Fluency vs. Adequacy
Fluency how fluent and grammatically correct the translation
output is
Adequacy how accurately the translation conveys the meaning of the
source
Output 1 The big blue house
Output 2 The big house red
Source La gran casa roja
Task-Based Evaluation
Practical uses of Machine Translation
Understand its limitations and you’ll understand
its capabilities!
No
 Translate a patent for filing
 Translate literature for
publication
 Translate marketing
materials
 Anything mission critical
without review
Yes
 Productivity tool for
professional translation
 Understand foreign patents
 Localisation processes and
“controlled’ content
 High volume, e.g. eDiscovery
Use cases in practice
Product descriptions
to open new markets
MT for post-editing
productivity across
industries
Developer, and user
for web content
Tens of thousands of
people using online
tools daily
Neural Networks
 Using artificial intelligence and deep learning to develop a
completely new way of doing machine translation!
Quality Estimation
 Functionality through which machine translation can “self-
assess” the quality of the translations it produces.
Online Adaptive Translation
 Machine translations that can automatically learn and improve
based on feedback, particularly from revisions.
Use-case specific MT
 Just like patent MT, but for countless other areas.
Current Hot Topics
About Iconic
We are a Machine Translation and Natural
Language Processing software and
services provider, delivering expert
solutions with Subject Matter Expertise
Iconic Ensemble Architecture…
…enhanced with Neural MT
Speed, Cost, and Quality
What is the difference between machine translation vs. manual translation when
translating a 10 page patent document from Chinese into English?
Machine Translation is not
designed to replace
professional translation but
there are many cases
where costly and time-
consuming manual
translation is simply not
necessary.
- Data confidentiality
- File formats
- Potential for customisation,
enhancements, and
improvement for specific
domains
More than just translation
DATA PROCESSING
E.G. OPTICAL CHARACTER
RECOGNITION, DIGITISATION
DATABASE BUILDING
E.G. COMBINING THE ABOVE, WITH
TRANSLATION, FOR EXPORT
DATA UNDERSTANDING
E.G. SUMMARISATION, CONCEPT &
KEY TERM IDENTIFICATION
INFORMATION EXTRACTION
E.G. CITATION ANALYSIS, CROSS-
LINGUAL SEARCH
Record Extraction
Extraction algorithms work on cleaned
OCR output, using patterns, keywords,
and formatting information.
Citation Analysis
Assessment of record and reference patterns Application for record extraction
Tracking variations across years
Application for bibliographic data fielding
Reference extraction + fielding
.com
Visit
and use the promo code epo2016 to get 20
free pages of translation
Thank You!
john@iptranslator.com
@IconicTrans

Contenu connexe

Tendances

6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation
RIILP
 
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
RIILP
 
Techniques in Translation
Techniques in TranslationTechniques in Translation
Techniques in Translation
juvelle villafania
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
RIILP
 
7. Trevor Cohn (usfd) Statistical Machine Translation
7. Trevor Cohn (usfd) Statistical Machine Translation7. Trevor Cohn (usfd) Statistical Machine Translation
7. Trevor Cohn (usfd) Statistical Machine Translation
RIILP
 
Techniques in translation, computer assisted, machine translation, subtitling...
Techniques in translation, computer assisted, machine translation, subtitling...Techniques in translation, computer assisted, machine translation, subtitling...
Techniques in translation, computer assisted, machine translation, subtitling...
Moses Altovar
 
Machine translation with statistical approach
Machine translation with statistical approachMachine translation with statistical approach
Machine translation with statistical approach
vini89
 

Tendances (20)

Machine translation
Machine translationMachine translation
Machine translation
 
Machine Translation: What it is?
Machine Translation: What it is?Machine Translation: What it is?
Machine Translation: What it is?
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine Translation
 
6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation
 
SMT3
SMT3SMT3
SMT3
 
Machine Translation
Machine TranslationMachine Translation
Machine Translation
 
Introduction To Translation Technologies
Introduction To Translation TechnologiesIntroduction To Translation Technologies
Introduction To Translation Technologies
 
Machine translator Introduction
Machine translator IntroductionMachine translator Introduction
Machine translator Introduction
 
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
 
Techniques in Translation
Techniques in TranslationTechniques in Translation
Techniques in Translation
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
7. Trevor Cohn (usfd) Statistical Machine Translation
7. Trevor Cohn (usfd) Statistical Machine Translation7. Trevor Cohn (usfd) Statistical Machine Translation
7. Trevor Cohn (usfd) Statistical Machine Translation
 
Spell checker for Kannada OCR
Spell checker for Kannada OCRSpell checker for Kannada OCR
Spell checker for Kannada OCR
 
Nltk
NltkNltk
Nltk
 
Techniques in translation, computer assisted, machine translation, subtitling...
Techniques in translation, computer assisted, machine translation, subtitling...Techniques in translation, computer assisted, machine translation, subtitling...
Techniques in translation, computer assisted, machine translation, subtitling...
 
Machine translation with statistical approach
Machine translation with statistical approachMachine translation with statistical approach
Machine translation with statistical approach
 
Speech synthesis technology
Speech synthesis technologySpeech synthesis technology
Speech synthesis technology
 
Python an-intro youtube-livestream-day1
Python an-intro youtube-livestream-day1Python an-intro youtube-livestream-day1
Python an-intro youtube-livestream-day1
 
Statistical machine translation for indian language copy
Statistical machine translation for indian language   copyStatistical machine translation for indian language   copy
Statistical machine translation for indian language copy
 

En vedette

En vedette (8)

Too many cooks: Preventing content interference so you can do your job
Too many cooks: Preventing content interference so you can do your jobToo many cooks: Preventing content interference so you can do your job
Too many cooks: Preventing content interference so you can do your job
 
Brides Haiti: Quatrième Sondage national d’opinions renseignant les Citoyens...
Brides Haiti:  Quatrième Sondage national d’opinions renseignant les Citoyens...Brides Haiti:  Quatrième Sondage national d’opinions renseignant les Citoyens...
Brides Haiti: Quatrième Sondage national d’opinions renseignant les Citoyens...
 
PHCN CTSN
PHCN CTSNPHCN CTSN
PHCN CTSN
 
Crime vocab 4_eso
Crime vocab 4_esoCrime vocab 4_eso
Crime vocab 4_eso
 
Cb08 joachin elizabeth
Cb08 joachin elizabethCb08 joachin elizabeth
Cb08 joachin elizabeth
 
Bệnh cơ
Bệnh cơBệnh cơ
Bệnh cơ
 
Facultad de ciencias administrativas a fines
Facultad de ciencias administrativas a finesFacultad de ciencias administrativas a fines
Facultad de ciencias administrativas a fines
 
ประวัติส่วนตัวพัฒนาพร
ประวัติส่วนตัวพัฒนาพรประวัติส่วนตัวพัฒนาพร
ประวัติส่วนตัวพัฒนาพร
 

Similaire à Past, Present, and Future: Machine Translation & Natural Language Processing for Patent Information

Language Grid
Language GridLanguage Grid
Language Grid
lindh
 
An Application for Performing Real Time Speech Translation in Mobile Environment
An Application for Performing Real Time Speech Translation in Mobile EnvironmentAn Application for Performing Real Time Speech Translation in Mobile Environment
An Application for Performing Real Time Speech Translation in Mobile Environment
Association of Scientists, Developers and Faculties
 

Similaire à Past, Present, and Future: Machine Translation & Natural Language Processing for Patent Information (20)

The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine Translation
 
"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents
 
Language Grid
Language GridLanguage Grid
Language Grid
 
machine transaltion
machine transaltionmachine transaltion
machine transaltion
 
visH (fin).pptx
visH (fin).pptxvisH (fin).pptx
visH (fin).pptx
 
Machine translation ppt by shantanu arora
Machine translation ppt by shantanu aroraMachine translation ppt by shantanu arora
Machine translation ppt by shantanu arora
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
 
Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...
Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...
Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...
 
E-Translation
E-TranslationE-Translation
E-Translation
 
Cyflwyniad Bloc
Cyflwyniad BlocCyflwyniad Bloc
Cyflwyniad Bloc
 
AI, don't f$# up my name.pdf
AI, don't f$# up my name.pdfAI, don't f$# up my name.pdf
AI, don't f$# up my name.pdf
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
An Application for Performing Real Time Speech Translation in Mobile Environment
An Application for Performing Real Time Speech Translation in Mobile EnvironmentAn Application for Performing Real Time Speech Translation in Mobile Environment
An Application for Performing Real Time Speech Translation in Mobile Environment
 
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translation
 
Translation j 2009-cocci (1)
Translation j 2009-cocci (1)Translation j 2009-cocci (1)
Translation j 2009-cocci (1)
 
Multi lingual corpus for machine aided translation
Multi lingual corpus for machine aided translationMulti lingual corpus for machine aided translation
Multi lingual corpus for machine aided translation
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
A Short Introduction To Text-To-Speech Synthesis
A Short Introduction To Text-To-Speech SynthesisA Short Introduction To Text-To-Speech Synthesis
A Short Introduction To Text-To-Speech Synthesis
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Past, Present, and Future: Machine Translation & Natural Language Processing for Patent Information

  • 1. ‘Past, Present, and Future’ Machine Translation & Natural Language Processing for Patent Information Dr. John Tinsley CEO, Iconic Translation Machines Ltd. EPOPIC. Madrid. 10th November 2016
  • 2. BSc in Computational Linguistics PhD in Machine Translation Language Technology consultant Founder of Iconic Translation Machines Why listen to me? Machine Translation is what I do! The world’s first and only patent specific machine translation platform
  • 3.  The use of computers to translate from one language into another  The use of computers to automate some, or all, of the translation process  An approach to Machine Translation, where translations for an input are estimated based on previous seen translation examples and associated (inferred) probabilities.  e.g. IPTranslator, Google Translate  Rule-based (or transfer-based): based on linguistic rules • e.g. Systran; Altavista’s Babelfish  Example-based: based on translation examples and inferred linguistic patterns Machine Translation: The Basics Machine Translation = automatic translation Statistical Machine Translation (SMT) Other approaches SMT is now by far the predominant approach*
  • 4. A corpus (pl. corpora) is a collection of texts, in electronic format, in a single language  document(s)  book(s) Bilingual Corpora a bilingual corpus Note source language = original language or language we’re translating from target language = language we’re translating into A bilingual corpus is a collection of corresponding texts, in multiple languages  a document & its translation  a book in multiple languages  European Parliament proceedings
  • 5. Aligned Bilingual Corpora A document-aligned bilingual corpus corresponds on a document level For translation, we required sentence-aligned bilingual corpora  The sentence on line 1 in the source language text corresponds to (i.e. is a translation of) the sentence on line 1 in the target language text etc.  Often referred to as parallel aligned corpora Sentence aligned bilingual parallel corpora are essential for statistical machine translation
  • 6. Learning from Previous Translations Suppose we already know (from a sentence-aligned bilingual corpus) that:  “dog” is translated as “perro”  “I have a cat” is translated as “Tengo un gato” We can theoretically translate:  “I have a dog”  “Tengo un perro”  Even though we have never seen “I have a dog” before Statistical machine translation induces information about unseen input, based on previously known translations:  Primarily co-occurrence statistics  Takes contextual information into account
  • 7. Statistical Machine Translation  Example of a small sentence-aligned bilingual corpus for English-French
  • 8. Statistical Machine Translation  We take some new sentence to translate
  • 9. Statistical Machine Translation  From the corpus we can infer possible target (French) translations for various source (English) words  We can then select the most probable translations based on simple frequencies (co-occurrence statistics)
  • 10. Statistical Machine Translation Given a previously unseen input sentence, and our collated statistics, we can estimate translation
  • 11. Advanced MT All modern approaches are based on building translations for complete sentences by putting together smaller pieces of translation Previous example is very simplistic  In reality SMT systems calculate much more complex statistical models over millions of sentence pairs for a pair of languages  Upwards of 2M sentence pairs on average for large-scale systems  Word-to-word translation probabilities  Phrase-to-phrase translation probabilities  Word order probabilities  Linguistic information (are the words nouns, verbs?)  Fluency of the final output Previous example is very simplistic Other statistics calculated include
  • 12. Data is Key For SMT data is key  Information (word/phrase correspondences and associated statistics) is only based on what we have seen before in the data Important that data used to train SMT systems is:  Of sufficient size  avoid sparseness/skewed statistics  Representative and relevant  contains the right type of language  High-quality  absence of misspellings, incorrect alignments etc.  Proofed by human translators training data
  • 13. Why is MT Difficult? A word or a phrase can have more than one meaning (ambiguity – lexical or structural)  e.g. “bank”, “dive”, “I saw the man with the telescope” People use language creatively  New words are cropping up all the time Linguistic differences between languages  e.g. structure of Irish sentences vs. structure of English sentences:  “Tá (Is) ocras (hunger) orm (on me)” <-> “I am hungry” There can be more than one way to express the same meaning.  “New York”, “The Big Apple”, “NYC”
  • 14. Why is MT Difficult?  Israeli officials are responsible for airport security.  Israel is in charge of the security at this airport.  The security work for this airport is the responsibility of the Israel government.  Israeli side was in charge of the security of this airport.  Israel is responsible for the airport’s security.  Israel is responsible for safety work at this airport.  Israel presides over the security of the airport.  Israel took charge of the airport security.  The safety of this airport is taken charge of by Israel.  This airport’s security is the responsibility of the Israeli security officials.
  • 15. No single solution for all languages Number agreement: the house / the houses vs. la maison / les maisons Gender agreement: the house / the cheese vs. la maison / le frommage English - Spanish English - French
  • 16. No single solution for all languages English - German English - Chinese 种水果的农民 The farmer who grows fruit [Lit: “grow fruit (particle) farmer”]
  • 17. Not all languages are created equal French German Turkish Finnish Spanish Chinese Korean Hungarian Portuguese Japanese Thai Basque
  • 18. The Challenge of Patents L is an organic group selected from -CH2- (OCH2CH2)n-, -CO-NR'-, with R'=H or C1-C4 alkyl group; n=0-8; Y=F, CF3 … maximum stress of 1.2 to 3.5 N/mm<2> and a maximum elongation of 700 to 1,300% at 0[deg.] C. Long Sentences Technical constructions Largest single document: 249,322 words Longest Sentence: 1,417 words
  • 19. The Challenge of Patents Very long sentences as standard Grammatically incomplete using nominal and telegraphic style (!) Passive forms are frequent Frequent use of subordinate clauses, participles, implicit constructs Inconsistent and incorrect spelling High use of neologisms Instances of synonymy and polysemy Spurious use of punctuation Authoring guide for “to be translated” text Patents break almost all of the rules!
  • 20. Judge the quality of an MT system by comparing its output against a human-produced “reference” translation  Pros: Quick, cheap, consistent  Cons: Inflexible, cannot be used on ‘new’ input  Pros: Reliable, flexible, multi-faceted (fluency, error analyses, benchmarking)  Cons: Slow, expensive, subjective  Fluency vs. Adequacy Evaluating Machine Translation Quality Automatic Evaluation Human Evaluation Task-Based Evaluation
  • 21. Evaluating Machine Translation Quality Task Based Evaluation  Standalone evaluation of MT systems is necessary to get a sense of the overall quality of a system  To determine the ultimate usability of an MT system, intrinsic task-based evaluation is required  Why? Fluency vs. Adequacy Fluency how fluent and grammatically correct the translation output is Adequacy how accurately the translation conveys the meaning of the source Output 1 The big blue house Output 2 The big house red Source La gran casa roja Task-Based Evaluation
  • 22. Practical uses of Machine Translation Understand its limitations and you’ll understand its capabilities! No  Translate a patent for filing  Translate literature for publication  Translate marketing materials  Anything mission critical without review Yes  Productivity tool for professional translation  Understand foreign patents  Localisation processes and “controlled’ content  High volume, e.g. eDiscovery
  • 23. Use cases in practice Product descriptions to open new markets MT for post-editing productivity across industries Developer, and user for web content Tens of thousands of people using online tools daily
  • 24. Neural Networks  Using artificial intelligence and deep learning to develop a completely new way of doing machine translation! Quality Estimation  Functionality through which machine translation can “self- assess” the quality of the translations it produces. Online Adaptive Translation  Machine translations that can automatically learn and improve based on feedback, particularly from revisions. Use-case specific MT  Just like patent MT, but for countless other areas. Current Hot Topics
  • 25. About Iconic We are a Machine Translation and Natural Language Processing software and services provider, delivering expert solutions with Subject Matter Expertise
  • 28.
  • 29.
  • 30. Speed, Cost, and Quality What is the difference between machine translation vs. manual translation when translating a 10 page patent document from Chinese into English? Machine Translation is not designed to replace professional translation but there are many cases where costly and time- consuming manual translation is simply not necessary.
  • 31. - Data confidentiality - File formats - Potential for customisation, enhancements, and improvement for specific domains
  • 32. More than just translation DATA PROCESSING E.G. OPTICAL CHARACTER RECOGNITION, DIGITISATION DATABASE BUILDING E.G. COMBINING THE ABOVE, WITH TRANSLATION, FOR EXPORT DATA UNDERSTANDING E.G. SUMMARISATION, CONCEPT & KEY TERM IDENTIFICATION INFORMATION EXTRACTION E.G. CITATION ANALYSIS, CROSS- LINGUAL SEARCH
  • 33. Record Extraction Extraction algorithms work on cleaned OCR output, using patterns, keywords, and formatting information.
  • 34. Citation Analysis Assessment of record and reference patterns Application for record extraction Tracking variations across years Application for bibliographic data fielding
  • 36. .com Visit and use the promo code epo2016 to get 20 free pages of translation

Notes de l'éditeur

  1. Second point is important. It has different uses and usability. The concept of FAHQMT is no more. Focus is now on HAMT and PEMT. Problems with rule-based is that they didn’t scale You need bilingual experts for each language pair SMT is the predominant approach
  2. Starting point for all systems is data. The most important aspect is the quality of the data…
  3. They are essential and the quality is crucial. The translations must be accurate and the alignment must be correct, otherwise we infer the wrong things. Introduce “noise” into our systems.
  4. How do we use these corpora? It’s all about learning and remembering things we’ve seen before, the same way you might go about translating something
  5. Ok, so the translation isn’t exactly right here. It should be “Je parle a la fille” but we haven’t seen enough examples (don’t have enough data) for reliable estimates, we’re just going on the counts of the words
  6. How likely a word is to translate to another word – as you have seen How likely the different phrases are to translate as one another What’s the likelihood a certain word will have a different position in the target sentence Sometimes we take into account linguistic information about the words, is it a verb, then it should go here, articles should proceed nouns, etc. Look at models of the target language and see if what we have produce makes sense (can these words go together in this order?)
  7. Google Translate aims to be a general system, but what happens when your translating a sports website? Quality issues can be caused by the fact that there’s a lot of other data in their models than sports news. Similarly, if I have a translation system for car manuals, it won’t be any good at translating sports websites. This is reflected in our systems at IPTranslator too where all of our models are built using patents which have been filed in multiple languages to ensure we get the style correct (patents are a bigger fish than this though)
  8. The simple answer is that language is complex! Which is what makes it difficult to learn but also so interesting at the same time! Who has the telescope, him or I? New words, especially in patents. And new usage of words. The verb “to tweet” didn’t exist so long ago…
  9. The last piece in the puzzle is understanding the languages you’re developing MT systems for. And that’s not understanding them in isolation – that’s understanding, for each language pair, what the differences are between them, e.g. many of the things we need to look out for when developing English-Spanish translation engines we don’t need to do for French-Spanish translation
  10. With certain language pairs, things get more complex. The processes that we need to develop are harder to develop, less studied, require smarter people! Chinese, need to identify these DE constructions so we know to move the head noun No tense, going into English, how do we know what tense? There’s no article! We have to generate it! DE particle has many translations, which one! FIRST THINGS FIRST, which ones are the words!? We need to segment the Chinese! ONLY WITH THESE SKILLS CAN YOU EXPLOIT THE TECHNOLOGY TO ITS FULLEST – AND WHAT DO WE GET IN DOING THIS? MT WITH SUBJECT MATTER EXPERTISE
  11. **EFFECT ON FEASIBILITY** Basically, some languages are easier for MT that others. General rule, closer two languages are to one another in terms of word order, grammatical structure, the easier. Here’s some rules of thumb (with English)
  12. But of course it’s not just that easy. Patents for example have a range of highly complex linguistic characteristics that make this challenging, both for PROFESSIONAL translators as well as for Translation Software. Lets look for example at this patent – what’s highlighted in blue is a SINGLE sentence, (which is an individual legal claim). Additionally, we have to deal with complex technical constructions such as chemical formulae, alphanumeric sequences, even genomic and amino acid sequences. And then we have patents which introduce a whole new level of complexity on top of the language issues… Patents are hard to read, never mind translate, never mind try to teach a computer how to translate them!
  13. Sometimes it’s hard to tell whether the translation is bad or that’s simply how the original patent was written
  14. Commercial machine translation is plagued with misleading marketing with unrealistic claims and promises - Need to manage expectations When I say NO, I mean no in a fully-automatic manner with no human intervention Filing – not when meaning is CRUCIAL Publication – no, there will be errors Marketing – no, not with subtleties, idioms, etc.
  15. MT solutions and services provider, specializing in providing customised solutions with subject matter expertise for specific techincal sectors, such as Patents/IP, life sciences, and financial. We are the MT partner of choice for some of the world’s largest translation companies, information providers, and government and enterprise organisations. For Translation Companies: We help translation companies to translate more content, more accurately for faster project turnaround, resulting in significant cost savings and increased revenue. For Enterprise Clients: We help enterprises to translate more content in less time, resulting in faster products to market and enhanced global reach. For Information Providers: We help information providers to translate knowledge, literature and documentary information faster and more accurately, resulting in broader knowledge offerings and faster time to market.
  16. THERE’S VALUE TO BE ADDED, HOW CAN WE HARNESS? We literally already have the perfect environment to allow NMT to be another string in the bow and let us use the most appropriate MT for the job WHETHER IT BE NEURAL FOR KOREAN, FOR CHAT TEXT, OR WHATEVER THE CASE MAY BE
  17. It’s not a one size fits all solution and who knows when it will be, but we have developed a framework that allows us to leverage it’s strength on a case by case basis to deliver the best possible translation for a given task. Overtime we fully expect the “brain to grow” and become the best MT on offer for various language pairs and content types, and when it is, WE”RE PERFECTLY POSITIONS FROM A TECHINCOLOGY AND EXPERTISE PERSPECTIVE to capitalise on this wave.
  18. We’ve launched a new product this year which is essentially repurposing the technology that we have and focusing on very particular use cases… Firstly, let’s just look at the stark motivation for using MT for patent information in the first place…
  19. The “standard” solution to the problem of foreign language documents is translations. But translation is costly, not that quick, and often it is complete overkill for what is required!! This is where MT comes as a much more cost-effective, rapid solution that allow you to make a QUICK determination as to whether something is relevant or not before you invest in a professional translation. And, while we all know that MT isn’t perfect, the reality of the situation is that the quality is often “good enough” or fit for the purpose of make this determiniation. SO IT’S A NO-BRAINER
  20. So going back to IPTranslator, the elephant in the room for us for a long time has been Google Translate. The first question we get asked always is “is it better than Google Translate?” The answer is yes, the majority of the time for most of the languages that we cover. However, is that increase in quality enough to justify the cost of our server over Google which is a free service? It’s hard to beat free! The reality is now, the “fit for purpose / good enough” quality level is something that Google can achieve often, especially since it started working with the EPO. So where does IPTranslator fit? Confidential Data File formats incl. pdf Potential for customisation, enhancements, and improvement for specific domains
  21. Not just for patents, but for journals and other non-patent literature
  22. Why was it challenging? Exceptions to patterns OCR errors Lack of formatting information
  23. The record extraction example is from Pattern B The bib data example is from Pattern 5