Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
#pubcon
Google BERT & Family & The
Natural Language
Understanding Leaderboard
Race
Presented by: Dawn Anderson @BeBertey
#pubcon
The Problem with Words
#pubcon
#pubcon
Words are
problematic.
Ambiguous…
polysemous…
synonymous
#pubcon
Ambiguity and Polysemy
Almost every other
word in the English
language has
multiple meanings
#pubcon
In spoken word it is
even worse because
of homophones and
prosody
#pubcon
Like “four
candles”
and “fork
handles”
#pubcon
Which does not bode well for
conversational search into the future
#pubcon
Today’s Topic: Current
Search Engine Solutions For
Dealing with the Problem of
Words
#pubcon
MS MARCO
#pubcon
Meet Bertey & Tedward
#pubcon
Word’s Context
• ”The meaning of a word is its use in a
language” (Ludwig Wittgenstein,
Philosopher, 1953)
• Image...
#pubcon
Word’s Context
Changes As A Sentence
Evolves
• The meaning of a word changes (literally) as
a sentence develops
• ...
#pubcon
Like “like”
We can see in just this
short sentence alone
using Stanford Part of
Speech Tagger Online
that the word...
#pubcon
Like “like”
For example: The word ”like” has several
possible parts of speech (including ‘verb’,
‘noun’, ‘adjectiv...
#pubcon
An important
part of this is
‘Part of
Speech’ (POS)
tagging
#pubcon
Chunking and Tokenization
#pubcon
Natural language
understanding is NOT
structured data
#pubcon
Structured data
helps to
disambiguate but
what about the ‘hot
mess’ in between?
#pubcon
Part of Speech Tagging (POS)
#pubcon
Example Part of Speech Tagging (POS)
• Pubcon
• is
• a
• great
• conference
• NNP
• VBZ
• DT
• JJ
• NN
• Proper no...
#pubcon
Popular POS (Part of
Speech) Taggers
• Penn Treebank Tagger -> 36
different part of speech tags
• CLAWS 7 (C7) Tag...
#pubcon
Pronouns are
problematic too
#pubcon
Computer
programs lose
track of who
is who easily
I’m confused… Here…
Have some flowers instead
#pubcon
Named Entity
Recognition is NOT
Named Entity
Disambiguation
#pubcon
#pubcon
#pubcon
Ontology Driven Natural Language Processing
Image credit: IBM
https://www.ibm.com/developerworks/community/blogs/n...
#pubcon
But even named
entities can be
polysemic
#pubcon
Did you mean?
• Amadeus Mozart
(composer)
• Mozart Street
• Mozart Cafe
#pubcon
AND VERBALLY…WHO
(WHAT) ARE YOU
TALKING ABOUT?
”LYNDSEY DOYLE”
OR ”LINSEED OIL”?
#pubcon
AND NOT EVERYONE
OR THING IS MAPPED
TO THE KNOWLEDGE
GRAPH
#pubcon
#pubcon
#pubcon
EVEN IF WE UNDERSTAND
THE ENTITY (THING) ITSELF
WE NEED TO UNDERSTAND
WORD’S CONTEXT
#pubcon
Semantic context matters
• He kicked the bucket
• I have yet to cross that off my bucket list
• The bucket was fil...
#pubcon
How can search
engines fill in the
gaps between
named entities?
#pubcon
When they can’t even tell the difference between Pomeranians and pancakes
#pubcon
They need
‘Text
cohesion’
Cohesion is
the grammatical and
lexical linking within a text
or sentence that holds a t...
#pubcon
Word’s Company
“You shall know a word by
the company it keeps” (John
Rupert Firth, Linguist,1957)
Image Attributio...
#pubcon
Words That
Live
Together Are
Strongly
Connected
• Co-occurrence
• Co-occurrence provides context
• Co-occurrence c...
#pubcon
Natural Language
Disambiguation
#pubcon
Natural Language
Recognition is NOT
Understanding
• Natural language understanding
requires understanding of conte...
#pubcon
Language models are trained
on very large text corpora or
collections (loads of words) to
learn distributional sim...
#pubcon
Vector representations of words (Word Vectors)
#pubcon
And build
vector space
models for
word
embeddings
king - man +
woman =
queen
#pubcon
A Moving Word ‘Context Window’
#pubcon
Typical window size might be 5
Source Text
Writin
g
a lis
t
of rando
m
sentence
s
is harde
r
than I Initiall
y
tho...
#pubcon
Example context window size 3
Source Text Training
Sample
s
The quick brown fox jumps over the lazy dog (the,
quic...
#pubcon
A Moving Word ‘Context Window’
#pubcon
Tensorflow (tool)
& e.g. Word2Vec
or Glove2Vec
(language models)
#pubcon
Continuous Bag of Words
(CBoW) (Method) or Skip-
gram (Opposite of CBoW)
Continuous Bag of Words - Taking a
contin...
#pubcon
Models learn the
weights of the
similarity and
relatedness
distances
#pubcon
Layers Everywhere
Concept2Ve
c
#pubcon
Google’s Topic
Layer is a new
Layer in the
Knowledge Graph
#pubcon
EXAMPLE MICROSOFT CONCEPT DISTRIBUTION LAYER
#pubcon
PAST LANGUAGE
MODELS (E.G.
WORD2VEC &
GLOVE2VEC) BUILT
CONTEXT-FREE WORD
EMBEDDINGS
#pubcon
Most language modellers are uni-directional
Source Text
Writin
g
a lis
t
of rando
m
sentence
s
is harde
r
than I I...
#pubcon
They can only look at
words in the context
window before and not
the words in the rest of
the sentence. Nor
senten...
#pubcon
OFTEN THE NEXT
SENTENCE REALLY
MATTERS
#pubcon
I
Remember
When My
Grandad
Kicked The
Bucket
BERT is able to
understand the
NEXT sentence
The NEXT sentence
here p...
#pubcon
“How far do you reckon
I could kick this
bucket?”
#pubcon
Did you mean “bank”?
Or did you mean “bank”?
#pubcon
NER Example
• E.g.
Sentence: “Taylor Swift will launch her new album in Apple Music.”
• NER result:“Taylor[B-PER] ...
#pubcon
Meet BERT
#pubcon
Not the pomeranian BERT
#pubcon
BERT (Bidirectional
Encoder
Representation
from Transformers)
#pubcon
Transformers (Attention simultaneously)
#pubcon
11 NLP Tasks
• BERT advances the State of
the Art (SOT) of 11 NLP
Tasks
#pubcon
BERT is different. BERT uses bi-directional
language modelling. The FIRST to do this
Source Text
Writin
g
a lis
t
...
#pubcon
BERT HAS BEEN OPEN
SOURCED BY GOOGLE AI
#pubcon
Google’s move to
open source BERT
may change natural
language processing
forever
#pubcon
Bert uses ‘Transformers’ &
’Masked Language Modelling’
#pubcon
Masked Language
Modelling Stops
The Target Word
From Seeing Itself
#pubcon
BERT can see the WHOLE
sentence on either side
of a word (contextual
language modelling) and
all of the words almo...
#pubcon
BERT has been pre-trained on a
lot of words … on the whole of
the English Wikipedia (2,500
million words)
#pubcon
Previously Uni-Directional
Previously all language
models were uni-
directional so could
only move the context
win...
#pubcon
Google BERT Paper
• Devlin, J., Chang,
M.W., Lee, K. and
Toutanova, K., 2018.
Bert: Pre-training of
deep bidirecti...
#pubcon
BERT can identify which sentence
likely comes next from two choices
#pubcon
THE ML & NLP COMMUNITY ARE VERY EXCITED
ABOUT BERT
#pubcon
EVERYBODY WANTS TO ‘BUILD-A-
BERT. NOW THERE ARE LOADS OF
ALGORITHMS WITH BERT
#pubcon
VANILLA BERT PROVIDES A PRE-TRAINED STARTING POINT
LAYER FOR NEURAL NETWORKS IN MACHINE LEARNING &
NATURAL LANGUAG...
#pubcon
Whilst BERT has
been pre-trained on
Wikipedia it is fine-
tuned on ‘questions
and answer
datasets’
#pubcon
Andre Broder’s Call to Arms in Assistive AI
#pubcon
Researchers compete over Natural Language Understanding
with e.g. SQuAD (Stanford Question & Answering Dataset)
#pubcon
BERT Has
Dramatically
Accelerated NLU
#pubcon
BERT now even beats the
human reasoning benchmark on
SQuAD
#pubcon
Not to be outdone – Microsoft also extends on BERT with MT-DNN
#pubcon
RoBERTa
from
Facebook
#pubcon
In GLUE – It’s Humans, MT-DNN, then BERT
#pubcon
Glue Benchmark Leaderboard
#pubcon
SuperGlue Benchmark
#pubcon
Stanford Question & Answering DataSet
#pubcon
Includes Adversarial Questions: Making Sure Machines Know What They Don’t Know
#pubcon
MS MARCO
#pubcon
MS MARCO: A Human Generated
MAchine Reading Comprehension
Dataset
• Rajpurkar, P., Zhang, J.,
Lopyrev, K. and Lian...
#pubcon
Real Bing
Questions
Feed MS
MARCO
From real Bing anonymized
queries
#pubcon
Teaching Machines Commonsense
Zellers, R., Bisk, Y.,
Schwartz, R. and Choi,
Y., 2018. Swag: A large-
scale adversa...
#pubcon
BERT Has Grown
• Further iterations have grown in size so that the
models are arguably so large they are inefficie...
#pubcon
FastBERT
#pubcon
ALBERT
BERT’s successor
from Google
Joint work
between Google
Research & Toyota
Technological
Institute
#pubcon
HuggingFace
#pubcon
Distil-BERT
(Distillated BERT)
#pubcon
VideoBERT
#pubcon
BLACK BOX
ALGORITHMS
#pubcon
Algorithmic
Bias
Concerns
Ricardo Baeza-Yates' work - Bias on the Web
NoBIAS Project
IBM initiatives to prevent bi...
#pubcon
Keep in Touch
•@dawnieando
•@BeBertey
#pubcon
And Remember…
#pubcon
#pubcon
References
• Rajpurkar, P., Zhang, J., Lopyrev, K. and
Liang, P., 2016. Squad: 100,000+ questions
for machine comp...
Prochain SlideShare
Chargement dans…5
×

Google BERT and Family and the Natural Language Understanding Leaderboard Race

7 397 vues

Publié le

Natural Language Understanding and Word Sense Disambiguation remains one of the prevailing challenges for both conversational and written word. Natural language understanding attempts to untangle the 'hot mess' of words between more structured data in content, but the challenge is not trivial, since there is so much polysemy in language. Some recent developments in machine learning have seen significant leaps forward in understanding more clearly the context (and therefore user intent and informational need at time of query). Here we will explore these developments, and some of their implementations and seek to understand what this means for search strategists and the brands they support both now and into the future.

Publié dans : Marketing

Google BERT and Family and the Natural Language Understanding Leaderboard Race

  1. 1. #pubcon Google BERT & Family & The Natural Language Understanding Leaderboard Race Presented by: Dawn Anderson @BeBertey
  2. 2. #pubcon The Problem with Words
  3. 3. #pubcon
  4. 4. #pubcon Words are problematic. Ambiguous… polysemous… synonymous
  5. 5. #pubcon Ambiguity and Polysemy Almost every other word in the English language has multiple meanings
  6. 6. #pubcon In spoken word it is even worse because of homophones and prosody
  7. 7. #pubcon Like “four candles” and “fork handles”
  8. 8. #pubcon Which does not bode well for conversational search into the future
  9. 9. #pubcon Today’s Topic: Current Search Engine Solutions For Dealing with the Problem of Words
  10. 10. #pubcon MS MARCO
  11. 11. #pubcon Meet Bertey & Tedward
  12. 12. #pubcon Word’s Context • ”The meaning of a word is its use in a language” (Ludwig Wittgenstein, Philosopher, 1953) • Image attribution: Moritz Nähr [Public domain]
  13. 13. #pubcon Word’s Context Changes As A Sentence Evolves • The meaning of a word changes (literally) as a sentence develops • Due to the multiple parts of speech a word could be in a given content
  14. 14. #pubcon Like “like” We can see in just this short sentence alone using Stanford Part of Speech Tagger Online that the word like is considered to be 2 separate parts of speech http://nlp.stanford.edu:8080/parser/index.jsp
  15. 15. #pubcon Like “like” For example: The word ”like” has several possible parts of speech (including ‘verb’, ‘noun’, ‘adjective’) POS = Part of Speech
  16. 16. #pubcon An important part of this is ‘Part of Speech’ (POS) tagging
  17. 17. #pubcon Chunking and Tokenization
  18. 18. #pubcon Natural language understanding is NOT structured data
  19. 19. #pubcon Structured data helps to disambiguate but what about the ‘hot mess’ in between?
  20. 20. #pubcon Part of Speech Tagging (POS)
  21. 21. #pubcon Example Part of Speech Tagging (POS) • Pubcon • is • a • great • conference • NNP • VBZ • DT • JJ • NN • Proper noun, singular • Verb (3rd person, singular, present) • Determiner • Adjective • Noun
  22. 22. #pubcon Popular POS (Part of Speech) Taggers • Penn Treebank Tagger -> 36 different part of speech tags • CLAWS 7 (C7) Tagset -> 146 different part of speech tags • Brown Corpus Tagger -> 81 different part of speech tags
  23. 23. #pubcon Pronouns are problematic too
  24. 24. #pubcon Computer programs lose track of who is who easily I’m confused… Here… Have some flowers instead
  25. 25. #pubcon Named Entity Recognition is NOT Named Entity Disambiguation
  26. 26. #pubcon
  27. 27. #pubcon
  28. 28. #pubcon Ontology Driven Natural Language Processing Image credit: IBM https://www.ibm.com/developerworks/community/blogs/nlp/entry/ontology_driven_nlp
  29. 29. #pubcon But even named entities can be polysemic
  30. 30. #pubcon Did you mean? • Amadeus Mozart (composer) • Mozart Street • Mozart Cafe
  31. 31. #pubcon AND VERBALLY…WHO (WHAT) ARE YOU TALKING ABOUT? ”LYNDSEY DOYLE” OR ”LINSEED OIL”?
  32. 32. #pubcon AND NOT EVERYONE OR THING IS MAPPED TO THE KNOWLEDGE GRAPH
  33. 33. #pubcon
  34. 34. #pubcon
  35. 35. #pubcon EVEN IF WE UNDERSTAND THE ENTITY (THING) ITSELF WE NEED TO UNDERSTAND WORD’S CONTEXT
  36. 36. #pubcon Semantic context matters • He kicked the bucket • I have yet to cross that off my bucket list • The bucket was filled with water
  37. 37. #pubcon How can search engines fill in the gaps between named entities?
  38. 38. #pubcon When they can’t even tell the difference between Pomeranians and pancakes
  39. 39. #pubcon They need ‘Text cohesion’ Cohesion is the grammatical and lexical linking within a text or sentence that holds a text together and gives it meaning. Without surrounding words the word bucket could mean anything in a sentence
  40. 40. #pubcon Word’s Company “You shall know a word by the company it keeps” (John Rupert Firth, Linguist,1957) Image Attribution: Wikimedia Commons Public Domain
  41. 41. #pubcon Words That Live Together Are Strongly Connected • Co-occurrence • Co-occurrence provides context • Co-occurrence changes word’s meaning • Words that share similar neighbours are also strongly connected • Similarity & relatedness
  42. 42. #pubcon Natural Language Disambiguation
  43. 43. #pubcon Natural Language Recognition is NOT Understanding • Natural language understanding requires understanding of context and common sense reasoning. VERY challenging for machines, but largely straightforward for humans.
  44. 44. #pubcon Language models are trained on very large text corpora or collections (loads of words) to learn distributional similarity
  45. 45. #pubcon Vector representations of words (Word Vectors)
  46. 46. #pubcon And build vector space models for word embeddings king - man + woman = queen
  47. 47. #pubcon A Moving Word ‘Context Window’
  48. 48. #pubcon Typical window size might be 5 Source Text Writin g a lis t of rando m sentence s is harde r than I Initiall y though t it woul d be Writin g a lis t of rando m sentence s is harde r than I Initiall y though t it woul d be Writin g a lis t of rando m sentence s is harde r than I Initiall y though t it woul d be Writin g a lis t of rando m sentence s is harde r than I Initiall y though t it woul d be 11 letters (5 left and 5 right of the moving target word)
  49. 49. #pubcon Example context window size 3 Source Text Training Sample s The quick brown fox jumps over the lazy dog (the, quick) (the, brown) (the, fox) The quick brown fox jumps over the lazy dog (quick, the) (quick, brown) (quick, fox) (quick, jumps) The quick brown fox jumps over the lazy dog Etcetera The quick brown fox jumps over the lazy dog Etcetera
  50. 50. #pubcon A Moving Word ‘Context Window’
  51. 51. #pubcon Tensorflow (tool) & e.g. Word2Vec or Glove2Vec (language models)
  52. 52. #pubcon Continuous Bag of Words (CBoW) (Method) or Skip- gram (Opposite of CBoW) Continuous Bag of Words - Taking a continuous bag of words with no context utilize a context window of n size n- gram) to ascertain words which are similar or related using Euclidean distances to create vector models and word embeddings
  53. 53. #pubcon Models learn the weights of the similarity and relatedness distances
  54. 54. #pubcon Layers Everywhere
  55. 55. Concept2Ve c
  56. 56. #pubcon Google’s Topic Layer is a new Layer in the Knowledge Graph
  57. 57. #pubcon EXAMPLE MICROSOFT CONCEPT DISTRIBUTION LAYER
  58. 58. #pubcon PAST LANGUAGE MODELS (E.G. WORD2VEC & GLOVE2VEC) BUILT CONTEXT-FREE WORD EMBEDDINGS
  59. 59. #pubcon Most language modellers are uni-directional Source Text Writin g a lis t of rando m sentence s is harde r than I Initiall y though t it woul d be Writin g a lis t of rando m sentence s is harde r than I Initiall y though t it woul d be Writin g a lis t of rando m sentence s is harde r than I Initiall y though t it woul d be Writin g a lis t of rando m sentence s is harde r than I Initiall y though t it woul d be They can traverse over the word’s context window from only left to right or right to left. Only in one direction, but not both at the same time
  60. 60. #pubcon They can only look at words in the context window before and not the words in the rest of the sentence. Nor sentence to follow next
  61. 61. #pubcon OFTEN THE NEXT SENTENCE REALLY MATTERS
  62. 62. #pubcon I Remember When My Grandad Kicked The Bucket BERT is able to understand the NEXT sentence The NEXT sentence here provides the context
  63. 63. #pubcon “How far do you reckon I could kick this bucket?”
  64. 64. #pubcon Did you mean “bank”? Or did you mean “bank”?
  65. 65. #pubcon NER Example • E.g. Sentence: “Taylor Swift will launch her new album in Apple Music.” • NER result:“Taylor[B-PER] Swift[I-PER] will[O] launch[O] her[O] new[O] album[O] in[O] Apple[B-ORG] Music[I-ORG].[O]” • PS: [O] means no meaning [B-PER]/[I-PER] means person name [B-ORG]/[I-ORG] means organization name Source: https://medium.com/@yingbiao/ner-with-bert-in-action- 936ff275bc73
  66. 66. #pubcon Meet BERT
  67. 67. #pubcon Not the pomeranian BERT
  68. 68. #pubcon BERT (Bidirectional Encoder Representation from Transformers)
  69. 69. #pubcon Transformers (Attention simultaneously)
  70. 70. #pubcon 11 NLP Tasks • BERT advances the State of the Art (SOT) of 11 NLP Tasks
  71. 71. #pubcon BERT is different. BERT uses bi-directional language modelling. The FIRST to do this Source Text Writin g a lis t of rando m sentence s is harde r than I Initiall y though t it woul d be Writin g a lis t of rando m sentence s is harde r than I Initiall y though t it woul d be Writin g a lis t of rando m sentence s is harde r than I Initiall y though t it woul d be Writin g a lis t of rando m sentence s is harde r than I Initiall y though t it woul d be Bert can see both the left and the right hand side of the target word
  72. 72. #pubcon BERT HAS BEEN OPEN SOURCED BY GOOGLE AI
  73. 73. #pubcon Google’s move to open source BERT may change natural language processing forever
  74. 74. #pubcon Bert uses ‘Transformers’ & ’Masked Language Modelling’
  75. 75. #pubcon Masked Language Modelling Stops The Target Word From Seeing Itself
  76. 76. #pubcon BERT can see the WHOLE sentence on either side of a word (contextual language modelling) and all of the words almost at once
  77. 77. #pubcon BERT has been pre-trained on a lot of words … on the whole of the English Wikipedia (2,500 million words)
  78. 78. #pubcon Previously Uni-Directional Previously all language models were uni- directional so could only move the context window in one directional A moving window of ‘n’ words (either left or right of a target word) to understand word’s context
  79. 79. #pubcon Google BERT Paper • Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  80. 80. #pubcon BERT can identify which sentence likely comes next from two choices
  81. 81. #pubcon THE ML & NLP COMMUNITY ARE VERY EXCITED ABOUT BERT
  82. 82. #pubcon EVERYBODY WANTS TO ‘BUILD-A- BERT. NOW THERE ARE LOADS OF ALGORITHMS WITH BERT
  83. 83. #pubcon VANILLA BERT PROVIDES A PRE-TRAINED STARTING POINT LAYER FOR NEURAL NETWORKS IN MACHINE LEARNING & NATURAL LANGUAGE DIVERSE TASKS
  84. 84. #pubcon Whilst BERT has been pre-trained on Wikipedia it is fine- tuned on ‘questions and answer datasets’
  85. 85. #pubcon Andre Broder’s Call to Arms in Assistive AI
  86. 86. #pubcon Researchers compete over Natural Language Understanding with e.g. SQuAD (Stanford Question & Answering Dataset)
  87. 87. #pubcon BERT Has Dramatically Accelerated NLU
  88. 88. #pubcon BERT now even beats the human reasoning benchmark on SQuAD
  89. 89. #pubcon Not to be outdone – Microsoft also extends on BERT with MT-DNN
  90. 90. #pubcon RoBERTa from Facebook
  91. 91. #pubcon In GLUE – It’s Humans, MT-DNN, then BERT
  92. 92. #pubcon Glue Benchmark Leaderboard
  93. 93. #pubcon SuperGlue Benchmark
  94. 94. #pubcon Stanford Question & Answering DataSet
  95. 95. #pubcon Includes Adversarial Questions: Making Sure Machines Know What They Don’t Know
  96. 96. #pubcon MS MARCO
  97. 97. #pubcon MS MARCO: A Human Generated MAchine Reading Comprehension Dataset • Rajpurkar, P., Zhang, J., Lopyrev, K. and Liang, P., 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
  98. 98. #pubcon Real Bing Questions Feed MS MARCO From real Bing anonymized queries
  99. 99. #pubcon Teaching Machines Commonsense Zellers, R., Bisk, Y., Schwartz, R. and Choi, Y., 2018. Swag: A large- scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326.
  100. 100. #pubcon BERT Has Grown • Further iterations have grown in size so that the models are arguably so large they are inefficience and unscaleable
  101. 101. #pubcon FastBERT
  102. 102. #pubcon ALBERT BERT’s successor from Google Joint work between Google Research & Toyota Technological Institute
  103. 103. #pubcon HuggingFace
  104. 104. #pubcon Distil-BERT (Distillated BERT)
  105. 105. #pubcon VideoBERT
  106. 106. #pubcon BLACK BOX ALGORITHMS
  107. 107. #pubcon Algorithmic Bias Concerns Ricardo Baeza-Yates' work - Bias on the Web NoBIAS Project IBM initiatives to prevent bias BERT does not know why it makes decisions BERT is considered a ‘black box algorithm’ Programmatic bias is a concern Algorithmic justice league is active
  108. 108. #pubcon Keep in Touch •@dawnieando •@BeBertey
  109. 109. #pubcon And Remember…
  110. 110. #pubcon
  111. 111. #pubcon References • Rajpurkar, P., Zhang, J., Lopyrev, K. and Liang, P., 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).

×