SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Our journey with semantic
embedding
Rob Koopman, Shenghui Wang
OCLC
Fourth Annual KnoweScape Conference, 22-24 Feb 2017
Agenda
● What is semantic embedding
● Applications:
○ Context explorer
○ Topic delineation
○ Information retrieval
○ Concept drift
An example by Stefan Evert: what’s the meaning of bardiwac?
•He handed her her glass of bardiwac.
•Beef dishes are made to complement the bardiwacs.
•Nigel staggered to his feet, face flushed from too much bardiwac.
•Malbec, one of the lesser-known bardiwac grapes, responds well to
Australia’s sunshine.
•I dined on bread and cheese and this excellent bardiwac.
•The drinks were delicious: blood-red bardiwac as well as light, sweet
Rhenish.
•
⇒ ‘bardiwac’ is a heavy red alcoholic beverage made from grapes
How can we calculate the similarity/relatedness?
● Discrete encoding does not help to automatically process
the underlying semantics
● Statistical Semantics [furnas1983, weaver1955] based on
the assumption of “a word is characterized by the
company it keeps” [firth1957]
● Distributional Hypothesis [harris1954, sahlgren2008]:
words that occur in similar contexts tend to have similar
meanings.
Let’s embed words in a vector space
● Words are represented in a continuous vector space
where semantically similar words are mapped to nearby
points ('are embedded nearby each other').
● A desirable property: cosine similarity
What can we do with the similarity?
● Context explorer
● Topic delineation
● Information retrieval
● Concept drift
Context explorer
What can we do with the similarity?
● Context exploration
http://thoth.pica.nl/astro/relate?input=supernovae&type=7
Document clustering
Topic delineation based on clustering
● Generate vectors for entities
● Generate vectors for articles based on weighted average
of entity vectors
● Use standard clustering methods to cluster articles
● At the end this approach has proven to be remarkably
compatible with methods based on citation networks.
Wang, S., & Koopman, R. (2017). Clustering articles based on semantic similarity. In J. Gläser, A.
Scharnhorst, & W. Glänzel (Eds.), Same data—Different results? (pp. 234–556). Towards a comparative
approach to the identification of thematic structures in science. Special Issue of Scientometrics
Information Retrieval
1. 2014 glycated nail proteins a new approach for detecting diabetes in
developing countries
2. 2015 glycation of nail proteins from basic biochemical findings to a
representative marker for diabetic glycation associated target organ damage
3. 2005 glycation products as markers and predictors of the progression of
diabetic complications
4. 2015 glycated nail proteins as a new biomarker in management of the
south kivu congolese diabetics
5. 2005 advanced glycosylation end products in skin serum saliva and urine and
its association with complications of patients with type 2 diabetes mellitus
6. 1993 review of diabetes identification of markers for early detection glycemic
control and monitoring clinical complications
7. 2012 glycation and biomarkers of vascular complications of diabetes
8. 2005 the nail under fungal siege in patients with type ii diabetes mellitus
9. 2003 improvement in quality of diabetes control and concentrations of age
products in patients with type 1 and insulin treated type 2 diabetes mellitus
studied over a period of 10 years jevin
10. 2005 a novel advanced glycation index and its association with diabetes and
microangiopathy
Now let’s evaluate and compare
Word embedding techniques
Two main categories of approaches:
● global co-occurrence count-based methods, such as
Latent Semantic Analysis and Random Projection
● local context predictive methods, such as neural
probabilistic language models
Word embedding techniques
Two main categories of approaches:
● global co-occurrence count-based methods, such as
Latent Semantic Analysis and Random Projection ---
suffer in word analogy tasks
● local context predictive methods, such as neural
probabilistic language models --- do not leverage the
global statistics
Word embedding techniques
● Ariadne (OCLC): based on Random Projection of the
global co-occurrence matrix
● Word2Vec (Google): shallow, two-layer neural networks
that are trained to reconstruct linguistic contexts of words
● GloVe (Stanford): a global log-bilinear regression model to
learn word vectors based on the ratio of the co-occurrence
probabilities of two words
Different models lead to different embeddings
knee
Word2Vec ankle, hip, elbow, knees, shoulder, patellofemoral, joint,
wrist, tka, patellar
GloVe ankle, hip, joint, knees, arthroplasty, osteoarthritis,
elbow, flexion, cruciate, joints
Ariadne knees, knee joint, contralateral knee, tibiofemoral, knee
pain, knee motion, medial compartment, lateral
compartment, operated knees, right knee
frog
Word2Vec toad, bullfrog, amphibian, rana, turtle, salamander,
caudiverbera, frogs, leptodactylid, pleurodema
GloVe rana, toad, amphibian, bullfrog, frogs, temporaria,
laevis, xenopus, anuran, catesbeiana
Ariadne frogs, isolated frog, frog muscle, rana pipiens, anurans,
hyla, anuran, tree frog, anuran species, hylid
Word analogy evaluation
Which word is the most similar to Italy in the same sense as
Paris is similar to France?
X=vector(``Paris'')-vector(``France'')+vector(``Italy'')
Word analogy evaluation
Which word is the most similar to Italy in the same sense as
Paris is similar to France?
X=vector(``Paris'')-vector(``France'')+vector(``Italy'')
Method Accuracy (%) Runtime
(seconds)
#Thread
Word2Vec 61.4 32,432 16
GloVe 53.6 22,680 16
Ariadne 1.6 15,020 1
Information retrieval evaluation
Use case: evidence-based medical guideline
Statement There are no indications to suggest that
a skin-sparing mastectomy followed by
immediate reconstruction leads to a
higher risk of local or systemic
recurrence of breast cancer.
Old references (pmid) 9142378, 1985335
New references (pmid) 9142378, 9694613, 18210199
From word embedding to document distance
● Doc2Vec: an extension of Word2Vec, that learns to
correlate documents and words, rather than words with
other words
● Ariadne: weighted average of word vectors
A tiny gold set
● 29 statements (16 breast cancer, 4 hepatitis C, 4 lung
cancer, 5 ovarian cancer)
● 103 (96 unique) source articles, 156 (145 unique) target
articles, in total 180 unique articles
● 66 articles are in both source and target lists, so the
baseline total recall is 42.3% (the average baseline recall
is 45.8%)
● These articles were published between 1984 and 2012.
Average recall
Average precision
Concept drift
Now let’s talk about concept drift
● 20 million Medline articles published since 1977
● 1.5 million entities (subjects, authors, journals, words)
● 8 five-year periods
● Each subject is embedded in 8 chronological vector
spaces
● Is there concept drift and can we detect it?
Jaccard similarity based on important subjects
Most and least stable subjects
Most stable subjects Least stable subjects
history 15th century
history 18th century
history 17th century
history 16th century
history 19th century
thymoma
history ancient
history medieval
rabies
history
diagnostic techniques surgical
chromium isotopes
shock surgical
iodine isotopes
diagnostic techniques and procedures
blood circulation time
trauma nervous system
cesium isotopes
liver extracts
macroglobulins
Subjects most related to “trauma nervous system”
1977-
1982
anatomy regional, fracture fixation internal, bulgaria, piedra, surgery plastic, germany west,
wound infection, carbuncle, burns
1982-
1987
legionellosis, povidone, tropocollagen, attention deficit disorder with hyperactivity,
legionnaires disease, transfer psychology
1987-
1992
leg injuries, neurosurgical procedures, arm injuries, wound infection, orthopedic equipment,
dermatomycoses, multiple trauma, candidiasis cutaneous, fractures closed
1992-
1997
piperacillin, tazobactam, microbiology, diagnostic errors, sorption detoxification,
arthroplasty, hsp40 heat shock proteins, emaciation, professional patient relations
1997-
2002
defensive medicine, insurance liability, diagnostic errors, expert testimony, birth injuries,
maleic anhydrides, dimethyl sulfate, medical errors, p protein hepatitis b virus
2002-
2007
peripheral nervous system diseases, peripheral nerve injuries, neurologic examination,
male, recovery of function, peripheral nerves, elbow, comorbidity, mother child relations
2007-
2012
peripheral nerve injuries, sciatic neuropathy, papilledema, sciatic nerve, peripheral nerves,
nerve crush, neuroma, nerve regeneration, acute disease
2012-
2017
mitochondrial dynamics, dental records, park7 protein human, persistent vegetative state,
dnm1l protein human, platelet derived growth factor bb, dual specificity phosphatases,
lingual nerve injuries, dental care
defensive medicine, insurance
liability, diagnostic errors,
expert testimony, birth injuries,
anatomy regional,
fracture fixation
internal, bulgaria,
piedra, surgery
plastic
Global drift based on Self Organising Maps
- Create document vectors
- Put the documents in a self organizing map
- For each point in the map count the documents in a year range
- Make sub maps for each year range
- Now color code lower than expected as blue and higher than
expected in red
- The result shows global drift
A point of attention is that this shows how the content of the medline
database drifts over time, not necessarily how science drifts over time.
Summary
● Semantic indexing enables the operations directly on the
underlying semantics
● It helps to explore the context of subject, cluster and
retrieve related documents, and study drift
● Different methods have their own limitations
● The choice is application sensitive

Contenu connexe

Similaire à Our journey with semantic embedding

Tactical Formalization of Linked Open Data (Ontology Summit 2014)
Tactical Formalization of Linked Open Data (Ontology Summit 2014)Tactical Formalization of Linked Open Data (Ontology Summit 2014)
Tactical Formalization of Linked Open Data (Ontology Summit 2014)
Michel Dumontier
 
Curriculum Vitae
Curriculum VitaeCurriculum Vitae
Curriculum Vitae
Dino Celeda
 
Turning Data into Knowledge - Semantic Technologies in Healthcare
Turning Data into Knowledge - Semantic Technologies in HealthcareTurning Data into Knowledge - Semantic Technologies in Healthcare
Turning Data into Knowledge - Semantic Technologies in Healthcare
SvetlaBoytcheva
 

Similaire à Our journey with semantic embedding (20)

David S.Cohen-Curriculum Vitae
David S.Cohen-Curriculum VitaeDavid S.Cohen-Curriculum Vitae
David S.Cohen-Curriculum Vitae
 
Lancet.docx comorbidity of fetal alcohol spectrum disorder a systematic revi...
Lancet.docx comorbidity of fetal alcohol spectrum disorder  a systematic revi...Lancet.docx comorbidity of fetal alcohol spectrum disorder  a systematic revi...
Lancet.docx comorbidity of fetal alcohol spectrum disorder a systematic revi...
 
Examining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingExamining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencing
 
NAACCR June 2020
NAACCR June 2020NAACCR June 2020
NAACCR June 2020
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Adriana Maggi - Italian EU Presidency Conference on Dementia
Adriana Maggi - Italian EU Presidency Conference on DementiaAdriana Maggi - Italian EU Presidency Conference on Dementia
Adriana Maggi - Italian EU Presidency Conference on Dementia
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Tactical Formalization of Linked Open Data (Ontology Summit 2014)
Tactical Formalization of Linked Open Data (Ontology Summit 2014)Tactical Formalization of Linked Open Data (Ontology Summit 2014)
Tactical Formalization of Linked Open Data (Ontology Summit 2014)
 
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
 
Nova Publishers Catalog - Just Out! 2015
Nova Publishers Catalog - Just Out!  2015Nova Publishers Catalog - Just Out!  2015
Nova Publishers Catalog - Just Out! 2015
 
Catalog 2015
Catalog 2015Catalog 2015
Catalog 2015
 
dipak kalra
dipak kalradipak kalra
dipak kalra
 
Annovis Bio (ANVS) Presentation - April 9, 2020
Annovis Bio (ANVS) Presentation - April 9, 2020Annovis Bio (ANVS) Presentation - April 9, 2020
Annovis Bio (ANVS) Presentation - April 9, 2020
 
6C Skrøvseth Data-driven analytics for decision support EHiN 2014
6C Skrøvseth Data-driven analytics for decision support EHiN 20146C Skrøvseth Data-driven analytics for decision support EHiN 2014
6C Skrøvseth Data-driven analytics for decision support EHiN 2014
 
Curriculum Vitae
Curriculum VitaeCurriculum Vitae
Curriculum Vitae
 
Leverage machine learning and new technologies to enhance rwe generation and ...
Leverage machine learning and new technologies to enhance rwe generation and ...Leverage machine learning and new technologies to enhance rwe generation and ...
Leverage machine learning and new technologies to enhance rwe generation and ...
 
Supportive Periodontal Therapy Part 2
Supportive Periodontal Therapy Part 2Supportive Periodontal Therapy Part 2
Supportive Periodontal Therapy Part 2
 
Turning Data into Knowledge - Semantic Technologies in Healthcare
Turning Data into Knowledge - Semantic Technologies in HealthcareTurning Data into Knowledge - Semantic Technologies in Healthcare
Turning Data into Knowledge - Semantic Technologies in Healthcare
 
IRDiRC: progress and expectations
IRDiRC: progress and expectationsIRDiRC: progress and expectations
IRDiRC: progress and expectations
 
Genomic Technologies for Biomarker Discovery
Genomic Technologies for Biomarker DiscoveryGenomic Technologies for Biomarker Discovery
Genomic Technologies for Biomarker Discovery
 

Plus de Shenghui Wang

Ariadne's Thread -- Exploring a world of networked information built from fre...
Ariadne's Thread -- Exploring a world of networked information built from fre...Ariadne's Thread -- Exploring a world of networked information built from fre...
Ariadne's Thread -- Exploring a world of networked information built from fre...
Shenghui Wang
 
Learning Concept Mappings from Instance Similarity
Learning Concept Mappings from Instance SimilarityLearning Concept Mappings from Instance Similarity
Learning Concept Mappings from Instance Similarity
Shenghui Wang
 
Measuring the dynamic bi-directional influence between content and social ne...
Measuring the dynamic bi-directional influence between content and social ne...Measuring the dynamic bi-directional influence between content and social ne...
Measuring the dynamic bi-directional influence between content and social ne...
Shenghui Wang
 
Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning
Shenghui Wang
 
What is concept dirft and how to measure it?
What is concept dirft and how to measure it?What is concept dirft and how to measure it?
What is concept dirft and how to measure it?
Shenghui Wang
 
Study concept drift in political ontologies
Study concept drift in political ontologiesStudy concept drift in political ontologies
Study concept drift in political ontologies
Shenghui Wang
 

Plus de Shenghui Wang (12)

Non-parametric Subject Prediction
Non-parametric Subject PredictionNon-parametric Subject Prediction
Non-parametric Subject Prediction
 
Semantic indexing for KOS
Semantic indexing for KOSSemantic indexing for KOS
Semantic indexing for KOS
 
Contextualization of topics - browsing through terms, authors, journals and c...
Contextualization of topics - browsing through terms, authors, journals and c...Contextualization of topics - browsing through terms, authors, journals and c...
Contextualization of topics - browsing through terms, authors, journals and c...
 
Exploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadataExploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadata
 
Ariadne's Thread -- Exploring a world of networked information built from fre...
Ariadne's Thread -- Exploring a world of networked information built from fre...Ariadne's Thread -- Exploring a world of networked information built from fre...
Ariadne's Thread -- Exploring a world of networked information built from fre...
 
Learning Concept Mappings from Instance Similarity
Learning Concept Mappings from Instance SimilarityLearning Concept Mappings from Instance Similarity
Learning Concept Mappings from Instance Similarity
 
Measuring the dynamic bi-directional influence between content and social ne...
Measuring the dynamic bi-directional influence between content and social ne...Measuring the dynamic bi-directional influence between content and social ne...
Measuring the dynamic bi-directional influence between content and social ne...
 
Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning
 
What is concept dirft and how to measure it?
What is concept dirft and how to measure it?What is concept dirft and how to measure it?
What is concept dirft and how to measure it?
 
ICA Slides
ICA SlidesICA Slides
ICA Slides
 
ECCS 2010
ECCS 2010ECCS 2010
ECCS 2010
 
Study concept drift in political ontologies
Study concept drift in political ontologiesStudy concept drift in political ontologies
Study concept drift in political ontologies
 

Dernier

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Dernier (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 

Our journey with semantic embedding

  • 1. Our journey with semantic embedding Rob Koopman, Shenghui Wang OCLC Fourth Annual KnoweScape Conference, 22-24 Feb 2017
  • 2. Agenda ● What is semantic embedding ● Applications: ○ Context explorer ○ Topic delineation ○ Information retrieval ○ Concept drift
  • 3. An example by Stefan Evert: what’s the meaning of bardiwac? •He handed her her glass of bardiwac. •Beef dishes are made to complement the bardiwacs. •Nigel staggered to his feet, face flushed from too much bardiwac. •Malbec, one of the lesser-known bardiwac grapes, responds well to Australia’s sunshine. •I dined on bread and cheese and this excellent bardiwac. •The drinks were delicious: blood-red bardiwac as well as light, sweet Rhenish. • ⇒ ‘bardiwac’ is a heavy red alcoholic beverage made from grapes
  • 4. How can we calculate the similarity/relatedness? ● Discrete encoding does not help to automatically process the underlying semantics ● Statistical Semantics [furnas1983, weaver1955] based on the assumption of “a word is characterized by the company it keeps” [firth1957] ● Distributional Hypothesis [harris1954, sahlgren2008]: words that occur in similar contexts tend to have similar meanings.
  • 5. Let’s embed words in a vector space ● Words are represented in a continuous vector space where semantically similar words are mapped to nearby points ('are embedded nearby each other'). ● A desirable property: cosine similarity
  • 6. What can we do with the similarity? ● Context explorer ● Topic delineation ● Information retrieval ● Concept drift
  • 8. What can we do with the similarity? ● Context exploration http://thoth.pica.nl/astro/relate?input=supernovae&type=7
  • 10. Topic delineation based on clustering ● Generate vectors for entities ● Generate vectors for articles based on weighted average of entity vectors ● Use standard clustering methods to cluster articles ● At the end this approach has proven to be remarkably compatible with methods based on citation networks. Wang, S., & Koopman, R. (2017). Clustering articles based on semantic similarity. In J. Gläser, A. Scharnhorst, & W. Glänzel (Eds.), Same data—Different results? (pp. 234–556). Towards a comparative approach to the identification of thematic structures in science. Special Issue of Scientometrics
  • 12.
  • 13.
  • 14.
  • 15. 1. 2014 glycated nail proteins a new approach for detecting diabetes in developing countries 2. 2015 glycation of nail proteins from basic biochemical findings to a representative marker for diabetic glycation associated target organ damage 3. 2005 glycation products as markers and predictors of the progression of diabetic complications 4. 2015 glycated nail proteins as a new biomarker in management of the south kivu congolese diabetics 5. 2005 advanced glycosylation end products in skin serum saliva and urine and its association with complications of patients with type 2 diabetes mellitus 6. 1993 review of diabetes identification of markers for early detection glycemic control and monitoring clinical complications 7. 2012 glycation and biomarkers of vascular complications of diabetes 8. 2005 the nail under fungal siege in patients with type ii diabetes mellitus 9. 2003 improvement in quality of diabetes control and concentrations of age products in patients with type 1 and insulin treated type 2 diabetes mellitus studied over a period of 10 years jevin 10. 2005 a novel advanced glycation index and its association with diabetes and microangiopathy
  • 16. Now let’s evaluate and compare
  • 17. Word embedding techniques Two main categories of approaches: ● global co-occurrence count-based methods, such as Latent Semantic Analysis and Random Projection ● local context predictive methods, such as neural probabilistic language models
  • 18. Word embedding techniques Two main categories of approaches: ● global co-occurrence count-based methods, such as Latent Semantic Analysis and Random Projection --- suffer in word analogy tasks ● local context predictive methods, such as neural probabilistic language models --- do not leverage the global statistics
  • 19. Word embedding techniques ● Ariadne (OCLC): based on Random Projection of the global co-occurrence matrix ● Word2Vec (Google): shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words ● GloVe (Stanford): a global log-bilinear regression model to learn word vectors based on the ratio of the co-occurrence probabilities of two words
  • 20. Different models lead to different embeddings knee Word2Vec ankle, hip, elbow, knees, shoulder, patellofemoral, joint, wrist, tka, patellar GloVe ankle, hip, joint, knees, arthroplasty, osteoarthritis, elbow, flexion, cruciate, joints Ariadne knees, knee joint, contralateral knee, tibiofemoral, knee pain, knee motion, medial compartment, lateral compartment, operated knees, right knee frog Word2Vec toad, bullfrog, amphibian, rana, turtle, salamander, caudiverbera, frogs, leptodactylid, pleurodema GloVe rana, toad, amphibian, bullfrog, frogs, temporaria, laevis, xenopus, anuran, catesbeiana Ariadne frogs, isolated frog, frog muscle, rana pipiens, anurans, hyla, anuran, tree frog, anuran species, hylid
  • 21. Word analogy evaluation Which word is the most similar to Italy in the same sense as Paris is similar to France? X=vector(``Paris'')-vector(``France'')+vector(``Italy'')
  • 22. Word analogy evaluation Which word is the most similar to Italy in the same sense as Paris is similar to France? X=vector(``Paris'')-vector(``France'')+vector(``Italy'') Method Accuracy (%) Runtime (seconds) #Thread Word2Vec 61.4 32,432 16 GloVe 53.6 22,680 16 Ariadne 1.6 15,020 1
  • 23. Information retrieval evaluation Use case: evidence-based medical guideline Statement There are no indications to suggest that a skin-sparing mastectomy followed by immediate reconstruction leads to a higher risk of local or systemic recurrence of breast cancer. Old references (pmid) 9142378, 1985335 New references (pmid) 9142378, 9694613, 18210199
  • 24. From word embedding to document distance ● Doc2Vec: an extension of Word2Vec, that learns to correlate documents and words, rather than words with other words ● Ariadne: weighted average of word vectors
  • 25. A tiny gold set ● 29 statements (16 breast cancer, 4 hepatitis C, 4 lung cancer, 5 ovarian cancer) ● 103 (96 unique) source articles, 156 (145 unique) target articles, in total 180 unique articles ● 66 articles are in both source and target lists, so the baseline total recall is 42.3% (the average baseline recall is 45.8%) ● These articles were published between 1984 and 2012.
  • 29. Now let’s talk about concept drift ● 20 million Medline articles published since 1977 ● 1.5 million entities (subjects, authors, journals, words) ● 8 five-year periods ● Each subject is embedded in 8 chronological vector spaces ● Is there concept drift and can we detect it?
  • 30. Jaccard similarity based on important subjects
  • 31. Most and least stable subjects Most stable subjects Least stable subjects history 15th century history 18th century history 17th century history 16th century history 19th century thymoma history ancient history medieval rabies history diagnostic techniques surgical chromium isotopes shock surgical iodine isotopes diagnostic techniques and procedures blood circulation time trauma nervous system cesium isotopes liver extracts macroglobulins
  • 32. Subjects most related to “trauma nervous system” 1977- 1982 anatomy regional, fracture fixation internal, bulgaria, piedra, surgery plastic, germany west, wound infection, carbuncle, burns 1982- 1987 legionellosis, povidone, tropocollagen, attention deficit disorder with hyperactivity, legionnaires disease, transfer psychology 1987- 1992 leg injuries, neurosurgical procedures, arm injuries, wound infection, orthopedic equipment, dermatomycoses, multiple trauma, candidiasis cutaneous, fractures closed 1992- 1997 piperacillin, tazobactam, microbiology, diagnostic errors, sorption detoxification, arthroplasty, hsp40 heat shock proteins, emaciation, professional patient relations 1997- 2002 defensive medicine, insurance liability, diagnostic errors, expert testimony, birth injuries, maleic anhydrides, dimethyl sulfate, medical errors, p protein hepatitis b virus 2002- 2007 peripheral nervous system diseases, peripheral nerve injuries, neurologic examination, male, recovery of function, peripheral nerves, elbow, comorbidity, mother child relations 2007- 2012 peripheral nerve injuries, sciatic neuropathy, papilledema, sciatic nerve, peripheral nerves, nerve crush, neuroma, nerve regeneration, acute disease 2012- 2017 mitochondrial dynamics, dental records, park7 protein human, persistent vegetative state, dnm1l protein human, platelet derived growth factor bb, dual specificity phosphatases, lingual nerve injuries, dental care defensive medicine, insurance liability, diagnostic errors, expert testimony, birth injuries, anatomy regional, fracture fixation internal, bulgaria, piedra, surgery plastic
  • 33. Global drift based on Self Organising Maps - Create document vectors - Put the documents in a self organizing map - For each point in the map count the documents in a year range - Make sub maps for each year range - Now color code lower than expected as blue and higher than expected in red - The result shows global drift A point of attention is that this shows how the content of the medline database drifts over time, not necessarily how science drifts over time.
  • 34. Summary ● Semantic indexing enables the operations directly on the underlying semantics ● It helps to explore the context of subject, cluster and retrieve related documents, and study drift ● Different methods have their own limitations ● The choice is application sensitive