This document discusses using natural language processing (NLP) to accelerate materials design. It describes how NLP techniques are being used to analyze over 4 million materials science papers to extract entities like materials, characterization methods, and properties. Word embedding algorithms represent words as vectors to capture relationships between words. NLP models are then trained on labeled text to recognize these entities. This allows automated searching of literature and predicting promising new materials for applications like thermoelectrics based on co-occurrence patterns in text. Future work includes developing structured materials databases from literature and learning embeddings to describe arbitrary materials.
Accelerating materials design through natural language processing
1. Accelerating materials design through natural
language processing
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
KSU Virtual Seminar, Feb 23 2021
Slides (already) posted to hackingmaterials.lbl.gov
2. • Often, materials are known for several decades
before their functional applications are known
– MgB2 sitting on lab shelves for 50 years before its
identification as a superconductor in 2001
– LiFePO4 known since 1938, only identified as a Li-ion
battery cathode in 1997
• Even after discovery, optimization and
commercialization still take decades
• How is this typically done?
2
Typically, both new materials discovery and optimization
take decades
3. What constrains traditional approaches to materials design?
3
“[The Chevrel] discovery resulted from a lot of
unsuccessful experiments of Mg ions insertion
into well-known hosts for Li+ ions insertion, as
well as from the thorough literature analysis
concerning the possibility of divalent ions
intercalation into inorganic materials.”
-Aurbach group, on discovery of Chevrel cathode
for multivalent (e.g., Mg2+) batteries
Levi, Levi, Chasid, Aurbach
J. Electroceramics (2009)
4. 4
Researchers are starting to fundamentally re-think how we
invent the materials that make up our devices
Next-
generation
materials
design
Computer-
aided
materials
design
Natural
language
processing
“Self-driving
laboratories”
6. 6
Can ML help us work through our backlog of information we
need to assimilate from text sources?
papers to read “someday”
NLP algorithms
7. • It is difficult to look up all information any given material
due to the many different ways chemical compositions
are written
– a search for “TiNiSn” will give different results than “NiTiSn”
– a search for “GaSb” won’t match text that reads “Ga0.5Sb0.5”
– a search for “SnBi4Te7” won’t match text that reads “we studied
SnBi4X7 (X=S, Se, Te)”.
– a search for “AgCrSe2”, if it doesn’t have any hits, won’t suggest
“CuCrSe2” as a similar result
• It is difficult to ask questions or compile summaries, e.g.:
– What is the band gap of “Si”?
– What are all the known dopants into GaAs?
– What are all materials studied as thermoelectrics?
7
Traditional search doesn’t answer the questions we want
8. What is matscholar?
• Matscholar is an attempt to organize the world’s
information on materials science, connecting
together topics of study, synthesis and
characterization methods, and specific materials
compositions
• It is also an effort to use state-of-the-art natural
language processing to make collective use of
the information in millions of articles
9. One of our main projects concerns named entity
recognition, or automatically labeling text
9
This allows for search
and is crucial to
downstream tasks
10. 1
0
> 4 million
Papers Collected
31 million
Properties
19 million
Materials Mentions
8.8 million
Characterization Methods
7.5 million
Applications
5 million
Synthesis Methods
•Data Collection: Over 4 million papers
collected from more than 2100 journals.
Note – entities are currently extracted only from the abstracts of the papers
13. • The publication data set is not complete
• Currently analyzing abstracts only
• The algorithms are not perfect
• The search interface could be improved further
• We would like to hear from you if you try this!
13
Limitations (it is not perfect)
14. 14
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
15. Extracted 4 million abstracts
of relevant scientific articles
using various APIs from
journal publishers
Some are more difficult than
others to obtain.
Data cleaning is often
needed (e.g., stray HTML
tags, copyright statements)
Abstract collection
continues …
15
Step 1 – data collection
16. 16
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
17. • First split the text into sentences
– Seems simple, but remember edge cases like ”et al.” or
“etc.” does not necessarily signify end of sentence despite
the period
• Then split the sentences into words
– Tricky things are detecting and normalizing chemical
formulas, selective lowercasing (“Battery” vs “battery” or
“BaS” vs “BAs”), homogenizing numbers, etc.
• Done largely with the ChemDataExtractor* with
some custom improvements
– We may move to a fully custom tokenizer soon
17
Step 2 - tokenization
*http://chemdataextractor.org
18. 18
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
19. • Part A is marking abstracts
as relevant / non-relevant
to inorganic materials
science
• Part B is tediously labeling
~600 abstracts
– Largely done by one person
– Spot-check of 25 abstracts
by a second person gave
87.4% agreement
19
Step 3 – hand label abstracts
20. 20
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
21. • We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
21
Step 4a: the word2vec algorithm is used to “featurize” words
Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
22. • We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
22
Step 4a: the word2vec algorithm is used to “featurize” words
Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
“You shall know a word by
the company it keeps”
- John Rupert Firth (1957)
23. • The classic example is:
– “king” - “man” + “woman” = ? → “queen”
23
Word embeddings trained on ”normal” text learns
relationships between words
24. 24
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
25. • If you read this sentence:
“The band gap of ___ is 4.5 eV”
It is clear that the blank should be filled in with a
material word (not a synthesis method, characterization
method, etc.)
How do we get a neural network to take into account
context (as well as properties of the word itself)?
25
Step 4b: How do we train a model to recognize context?
26. 26
Step 4b.An LSTM neural net classifies words by reading
word sequences
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
27. 27
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
28. 28
Step 5. Sit back and let the model label things for you!
Named Entity Recognition
X
• Custom machine learning models to
extract the most valuable materials-related
information.
• Utilizes a long short-term memory (LSTM)
network trained on ~1000 hand-annotated
abstracts.
• f1 scores of ~0.9. f1 score for inorganic
materials extraction is >0.9.
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
29. 29
Could these techniques also be used to predict which
materials we might want to screen for an application?
papers to read “someday”
NLP algorithms
30. • The classic example is:
– “king” - “man” + “woman” = ? → “queen”
30
Remember that word embeddings seem to learn
relationships in text
31. 31
For scientific text, it learns scientific concepts as well
crystal structures of the elements
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
32. 32
There seems to be materials knowledge encoded in the
word vectors
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
33. 33
Note that more data is not always better!
We want relevance
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
34. 34
Word embeddings also have the periodic table encoded in it
with no prior knowledge
“word embedding”
periodic table
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
35. • Dot product of a composition word with
the word “thermoelectric” essentially
predicts how likely that word is to appear
in an abstract with the word
thermoelectric
• Compositions with high dot products are
typically known thermoelectrics
• Sometimes, compositions have a high dot
product with “thermoelectric” but have
never been studied as a thermoelectric
• These compositions usually have high
computed power factors!
(DFT+BoltzTraP)
35
Making predictions: dot products measure likelihood for
words to co-occur
Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from
materials science literature. Nature 571, 95–98 (2019).
36. 36
Try ”going back in time” and ranking materials, and follow
what happens in later years
Tshitoyan, V. et al.
Unsupervised word
embeddings capture latent
knowledge from materials
science literature. Nature
571, 95–98 (2019).
37. – For every year since
2001, see which
compounds we would
have predicted using
only literature data until
that point in time
– Make predictions of
what materials are the
most promising
thermoelectrics for
data until that year
– See if those materials
were actually studied as
thermoelectrics in
subsequent years 37
A more comprehensive “back in time” test
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
38. 38
We also published a list of potential new thermoelectrics
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
It is one thing to
retroactively test, but
perhaps another to see
how things go after
publication
39. 39
Two were studied between submission and publication of
manuscript
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
40. 40
More were studied since then (mainly computationally)
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
41. 41
More were studied since then (mainly computationally)
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
42. 42
More were studied since then (mainly computationally)
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
https://arxiv.org/abs/2010.08461
43. 43
Our collaborators also synthesized a prediction, finding a
moderate zT of 0.14
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
44. 44
How is this working?
“Context
words” link
together
information
from different
sources
46. 46
1.Automatic creation of structured materials databases from
the literature, e.g. doping database
Sentence Base
Material
Dopant Doping
Concentr.
…the influence of yttrium doping (0-10mol%) on BSCF… BSCF Yttrium 0-10 mol%
undoped, anion-doped(Sb,Bi) and cation-doped(Ca,Zn) solid sln.
of Mg10Si2Sn3…
Mg10Si2Sn3 Sb, Bi, Ca, Zn
The zT of As2Cd3 with electron doping is found to be ~ with
n=10^20cm-3
As2Cd3 electron n=10^20cm-3
This leads to zT=0.5 obtained at 500K (p=10^20cm-3) in p-type
As2Cd3T
As2Cd3 p-type p=10^20cm-3
The undoped and 0.25wt% La doped CdO films show 111…
…however, …. for doping concentrations greater than 0.50wt%.
CdO La 0.25wt%,
>0.5%
Will allow you to answer questions like “what
are all the materials known to be doped with
Eu3+” ?
47. 47
2. Learning representations of materials
● Mat2vec suggested that embeddings contain chemical information
● Can we make embeddings for arbitrary materials as material descriptors?
● i.e., word embeddings for materials not in the literature
● Descriptors could be used for direct classification for application (link
prediction), or quantitative property prediction (regression features)
49. 49
Initial results – predicting experimental band gap from
composition (~3000 data points)
50. 50
3. Creating a comprehensive software library for materials
science NLP research (multiple LBNL research groups)
https://github.com/lbnlp
51. 51
4. D2S2 - data driven synthesis science
(in progress, larger LBNL collaboration)
Can we combine natural language processing with theory
and experiments to control synthesis?
52. Title auto-generated from abstract Published Title
Dynamics of molecular hydrogen
confined in narrow nanopores
Restricted dynamics of molecular
hydrogen confined in activated carbon
nanopores
Microfluidic Generation of
Polydisperse Solid Foams
Generation of Solid Foams with
Controlled Polydispersity Using
Microfluidics
Minimum variance unbiased estimator
of product performance
Assessing the lifetime performance
index of gamma lifetime products in
the manufacturing industry
Angle resolved ultraviolet
photoemission study of fluorescein
films on Ag 110
The growth of thin fluorescein films on
Ag 110”
52
... and also some fun things, like automatic title generation
Also have results in suggesting journals to submit a new article to, etc.
53. 53
The Matscholar team
Kristin Persson
Anubhav Jain
Gerbrand Ceder
John
Dagdelen
Leigh
Weston
Vahe
Tshitoyan
Amalie
Trewartha
Alex
Dunn
Viktoriia
Baibakova
Funding from
(now at Google) (now at Medium)
Slides (already) posted to
hackingmaterials.lbl.gov