SlideShare une entreprise Scribd logo
1  sur  50
Télécharger pour lire hors ligne
Discovering advanced materials for energy
applications by mining the scientific literature
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
AFRL meeting, Jan 2020
Slides (already) posted to hackingmaterials.lbl.gov
• Often, materials are known for several decades
before their functional applications are known
– MgB2 sitting on lab shelves for 50 years before its
identification as a superconductor in 2001
– LiFePO4 known since 1938, only identified as a Li-ion
battery cathode in 1997
• Even after discovery, optimization and
commercialization still take decades
• To get a sense for why this is so hard, let’s look at
the problem in more detail …
2
Typically, both new materials discovery and optimization
take decades
What constrains traditional approaches to materials design?
3
“[The Chevrel] discovery resulted from a lot of
unsuccessful experiments of Mg ions insertion
into well-known hosts for Li+ ions insertion, as
well as from the thorough literature analysis
concerning the possibility of divalent ions
intercalation into inorganic materials.”
-Aurbach group, on discovery of Chevrel cathode
for multivalent (e.g., Mg2+) batteries
Levi, Levi, Chasid, Aurbach
J. Electroceramics (2009)
4
Researchers are starting to fundamentally re-think how we
invent the materials that make up our devices
Next-
generation
materials
design
Computer-
aided
materials
design
Natural
language
processing
“Self-driving
laboratories”
Outline
5
① Natural language processing - where are
we right now?
② What’s next for the NLP work?
6
Can ML help us work through our backlog of information we
need to assimilate from text sources?
papers to read “someday”
NLP algorithms
• It is difficult to look up all information any given material
due to the many different ways chemical compositions
are written
– a search for “TiNiSn” will give different results than “NiTiSn”
– a search for “GaSb” won’t match text that reads “Ga0.5Sb0.5”
– a search for “SnBi4Te7” won’t match text that reads “we studied
SnBi4X7 (X=S, Se, Te)”.
– a search for “AgCrSe2”, if it doesn’t have any hits, won’t suggest
“CuCrSe2” as a similar result
• It is difficult to compile summaries, e.g.:
– A list of all materials studied for an application
– A list of all synthesis methods for a material
7
Traditional search doesn’t answer the questions we want
What is matscholar?
• Matscholar is an attempt to organize the world’s
information on materials science, connecting
together topics of study, synthesis and
characterization methods, and specific materials
compositions
• It is also an effort to use state-of-the-art natural
language processing to make collective use of
the information in millions of articles
One of our main projects concerns named entity
recognition, or automatically labeling text
9
1
0
> 4 million
Papers Collected
31 million
Properties
19 million
Materials Mentions
8.8 million
Characterization Methods
7.5 million
Applications
5 million
Synthesis Methods
•Data Collection: Over 4 million full papers*
collected from more than 2100 journals.
* Entities only extracted from abstracts deemed relevant to inorganic materials
science (~2M) so far.
11
Now we can search!
Live on www.matscholar.com
12
Another example …
13
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
Extracted 4 million
abstracts of relevant
scientific articles using
various APIs from
journal publishers
Some are more difficult
than others to obtain.
Abstract collection
continues …
14
Step 1 – data collection
15
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
• First split the text into sentences
– Seems simple, but remember edge cases like ”et al.” or
“etc.” does not necessarily signify end of sentence despite
the period
• Then split the sentences into words
– Tricky things are detecting and normalizing chemical
formulas, selective lowercasing (“Battery” vs “battery” or
“BaS” vs “BAs”), homogenizing numbers, etc.
• Done largely with the ChemDataExtractor* with
some custom improvements
– We may move to a fully custom tokenizer soon
16
Step 2 - tokenization
*http://chemdataextractor.org
17
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
• Part A is marking abstracts
as relevant / non-relevant
to inorganic materials
science
• Part B is tediously labeling
~600 abstracts
– Largely done by one person
– Spot-check of 25 abstracts
by a second person gave
87.4% agreement
18
Step 3 – hand label abstracts
19
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
• We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
20
Step 4a: the word2vec algorithm is used to “featurize” words
Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
• We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
21
Step 4a: the word2vec algorithm is used to “featurize” words
Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
“You shall know a word by
the company it keeps”
- John Rupert Firth (1957)
• The classic example is:
– “king” - “man” + “woman” = ? → “queen”
22
Word embeddings trained on ”normal” text learns
relationships between words
23
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
• If you read this sentence:
“The band gap of ___ is 4.5 eV”
It is clear that the blank should be filled in with a
material word (not a synthesis method, characterization
method, etc.)
How do we get a neural network to take into account
context (as well as properties of the word itself)?
24
Step 4b: How do we train a model to recognize context?
25
Step 4b.An LSTM neural net classifies words by reading
word sequences
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
26
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
27
Step 5. Sit back and let the model label things for you!Named Entity Recognition
X
• Custom machine learning models to
extract the most valuable materials-related
information.
• Utilizes a long short-term memory (LSTM)
network trained on ~1000 hand-annotated
abstracts.
• f1 scores of ~0.9. f1 score for inorganic
materials extraction is >0.9.
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
28
Live online …
29
Could these techniques also be used to predict which
materials we might want to screen for an application?
papers to read “someday”
NLP algorithms
• The classic example is:
– “king” - “man” + “woman” = ? → “queen”
30
Remember that word embeddings seem to learn
relationships in text
31
For scientific text, it learns scientific concepts as well
crystal structures of the elements
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
32
There seems to be materials knowledge encoded in the
word vectors
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
33
Note that more data is not always better!
We want relevance
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
34
Word embeddings also have the periodic table encoded in it
with no prior knowledge
“word embedding”
periodic table
• Dot product of a composition word with
the word “thermoelectric” essentially
predicts how likely that word is to appear
in an abstract with the word
thermoelectric
• Compositions with high dot products are
typically known thermoelectrics
• Sometimes, compositions have a high dot
product with “thermoelectric” but have
never been studied as a thermoelectric
• These compositions usually have high
computed power factors!
(DFT+BoltzTraP)
35
Making predictions: dot products measure likelihood for
words to co-occur
Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from
materials science literature. Nature 571, 95–98 (2019).
36
Try ”going back in time” and ranking materials, and follow
what happens in later years
Tshitoyan, V. et al.
Unsupervised word
embeddings capture latent
knowledge from materials
science literature. Nature
571, 95–98 (2019).
– For every year since
2001, see which
compounds we would
have predicted using
only literature data until
that point in time
– Make predictions of
what materials are the
most promising
thermoelectrics for
data until that year
– See if those materials
were actually studied as
thermoelectrics in
subsequent years 37
A more comprehensive “back in time” test
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
• Thus far, 2 of our top 20 predictions made in
~August 2018 have already been reported in the
literature for the first time as thermoelectrics
– Li3Sb was the subject of a computational study
(predicted zT=2.42) in Oct 2018
– SnTe2 was experimentally found to be a moderately
good thermoelectric (expt zT=0.71) in Dec 2018
• We are working with an experimentalist on one
of the predictions (but ”spare time” project)
38
How about “forward” predictions?
[1] Yang et al. "Low lattice thermal conductivity and
excellent thermoelectric behavior in Li3Sb and Li3Bi."
Journal of Physics: Condensed Matter 30.42 (2018):
425401
[2] Wang et al. "Ultralow lattice thermal conductivity and
electronic properties of monolayer 1T phase semimetal
SiTe2 and SnTe2." Physica E: Low-dimensional Systems and
Nanostructures 108 (2019): 53-59
39
How is this working?
“Context
words” link
together
information
from different
sources
Outline
40
① Natural language processing - where are
we right now?
② What’s next for the NLP work?
• Currently, we only have word vectors for
compositions that explicitly appear in abstracts
• We can rank known materials for an application,
but for materials with zero or little mention in the
scientific literature, we are stuck!
• How do we get word embeddings for
compositions that do not exist in the text?
41
Making predictions for entirely new compositions
42
“Hidden representation learning”
43
Initial results – predicting experimental band gap from
composition (~3000 data points)
44
Going beyond entity recognition towards relationship
extraction
45
Current approach is not good enough
• E.g., automatically generate databases from the
literature
– Materials and their numerical band gaps (or thermal
conductivities, or bulk modulus, or superconducting
temperature, etc.)
– If materials can be made n-type, p-type, or both
– Which synthesis techniques led to various sample
descriptors
• Will likely require more powerful techniques, e.g.,
attention-based algorithms (BERT, Google XLNet …)
– To be investigated …
46
Once the accuracy improves, we can start to make much
more powerful searches
47
D2S2 - data driven synthesis science (just starting)
Can we combine natural language processing with theory
and experiments to control synthesis?
Title auto-generated from abstract Published Title
Dynamics of molecular hydrogen
confined in narrow nanopores
Restricted dynamics of molecular
hydrogen confined in activated carbon
nanopores
Microfluidic Generation of
Polydisperse Solid Foams
Generation of Solid Foams with
Controlled Polydispersity Using
Microfluidics
Minimum variance unbiased estimator
of product performance
Assessing the lifetime performance
index of gamma lifetime products in
the manufacturing industry
Angle resolved ultraviolet
photoemission study of fluorescein
films on Ag 110
The growth of thin fluorescein films on
Ag 110”
48
... and also some fun things, like automatic title generation
49
Acknowledgements
Slides (already) posted to hackingmaterials.lbl.gov
• High-throughput DFT
– Gerbrand Ceder and “BURP” team
– Funding: Bosch / Umicore
• Natural language processing
– Gerbrand Ceder, Kristin Persson, and “Matscholar” team
– Funding: Toyota Research Institutes
• Overall work funded by US Department of Energy
50
The Matscholar team
Kristin PerssonAnubhav JainGerbrand Ceder
John
Dagdelen
Leigh
Weston
Vahe
Tshitoyan
Amalie
Trewartha
Alex
Dunn
Viktoriia
Baibakova
Funding from
(now at Google) (now at Medium)

Contenu connexe

Tendances

Computational screening of tens of thousands of compounds as potential thermo...
Computational screening of tens of thousands of compounds as potential thermo...Computational screening of tens of thousands of compounds as potential thermo...
Computational screening of tens of thousands of compounds as potential thermo...
Anubhav Jain
 

Tendances (20)

Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
 
Combined Theory and Data-Driven Approaches to Thermoelectrics Materials Disco...
Combined Theory and Data-Driven Approaches to Thermoelectrics Materials Disco...Combined Theory and Data-Driven Approaches to Thermoelectrics Materials Disco...
Combined Theory and Data-Driven Approaches to Thermoelectrics Materials Disco...
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
Introduction (Part I): High-throughput computation and machine learning appli...
Introduction (Part I): High-throughput computation and machine learning appli...Introduction (Part I): High-throughput computation and machine learning appli...
Introduction (Part I): High-throughput computation and machine learning appli...
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...
 
High-throughput computation and machine learning methods applied to materials...
High-throughput computation and machine learning methods applied to materials...High-throughput computation and machine learning methods applied to materials...
High-throughput computation and machine learning methods applied to materials...
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
Methods, tools, and examples (Part II): High-throughput computation and machi...
Methods, tools, and examples (Part II): High-throughput computation and machi...Methods, tools, and examples (Part II): High-throughput computation and machi...
Methods, tools, and examples (Part II): High-throughput computation and machi...
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
 
Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...
Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...
Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
 
Computational screening of tens of thousands of compounds as potential thermo...
Computational screening of tens of thousands of compounds as potential thermo...Computational screening of tens of thousands of compounds as potential thermo...
Computational screening of tens of thousands of compounds as potential thermo...
 
Density functional theory calculations and data mining for new thermoelectric...
Density functional theory calculations and data mining for new thermoelectric...Density functional theory calculations and data mining for new thermoelectric...
Density functional theory calculations and data mining for new thermoelectric...
 

Similaire à Discovering advanced materials for energy applications by mining the scientific literature

Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
aimsnist
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 

Similaire à Discovering advanced materials for energy applications by mining the scientific literature (20)

Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
Digitally enabling the RSC archive
Digitally enabling the RSC archiveDigitally enabling the RSC archive
Digitally enabling the RSC archive
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
 
Paul Groth
Paul GrothPaul Groth
Paul Groth
 
anifield.pptx
anifield.pptxanifield.pptx
anifield.pptx
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
 
Chemspider Presentation at the ACS Meeting in New orleans
Chemspider Presentation at the ACS Meeting in New orleansChemspider Presentation at the ACS Meeting in New orleans
Chemspider Presentation at the ACS Meeting in New orleans
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
 
anifield.pdf
anifield.pdfanifield.pdf
anifield.pdf
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Text recycling research project
Text recycling research project Text recycling research project
Text recycling research project
 
Getting Reading for the Next Generation Science Standards Part 3: Crosscuttin...
Getting Reading for the Next Generation Science Standards Part 3: Crosscuttin...Getting Reading for the Next Generation Science Standards Part 3: Crosscuttin...
Getting Reading for the Next Generation Science Standards Part 3: Crosscuttin...
 
Applying machine learning techniques to big data in the scholarly domain
Applying machine learning techniques to big data in the scholarly domainApplying machine learning techniques to big data in the scholarly domain
Applying machine learning techniques to big data in the scholarly domain
 
ICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials ProjectICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials Project
 

Plus de Anubhav Jain

Plus de Anubhav Jain (20)

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesis
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst Design
 
Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials Project
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...
 
Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...
 
Discovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectDiscovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials Project
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst Design
 
Assessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data AnalysisAssessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data Analysis
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 

Dernier

Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 

Dernier (20)

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 

Discovering advanced materials for energy applications by mining the scientific literature

  • 1. Discovering advanced materials for energy applications by mining the scientific literature Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA AFRL meeting, Jan 2020 Slides (already) posted to hackingmaterials.lbl.gov
  • 2. • Often, materials are known for several decades before their functional applications are known – MgB2 sitting on lab shelves for 50 years before its identification as a superconductor in 2001 – LiFePO4 known since 1938, only identified as a Li-ion battery cathode in 1997 • Even after discovery, optimization and commercialization still take decades • To get a sense for why this is so hard, let’s look at the problem in more detail … 2 Typically, both new materials discovery and optimization take decades
  • 3. What constrains traditional approaches to materials design? 3 “[The Chevrel] discovery resulted from a lot of unsuccessful experiments of Mg ions insertion into well-known hosts for Li+ ions insertion, as well as from the thorough literature analysis concerning the possibility of divalent ions intercalation into inorganic materials.” -Aurbach group, on discovery of Chevrel cathode for multivalent (e.g., Mg2+) batteries Levi, Levi, Chasid, Aurbach J. Electroceramics (2009)
  • 4. 4 Researchers are starting to fundamentally re-think how we invent the materials that make up our devices Next- generation materials design Computer- aided materials design Natural language processing “Self-driving laboratories”
  • 5. Outline 5 ① Natural language processing - where are we right now? ② What’s next for the NLP work?
  • 6. 6 Can ML help us work through our backlog of information we need to assimilate from text sources? papers to read “someday” NLP algorithms
  • 7. • It is difficult to look up all information any given material due to the many different ways chemical compositions are written – a search for “TiNiSn” will give different results than “NiTiSn” – a search for “GaSb” won’t match text that reads “Ga0.5Sb0.5” – a search for “SnBi4Te7” won’t match text that reads “we studied SnBi4X7 (X=S, Se, Te)”. – a search for “AgCrSe2”, if it doesn’t have any hits, won’t suggest “CuCrSe2” as a similar result • It is difficult to compile summaries, e.g.: – A list of all materials studied for an application – A list of all synthesis methods for a material 7 Traditional search doesn’t answer the questions we want
  • 8. What is matscholar? • Matscholar is an attempt to organize the world’s information on materials science, connecting together topics of study, synthesis and characterization methods, and specific materials compositions • It is also an effort to use state-of-the-art natural language processing to make collective use of the information in millions of articles
  • 9. One of our main projects concerns named entity recognition, or automatically labeling text 9
  • 10. 1 0 > 4 million Papers Collected 31 million Properties 19 million Materials Mentions 8.8 million Characterization Methods 7.5 million Applications 5 million Synthesis Methods •Data Collection: Over 4 million full papers* collected from more than 2100 journals. * Entities only extracted from abstracts deemed relevant to inorganic materials science (~2M) so far.
  • 11. 11 Now we can search! Live on www.matscholar.com
  • 13. 13 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 14. Extracted 4 million abstracts of relevant scientific articles using various APIs from journal publishers Some are more difficult than others to obtain. Abstract collection continues … 14 Step 1 – data collection
  • 15. 15 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 16. • First split the text into sentences – Seems simple, but remember edge cases like ”et al.” or “etc.” does not necessarily signify end of sentence despite the period • Then split the sentences into words – Tricky things are detecting and normalizing chemical formulas, selective lowercasing (“Battery” vs “battery” or “BaS” vs “BAs”), homogenizing numbers, etc. • Done largely with the ChemDataExtractor* with some custom improvements – We may move to a fully custom tokenizer soon 16 Step 2 - tokenization *http://chemdataextractor.org
  • 17. 17 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 18. • Part A is marking abstracts as relevant / non-relevant to inorganic materials science • Part B is tediously labeling ~600 abstracts – Largely done by one person – Spot-check of 25 abstracts by a second person gave 87.4% agreement 18 Step 3 – hand label abstracts
  • 19. 19 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 20. • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target 20 Step 4a: the word2vec algorithm is used to “featurize” words Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
  • 21. • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target 21 Step 4a: the word2vec algorithm is used to “featurize” words Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017 “You shall know a word by the company it keeps” - John Rupert Firth (1957)
  • 22. • The classic example is: – “king” - “man” + “woman” = ? → “queen” 22 Word embeddings trained on ”normal” text learns relationships between words
  • 23. 23 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 24. • If you read this sentence: “The band gap of ___ is 4.5 eV” It is clear that the blank should be filled in with a material word (not a synthesis method, characterization method, etc.) How do we get a neural network to take into account context (as well as properties of the word itself)? 24 Step 4b: How do we train a model to recognize context?
  • 25. 25 Step 4b.An LSTM neural net classifies words by reading word sequences Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 26. 26 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 27. 27 Step 5. Sit back and let the model label things for you!Named Entity Recognition X • Custom machine learning models to extract the most valuable materials-related information. • Utilizes a long short-term memory (LSTM) network trained on ~1000 hand-annotated abstracts. • f1 scores of ~0.9. f1 score for inorganic materials extraction is >0.9. Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 29. 29 Could these techniques also be used to predict which materials we might want to screen for an application? papers to read “someday” NLP algorithms
  • 30. • The classic example is: – “king” - “man” + “woman” = ? → “queen” 30 Remember that word embeddings seem to learn relationships in text
  • 31. 31 For scientific text, it learns scientific concepts as well crystal structures of the elements Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 32. 32 There seems to be materials knowledge encoded in the word vectors Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 33. 33 Note that more data is not always better! We want relevance Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 34. 34 Word embeddings also have the periodic table encoded in it with no prior knowledge “word embedding” periodic table
  • 35. • Dot product of a composition word with the word “thermoelectric” essentially predicts how likely that word is to appear in an abstract with the word thermoelectric • Compositions with high dot products are typically known thermoelectrics • Sometimes, compositions have a high dot product with “thermoelectric” but have never been studied as a thermoelectric • These compositions usually have high computed power factors! (DFT+BoltzTraP) 35 Making predictions: dot products measure likelihood for words to co-occur Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 36. 36 Try ”going back in time” and ranking materials, and follow what happens in later years Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 37. – For every year since 2001, see which compounds we would have predicted using only literature data until that point in time – Make predictions of what materials are the most promising thermoelectrics for data until that year – See if those materials were actually studied as thermoelectrics in subsequent years 37 A more comprehensive “back in time” test Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 38. • Thus far, 2 of our top 20 predictions made in ~August 2018 have already been reported in the literature for the first time as thermoelectrics – Li3Sb was the subject of a computational study (predicted zT=2.42) in Oct 2018 – SnTe2 was experimentally found to be a moderately good thermoelectric (expt zT=0.71) in Dec 2018 • We are working with an experimentalist on one of the predictions (but ”spare time” project) 38 How about “forward” predictions? [1] Yang et al. "Low lattice thermal conductivity and excellent thermoelectric behavior in Li3Sb and Li3Bi." Journal of Physics: Condensed Matter 30.42 (2018): 425401 [2] Wang et al. "Ultralow lattice thermal conductivity and electronic properties of monolayer 1T phase semimetal SiTe2 and SnTe2." Physica E: Low-dimensional Systems and Nanostructures 108 (2019): 53-59
  • 39. 39 How is this working? “Context words” link together information from different sources
  • 40. Outline 40 ① Natural language processing - where are we right now? ② What’s next for the NLP work?
  • 41. • Currently, we only have word vectors for compositions that explicitly appear in abstracts • We can rank known materials for an application, but for materials with zero or little mention in the scientific literature, we are stuck! • How do we get word embeddings for compositions that do not exist in the text? 41 Making predictions for entirely new compositions
  • 43. 43 Initial results – predicting experimental band gap from composition (~3000 data points)
  • 44. 44 Going beyond entity recognition towards relationship extraction
  • 45. 45 Current approach is not good enough
  • 46. • E.g., automatically generate databases from the literature – Materials and their numerical band gaps (or thermal conductivities, or bulk modulus, or superconducting temperature, etc.) – If materials can be made n-type, p-type, or both – Which synthesis techniques led to various sample descriptors • Will likely require more powerful techniques, e.g., attention-based algorithms (BERT, Google XLNet …) – To be investigated … 46 Once the accuracy improves, we can start to make much more powerful searches
  • 47. 47 D2S2 - data driven synthesis science (just starting) Can we combine natural language processing with theory and experiments to control synthesis?
  • 48. Title auto-generated from abstract Published Title Dynamics of molecular hydrogen confined in narrow nanopores Restricted dynamics of molecular hydrogen confined in activated carbon nanopores Microfluidic Generation of Polydisperse Solid Foams Generation of Solid Foams with Controlled Polydispersity Using Microfluidics Minimum variance unbiased estimator of product performance Assessing the lifetime performance index of gamma lifetime products in the manufacturing industry Angle resolved ultraviolet photoemission study of fluorescein films on Ag 110 The growth of thin fluorescein films on Ag 110” 48 ... and also some fun things, like automatic title generation
  • 49. 49 Acknowledgements Slides (already) posted to hackingmaterials.lbl.gov • High-throughput DFT – Gerbrand Ceder and “BURP” team – Funding: Bosch / Umicore • Natural language processing – Gerbrand Ceder, Kristin Persson, and “Matscholar” team – Funding: Toyota Research Institutes • Overall work funded by US Department of Energy
  • 50. 50 The Matscholar team Kristin PerssonAnubhav JainGerbrand Ceder John Dagdelen Leigh Weston Vahe Tshitoyan Amalie Trewartha Alex Dunn Viktoriia Baibakova Funding from (now at Google) (now at Medium)