This document discusses using natural language processing and machine learning techniques to extract and analyze synthesis recipes from materials science literature. It presents work using sequence-to-sequence models to extract entities and relationships for the synthesis of gold nanorods and bismuth ferrite from research papers. Decision trees trained on the extracted data are able to reproduce conclusions about the effects of synthesis parameters from literature. However, applying these techniques to predictive synthesis still faces challenges regarding reproducibility, missing information, and lack of negative examples in literature datasets.
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature
1. Natural Language Processing for Data Extraction
and Synthesizability Prediction from the Energy
Materials Literature
Anubhav Jain
Lawrence Berkeley National Laboratory
MRS Fall meeting, Nov 2022
Slides (already) posted to hackingmaterials.lbl.gov
2. Literature data can be a key source of materials learning
2
Plan
Synthesize
Characterize
Analyze
local db +
ML
Automated Lab A
Plan
Synthesize
Characterize
Analyze
Conventional Lab B
Plan
Synthesize
Characterize
Analyze
local db +
ML
Automated Lab C
Literature data
+ broad coverage
– difficult to parse
– lack negative examples
– reproducibility
Other A-lab data
+ structured data formats
+ negative examples
– not much out there …
Theory data
+ readily available
– difficult to establish
relevance to synthesis
– computation time
3. Several research groups are now attempting to
collect data sets from the research literature
3
Weston, L. et al Named Entity Recognition
and Normalization Applied to Large-Scale
Information Extraction from the Materials
Science Literature. J. Chem. Inf. Model.
(2019)
Recently, we also tried BERT variants
Trewartha, A.; Walker, N.; Huo, H.; Lee, S.;
Cruse, K.; Dagdelen, J.; Dunn, A.; Persson,
K. A.; Ceder, G.; Jain, A. Quantifying the
Advantage of Domain-Specific Pre-Training
on Named Entity Recognition Tasks in
Materials Science. Patterns 2022, 3 (4),
100488.
4. Models were good for labeling entities, but
didn’t understand relationships
4
Named Entity Recognition
• Custom machine learning models to
extract the most valuable materials-related
information.
• Utilizes a long short-term memory (LSTM)
network trained on ~1000 hand-annotated
abstracts.
Trewartha, A.; Walker, N.; Huo, H.; Lee, S.;
Cruse, K.; Dagdelen, J.; Dunn, A.; Persson,
K. A.; Ceder, G.; Jain, A. Quantifying the
Advantage of Domain-Specific Pre-Training
on Named Entity Recognition Tasks in
Materials Science. Patterns 2022, 3 (4),
100488.
Relationships have usually been extracted
via either manual or semi-automated
regular expression construction along
with grammar tree analysis, e.g.
ChemDataExtractor – can be tedious!
5. Outline
• Using sequence-to-sequence models for combined entity
detection and relationship extraction
• Analyzing synthesis of Au nanorods using literature data
• Analyzing synthesis of phase-pure BiFeO3 using literature data
5
6. A Sequence-to-Sequence Approach
• Language model takes a sequence of tokens
as input and outputs a sequence of tokens
• Maximizes the likelihood of the output
conditioned on the input
• Additionally includes task conditioning, which can
learn the desired format for outputs
• We’ve done many explorations now with
OpenAI’s GPT-3 which has 175 billion
parameters
• interact with the model through their (paid) API,
although costs are relatively modest
• Capacity for “understanding” language as well
as “world knowledge”
7. How a sequence-to-sequence approach works
7
Seq2Seq model
(GPT3)
Text in (“prompt”) Text out (“completion”)
10. But it’s not perfect for technical data
10
Seq2Seq model
(GPT3)
Text in (“prompt”) Text out (“completion”)
11. A workflow for fine-tuning GPT-3
1. Initial training set of templates
filled mostly manually, as zero-
shot GPT is often poor for
technical tasks
2. Fine-tune model to fill
templates, use the model to
assist in annotation
3. Repeat as necessary until
desired inference accuracy is
achieved
12. This procedure can extract complex,
hierarchical relationships between entities
12
13. Outline
• Using sequence-to-sequence models for combined entity
detection and relationship extraction
• Analyzing synthesis of Au nanorods using literature data
• Analyzing synthesis of phase-pure BiFeO3 using literature data
13
14. Templated extraction of synthesis recipes
• Annotate paragraphs to output
structured recipe templates
• JSON-format
• Designed using domain knowledge
from experimentalists
• Template is relation graph to be
filled in by model
15. Example Extraction for Au nanorod synthesis
Note: we are still formally evaluating performance various
issues in getting an accurate evaluation, e.g., predictions
that are functionally correct but written differently
16. Analyzing AuNR synthesis data set
16
Note that this data set was collected manually via hand-tuned
regular expressions, not NLP or GPT-3 as it was done in parallel
to that work.
We are currently looking at pros/cons of manual approach vs
GPT_3 approach.
Representing recipes as precursor vectors for machine learning
17. Training a decision tree to predict AuNR
shape shows similar conclusions as literature
17
Rod
Cube
Rod
Cube Bipyramid Star Bipyramid
None
None
None
None
None
None None
• Decision tree shows seed capping
agent type as first decision
boundary for shape determination
• “Citrate-capped gold seeds form
penta-twinned structure, while
CTAB-capped seeds are single
crystalline, hence former leads to
bipyramids and latter leads to
rods”1,2
1 Liu and Guyot-Sionnest, J.
Phys. Chem. B, 2005 109 (47),
22192-22200
2
Grzelczak et al., Chem. Soc.
Rev., 2008,37, 1783-1791
18. We also see some effect of AgNO3
concentration on AuNR size, but data is noisy
18
N. D. Burrows et al., Langmuir 2017 33 (8), 1891-1907
growth: HAuCl4, CTAB, AA, AgNO3
growth: HAuCl4, CTAB, AA, AgNO3 w/ HAuCl4/CTAB<0.01 filter
growth: HAuCl4, CTAB, AA, AgNO3 + HCl
19. Overall thoughts on AuNR data set
• The seq2seq method is showing good capabilities in terms of
extracting complex nanorod synthesis data
• We are going to start integrating this into our own pipeline to replace
manual regex for relationship extraction
• Performing machine learning to form hypothesis generation on
AuNR shape and size is messy
• Data sets are messy, and not particularly large
• Nevertheless, it is encouraging that conclusions from the
literature can be automatically found by machine learning
19
20. Outline
• Using sequence-to-sequence models for combined entity
detection and relationship extraction
• Analyzing synthesis of Au nanorods using literature data
• Analyzing synthesis of phase-pure BiFeO3 using literature
data
20
21. Seq2Seq approach for solid state synthesis
Initial tests of the seq2seq method on solid state synthesis has encouraging results, but needs further testing
22. For now, we use manual data extraction to
tackle the problem of BiFeO3 synthesis
22
340 total synthesis recipes (from 178 articles); 57 features per recipe
23. Machine learning (decision tree) predictions
are in-line with common knowledge
23
Machine learning (decision tree) predictions
are in-line with common knowledge
24
24. Missing synthesis information – can it be
recovered / reproduced easily?
24
Could not reproduce
Partially reproducible
Reproducible
26. Conclusions
• As large language models grow larger and more capable, they are able to parse
increasingly complex scientific text into structured formats
• Applying NLP + ML on synthesis data sets shows that scientific heuristics can be
automatically uncovered, which is promising
• Nevertheless, issues remain in applying NLP to predictive synthesis
• Reproducibility / missing information / conflicting information
• General lack of negative examples
• Unknown data quality
• Thus, results from such techniques will likely need to be treated as initial
hypotheses to be complemented by further experiments
26
27. Acknowledgements
NLP (seq2seq)
• Alex Dunn
• John Dagdelen
• Nick Walker
• Sanghoon Lee
• Amalie Trewartha
27
Funding provided by:
• U.S. Department of Energy, Basic Energy Science, “D2S2” program
• Toyota Research Institutes, Accelerated Materials Design program
Slides (already) posted to hackingmaterials.lbl.gov
AuNR analysis
• Sanghoon Lee
• Sam Gleason
• Kevin Cruse
BiFeO3 analysis
• Kevin Cruse
• Viktoriia Baibakova
• Maged Abdelsamie
• Kootak Hong
• Carolin Sutter-Fella
• Gerbrand Ceder