SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Natural Language Processing for Data Extraction
and Synthesizability Prediction from the Energy
Materials Literature
Anubhav Jain
Lawrence Berkeley National Laboratory
MRS Fall meeting, Nov 2022
Slides (already) posted to hackingmaterials.lbl.gov
Literature data can be a key source of materials learning
2
Plan
Synthesize
Characterize
Analyze
local db +
ML
Automated Lab A
Plan
Synthesize
Characterize
Analyze
Conventional Lab B
Plan
Synthesize
Characterize
Analyze
local db +
ML
Automated Lab C
Literature data
+ broad coverage
– difficult to parse
– lack negative examples
– reproducibility
Other A-lab data
+ structured data formats
+ negative examples
– not much out there …
Theory data
+ readily available
– difficult to establish
relevance to synthesis
– computation time
Several research groups are now attempting to
collect data sets from the research literature
3
Weston, L. et al Named Entity Recognition
and Normalization Applied to Large-Scale
Information Extraction from the Materials
Science Literature. J. Chem. Inf. Model.
(2019)
Recently, we also tried BERT variants
Trewartha, A.; Walker, N.; Huo, H.; Lee, S.;
Cruse, K.; Dagdelen, J.; Dunn, A.; Persson,
K. A.; Ceder, G.; Jain, A. Quantifying the
Advantage of Domain-Specific Pre-Training
on Named Entity Recognition Tasks in
Materials Science. Patterns 2022, 3 (4),
100488.
Models were good for labeling entities, but
didn’t understand relationships
4
Named Entity Recognition
• Custom machine learning models to
extract the most valuable materials-related
information.
• Utilizes a long short-term memory (LSTM)
network trained on ~1000 hand-annotated
abstracts.
Trewartha, A.; Walker, N.; Huo, H.; Lee, S.;
Cruse, K.; Dagdelen, J.; Dunn, A.; Persson,
K. A.; Ceder, G.; Jain, A. Quantifying the
Advantage of Domain-Specific Pre-Training
on Named Entity Recognition Tasks in
Materials Science. Patterns 2022, 3 (4),
100488.
Relationships have usually been extracted
via either manual or semi-automated
regular expression construction along
with grammar tree analysis, e.g.
ChemDataExtractor – can be tedious!
Outline
• Using sequence-to-sequence models for combined entity
detection and relationship extraction
• Analyzing synthesis of Au nanorods using literature data
• Analyzing synthesis of phase-pure BiFeO3 using literature data
5
A Sequence-to-Sequence Approach
• Language model takes a sequence of tokens
as input and outputs a sequence of tokens
• Maximizes the likelihood of the output
conditioned on the input
• Additionally includes task conditioning, which can
learn the desired format for outputs
• We’ve done many explorations now with
OpenAI’s GPT-3 which has 175 billion
parameters
• interact with the model through their (paid) API,
although costs are relatively modest
• Capacity for “understanding” language as well
as “world knowledge”
How a sequence-to-sequence approach works
7
Seq2Seq model
(GPT3)
Text in (“prompt”) Text out (“completion”)
Another example
8
Seq2Seq model
(GPT3)
Text in (“prompt”) Text out (“completion”)
Structured data
9
Seq2Seq model
(GPT3)
Text in (“prompt”) Text out (“completion”)
But it’s not perfect for technical data
10
Seq2Seq model
(GPT3)
Text in (“prompt”) Text out (“completion”)
A workflow for fine-tuning GPT-3
1. Initial training set of templates
filled mostly manually, as zero-
shot GPT is often poor for
technical tasks
2. Fine-tune model to fill
templates, use the model to
assist in annotation
3. Repeat as necessary until
desired inference accuracy is
achieved
This procedure can extract complex,
hierarchical relationships between entities
12
Outline
• Using sequence-to-sequence models for combined entity
detection and relationship extraction
• Analyzing synthesis of Au nanorods using literature data
• Analyzing synthesis of phase-pure BiFeO3 using literature data
13
Templated extraction of synthesis recipes
• Annotate paragraphs to output
structured recipe templates
• JSON-format
• Designed using domain knowledge
from experimentalists
• Template is relation graph to be
filled in by model
Example Extraction for Au nanorod synthesis
Note: we are still formally evaluating performance various
issues in getting an accurate evaluation, e.g., predictions
that are functionally correct but written differently
Analyzing AuNR synthesis data set
16
Note that this data set was collected manually via hand-tuned
regular expressions, not NLP or GPT-3 as it was done in parallel
to that work.
We are currently looking at pros/cons of manual approach vs
GPT_3 approach.
Representing recipes as precursor vectors for machine learning
Training a decision tree to predict AuNR
shape shows similar conclusions as literature
17
Rod
Cube
Rod
Cube Bipyramid Star Bipyramid
None
None
None
None
None
None None
• Decision tree shows seed capping
agent type as first decision
boundary for shape determination
• “Citrate-capped gold seeds form
penta-twinned structure, while
CTAB-capped seeds are single
crystalline, hence former leads to
bipyramids and latter leads to
rods”1,2
1 Liu and Guyot-Sionnest, J.
Phys. Chem. B, 2005 109 (47),
22192-22200
2
Grzelczak et al., Chem. Soc.
Rev., 2008,37, 1783-1791
We also see some effect of AgNO3
concentration on AuNR size, but data is noisy
18
N. D. Burrows et al., Langmuir 2017 33 (8), 1891-1907
growth: HAuCl4, CTAB, AA, AgNO3
growth: HAuCl4, CTAB, AA, AgNO3 w/ HAuCl4/CTAB<0.01 filter
growth: HAuCl4, CTAB, AA, AgNO3 + HCl
Overall thoughts on AuNR data set
• The seq2seq method is showing good capabilities in terms of
extracting complex nanorod synthesis data
• We are going to start integrating this into our own pipeline to replace
manual regex for relationship extraction
• Performing machine learning to form hypothesis generation on
AuNR shape and size is messy
• Data sets are messy, and not particularly large
• Nevertheless, it is encouraging that conclusions from the
literature can be automatically found by machine learning
19
Outline
• Using sequence-to-sequence models for combined entity
detection and relationship extraction
• Analyzing synthesis of Au nanorods using literature data
• Analyzing synthesis of phase-pure BiFeO3 using literature
data
20
Seq2Seq approach for solid state synthesis
Initial tests of the seq2seq method on solid state synthesis has encouraging results, but needs further testing
For now, we use manual data extraction to
tackle the problem of BiFeO3 synthesis
22
340 total synthesis recipes (from 178 articles); 57 features per recipe
Machine learning (decision tree) predictions
are in-line with common knowledge
23
Machine learning (decision tree) predictions
are in-line with common knowledge
24
Missing synthesis information – can it be
recovered / reproduced easily?
24
Could not reproduce
Partially reproducible
Reproducible
Exploring unexplored portions of synthesis
space
25
These
decision trees
are
interpretable,
but are they
physical?
Conclusions
• As large language models grow larger and more capable, they are able to parse
increasingly complex scientific text into structured formats
• Applying NLP + ML on synthesis data sets shows that scientific heuristics can be
automatically uncovered, which is promising
• Nevertheless, issues remain in applying NLP to predictive synthesis
• Reproducibility / missing information / conflicting information
• General lack of negative examples
• Unknown data quality
• Thus, results from such techniques will likely need to be treated as initial
hypotheses to be complemented by further experiments
26
Acknowledgements
NLP (seq2seq)
• Alex Dunn
• John Dagdelen
• Nick Walker
• Sanghoon Lee
• Amalie Trewartha
27
Funding provided by:
• U.S. Department of Energy, Basic Energy Science, “D2S2” program
• Toyota Research Institutes, Accelerated Materials Design program
Slides (already) posted to hackingmaterials.lbl.gov
AuNR analysis
• Sanghoon Lee
• Sam Gleason
• Kevin Cruse
BiFeO3 analysis
• Kevin Cruse
• Viktoriia Baibakova
• Maged Abdelsamie
• Kootak Hong
• Carolin Sutter-Fella
• Gerbrand Ceder
Sol-gel synthesis of BiFeO3
28

Contenu connexe

Tendances

情報抽出入門 〜非構造化データを構造化させる技術〜
情報抽出入門 〜非構造化データを構造化させる技術〜情報抽出入門 〜非構造化データを構造化させる技術〜
情報抽出入門 〜非構造化データを構造化させる技術〜
Yuya Unno
 

Tendances (20)

合成経路探索 -論文まとめ- (PFN中郷孝祐)
合成経路探索 -論文まとめ-  (PFN中郷孝祐)合成経路探索 -論文まとめ-  (PFN中郷孝祐)
合成経路探索 -論文まとめ- (PFN中郷孝祐)
 
【DL輪読会】Dropout Reduces Underfitting
【DL輪読会】Dropout Reduces Underfitting【DL輪読会】Dropout Reduces Underfitting
【DL輪読会】Dropout Reduces Underfitting
 
PFP:材料探索のための汎用Neural Network Potential - 2021/10/4 QCMSR + DLAP共催
PFP:材料探索のための汎用Neural Network Potential - 2021/10/4 QCMSR + DLAP共催PFP:材料探索のための汎用Neural Network Potential - 2021/10/4 QCMSR + DLAP共催
PFP:材料探索のための汎用Neural Network Potential - 2021/10/4 QCMSR + DLAP共催
 
全力解説!Transformer
全力解説!Transformer全力解説!Transformer
全力解説!Transformer
 
WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...
WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...
WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...
 
論文紹介: "MolGAN: An implicit generative model for small molecular graphs"
論文紹介: "MolGAN: An implicit generative model for small molecular graphs"論文紹介: "MolGAN: An implicit generative model for small molecular graphs"
論文紹介: "MolGAN: An implicit generative model for small molecular graphs"
 
【DL輪読会】AUTOGT: AUTOMATED GRAPH TRANSFORMER ARCHITECTURE SEARCH
【DL輪読会】AUTOGT: AUTOMATED GRAPH TRANSFORMER ARCHITECTURE SEARCH【DL輪読会】AUTOGT: AUTOMATED GRAPH TRANSFORMER ARCHITECTURE SEARCH
【DL輪読会】AUTOGT: AUTOMATED GRAPH TRANSFORMER ARCHITECTURE SEARCH
 
階層ディリクレ過程事前分布モデルによる画像領域分割
階層ディリクレ過程事前分布モデルによる画像領域分割階層ディリクレ過程事前分布モデルによる画像領域分割
階層ディリクレ過程事前分布モデルによる画像領域分割
 
汎用なNeural Network Potential「Matlantis」を使った新素材探索_浅野_JACI先端化学・材料技術部会 高選択性反応分科会主...
汎用なNeural Network Potential「Matlantis」を使った新素材探索_浅野_JACI先端化学・材料技術部会 高選択性反応分科会主...汎用なNeural Network Potential「Matlantis」を使った新素材探索_浅野_JACI先端化学・材料技術部会 高選択性反応分科会主...
汎用なNeural Network Potential「Matlantis」を使った新素材探索_浅野_JACI先端化学・材料技術部会 高選択性反応分科会主...
 
EfficientDet: Scalable and Efficient Object Detection
EfficientDet: Scalable and Efficient Object DetectionEfficientDet: Scalable and Efficient Object Detection
EfficientDet: Scalable and Efficient Object Detection
 
自由エネルギー原理から エナクティヴィズムへ
自由エネルギー原理から エナクティヴィズムへ自由エネルギー原理から エナクティヴィズムへ
自由エネルギー原理から エナクティヴィズムへ
 
統計モデリングで癌の5年生存率データから良い病院を探す
統計モデリングで癌の5年生存率データから良い病院を探す統計モデリングで癌の5年生存率データから良い病院を探す
統計モデリングで癌の5年生存率データから良い病院を探す
 
PFP:材料探索のための汎用Neural Network Potential_中郷_20220422POLセミナー
PFP:材料探索のための汎用Neural Network Potential_中郷_20220422POLセミナーPFP:材料探索のための汎用Neural Network Potential_中郷_20220422POLセミナー
PFP:材料探索のための汎用Neural Network Potential_中郷_20220422POLセミナー
 
Matlantisを活用した蓄電池材料研究_名古屋工業大学 中山氏_Matlantis User Conference
Matlantisを活用した蓄電池材料研究_名古屋工業大学 中山氏_Matlantis User ConferenceMatlantisを活用した蓄電池材料研究_名古屋工業大学 中山氏_Matlantis User Conference
Matlantisを活用した蓄電池材料研究_名古屋工業大学 中山氏_Matlantis User Conference
 
ポアソン画像合成
ポアソン画像合成ポアソン画像合成
ポアソン画像合成
 
PyTorch, PixyzによるGenerative Query Networkの実装
PyTorch, PixyzによるGenerative Query Networkの実装PyTorch, PixyzによるGenerative Query Networkの実装
PyTorch, PixyzによるGenerative Query Networkの実装
 
ベイジアンディープニューラルネット
ベイジアンディープニューラルネットベイジアンディープニューラルネット
ベイジアンディープニューラルネット
 
情報抽出入門 〜非構造化データを構造化させる技術〜
情報抽出入門 〜非構造化データを構造化させる技術〜情報抽出入門 〜非構造化データを構造化させる技術〜
情報抽出入門 〜非構造化データを構造化させる技術〜
 
[DL Hacks] Learning Transferable Features with Deep Adaptation Networks
[DL Hacks] Learning Transferable Features with Deep Adaptation Networks[DL Hacks] Learning Transferable Features with Deep Adaptation Networks
[DL Hacks] Learning Transferable Features with Deep Adaptation Networks
 
MixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised LearningMixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised Learning
 

Similaire à Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature

Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Databricks
 
Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learning
jaumebp
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
gwprice
 
DNA Query Language DNAQL: A Novel Approach
DNA Query Language DNAQL: A Novel ApproachDNA Query Language DNAQL: A Novel Approach
DNA Query Language DNAQL: A Novel Approach
Editor IJCATR
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
butest
 

Similaire à Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature (20)

Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
 
Thesis def
Thesis defThesis def
Thesis def
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
 
Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learning
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
 
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
 
ADPosterFinal
ADPosterFinalADPosterFinal
ADPosterFinal
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
 
DNA Query Language DNAQL: A Novel Approach
DNA Query Language DNAQL: A Novel ApproachDNA Query Language DNAQL: A Novel Approach
DNA Query Language DNAQL: A Novel Approach
 
An interactive approach to multiobjective clustering of gene expression patterns
An interactive approach to multiobjective clustering of gene expression patternsAn interactive approach to multiobjective clustering of gene expression patterns
An interactive approach to multiobjective clustering of gene expression patterns
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
Cornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 NetsCornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 Nets
 
The Smart Way To Invest in AI and ML_SFStartupDay
The Smart Way To Invest in AI and ML_SFStartupDayThe Smart Way To Invest in AI and ML_SFStartupDay
The Smart Way To Invest in AI and ML_SFStartupDay
 

Plus de Anubhav Jain

Plus de Anubhav Jain (20)

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesis
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst Design
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials Project
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...
 
Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...
 
Discovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectDiscovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials Project
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst Design
 
Assessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data AnalysisAssessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data Analysis
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
 

Dernier

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
University of Hertfordshire
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
anilsa9823
 

Dernier (20)

Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 

Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature

  • 1. Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature Anubhav Jain Lawrence Berkeley National Laboratory MRS Fall meeting, Nov 2022 Slides (already) posted to hackingmaterials.lbl.gov
  • 2. Literature data can be a key source of materials learning 2 Plan Synthesize Characterize Analyze local db + ML Automated Lab A Plan Synthesize Characterize Analyze Conventional Lab B Plan Synthesize Characterize Analyze local db + ML Automated Lab C Literature data + broad coverage – difficult to parse – lack negative examples – reproducibility Other A-lab data + structured data formats + negative examples – not much out there … Theory data + readily available – difficult to establish relevance to synthesis – computation time
  • 3. Several research groups are now attempting to collect data sets from the research literature 3 Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019) Recently, we also tried BERT variants Trewartha, A.; Walker, N.; Huo, H.; Lee, S.; Cruse, K.; Dagdelen, J.; Dunn, A.; Persson, K. A.; Ceder, G.; Jain, A. Quantifying the Advantage of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science. Patterns 2022, 3 (4), 100488.
  • 4. Models were good for labeling entities, but didn’t understand relationships 4 Named Entity Recognition • Custom machine learning models to extract the most valuable materials-related information. • Utilizes a long short-term memory (LSTM) network trained on ~1000 hand-annotated abstracts. Trewartha, A.; Walker, N.; Huo, H.; Lee, S.; Cruse, K.; Dagdelen, J.; Dunn, A.; Persson, K. A.; Ceder, G.; Jain, A. Quantifying the Advantage of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science. Patterns 2022, 3 (4), 100488. Relationships have usually been extracted via either manual or semi-automated regular expression construction along with grammar tree analysis, e.g. ChemDataExtractor – can be tedious!
  • 5. Outline • Using sequence-to-sequence models for combined entity detection and relationship extraction • Analyzing synthesis of Au nanorods using literature data • Analyzing synthesis of phase-pure BiFeO3 using literature data 5
  • 6. A Sequence-to-Sequence Approach • Language model takes a sequence of tokens as input and outputs a sequence of tokens • Maximizes the likelihood of the output conditioned on the input • Additionally includes task conditioning, which can learn the desired format for outputs • We’ve done many explorations now with OpenAI’s GPT-3 which has 175 billion parameters • interact with the model through their (paid) API, although costs are relatively modest • Capacity for “understanding” language as well as “world knowledge”
  • 7. How a sequence-to-sequence approach works 7 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  • 8. Another example 8 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  • 9. Structured data 9 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  • 10. But it’s not perfect for technical data 10 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  • 11. A workflow for fine-tuning GPT-3 1. Initial training set of templates filled mostly manually, as zero- shot GPT is often poor for technical tasks 2. Fine-tune model to fill templates, use the model to assist in annotation 3. Repeat as necessary until desired inference accuracy is achieved
  • 12. This procedure can extract complex, hierarchical relationships between entities 12
  • 13. Outline • Using sequence-to-sequence models for combined entity detection and relationship extraction • Analyzing synthesis of Au nanorods using literature data • Analyzing synthesis of phase-pure BiFeO3 using literature data 13
  • 14. Templated extraction of synthesis recipes • Annotate paragraphs to output structured recipe templates • JSON-format • Designed using domain knowledge from experimentalists • Template is relation graph to be filled in by model
  • 15. Example Extraction for Au nanorod synthesis Note: we are still formally evaluating performance various issues in getting an accurate evaluation, e.g., predictions that are functionally correct but written differently
  • 16. Analyzing AuNR synthesis data set 16 Note that this data set was collected manually via hand-tuned regular expressions, not NLP or GPT-3 as it was done in parallel to that work. We are currently looking at pros/cons of manual approach vs GPT_3 approach. Representing recipes as precursor vectors for machine learning
  • 17. Training a decision tree to predict AuNR shape shows similar conclusions as literature 17 Rod Cube Rod Cube Bipyramid Star Bipyramid None None None None None None None • Decision tree shows seed capping agent type as first decision boundary for shape determination • “Citrate-capped gold seeds form penta-twinned structure, while CTAB-capped seeds are single crystalline, hence former leads to bipyramids and latter leads to rods”1,2 1 Liu and Guyot-Sionnest, J. Phys. Chem. B, 2005 109 (47), 22192-22200 2 Grzelczak et al., Chem. Soc. Rev., 2008,37, 1783-1791
  • 18. We also see some effect of AgNO3 concentration on AuNR size, but data is noisy 18 N. D. Burrows et al., Langmuir 2017 33 (8), 1891-1907 growth: HAuCl4, CTAB, AA, AgNO3 growth: HAuCl4, CTAB, AA, AgNO3 w/ HAuCl4/CTAB<0.01 filter growth: HAuCl4, CTAB, AA, AgNO3 + HCl
  • 19. Overall thoughts on AuNR data set • The seq2seq method is showing good capabilities in terms of extracting complex nanorod synthesis data • We are going to start integrating this into our own pipeline to replace manual regex for relationship extraction • Performing machine learning to form hypothesis generation on AuNR shape and size is messy • Data sets are messy, and not particularly large • Nevertheless, it is encouraging that conclusions from the literature can be automatically found by machine learning 19
  • 20. Outline • Using sequence-to-sequence models for combined entity detection and relationship extraction • Analyzing synthesis of Au nanorods using literature data • Analyzing synthesis of phase-pure BiFeO3 using literature data 20
  • 21. Seq2Seq approach for solid state synthesis Initial tests of the seq2seq method on solid state synthesis has encouraging results, but needs further testing
  • 22. For now, we use manual data extraction to tackle the problem of BiFeO3 synthesis 22 340 total synthesis recipes (from 178 articles); 57 features per recipe
  • 23. Machine learning (decision tree) predictions are in-line with common knowledge 23 Machine learning (decision tree) predictions are in-line with common knowledge 24
  • 24. Missing synthesis information – can it be recovered / reproduced easily? 24 Could not reproduce Partially reproducible Reproducible
  • 25. Exploring unexplored portions of synthesis space 25 These decision trees are interpretable, but are they physical?
  • 26. Conclusions • As large language models grow larger and more capable, they are able to parse increasingly complex scientific text into structured formats • Applying NLP + ML on synthesis data sets shows that scientific heuristics can be automatically uncovered, which is promising • Nevertheless, issues remain in applying NLP to predictive synthesis • Reproducibility / missing information / conflicting information • General lack of negative examples • Unknown data quality • Thus, results from such techniques will likely need to be treated as initial hypotheses to be complemented by further experiments 26
  • 27. Acknowledgements NLP (seq2seq) • Alex Dunn • John Dagdelen • Nick Walker • Sanghoon Lee • Amalie Trewartha 27 Funding provided by: • U.S. Department of Energy, Basic Energy Science, “D2S2” program • Toyota Research Institutes, Accelerated Materials Design program Slides (already) posted to hackingmaterials.lbl.gov AuNR analysis • Sanghoon Lee • Sam Gleason • Kevin Cruse BiFeO3 analysis • Kevin Cruse • Viktoriia Baibakova • Maged Abdelsamie • Kootak Hong • Carolin Sutter-Fella • Gerbrand Ceder
  • 28. Sol-gel synthesis of BiFeO3 28