SlideShare une entreprise Scribd logo
1  sur  50
Télécharger pour lire hors ligne
Logan Ward1 (loganw@uchicago.edu)
Ben Blaiszik1,2 (blaiszik@uchicago.edu),
Ian Foster (foster@uchicago.edu)1,2, Ryan Chard2
Jonathon Gaff1, Kyle Chard1, Jim Pruyne1,
Rachana Ananthakrishnan1, Steven Tuecke1
Michael Ondrejcek3, Kenton McHenry3, John Towns3
University of Chicago1, Argonne National Laboratory2, University of Illinois at Urbana-Champaign3
materialsdatafacility.org
globus.org
Materials Data Facility:
A Distributed Model for
the Materials Data Community
15 August 2017
The Materials Data Facility Team
2
UC/Argonne
Ian Foster (PI) Ben Blaiszik Steve Tuecke
Kyle ChardJim Pruyne
Logan Ward Jonathon Gaff
Illinois (Urbana-Champaign)
Rachana
Ananthakrishnan
John Towns (PI) Kenton McHenry
Michal Ondrejcek
Stephen Rosen
Ryan Chard
Data-Intensive Materials Science
3
Materials Databases High-Throughput Screening
Machine Learning Multi-scale Modeling
Kirklin	et	al.	Acta	Mat. (2016)
de	Jong	et	al.	Sci	Rep. (2016) Sparks	et	al.	Scr.	Mat. (2015) https://www.mpg.de/
Data-Intensive Materials Science
4
Science is becoming limited by the ability to handle data
- Where to get it?
- How to selectively share it?
- Where to store it?
- How do know what it is?
- How to build software that uses it?
- How to get others to share theirs?
- How to keep track of provenance?
- ….?
Our goal is to create easy answers to these questions
Why create the MDF?
5
1. Make your data shareable
Custom access control, using institution credentials
2. Make your data open
Access to >100TB of storage space
3. Make your data accessible
Search across distributed resources
Automatic, domain-specific metadata extraction
4. Make your data computable
Tight integration with computing resources
5. Make your data valuable
Citable with DOIs, measured with usage stats
$
EP
What is the MDF?
EP
EP
EP
EP
Deep indexing
Query
Browse
Aggregate
Publish
Mint DOIs
Associate
metadata
Databases
Datasets
APIs
LIMS
etc.
Distributed data
storage
Data
publication
service
Data
discovery
service
SHAREABLE AND OPEN DATA
7
EP
Globus and the research data lifecycle
8
Researcher initiates
transfer request; or
requested automatically
by script, science
gateway
1
Instrument
Compute Facility
Globus transfers files
reliably, securely
2
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
4
Curator reviews and
approves; data set
published on campus
or other system
7
Researcher
selects files to
share, selects
user or group,
and sets access
permissions
3
Collaborator logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
5
Researcher
assembles data set;
describes it using
metadata (Datacite
& domain-specific)
6
6
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
8
Publication
Repository
Personal Computer
Transfer
Share
Publish
Discover
• Only a Web browser
required
• Use storage system
of your choice
• Access using your
campus credentials
8
Data sharing and Globus
9
Easily control who gains access to your data:
- Globus can use University/Laboratory credentials
- You can establish groups of authorized users
Data sharing and Globus
10
Simple to move data to/from any resource
Open data and Globus
11
Open data and Globus
12
Bottom Line: Globus provides a
robust, highly-developed, well-
supported platform for sharing and
managing open data
DATA ACCESSIBILITY
13
What do I mean by “accessibility”?
Need: Simplify finding and acquiring materials data
Major Challenges:
1. Data spread across many resources
§ Have to search each repository individually
§ Different services, different APIs to get data
2. Contents of resources are poorly described
§ Lack domain-specific metadata
Goal: Linking together world’s materials data resources,
with enough metadata to make it useful
14
Part 1: Linking with the Data Community
15
Materials	Project
Citrination
Materials	
Commons
Other	Facilities	(APS,	SNS,	NSLS,	…),	Institutional	Repositories,	
Publishers!
Metadata
Publishing
MetadataMD,
Pub.,	Compute
Metadata
Publishing
NCSA-PIREHV/TMSMBDH
MDF data discovery ecosystem
EP
NIST
MRR
Data
discovery
service
Harvest
Deep index
Register / Sync
Services
Bots
MDF
Pub
Service
Automate
Process
Refine
Analyze
Data Output
Data Input
EP
Data Sources
Query
Browse
Aggregate
User Interfaces
Identify resources for indexing
16
MDF + NIST Database Tools
17
Data
discovery
service
MDCS
NIST
MRR
Ref:	Dima,	et	al.	JOM.	68	(2016),	2053.	doi:	10.1007/s11837-016-2000-4
MDF + NIST Database Tools
18
Data
discovery
service
MDCS
NIST
MRR
MDF	automates	publicizing	data
and	provides	a	uniform	search	interface
Piping DFT data from MDF to Citrine
{ "category": "system.chemical",
"chemicalFormula": "MgO2",
"properties": {
"units": "eV", "name": "Band gap",
"scalars": [ { "value": 7.8 } ] } }
2.	Bot	requests	open	DFT	data	periodically
3.	Bot	accesses	data,	runs	DFT	parser	to	refine	data
4.	Push	metadata	to	Citrine
1.	User	publishes	DFT	dataset
5.	Ingest	DFT	data	quality	report
…
Our	datasets	are	discoverable	through	many	tools
19
Part 2: A Materials Data Search Engine
Goal: Simplify finding useful data
Key Issue: Lack of metadata
Approaches:
1. Simplifying metadata capture from the source
2. Extracting useful information from dataset
20
Route 1: Integrating with LIMS/Workflow
Tools
21
MAST
Materials Commons (MC)
T2C2 (4CeeD)
• Build connections to international materials
efforts and registries (e.g., NIMS, RDA, NIST,
EUDAT, NDS)
• Promote IMaD data services, tools, and
accomplishments to the community
• Develop video tutorials, webinars, and shared
code repositories
• Interface with the Materials Accelerator
Network (MAN)
• Engage with colleges, industry, and
consortiums
• (Wisconsin) Regional Materials and
Manufacturing Network (RM2N)
• (Illinois) Digital Manufacturing and
Design Innovation Institute DMDII
• (Michigan) LIFT consortium
Engagement
Linking Software and Services
PIs: I. Foster1,2, J. Allison3, D. Morgan4, D. Trinkle5, P. Voorhees6
1 University of Chicago 2 Argonne National Laboratory 3 University of Michigan 4 University of Wisconsin-Madison 5
University of Illinois at Urbana-Champaign, 6 Northwestern University
Overview
• NSF Midwest Big Data Spoke
• Argonne Leadership Computing Facility (>1000 users/year)
§ Working with datasets that comprise ~300M core hours, with 200M
more identified for near term
§ New joint effort to roll out MDF-like capabilities to ALCF users
• Advanced Photon Source (>5000 users/year)
• Building pipelines and procedures to index and publish data from
15 beamlines (~1/3 of the facility) in conjunction with the APS
software team (Schwartz)
• Advanced Light Source (>2000 users/year)
• Integration with CAMERA project and associated tomography
beamlines
Linking Data from Major Facilities
22Working	with	user	facilities	to	facilitate	capturing	data/metadata
Ripple: Home automation for research data
Doi:10.1109/ICDCSW.2017.30 23
Procedure for automating tomography experiments:
At ALS: Detect new beamline data,
and transfer it to NERSC
At NERSC: Submit, run jobs on Edison,
transfer data back to ALS
At ALS: Create a shared endpoint,
notify collaborators of result via email
Automate	capturing	results	and	metadata
Ryan	Chard
Route 2: Deep Indexing Materials Data
MDF
Index Data resources
indexed
116
Records
>3.4M
Repositories harvested
• MDF
• NIST MML Repo
• MATIN
• Materials
Commons
• CXIDB
• NIST Materials
Resource
Registry
6
~200 Datasets
~260 TB
Made
discoverable
24
Adding More Metadata to NIST MatDL
Dataset	As	Published
Limited	Metadata
Querying	Difficult
25
Adding More Metadata to NIST MatDL
Deep-Indexed	into	the	MDF
Data	Available	Programmatically
26
Adding More Metadata to NIST MatDL
Deep-Indexed	into	the	MDF
Can	be	used	for	scripting
27
Another benefit: domain-specific querying
Example service possible with DFT
data files
Answer questions like:
“Do we have any data about
anatase-TiO2?”
“Who else has studied Li-MnO3
batteries with DFT?”
Crystal Structure File
.cif, VASP, etc.
Entries from MDF that
are structurally-similar
28
Skluma: A Statistical Learning Pipeline
for Taming Unkempt Data Repositories
29
doi:10.1145/3085504.3091116
Goal: Build	intelligent	search	indexes	
with	minimal	human	effort
Method:	Employ	machine	learning	
to	extract	metadata	from	file	
repositories
- Classify	data	files
- Detect	file	types
Tyler	Skluzacek
Search	Otherwise-Unusable	Data	Repositories
MDF Forge python package (under development)
• Interface to MDF services
• Helper functions for common tasks
APIs, Automation, and Examples
https://github.com/materials-data-facility/forge
30
Tools for using these capabilities will be available soon
COMPUTABLE DATA
31
Computable Data
Reproducing data-driven science should be trivial
It often is not. Common problems:
§ If available, datasets lack documentation
§ Algorithms/methods are not open sourced
§ Models rarely published
§ Software installation/configuration require expertise
Our goal: Simplify publishing data-driven science
- Storing software and models
- Integrating them with compute resources
32
Integrating analytics tools with MDF
33
MATIN (GT)
~ 10 datasets
Used in
education
Result: Scientists connected with data, analytics tools,
and compute capability
MDF Data
Publication
MATIN (GT)
MML
Repository
(NIST)
Materials
Commons
(UM
PRISMS) Coherent X-Ray
Tomography
Database (LNL)
To	End	UsersTo	End	UsersTo	Compute	ResourcesFrom	Data	Repositories
Jetstream is a self-provisioned, scalable science and engineering cloud environment
operated by Indiana University for the National Science Foundation: jetstream-cloud.org
Building a machine learning model using MDF
A simple web service to train ML forcefields
34
35
Building a machine learning model using MDF
Example: Building force-field potentials from different datasets
Data resources: 3 DFT datasets with Aluminum data
1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov
Result: Improved performance by integrating data sources
36
Building a machine learning model using MDF
Method: Botu et al. JPCC. (2017)
Using only original data
Training	SetHoldout	Set
Example: Building force-field potentials from different datasets
Data resources: 3 DFT datasets with Aluminum data
1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov
Result: Improved performance by integrating data sources
37
Building a machine learning model using MDF
Method: Botu et al. JPCC. (2017)
Using only original data
Training	SetHoldout	Set
Including Diffusion Dataset
Example: Building force-field potentials from different datasets
Data resources: 3 DFT datasets with Aluminum data
1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov
Result: Improved performance by integrating data sources
38
Building a machine learning model using MDF
Method: Botu et al. JPCC. (2017)
Using only original data
Training	SetHoldout	Set
Including Diffusion DatasetIncluding 𝐷 + 𝑇# Dataset
Better performance in original application: No new DFT calculations
• Summer Intern (Jiming Chen) reproducing and
extending materials and ML papers with the MDF
• Joined our team with the NSF WholeTale project
Reproducing data-driven MSE with MDF
Users publish data
to the MDF…
… and code to
WholeTale
Long-term goals:
- Assemble community-driven resource for ML tools/examples
- Use MDF/WholeTale to create benchmark challenges
Jiming Chen (UIUC)
39
Replicating Ward et al. 2016
40
• Publish and share models and code linked with full
training datasets
• Link database with HPC/Cloud computing resources
• Provide uniform interface for training, running models
DLHub: Advancing Deep Learning Adoption
INCREASING VALUE OF DATA
42
$
What is the MDF?
EP
EP
EP
EP
Deep indexing
Query
Browse
Aggregate
Publish
Mint DOIs
Associate
metadata
Databases
Datasets
APIs
LIMS
etc.
Distributed data
storage
Data
publication
service
Data
discovery
service
43
Data publication service
44
• Mechanisms to create and enforce
schemas and logical collections
• Web UI to create datasets and manage
curation and admin tasks
• Tools to automate publication process
• Dataset record permanent landing page
for DOI link
• Record shows some metadata links to
the rest
• Direct link to underlying files
• Download statistics
Published Data Highlights
45
~ 30 datasets
~ 6.5 TB
MATIN (GT)
~ 10 datasets
Used in
education
X-ray Scattering Image Classification
Using Deep Learning
http://dx.doi.org/10.18126/M2Z30Z
Electron Backscattering and
Diffraction Datasets for Ni, Mg, Fe, Si
Yager et al.Marc De Graef et al.
Phase Field Benchmark I Dataset
Jokisaari et al.
Grain Structure, Grain-averaged Lattice Strains, and
Macro-scale Strain Data for Superelastic Nickel-
Titanium Shape Memory Alloy Polycrystal Loaded in
Tension
Paranjape et al.
• Largest dataset to date (>1.5 TB). Showcases MDF unique
capabilities and makes a unique dataset discoverable for code
development, analysis, and benchmarking
Datasets Are Citable
46
Streamline & automate data publication
12.5 TB
12.4 TB out
Data
Volumes
Publication
Authors
94
Institutions
14
Accesses
>1000
Total
datasets
50
CHiMaD
datasets
16
Pipeline CHiMaD
datasets
+14
Total
datasets
+30
Advantages of Globus Publish
Capable of handling large datasets
§ Publish data in place
§ Integration with Globus Transfer/HTTPS
Deep indexing of materials-specific metadata
§ Parse common materials data types
§ Make data searchable on the file-level
Automatically re-publishing data elsewhere
§ Publishing dataset metadata to MRR, Google Scholar, etc.
§ Sending fine-grained metadata to other databases (e.g., Citrine)
In Progress: Know how often your data is used
§ Track when it is used in analytics tools
48
All	of	these	capabilities	increase	the	value	of	your	data
Why create the MDF?
http://materialsdatafacility.org 49
1. Make your data shareable
Custom access control, using institution credentials
2. Make your data open
Access to >100TB of storage space
3. Make your data accessible
Search across distributed resources
Automatic, domain-specific metadata extraction
4. Make your data computable
Tight integration with computing resources
5. Make your data valuable
Citable with DOIs, measured with usage stats
$
EP
Thanks to our sponsors!
50
U . S . D E P A R T M E N T O F
ENERGY

Contenu connexe

Similaire à The Materials Data Facility: A Distributed Model for the Materials Data Community

Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceGlobus
 
A Year in Review - Building a Comprehensive Data Management Program
A Year in Review - Building a Comprehensive Data Management ProgramA Year in Review - Building a Comprehensive Data Management Program
A Year in Review - Building a Comprehensive Data Management ProgramDataWorks Summit
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...SEAD
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...OpenAIRE
 
The Future of Semantics on the Web
The Future of Semantics on the WebThe Future of Semantics on the Web
The Future of Semantics on the WebJohn Domingue
 
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...aceas13tern
 
John morrissey c3 dis fair working data.pptx
John morrissey c3 dis fair working data.pptxJohn morrissey c3 dis fair working data.pptx
John morrissey c3 dis fair working data.pptxARDC
 
Recognising data sharing
Recognising data sharingRecognising data sharing
Recognising data sharingJisc RDM
 
Impact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and EducationImpact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and EducationMANENDRASINGH30
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and LibariesRob Grim
 
Preparing your data for sharing and publishing
Preparing your data for sharing and publishingPreparing your data for sharing and publishing
Preparing your data for sharing and publishingVarsha Khodiyar
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Bertram Ludäscher
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...Yongyao Jiang
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation Research Data Alliance
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation Research Data Alliance
 
Hattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsHattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsJason Hattrick-Simpers
 
Open science, open-source, and open data: Collaboration as an emergent property?
Open science, open-source, and open data: Collaboration as an emergent property?Open science, open-source, and open data: Collaboration as an emergent property?
Open science, open-source, and open data: Collaboration as an emergent property?Hilmar Lapp
 

Similaire à The Materials Data Facility: A Distributed Model for the Materials Data Community (20)

Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 
A Year in Review - Building a Comprehensive Data Management Program
A Year in Review - Building a Comprehensive Data Management ProgramA Year in Review - Building a Comprehensive Data Management Program
A Year in Review - Building a Comprehensive Data Management Program
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
 
The Future of Semantics on the Web
The Future of Semantics on the WebThe Future of Semantics on the Web
The Future of Semantics on the Web
 
Lowenberg Making Data Count
Lowenberg Making Data CountLowenberg Making Data Count
Lowenberg Making Data Count
 
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
 
John morrissey c3 dis fair working data.pptx
John morrissey c3 dis fair working data.pptxJohn morrissey c3 dis fair working data.pptx
John morrissey c3 dis fair working data.pptx
 
Recognising data sharing
Recognising data sharingRecognising data sharing
Recognising data sharing
 
Impact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and EducationImpact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and Education
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and Libaries
 
Preparing your data for sharing and publishing
Preparing your data for sharing and publishingPreparing your data for sharing and publishing
Preparing your data for sharing and publishing
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
Research data life cycle
Research data life cycleResearch data life cycle
Research data life cycle
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
 
Hattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsHattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in Materials
 
Open science, open-source, and open data: Collaboration as an emergent property?
Open science, open-source, and open data: Collaboration as an emergent property?Open science, open-source, and open data: Collaboration as an emergent property?
Open science, open-source, and open data: Collaboration as an emergent property?
 

Dernier

Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONrouseeyyy
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxBhagirath Gogikar
 

Dernier (20)

Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 

The Materials Data Facility: A Distributed Model for the Materials Data Community

  • 1. Logan Ward1 (loganw@uchicago.edu) Ben Blaiszik1,2 (blaiszik@uchicago.edu), Ian Foster (foster@uchicago.edu)1,2, Ryan Chard2 Jonathon Gaff1, Kyle Chard1, Jim Pruyne1, Rachana Ananthakrishnan1, Steven Tuecke1 Michael Ondrejcek3, Kenton McHenry3, John Towns3 University of Chicago1, Argonne National Laboratory2, University of Illinois at Urbana-Champaign3 materialsdatafacility.org globus.org Materials Data Facility: A Distributed Model for the Materials Data Community 15 August 2017
  • 2. The Materials Data Facility Team 2 UC/Argonne Ian Foster (PI) Ben Blaiszik Steve Tuecke Kyle ChardJim Pruyne Logan Ward Jonathon Gaff Illinois (Urbana-Champaign) Rachana Ananthakrishnan John Towns (PI) Kenton McHenry Michal Ondrejcek Stephen Rosen Ryan Chard
  • 3. Data-Intensive Materials Science 3 Materials Databases High-Throughput Screening Machine Learning Multi-scale Modeling Kirklin et al. Acta Mat. (2016) de Jong et al. Sci Rep. (2016) Sparks et al. Scr. Mat. (2015) https://www.mpg.de/
  • 4. Data-Intensive Materials Science 4 Science is becoming limited by the ability to handle data - Where to get it? - How to selectively share it? - Where to store it? - How do know what it is? - How to build software that uses it? - How to get others to share theirs? - How to keep track of provenance? - ….? Our goal is to create easy answers to these questions
  • 5. Why create the MDF? 5 1. Make your data shareable Custom access control, using institution credentials 2. Make your data open Access to >100TB of storage space 3. Make your data accessible Search across distributed resources Automatic, domain-specific metadata extraction 4. Make your data computable Tight integration with computing resources 5. Make your data valuable Citable with DOIs, measured with usage stats $ EP
  • 6. What is the MDF? EP EP EP EP Deep indexing Query Browse Aggregate Publish Mint DOIs Associate metadata Databases Datasets APIs LIMS etc. Distributed data storage Data publication service Data discovery service
  • 7. SHAREABLE AND OPEN DATA 7 EP
  • 8. Globus and the research data lifecycle 8 Researcher initiates transfer request; or requested automatically by script, science gateway 1 Instrument Compute Facility Globus transfers files reliably, securely 2 Globus controls access to shared files on existing storage; no need to move files to cloud storage! 4 Curator reviews and approves; data set published on campus or other system 7 Researcher selects files to share, selects user or group, and sets access permissions 3 Collaborator logs in to Globus and accesses shared files; no local account required; download via Globus 5 Researcher assembles data set; describes it using metadata (Datacite & domain-specific) 6 6 Peers, collaborators search and discover datasets; transfer and share using Globus 8 Publication Repository Personal Computer Transfer Share Publish Discover • Only a Web browser required • Use storage system of your choice • Access using your campus credentials 8
  • 9. Data sharing and Globus 9 Easily control who gains access to your data: - Globus can use University/Laboratory credentials - You can establish groups of authorized users
  • 10. Data sharing and Globus 10 Simple to move data to/from any resource
  • 11. Open data and Globus 11
  • 12. Open data and Globus 12 Bottom Line: Globus provides a robust, highly-developed, well- supported platform for sharing and managing open data
  • 14. What do I mean by “accessibility”? Need: Simplify finding and acquiring materials data Major Challenges: 1. Data spread across many resources § Have to search each repository individually § Different services, different APIs to get data 2. Contents of resources are poorly described § Lack domain-specific metadata Goal: Linking together world’s materials data resources, with enough metadata to make it useful 14
  • 15. Part 1: Linking with the Data Community 15 Materials Project Citrination Materials Commons Other Facilities (APS, SNS, NSLS, …), Institutional Repositories, Publishers! Metadata Publishing MetadataMD, Pub., Compute Metadata Publishing NCSA-PIREHV/TMSMBDH
  • 16. MDF data discovery ecosystem EP NIST MRR Data discovery service Harvest Deep index Register / Sync Services Bots MDF Pub Service Automate Process Refine Analyze Data Output Data Input EP Data Sources Query Browse Aggregate User Interfaces Identify resources for indexing 16
  • 17. MDF + NIST Database Tools 17 Data discovery service MDCS NIST MRR Ref: Dima, et al. JOM. 68 (2016), 2053. doi: 10.1007/s11837-016-2000-4
  • 18. MDF + NIST Database Tools 18 Data discovery service MDCS NIST MRR MDF automates publicizing data and provides a uniform search interface
  • 19. Piping DFT data from MDF to Citrine { "category": "system.chemical", "chemicalFormula": "MgO2", "properties": { "units": "eV", "name": "Band gap", "scalars": [ { "value": 7.8 } ] } } 2. Bot requests open DFT data periodically 3. Bot accesses data, runs DFT parser to refine data 4. Push metadata to Citrine 1. User publishes DFT dataset 5. Ingest DFT data quality report … Our datasets are discoverable through many tools 19
  • 20. Part 2: A Materials Data Search Engine Goal: Simplify finding useful data Key Issue: Lack of metadata Approaches: 1. Simplifying metadata capture from the source 2. Extracting useful information from dataset 20
  • 21. Route 1: Integrating with LIMS/Workflow Tools 21 MAST Materials Commons (MC) T2C2 (4CeeD) • Build connections to international materials efforts and registries (e.g., NIMS, RDA, NIST, EUDAT, NDS) • Promote IMaD data services, tools, and accomplishments to the community • Develop video tutorials, webinars, and shared code repositories • Interface with the Materials Accelerator Network (MAN) • Engage with colleges, industry, and consortiums • (Wisconsin) Regional Materials and Manufacturing Network (RM2N) • (Illinois) Digital Manufacturing and Design Innovation Institute DMDII • (Michigan) LIFT consortium Engagement Linking Software and Services PIs: I. Foster1,2, J. Allison3, D. Morgan4, D. Trinkle5, P. Voorhees6 1 University of Chicago 2 Argonne National Laboratory 3 University of Michigan 4 University of Wisconsin-Madison 5 University of Illinois at Urbana-Champaign, 6 Northwestern University Overview • NSF Midwest Big Data Spoke
  • 22. • Argonne Leadership Computing Facility (>1000 users/year) § Working with datasets that comprise ~300M core hours, with 200M more identified for near term § New joint effort to roll out MDF-like capabilities to ALCF users • Advanced Photon Source (>5000 users/year) • Building pipelines and procedures to index and publish data from 15 beamlines (~1/3 of the facility) in conjunction with the APS software team (Schwartz) • Advanced Light Source (>2000 users/year) • Integration with CAMERA project and associated tomography beamlines Linking Data from Major Facilities 22Working with user facilities to facilitate capturing data/metadata
  • 23. Ripple: Home automation for research data Doi:10.1109/ICDCSW.2017.30 23 Procedure for automating tomography experiments: At ALS: Detect new beamline data, and transfer it to NERSC At NERSC: Submit, run jobs on Edison, transfer data back to ALS At ALS: Create a shared endpoint, notify collaborators of result via email Automate capturing results and metadata Ryan Chard
  • 24. Route 2: Deep Indexing Materials Data MDF Index Data resources indexed 116 Records >3.4M Repositories harvested • MDF • NIST MML Repo • MATIN • Materials Commons • CXIDB • NIST Materials Resource Registry 6 ~200 Datasets ~260 TB Made discoverable 24
  • 25. Adding More Metadata to NIST MatDL Dataset As Published Limited Metadata Querying Difficult 25
  • 26. Adding More Metadata to NIST MatDL Deep-Indexed into the MDF Data Available Programmatically 26
  • 27. Adding More Metadata to NIST MatDL Deep-Indexed into the MDF Can be used for scripting 27
  • 28. Another benefit: domain-specific querying Example service possible with DFT data files Answer questions like: “Do we have any data about anatase-TiO2?” “Who else has studied Li-MnO3 batteries with DFT?” Crystal Structure File .cif, VASP, etc. Entries from MDF that are structurally-similar 28
  • 29. Skluma: A Statistical Learning Pipeline for Taming Unkempt Data Repositories 29 doi:10.1145/3085504.3091116 Goal: Build intelligent search indexes with minimal human effort Method: Employ machine learning to extract metadata from file repositories - Classify data files - Detect file types Tyler Skluzacek Search Otherwise-Unusable Data Repositories
  • 30. MDF Forge python package (under development) • Interface to MDF services • Helper functions for common tasks APIs, Automation, and Examples https://github.com/materials-data-facility/forge 30 Tools for using these capabilities will be available soon
  • 32. Computable Data Reproducing data-driven science should be trivial It often is not. Common problems: § If available, datasets lack documentation § Algorithms/methods are not open sourced § Models rarely published § Software installation/configuration require expertise Our goal: Simplify publishing data-driven science - Storing software and models - Integrating them with compute resources 32
  • 33. Integrating analytics tools with MDF 33 MATIN (GT) ~ 10 datasets Used in education Result: Scientists connected with data, analytics tools, and compute capability MDF Data Publication MATIN (GT) MML Repository (NIST) Materials Commons (UM PRISMS) Coherent X-Ray Tomography Database (LNL) To End UsersTo End UsersTo Compute ResourcesFrom Data Repositories Jetstream is a self-provisioned, scalable science and engineering cloud environment operated by Indiana University for the National Science Foundation: jetstream-cloud.org
  • 34. Building a machine learning model using MDF A simple web service to train ML forcefields 34
  • 35. 35 Building a machine learning model using MDF
  • 36. Example: Building force-field potentials from different datasets Data resources: 3 DFT datasets with Aluminum data 1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov Result: Improved performance by integrating data sources 36 Building a machine learning model using MDF Method: Botu et al. JPCC. (2017) Using only original data Training SetHoldout Set
  • 37. Example: Building force-field potentials from different datasets Data resources: 3 DFT datasets with Aluminum data 1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov Result: Improved performance by integrating data sources 37 Building a machine learning model using MDF Method: Botu et al. JPCC. (2017) Using only original data Training SetHoldout Set Including Diffusion Dataset
  • 38. Example: Building force-field potentials from different datasets Data resources: 3 DFT datasets with Aluminum data 1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov Result: Improved performance by integrating data sources 38 Building a machine learning model using MDF Method: Botu et al. JPCC. (2017) Using only original data Training SetHoldout Set Including Diffusion DatasetIncluding 𝐷 + 𝑇# Dataset Better performance in original application: No new DFT calculations
  • 39. • Summer Intern (Jiming Chen) reproducing and extending materials and ML papers with the MDF • Joined our team with the NSF WholeTale project Reproducing data-driven MSE with MDF Users publish data to the MDF… … and code to WholeTale Long-term goals: - Assemble community-driven resource for ML tools/examples - Use MDF/WholeTale to create benchmark challenges Jiming Chen (UIUC) 39
  • 40. Replicating Ward et al. 2016 40
  • 41. • Publish and share models and code linked with full training datasets • Link database with HPC/Cloud computing resources • Provide uniform interface for training, running models DLHub: Advancing Deep Learning Adoption
  • 42. INCREASING VALUE OF DATA 42 $
  • 43. What is the MDF? EP EP EP EP Deep indexing Query Browse Aggregate Publish Mint DOIs Associate metadata Databases Datasets APIs LIMS etc. Distributed data storage Data publication service Data discovery service 43
  • 44. Data publication service 44 • Mechanisms to create and enforce schemas and logical collections • Web UI to create datasets and manage curation and admin tasks • Tools to automate publication process • Dataset record permanent landing page for DOI link • Record shows some metadata links to the rest • Direct link to underlying files • Download statistics
  • 45. Published Data Highlights 45 ~ 30 datasets ~ 6.5 TB MATIN (GT) ~ 10 datasets Used in education X-ray Scattering Image Classification Using Deep Learning http://dx.doi.org/10.18126/M2Z30Z Electron Backscattering and Diffraction Datasets for Ni, Mg, Fe, Si Yager et al.Marc De Graef et al. Phase Field Benchmark I Dataset Jokisaari et al. Grain Structure, Grain-averaged Lattice Strains, and Macro-scale Strain Data for Superelastic Nickel- Titanium Shape Memory Alloy Polycrystal Loaded in Tension Paranjape et al. • Largest dataset to date (>1.5 TB). Showcases MDF unique capabilities and makes a unique dataset discoverable for code development, analysis, and benchmarking
  • 47. Streamline & automate data publication 12.5 TB 12.4 TB out Data Volumes Publication Authors 94 Institutions 14 Accesses >1000 Total datasets 50 CHiMaD datasets 16 Pipeline CHiMaD datasets +14 Total datasets +30
  • 48. Advantages of Globus Publish Capable of handling large datasets § Publish data in place § Integration with Globus Transfer/HTTPS Deep indexing of materials-specific metadata § Parse common materials data types § Make data searchable on the file-level Automatically re-publishing data elsewhere § Publishing dataset metadata to MRR, Google Scholar, etc. § Sending fine-grained metadata to other databases (e.g., Citrine) In Progress: Know how often your data is used § Track when it is used in analytics tools 48 All of these capabilities increase the value of your data
  • 49. Why create the MDF? http://materialsdatafacility.org 49 1. Make your data shareable Custom access control, using institution credentials 2. Make your data open Access to >100TB of storage space 3. Make your data accessible Search across distributed resources Automatic, domain-specific metadata extraction 4. Make your data computable Tight integration with computing resources 5. Make your data valuable Citable with DOIs, measured with usage stats $ EP
  • 50. Thanks to our sponsors! 50 U . S . D E P A R T M E N T O F ENERGY