SlideShare a Scribd company logo
1 of 18
|
From Maslow’s Hierarchy
to Knowledgegraphs:
Experiments in Big and
Small Data at Elsevier
Anita de Waard, a.dewaard@elsevier.com
VP Research Data Management, Elsevier
Charleston Conference, November 4, 2016
| 2
Big Data vs. Small Data: What Will I Be Talking About?
Data Type Small Big
User UX User analytics
Performance Pure Scival
Research Research Data
Management (RDM)
HPC systems
(HEP, astronomy, etc)
Text Text mining KnowledgeGraphs
Health Medical systems Precision Medicine
Elsevier does I will talk about
|
Bauer, B. (Bruno) et al,(2015) ‘Forschende und ihre Daten. Ergebnisse einer österreichweiten Befragung (eBook)‘ (in German)
E-infrastructures Austria, https://phaidra.univie.ac.at/detail_object/o:407736
Stays at institution
Take it with me
Don’t know
Data is lost
Other
When You Leave Your Institution, What Happens To Your Data?
|
When we talk about data, we really talk about the following:
Machine & environment settings
Raw data Processed data
Scripts & analyses
Protocols, methods, algorithms
Accessibility
Reproducibility
Reusability
Discoverability
Note: images for illustrative purpose only
4
|
https://www.elsevier.com/connect/10-aspects-of-highly-effective-research-data
A Maslow Hierarchy for Research Data:
|
Preserve Process: Hivebench (http://www.hivebench.com)
|
Linked to published
papers – or not
Linked to Github – or
not
Versioning and
provenance
Preserve Data: Mendeley Data (https://data.mendeley.com/)
|
http://www.journals.elsevier.com/softwarex/
Share and Comprehend: SoftwareX
(http://www.journals.elsevier.com/softwarex/)
|
Access: Linking papers to data: www.Scholix.org
• ICSU/WDS/RDA Publishing Data
Service Working group
• Creating linked-data model for
exposing DOI to DOI links outside
publisher’s firewall
• Merged with National Data Service
pilot with the same goal
• Collaboration between CrossRef,
DataCite, Europe PubMed Central,
ANDS, Thompson Reuters,
Elsevier, OpenAire
Objective: move from
a plethora of (mostly)
bilateral
arrangements
between the different
players…
.. a one-for-all
cross-referencing
service for articles
and data
.. to ..
|
Discover: Data Search (http://datasearch.elsevier.com)
DataSearch.Elsevier.com
1. Across repositories
2. (Deep) indexing of data, so not just metadata
3. Data preview
1
3
2
|
https://www.elsevier.com/connect/10-aspects-of-highly-effective-research-data
A Maslow Hierarchy for Research Data:
Data at Risk
Reproducibility Papers
|
Content
Universal
schema
Surface form
relations
Structured
relations
Factorization
model
Matrix
Construction
Open
Information
Extraction
Entity
Resolution
Matrix
Factorization
Knowledge
graph
Curation
Predicted
relations
Matrix
Completion
Taxonomy
Triple
Extraction
GOAL: IDENTIFY ENTITIES AND RELATIONSHIP ACROSS THE ENTIRE ELSEVIER
CORPUS IN SCIENCE DIRECT
TEXT MINING + ENTITY IDENTIFICATION, USING OUR TAXONOMIES (EMMET,
COMPENDEX, AND OTHER)
UNSUPERVISED, SCALABLE AND BUILT WITH OFF-THE-SHELF TECHNOLOGIES
COLLABORATION WITH UNIVERSITY COLLEGE LONDON AND UM AMHERST [1]
TOWARDS AN ELSEVIER KNOWLEDGE GRAPH
14M articles from
Science Direct
3.3M triples
475M triples
49M triples p x r matrix p x k, k x r latent factor
matrices
~102 triples
920K concepts
from EMMeT
[1] Riedel, S., L. Yao, A. McCallum, and B. M. Marlin. (2013).
"Relation extraction with matrix factorization and universal
schemas”, http://www.aclweb.org/anthology/N13-1008
|
SAMPLE OUTPUT:
glaucoma developed many years after chronic inflammation of uveal tract
glaucoma develop following chronic inflammation of uveal tract
glaucoma can appear soon in family history of glaucoma
glaucoma can appear soon in age over 40
glaucoma the risk of functional visual field loss
glaucoma contributing causes of functional visual field loss
glaucoma contributed to functional visual field loss
glaucoma is considered the second leading cause of functional visual field loss
glaucoma remains the second leading cause of functional visual field loss
Deduplication/normalization: downsampled from 49M entity-resolved triples:
|
Knowledge Graphs for the Life Sciences:
Bradley Allen, DC Conference, Oct 2016,
http://www.slideshare.net/bpa777/dc2016-keynote-20161013-67164305/15
| 15
Trends driving Digital Health & Precision Medicine:
need for health data with consent
4500 tests for gene
disorders available
(2013: 3200 +20% CAGR)
$1245
cost to sequence
full genome
(10/2014: $5730)
$199
cost of 23andME
test
25 million
biomed articles
referenced on PubMed
30 days → 1
hour
manual to machine
learning
time needed to develop
one prediction model at
Elsevier
1.2 million
new biomed articles p.a.
76%
of US hospitals use
at least a basic EMR
130 million patient
data sets at large insurer
21 m complete for last 2 years
7 m with clinical and lab data
NB: 6 m (no clin, lab) in Germany
6.5 million in Catalonia
105 mm ECG
high ecg quality, heart rate, respiratory,
body temp, activity, body position,
water tight, induction charged, bluetooth,
continuous data feed
patientslikeme has
400,000+ members
31 million data points covering
2,500+ conditions, donating data
1. genetic testing
2. information explosion
3. patient data
4. biosensors - IoT in health
5. machine learning
6. patient empowerment
| 16
The Elsevier Medical Graph is a deep predictive model
that relates attributes of over 2000 medical conditions
to phenotypes of patients at potential risk of re-admission.
Probability of occurrance within next five years. 2,083 ICD10 conditions.
Based on 6 year longitudinal history of 6 million German patients.
| 17
Big Data vs. Small Data: What Did I Talk About?
Data Type Small Big
User UX User analytics
Performance Pure Scival
Research Research Data
Management (RDM)
HPC systems
(HEP, astronomy, etc)
Text Text mining KnowledgeGraphs
Health Medical systems Precision Medicine
Elsevier does I discussed!
|
Thank you!
18
Anita de Waard, VP Research Data Collaborations,
Elsevier RDM Services
Jericho, VT 05465
a.dewaard@elsevier.com

More Related Content

What's hot

NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
European School of Oncology
 
SELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSSELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODS
KAMIL MAJEED
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.doc
butest
 
MAPINFO PROJECT
MAPINFO PROJECTMAPINFO PROJECT
MAPINFO PROJECT
Gargi Sen
 

What's hot (20)

Data Analysis and Prediction System for Meteorological Data
Data Analysis and Prediction System for Meteorological DataData Analysis and Prediction System for Meteorological Data
Data Analysis and Prediction System for Meteorological Data
 
Real life application of statistics in engineering
Real life application of statistics in engineeringReal life application of statistics in engineering
Real life application of statistics in engineering
 
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
 
Intelligent generator of big data medical
Intelligent generator of big data medicalIntelligent generator of big data medical
Intelligent generator of big data medical
 
Fault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clusteringFault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clustering
 
130509
130509130509
130509
 
Machine learning in biology
Machine learning in biologyMachine learning in biology
Machine learning in biology
 
Automatic and unsupervised topic discovery in social networks
Automatic and unsupervised topic discovery in social networksAutomatic and unsupervised topic discovery in social networks
Automatic and unsupervised topic discovery in social networks
 
Statistics in real life engineering
Statistics in real life engineeringStatistics in real life engineering
Statistics in real life engineering
 
krynski_cv
krynski_cvkrynski_cv
krynski_cv
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical research
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
SELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSSELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODS
 
Reproducible research: First steps.
Reproducible research: First steps. Reproducible research: First steps.
Reproducible research: First steps.
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSIONEFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
 
Outlier Detection using Reverse Neares Neighbor for Unsupervised Data
Outlier Detection using Reverse Neares Neighbor for Unsupervised DataOutlier Detection using Reverse Neares Neighbor for Unsupervised Data
Outlier Detection using Reverse Neares Neighbor for Unsupervised Data
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.doc
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
MAPINFO PROJECT
MAPINFO PROJECTMAPINFO PROJECT
MAPINFO PROJECT
 

Viewers also liked

The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
Trey Grainger
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Trey Grainger
 

Viewers also liked (11)

The Narrative Structure of Research Articles, or, Why Science is Like a Fairy...
The Narrative Structure of Research Articles, or, Why Science is Like a Fairy...The Narrative Structure of Research Articles, or, Why Science is Like a Fairy...
The Narrative Structure of Research Articles, or, Why Science is Like a Fairy...
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
 
Elsevier‘s RDM Program: Ten Habits of Highly Effective Data
Elsevier‘s RDM Program: Ten Habits of Highly Effective DataElsevier‘s RDM Program: Ten Habits of Highly Effective Data
Elsevier‘s RDM Program: Ten Habits of Highly Effective Data
 
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne UlitmatumElsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
 
Die Bedeutung von Machine Learning für den e-Commerce am Beispiel von Amazon
Die Bedeutung von Machine Learning für den e-Commerce am Beispiel von AmazonDie Bedeutung von Machine Learning für den e-Commerce am Beispiel von Amazon
Die Bedeutung von Machine Learning für den e-Commerce am Beispiel von Amazon
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
 
Elsevier Medical Graph – mit Machine Learning zu Precision Medicine
Elsevier Medical Graph – mit Machine Learning zu Precision MedicineElsevier Medical Graph – mit Machine Learning zu Precision Medicine
Elsevier Medical Graph – mit Machine Learning zu Precision Medicine
 
Publishing the Full Research Data Lifecycle
Publishing the Full Research Data LifecyclePublishing the Full Research Data Lifecycle
Publishing the Full Research Data Lifecycle
 
Knowledge Graphs at Elsevier
Knowledge Graphs at ElsevierKnowledge Graphs at Elsevier
Knowledge Graphs at Elsevier
 
Medical Graphs
Medical GraphsMedical Graphs
Medical Graphs
 

Similar to Charleston Conference 2016

Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics Poster
Michael Atkins
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
Ian Foster
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
Michael Atkins
 

Similar to Charleston Conference 2016 (20)

Supervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For CancerSupervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For Cancer
 
C0344023028
C0344023028C0344023028
C0344023028
 
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksResults Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
 
OII Summer Doctoral Programme 2010: Global brain by Meyer & Schroeder
OII Summer Doctoral Programme 2010: Global brain by Meyer & SchroederOII Summer Doctoral Programme 2010: Global brain by Meyer & Schroeder
OII Summer Doctoral Programme 2010: Global brain by Meyer & Schroeder
 
Systems biology for medical students/Systems medicine
Systems biology for medical students/Systems medicineSystems biology for medical students/Systems medicine
Systems biology for medical students/Systems medicine
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics Poster
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical Data
 
AAPM Foster July 2009
AAPM Foster July 2009AAPM Foster July 2009
AAPM Foster July 2009
 
Recent Advances in Deep Learning Techniques for Electronic Health Record
Recent Advances in Deep Learning Techniques for Electronic Health RecordRecent Advances in Deep Learning Techniques for Electronic Health Record
Recent Advances in Deep Learning Techniques for Electronic Health Record
 
An introduction to machine learning in biomedical research: Key concepts, pr...
An introduction to machine learning in biomedical research:  Key concepts, pr...An introduction to machine learning in biomedical research:  Key concepts, pr...
An introduction to machine learning in biomedical research: Key concepts, pr...
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
 
G. Poste. Managing the Data Deluge: Critical Issues in the Integration and An...
G. Poste. Managing the Data Deluge: Critical Issues in the Integration and An...G. Poste. Managing the Data Deluge: Critical Issues in the Integration and An...
G. Poste. Managing the Data Deluge: Critical Issues in the Integration and An...
 
Bioinformatics databases: Current Trends and Future Perspectives
Bioinformatics databases: Current Trends and Future PerspectivesBioinformatics databases: Current Trends and Future Perspectives
Bioinformatics databases: Current Trends and Future Perspectives
 
Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...
 
Introduction to systems medicine
Introduction to systems medicineIntroduction to systems medicine
Introduction to systems medicine
 
informatics_future.pdf
informatics_future.pdfinformatics_future.pdf
informatics_future.pdf
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
 
Machine learning to solve bioinformatics problems
Machine learning to solve bioinformatics problemsMachine learning to solve bioinformatics problems
Machine learning to solve bioinformatics problems
 

More from Anita de Waard

More from Anita de Waard (20)

Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and ReuseMendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
 
Why would a publisher care about open data?
Why would a publisher care about open data?Why would a publisher care about open data?
Why would a publisher care about open data?
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
 
NFAIS Talk on Enabling FAIR Data
NFAIS Talk on Enabling FAIR DataNFAIS Talk on Enabling FAIR Data
NFAIS Talk on Enabling FAIR Data
 
CNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsCNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data Commons
 
Enabling FAIR Data: TAG B Authoring Guidelines
Enabling FAIR Data: TAG B Authoring GuidelinesEnabling FAIR Data: TAG B Authoring Guidelines
Enabling FAIR Data: TAG B Authoring Guidelines
 
Scientific facts are myths, told through fairytales and spread by gossip.
Scientific facts are myths, told through fairytales and spread by gossip.Scientific facts are myths, told through fairytales and spread by gossip.
Scientific facts are myths, told through fairytales and spread by gossip.
 
Data, Data Everywhere: What's A Publisher to Do?
Data, Data Everywhere: What's  A Publisher to Do?Data, Data Everywhere: What's  A Publisher to Do?
Data, Data Everywhere: What's A Publisher to Do?
 
Talk on Research Data Management
Talk on Research Data ManagementTalk on Research Data Management
Talk on Research Data Management
 
History of the future
History of the futureHistory of the future
History of the future
 
Networked Science, And Integrating with Dataverse
Networked Science, And Integrating with DataverseNetworked Science, And Integrating with Dataverse
Networked Science, And Integrating with Dataverse
 
Big Data and the Future of Publishing
Big Data and the Future of PublishingBig Data and the Future of Publishing
Big Data and the Future of Publishing
 
Data Repositories: Recommendation, Certification and Models for Cost Recovery
Data Repositories: Recommendation, Certification and Models for Cost RecoveryData Repositories: Recommendation, Certification and Models for Cost Recovery
Data Repositories: Recommendation, Certification and Models for Cost Recovery
 
The Economics of Data Sharing
The Economics of Data SharingThe Economics of Data Sharing
The Economics of Data Sharing
 
Public Identifiers in Scholarly Publishing
Public Identifiers in Scholarly PublishingPublic Identifiers in Scholarly Publishing
Public Identifiers in Scholarly Publishing
 
RDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupRDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest Group
 
The Rocky Road to Reuse
The Rocky Road to ReuseThe Rocky Road to Reuse
The Rocky Road to Reuse
 
Collaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and softwareCollaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and software
 
Argumentation in biology papers
Argumentation in biology papersArgumentation in biology papers
Argumentation in biology papers
 
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 

Recently uploaded (20)

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 

Charleston Conference 2016

  • 1. | From Maslow’s Hierarchy to Knowledgegraphs: Experiments in Big and Small Data at Elsevier Anita de Waard, a.dewaard@elsevier.com VP Research Data Management, Elsevier Charleston Conference, November 4, 2016
  • 2. | 2 Big Data vs. Small Data: What Will I Be Talking About? Data Type Small Big User UX User analytics Performance Pure Scival Research Research Data Management (RDM) HPC systems (HEP, astronomy, etc) Text Text mining KnowledgeGraphs Health Medical systems Precision Medicine Elsevier does I will talk about
  • 3. | Bauer, B. (Bruno) et al,(2015) ‘Forschende und ihre Daten. Ergebnisse einer österreichweiten Befragung (eBook)‘ (in German) E-infrastructures Austria, https://phaidra.univie.ac.at/detail_object/o:407736 Stays at institution Take it with me Don’t know Data is lost Other When You Leave Your Institution, What Happens To Your Data?
  • 4. | When we talk about data, we really talk about the following: Machine & environment settings Raw data Processed data Scripts & analyses Protocols, methods, algorithms Accessibility Reproducibility Reusability Discoverability Note: images for illustrative purpose only 4
  • 6. | Preserve Process: Hivebench (http://www.hivebench.com)
  • 7. | Linked to published papers – or not Linked to Github – or not Versioning and provenance Preserve Data: Mendeley Data (https://data.mendeley.com/)
  • 8. | http://www.journals.elsevier.com/softwarex/ Share and Comprehend: SoftwareX (http://www.journals.elsevier.com/softwarex/)
  • 9. | Access: Linking papers to data: www.Scholix.org • ICSU/WDS/RDA Publishing Data Service Working group • Creating linked-data model for exposing DOI to DOI links outside publisher’s firewall • Merged with National Data Service pilot with the same goal • Collaboration between CrossRef, DataCite, Europe PubMed Central, ANDS, Thompson Reuters, Elsevier, OpenAire Objective: move from a plethora of (mostly) bilateral arrangements between the different players… .. a one-for-all cross-referencing service for articles and data .. to ..
  • 10. | Discover: Data Search (http://datasearch.elsevier.com) DataSearch.Elsevier.com 1. Across repositories 2. (Deep) indexing of data, so not just metadata 3. Data preview 1 3 2
  • 12. | Content Universal schema Surface form relations Structured relations Factorization model Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization Knowledge graph Curation Predicted relations Matrix Completion Taxonomy Triple Extraction GOAL: IDENTIFY ENTITIES AND RELATIONSHIP ACROSS THE ENTIRE ELSEVIER CORPUS IN SCIENCE DIRECT TEXT MINING + ENTITY IDENTIFICATION, USING OUR TAXONOMIES (EMMET, COMPENDEX, AND OTHER) UNSUPERVISED, SCALABLE AND BUILT WITH OFF-THE-SHELF TECHNOLOGIES COLLABORATION WITH UNIVERSITY COLLEGE LONDON AND UM AMHERST [1] TOWARDS AN ELSEVIER KNOWLEDGE GRAPH 14M articles from Science Direct 3.3M triples 475M triples 49M triples p x r matrix p x k, k x r latent factor matrices ~102 triples 920K concepts from EMMeT [1] Riedel, S., L. Yao, A. McCallum, and B. M. Marlin. (2013). "Relation extraction with matrix factorization and universal schemas”, http://www.aclweb.org/anthology/N13-1008
  • 13. | SAMPLE OUTPUT: glaucoma developed many years after chronic inflammation of uveal tract glaucoma develop following chronic inflammation of uveal tract glaucoma can appear soon in family history of glaucoma glaucoma can appear soon in age over 40 glaucoma the risk of functional visual field loss glaucoma contributing causes of functional visual field loss glaucoma contributed to functional visual field loss glaucoma is considered the second leading cause of functional visual field loss glaucoma remains the second leading cause of functional visual field loss Deduplication/normalization: downsampled from 49M entity-resolved triples:
  • 14. | Knowledge Graphs for the Life Sciences: Bradley Allen, DC Conference, Oct 2016, http://www.slideshare.net/bpa777/dc2016-keynote-20161013-67164305/15
  • 15. | 15 Trends driving Digital Health & Precision Medicine: need for health data with consent 4500 tests for gene disorders available (2013: 3200 +20% CAGR) $1245 cost to sequence full genome (10/2014: $5730) $199 cost of 23andME test 25 million biomed articles referenced on PubMed 30 days → 1 hour manual to machine learning time needed to develop one prediction model at Elsevier 1.2 million new biomed articles p.a. 76% of US hospitals use at least a basic EMR 130 million patient data sets at large insurer 21 m complete for last 2 years 7 m with clinical and lab data NB: 6 m (no clin, lab) in Germany 6.5 million in Catalonia 105 mm ECG high ecg quality, heart rate, respiratory, body temp, activity, body position, water tight, induction charged, bluetooth, continuous data feed patientslikeme has 400,000+ members 31 million data points covering 2,500+ conditions, donating data 1. genetic testing 2. information explosion 3. patient data 4. biosensors - IoT in health 5. machine learning 6. patient empowerment
  • 16. | 16 The Elsevier Medical Graph is a deep predictive model that relates attributes of over 2000 medical conditions to phenotypes of patients at potential risk of re-admission. Probability of occurrance within next five years. 2,083 ICD10 conditions. Based on 6 year longitudinal history of 6 million German patients.
  • 17. | 17 Big Data vs. Small Data: What Did I Talk About? Data Type Small Big User UX User analytics Performance Pure Scival Research Research Data Management (RDM) HPC systems (HEP, astronomy, etc) Text Text mining KnowledgeGraphs Health Medical systems Precision Medicine Elsevier does I discussed!
  • 18. | Thank you! 18 Anita de Waard, VP Research Data Collaborations, Elsevier RDM Services Jericho, VT 05465 a.dewaard@elsevier.com