Dunning - SIGMOD - Data Economy.pptx

T
Ted DunningSoftware Engineer à MapR Technologies
FROM ROOTS TO FRUITS: EXPLORING
LINEAGE FOR DATASET
RECOMMENDATIONS
Ted Dunning, Fellow, HPE
18 June, 2023
the meaning of words lies in their use
2
the meaning of words lies in their use
3
the meaning of data lies in its use
(apologies to Dr. Wittgenstein)
4
A meteorologist’s data
- rainfall
- windspeed
- temperature
5
A meteorologist’s data
- rainfall
- windspeed
- temperature
A business uses the
data to predict umbrella
sales
6
What does the data
actually mean?
7
What does the data
actually mean?
the meaning of data lies in its use
TRAINING PROCESS
8
README
URL
History
Datasets
+
Models
Metadata
We start with explicit metadata.
Examples: column and table
names, documentation, common
values, and others
TRAINING PROCESS
9
README
URL
History
Datasets
+
Models
Metadata
This is encoded as a large
artifact x characters
incidence table
At this point, direct metadata
search is possible
TRAINING PROCESS
10
README
URL
History
Datasets
+
Models
Metadata
We augment with
metadata from all
ancestors and
descendants in
the global data
lineage graph
TRAINING PROCESS
11
README
URL
History
Datasets
+
Models
Metadata
Finally, we reduce the characteristic
cooccurrences using indicator-based
recommendation methods.
A NOTE ON IMPLICATIONS
12
The characteristic indicator
matrix is what connects
“umbrella” with “rainfall” or
“mosquito” with
“temperature” + “windspeed”
QUERY PROCESS
13
The original query is often
textual, possibly a README
QUERY PROCESS
14
augmented by recent project
behavior (queries, references)
QUERY PROCESS
15
The query is expanded based
on indicators (when they say
“umbrellas” they also mean
“rainfall”)
as well as semantic token
embedding using BERT
Recommendations Explanation
positives.csv
notpositives.csv
SARIMA_model
dengue_monthly.csv
climate_monthly.csv
“dengue” ancestor
“dengue” ancestor
“dengue” ancestor
“dengue"
“wind speed”
QUERY PROCESS
16
The final results include an
explanation of why files or
programs are included.
Recommendations Explanation
positives.csv
notpositives.csv
SARIMA_model
dengue_monthly.csv
climate_monthly.csv
“dengue” ancestor
“dengue” ancestor
“dengue” ancestor
“dengue"
“wind speed”
QUERY PROCESS
17
EVALUATION
• Evaluation is difficult due to a lack of public datasets
• Most machine learning examples are truncated to final steps
• Very few non-machine learning pipelines exist outside of toy examples
• Private datasets generally cannot be shared
• Still important to use when possible due to scale
• Evaluation of recommendation engines is a subtle art
• Their purpose is to change behaviors
• Todays recommendations select tomorrow’s training data
• We aren’t to this point yet, this would be a symptom of success
18
EVALUATION
19
EVALUATION
20
THANK YOU
ted.dunning@hpe.com
@ted_dunning
@ted_dunning@mastodon.social
21
1 sur 21

Recommandé

Python's Role in the Future of Data Analysis par
Python's Role in the Future of Data AnalysisPython's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPeter Wang
6.4K vues66 diapositives
Open government data portals: from publishing to use and impact par
Open government data portals: from publishing to use and impactOpen government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactElena Simperl
175 vues40 diapositives
The web of data: how are we doing so far? par
The web of data: how are we doing so far?The web of data: how are we doing so far?
The web of data: how are we doing so far?Elena Simperl
1.5K vues41 diapositives
Consuming open and linked data with open source tools par
Consuming open and linked data with open source toolsConsuming open and linked data with open source tools
Consuming open and linked data with open source toolsJoanne Cook
948 vues29 diapositives
TSE_Pres12.pptx par
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptxssuseracaaae2
7 vues18 diapositives
Being FAIR: FAIR data and model management SSBSS 2017 Summer School par
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
978 vues65 diapositives

Contenu connexe

Similaire à Dunning - SIGMOD - Data Economy.pptx

Stream Processing par
Stream Processing Stream Processing
Stream Processing FogGuru MSCA Project
60 vues38 diapositives
Wed roman tut_open_datapub par
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapubeswcsummerschool
433 vues36 diapositives
Cognitive data par
Cognitive dataCognitive data
Cognitive dataSören Auer
1.9K vues48 diapositives
From Science to Data: Following a principled path to Data Science par
From Science to Data: Following a principled path to Data ScienceFrom Science to Data: Following a principled path to Data Science
From Science to Data: Following a principled path to Data ScienceInstitute of Contemporary Sciences
129 vues40 diapositives
Data science | What is Data science par
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
840 vues32 diapositives
Converged IT and Data Commons par
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data CommonsSimon Twigger
25 vues26 diapositives

Similaire à Dunning - SIGMOD - Data Economy.pptx(20)

Data science | What is Data science par ShilpaKrishna6
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
ShilpaKrishna6840 vues
Research Knowledge Graphs at GESIS & NFDI4DataScience par Stefan Dietze
Research Knowledge Graphs at GESIS & NFDI4DataScienceResearch Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScience
Stefan Dietze101 vues
Camp 4-data workshop presentation par Paolo Missier
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentation
Paolo Missier659 vues
Big Data Benchmarking Tutorial par Tilmann Rabl
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
Tilmann Rabl5K vues
Henning agt talk-caise-semnet par caise2013vlc
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
caise2013vlc536 vues
Challenges in Analytics for BIG Data par Prasant Misra
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
Prasant Misra551 vues
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t... par Anastasija Nikiforova
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Donders neuroimage toolkit - open science and good practices par Radboud University
Donders neuroimage toolkit -  open science and good practicesDonders neuroimage toolkit -  open science and good practices
Donders neuroimage toolkit - open science and good practices
Data management for TA's par aaroncollie
Data management for TA'sData management for TA's
Data management for TA's
aaroncollie576 vues
Capturing Context in Scientific Experiments: Towards Computer-Driven Science par dgarijo
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
dgarijo551 vues
Current Trends and Challenges in Big Data Benchmarking par eXascale Infolab
Current Trends and Challenges in Big Data BenchmarkingCurrent Trends and Challenges in Big Data Benchmarking
Current Trends and Challenges in Big Data Benchmarking
eXascale Infolab3.2K vues

Plus de Ted Dunning

How to Get Going with Kubernetes par
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
593 vues80 diapositives
Progress for big data in Kubernetes par
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
473 vues82 diapositives
Anomaly Detection: How to find what you didn’t know to look for par
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
766 vues104 diapositives
Streaming Architecture including Rendezvous for Machine Learning par
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
682 vues83 diapositives
Machine Learning Logistics par
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
613 vues52 diapositives
Tensor Abuse - how to reuse machine learning frameworks par
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
883 vues24 diapositives

Plus de Ted Dunning(20)

How to Get Going with Kubernetes par Ted Dunning
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
Ted Dunning593 vues
Progress for big data in Kubernetes par Ted Dunning
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning473 vues
Anomaly Detection: How to find what you didn’t know to look for par Ted Dunning
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
Ted Dunning766 vues
Streaming Architecture including Rendezvous for Machine Learning par Ted Dunning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning682 vues
Machine Learning Logistics par Ted Dunning
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
Ted Dunning613 vues
Tensor Abuse - how to reuse machine learning frameworks par Ted Dunning
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
Ted Dunning883 vues
Machine Learning logistics par Ted Dunning
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
Ted Dunning3.9K vues
Finding Changes in Real Data par Ted Dunning
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
Ted Dunning803 vues
Where is Data Going? - RMDC Keynote par Ted Dunning
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
Ted Dunning545 vues
Cheap learning-dunning-9-18-2015 par Ted Dunning
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
Ted Dunning1.8K vues
Sharing Sensitive Data Securely par Ted Dunning
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
Ted Dunning1.8K vues
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time par Ted Dunning
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Ted Dunning2.8K vues
How the Internet of Things is Turning the Internet Upside Down par Ted Dunning
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
Ted Dunning1.7K vues
Apache Kylin - OLAP Cubes for SQL on Hadoop par Ted Dunning
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning8.5K vues
Dunning time-series-2015 par Ted Dunning
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
Ted Dunning1.1K vues
Doing-the-impossible par Ted Dunning
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
Ted Dunning3.3K vues
Anomaly Detection - New York Machine Learning par Ted Dunning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
Ted Dunning6.3K vues
Cognitive computing with big data, high tech and low tech approaches par Ted Dunning
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
Ted Dunning2.6K vues

Dernier

Infomatica-MDM.pptx par
Infomatica-MDM.pptxInfomatica-MDM.pptx
Infomatica-MDM.pptxKapil Rangwani
12 vues16 diapositives
Running PostgreSQL in a Kubernetes cluster: CloudNativePG par
Running PostgreSQL in a Kubernetes cluster: CloudNativePGRunning PostgreSQL in a Kubernetes cluster: CloudNativePG
Running PostgreSQL in a Kubernetes cluster: CloudNativePGNick Ivanov
7 vues29 diapositives
Listed Instruments Survey 2022.pptx par
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptxsecretariat4
130 vues12 diapositives
AZConf 2023 - Considerations for LLMOps: Running LLMs in production par
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionSARADINDU SENGUPTA
9 vues16 diapositives
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf par
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOppotus
34 vues19 diapositives
DGIQ East 2023 AI Ethics SIG par
DGIQ East 2023 AI Ethics SIGDGIQ East 2023 AI Ethics SIG
DGIQ East 2023 AI Ethics SIGKaren Lopez
5 vues7 diapositives

Dernier(20)

Running PostgreSQL in a Kubernetes cluster: CloudNativePG par Nick Ivanov
Running PostgreSQL in a Kubernetes cluster: CloudNativePGRunning PostgreSQL in a Kubernetes cluster: CloudNativePG
Running PostgreSQL in a Kubernetes cluster: CloudNativePG
Nick Ivanov7 vues
Listed Instruments Survey 2022.pptx par secretariat4
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptx
secretariat4130 vues
AZConf 2023 - Considerations for LLMOps: Running LLMs in production par SARADINDU SENGUPTA
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf par Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus34 vues
DGIQ East 2023 AI Ethics SIG par Karen Lopez
DGIQ East 2023 AI Ethics SIGDGIQ East 2023 AI Ethics SIG
DGIQ East 2023 AI Ethics SIG
Karen Lopez5 vues
PRIVACY AWRE PERSONAL DATA STORAGE par antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204218 vues
Pydata Global 2023 - How can a learnt model unlearn something par SARADINDU SENGUPTA
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn something
4_4_WP_4_06_ND_Model.pptx par d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 vues
CRM stick or twist.pptx par info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 vues
CRM stick or twist workshop par info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821714 vues
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS Triange par RNayak3
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS TriangeAnalytics Center of Excellence | Data CoE |Analytics CoE| WNS Triange
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS Triange
RNayak35 vues
Best Home Security Systems.pptx par mogalang
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptx
mogalang9 vues

Dunning - SIGMOD - Data Economy.pptx