SlideShare une entreprise Scribd logo
1  sur  60
Télécharger pour lire hors ligne
Big Data Modeling
Today:
Marc C. Hadfield, Founder

Vital AI

http://vital.ai
marc@vital.ai
917.463.4776
intro
Marc C. Hadfield, Founder Vital AI

http://vital.ai

marc@vital.ai
Big Data Modeling is

Data Modeling with the
“Variety” Big Data Dimension
in mind…
Big Data “Variety” Dimension
The “Variety” problem can be addressed by a
combination of improved tools and a methodology
involving both system architecture and

data science / analysis.
Compared to Volume and Velocity, Variety is a very
labor -intensive human-centric process.
Variety is the many types of data to be utilized
together in a data-driven application.

Potentially too many types for any single person to
keep track of (especially in Life Sciences).
Key Takeaways:
Using OWL as a “meta-schema” can drastically
reduce operations/development effort and
increase the value of the data for analysis.
OWL can augment and not replace familiar
development processes and tools.
A huge amount of ongoing development effort is
spent transforming data across components and
keeping data consistent during analysis.
Collecting Good Data = Good Analytics
Big Data Modeling:
Challenges
Goals
OWL as Modeling Language
Using OWL-based Models…
Collaboration/Modeling Tools
Examples from NYC Department of Education:
Domain Ontology
Application Architecture
Development Methodology/Tools
NYC Department of Education:
Data Architecture
Data Science
Data Models in:
Challenges
Mobile/Web App Architecture
Data Model Data Model
Mobile App
Server
Implementation Database
Database
Master Database
"Data Lake"
Database
Database
Database
Business Intelligence
Data Analytics
Dashboard
Enterprise DataWarehouse Architecture
Schema “on read” or “on write”
Data Model
Data Model
Data Model
Data Model
Data Model
ETL Process
MobileApp
Server Layer
Real Time Data
Calculated Views
Hadoop
Predictive Analytics
Master Database
"Data Lake"
Business Intelligence
Data Analytics
Dashboard
Lambda Architecture + Hadoop: Data Driven App
Data Model
Data Model
Data Model
Data ModelData Model
Data Wrangling / Data Science Master Database
"Data Lake"
Business Intelligence
Data Analytics
Raw Data
R
Data Model
Data Model
Data Model
Prediction Models must integrate back with
production environment:
Same Data, Difference Contexts…
Redundant Models.
Data Architecture Issues
{
Database Schema

JSON Data

Data Object Classes

Avro/Parquet
Redundant Data Definitions:
Considerable Development / Maintenance / Operational Overhead
Data Science / Data Wrangling Issues


Data Harmonization: Merging Datasets from Multiple Sources



Loss of Context: Feature f123 = Column135 X Column45 / Column13

Side note: Let’s stop using CSV files for datasets!

No more flat datasets!
Goals
Goals:
Reduce redundancy in Data Definitions
Enforce Clean/Harmonized Data

Use Contextual Datasets
Use Best Software Components (Databases, Analytics, …)
Use Familiar Tools (IDE, git, Languages, R)
OWL as Modeling Language
Web Ontology Language (OWL)
Specifies an Ontology (“Data Model”)
Formal Semantics, W3C Standard
Provides a language to describe the meaning of data
properties and how they relate to classes.
Example: Mammal

Necessary Conditions: warm-blooded, vertebrate animal,
has hair or fur, secrets milk, (typically) live birth
Greater descriptive power than Schema (SQL Tables) and
Serialization Frameworks (Avro)
Why OWL?
If we can more formally specify what the data *means*,
then we can have a single data model (ontology) apply to
our entire architecture, and data can be transformed
automatically locally as per the needs of a specific
software module.
Manually coded data transforms may be “lossy” and/or
introduce errors, so eliminating them helps keep data
clean.
Why OWL? (continued)
Example: if we specify what a “Document” is, then a text-
mining analyzer will know how to access the textual data
without further prompting.
Example: if we specify Features for use in Machine
Learning in the ontology, then features can be generated
automatically to train Machine Learning Models, and the
same features would be generated when we use the model
in production.
Why OWL? (continued)
Note: As ontologies can extend other ontologies, rather
than a single ontology, a collection of linked ontologies
can be used, allowing segmentation across an
organization.
Vital Core Ontology
Protege Editor…
Nodes, Edges, HyperNodes, HyperEdges get URIs
John/WorksFor/IBM —> Node / Edge / Node
Vital Core
Ontology
Vital Domain
Ontology
Application
Domain Ontology
Extending the Ontology
NYC Dept of Education Domain Ontology
Generating Data Bindings with VitalSigns:
Ontology VitalSigns
Groovy Bindings
Semantic Bindings
Hadoop Bindings
Prolog Bindings
Graph Bindings
HBase Bindings
JavaScript Bindings
Code/Schema Generation
vitalsigns generate -ont name…
person123.name = "John"
person123.worksFor.company456
<person123> <hasName> "John"
<worksFor123> <hasSource> <person123>
<worksFor123> <hasDestination> <company456>
<worksFor123> <hasType> <worksFor>
person123, Node:type=Person, Node:hasName="John"
worksFor123, Edge:type=worksFor, Edge:hasSource=person123, Edge:hasDestination=company456
Groovy
RDF
HBase
Data Representations
VitalSigns
Generation —> JAR Library
Runtime
Domain Ontology
Domain Ontology
Domain Ontology
Domain Ontology
VitalSigns
Class
Using OWL-based Models
Developing with the Ontology in UI, Hadoop, NLP, Scripts, ...
Node:Person Node:PersonEdge:hasFriend
Set<Friend> person123.getFriends()
Eclipse IDE
// Reference to an NYCSchool object
NYCSchool school123 = … // get from database

!
// Get a list of programs, local context (cache)
List<NYCSchoolProgram> programs = school123.getPrograms()
!
// Get list of programs, global context (database)
List<NYCSchoolProgram> programs =
school123.getPrograms(Context.ServiceWide)
!
JVM Development
Using JSON-Schema Data in JavaScript
for(var i = 0 ; i < progressReports.length; i++) {	

	

 	

 	

 	

	

 var r = progressReports[i];	

	

 	

 	

 	

	

 var sub = $('<ul>');	

	

 	

 	

 	

	

 sub.append('<li>Overall Grade : ' + r.progReportOverallGrade + '</li>');	

	

 sub.append('<li>Progress Grade: ' + r.progReportProgressGrade + '</li>');	

	

 sub.append('<li>Environment Grade: ' + r.progReportEnvironmentGrade + '</li>');	

	

 sub.append('<li>College and Career Readiness Grade: ' + r.progRepCollegeAndCareerReadinessGrade+ '</li>');	

	

 sub.append('<li>Performance Grade: ' + r.progReportPerformanceGrade+ '</li>');	

	

 sub.append('<li>Closing the Achievement Gap Points: ' + r.progReportClosingTheAchievementGapPoints+ '</li>');	

	

 sub.append('<li>Percentile Rank: ' + r.progReportPercentileRank + '</li>');	

	

 sub.append('<li>Overall Score: ' + r.progReportOverallScore + '</li>');	

	

 	

 	

 	

	

 }
NoSQL Queries
Query API / CRUD Operations

!
Queries generated into “native” NoSQL Query format:
Sparql / Triplestore (Allegrograph)
HBase / DynamoDB
MongoDB
Hive/HiveQL (on Spark/Hadoop2.0)
Query Types: “Select” and “Graph”
Abstract type of datastore from application/analytics code
Pass in a “native” query when necessary
Data Serialization, Analytics Jobs
Data Serialized into file format by blocks of objects
Leverage Hadoop Serialization Standards:
Sequence File, Avro, Parquet
Get data in and out of HDFS Files
Spark/Hadoop jobs passed a set of objects as input

URI of object is key
Data Objects are serialized into Compressed Strings for
transport over Flume, etc.
Machine Learning
Via Hadoop, Spark, R
Mahout, MLLib
Build Predictive Models
Classification, Clustering...
Use Features defined in
Ontology
Learn Target defined in
Ontology
Models consume Ontology
Data as input
Natural Language Processing/Text Mining
Topic Categorization…
Extract Entities… Text Features from Ontology
Classes extending Document…
Graph Analytics
GraphX, Giraph: PageRank,
Centrality, Interest Graph, …
Inference / Rules
Use Semantic Web Rule Engines / Reasoners

!
Load Ontology + RDF Representation of Data Instances (Individuals)
R Analytics
Load Serialized Data into R Dataframes
!
Reference Classes and Properties by Name in Dataframes
(cleaner code than huge number of columns)
Graph Visualization with Cytoscape
Data already in Node/Edge Graph Form
Graph Visualization with Cytoscape
Visualize Data “Hot Spots”
NYC Schools Architecture
Mobile App
JSON
Schema
VertX
Vital Flow Queue
Rule
Engine
NLP
DynamoDB
Vital Prime
VitalService
Client
NYC Schools
Data Model
R
Serialized Data
Data
Insights
Collaboration & Tools
Collaboration/Tools
git - code revision system
OWL Ontologies treated as code artifact
Coordinate across Teams:

“Front End”, “Back End”, “Operations”,
“Business Intelligence”, “Data Science”…
Coordinate across Enterprise:
Departments / Business Units
“Data Model of Record”
Ontology Versioning
NYCSchoolRecommendation-0.1.8.owl
Semantic Versioning (http://semver.org/)
vitalsigns command line
vitalsigns generate
vitalsigns upversion/downversion
code/schema generation
increase version patch number

move previous version to archive

rename OWL file including username
JAR files pushed into Maven

(Continuous Integration)
Git Integration
git: add, commit, push, pull
diff: determine differences
merge: merge two Ontologies
detect types of Ontology changes
merge into new patch version
OWL as Data Modeling Language:

Data Architecture & Data Science / Analytics
Conclusions
Leverage Existing Tools, Components
Reduce model redundancy, reduce effort.
A Means to Collaborate Across Teams: Data Model of Record
Cleaner Data
Integrate additional analysis
For more information, please
contact:
Marc C. Hadfield
http://vital.ai
marc@vital.ai
917.463.4776
Thank You!
Vital AI: Big Data Modeling
Vital AI: Big Data Modeling
Vital AI: Big Data Modeling

Contenu connexe

Tendances

Multi-model database
Multi-model databaseMulti-model database
Multi-model databaseJiaheng Lu
 
Semantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational DatabasesSemantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational DatabasesCambridge Semantics
 
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...South London Geek Nights
 
An Introduction to Graph: Database, Analytics, and Cloud Services
An Introduction to Graph:  Database, Analytics, and Cloud ServicesAn Introduction to Graph:  Database, Analytics, and Cloud Services
An Introduction to Graph: Database, Analytics, and Cloud ServicesJean Ihm
 
Graph Analysis over JSON, Larus
Graph Analysis over JSON, LarusGraph Analysis over JSON, Larus
Graph Analysis over JSON, LarusNeo4j
 
The Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j OverviewThe Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j OverviewNeo4j
 
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Paris Sud University
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics
 
Multi-model Databases and Tightly Integrated Polystores
Multi-model Databases and Tightly Integrated PolystoresMulti-model Databases and Tightly Integrated Polystores
Multi-model Databases and Tightly Integrated PolystoresJiaheng Lu
 
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningRisk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningCambridge Semantics
 
Choosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your ProjectChoosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your ProjectOntotext
 
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...Connected Data World
 
Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric
Using Cloud Automation Technologies to Deliver an Enterprise Data FabricUsing Cloud Automation Technologies to Deliver an Enterprise Data Fabric
Using Cloud Automation Technologies to Deliver an Enterprise Data FabricCambridge Semantics
 
Large Scale Graph Analytics with RDF and LPG Parallel Processing
Large Scale Graph Analytics with RDF and LPG Parallel ProcessingLarge Scale Graph Analytics with RDF and LPG Parallel Processing
Large Scale Graph Analytics with RDF and LPG Parallel ProcessingCambridge Semantics
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryMark Grover
 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsCambridge Semantics
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Going Beyond Rows and Columns with Graph Analytics
Going Beyond Rows and Columns with Graph AnalyticsGoing Beyond Rows and Columns with Graph Analytics
Going Beyond Rows and Columns with Graph AnalyticsCambridge Semantics
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo UnstructuredCambridge Semantics
 

Tendances (20)

Multi-model database
Multi-model databaseMulti-model database
Multi-model database
 
Semantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational DatabasesSemantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational Databases
 
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
 
An Introduction to Graph: Database, Analytics, and Cloud Services
An Introduction to Graph:  Database, Analytics, and Cloud ServicesAn Introduction to Graph:  Database, Analytics, and Cloud Services
An Introduction to Graph: Database, Analytics, and Cloud Services
 
Graph Analysis over JSON, Larus
Graph Analysis over JSON, LarusGraph Analysis over JSON, Larus
Graph Analysis over JSON, Larus
 
The Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j OverviewThe Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j Overview
 
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 
Multi-model Databases and Tightly Integrated Polystores
Multi-model Databases and Tightly Integrated PolystoresMulti-model Databases and Tightly Integrated Polystores
Multi-model Databases and Tightly Integrated Polystores
 
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningRisk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
 
Choosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your ProjectChoosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your Project
 
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
 
Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric
Using Cloud Automation Technologies to Deliver an Enterprise Data FabricUsing Cloud Automation Technologies to Deliver an Enterprise Data Fabric
Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric
 
Large Scale Graph Analytics with RDF and LPG Parallel Processing
Large Scale Graph Analytics with RDF and LPG Parallel ProcessingLarge Scale Graph Analytics with RDF and LPG Parallel Processing
Large Scale Graph Analytics with RDF and LPG Parallel Processing
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using Semantics
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Going Beyond Rows and Columns with Graph Analytics
Going Beyond Rows and Columns with Graph AnalyticsGoing Beyond Rows and Columns with Graph Analytics
Going Beyond Rows and Columns with Graph Analytics
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo Unstructured
 

Similaire à Vital AI: Big Data Modeling

Natural Language Processing & Semantic Models in an Imperfect World
Natural Language Processing & Semantic Modelsin an Imperfect WorldNatural Language Processing & Semantic Modelsin an Imperfect World
Natural Language Processing & Semantic Models in an Imperfect WorldVital.AI
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling TechniqueCarmen Sanborn
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationSören Auer
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic WebIvan Herman
 
Enterprise knowledge graphs
Enterprise knowledge graphsEnterprise knowledge graphs
Enterprise knowledge graphsSören Auer
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
 
NLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftNLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftSebastian Hellmann
 
Semantics in Financial Services -David Newman
Semantics in Financial Services -David NewmanSemantics in Financial Services -David Newman
Semantics in Financial Services -David NewmanPeter Berger
 
X api chinese cop monthly meeting feb.2016
X api chinese cop monthly meeting   feb.2016X api chinese cop monthly meeting   feb.2016
X api chinese cop monthly meeting feb.2016Jessie Chuang
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachAndre Freitas
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph IntroductionSören Auer
 
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE
 
Structured Dynamics' Semantic Technologies Product Stack
Structured Dynamics' Semantic Technologies Product StackStructured Dynamics' Semantic Technologies Product Stack
Structured Dynamics' Semantic Technologies Product StackMike Bergman
 
Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Max Neunhöffer
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialLeeFeigenbaum
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Gautier Poupeau
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the partsCarole Goble
 
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODOLinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODOChris Mungall
 
Linked Data Tutorial
Linked Data TutorialLinked Data Tutorial
Linked Data TutorialSören Auer
 

Similaire à Vital AI: Big Data Modeling (20)

Natural Language Processing & Semantic Models in an Imperfect World
Natural Language Processing & Semantic Modelsin an Imperfect WorldNatural Language Processing & Semantic Modelsin an Imperfect World
Natural Language Processing & Semantic Models in an Imperfect World
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data Integration
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
Enterprise knowledge graphs
Enterprise knowledge graphsEnterprise knowledge graphs
Enterprise knowledge graphs
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
NLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftNLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draft
 
Semantics in Financial Services -David Newman
Semantics in Financial Services -David NewmanSemantics in Financial Services -David Newman
Semantics in Financial Services -David Newman
 
X api chinese cop monthly meeting feb.2016
X api chinese cop monthly meeting   feb.2016X api chinese cop monthly meeting   feb.2016
X api chinese cop monthly meeting feb.2016
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
 
Structured Dynamics' Semantic Technologies Product Stack
Structured Dynamics' Semantic Technologies Product StackStructured Dynamics' Semantic Technologies Product Stack
Structured Dynamics' Semantic Technologies Product Stack
 
Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web Tutorial
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the parts
 
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODOLinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
 
Linked Data Tutorial
Linked Data TutorialLinked Data Tutorial
Linked Data Tutorial
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 

Dernier

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Dernier (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Vital AI: Big Data Modeling

  • 1. Big Data Modeling Today: Marc C. Hadfield, Founder
 Vital AI
 http://vital.ai marc@vital.ai 917.463.4776
  • 2. intro Marc C. Hadfield, Founder Vital AI
 http://vital.ai
 marc@vital.ai
  • 3. Big Data Modeling is
 Data Modeling with the “Variety” Big Data Dimension in mind…
  • 4. Big Data “Variety” Dimension The “Variety” problem can be addressed by a combination of improved tools and a methodology involving both system architecture and
 data science / analysis. Compared to Volume and Velocity, Variety is a very labor -intensive human-centric process. Variety is the many types of data to be utilized together in a data-driven application.
 Potentially too many types for any single person to keep track of (especially in Life Sciences).
  • 5. Key Takeaways: Using OWL as a “meta-schema” can drastically reduce operations/development effort and increase the value of the data for analysis. OWL can augment and not replace familiar development processes and tools. A huge amount of ongoing development effort is spent transforming data across components and keeping data consistent during analysis. Collecting Good Data = Good Analytics
  • 6. Big Data Modeling: Challenges Goals OWL as Modeling Language Using OWL-based Models… Collaboration/Modeling Tools
  • 7. Examples from NYC Department of Education: Domain Ontology Application Architecture Development Methodology/Tools
  • 8. NYC Department of Education:
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 16. Mobile/Web App Architecture Data Model Data Model Mobile App Server Implementation Database
  • 17. Database Master Database "Data Lake" Database Database Database Business Intelligence Data Analytics Dashboard Enterprise DataWarehouse Architecture Schema “on read” or “on write” Data Model Data Model Data Model Data Model Data Model ETL Process
  • 18. MobileApp Server Layer Real Time Data Calculated Views Hadoop Predictive Analytics Master Database "Data Lake" Business Intelligence Data Analytics Dashboard Lambda Architecture + Hadoop: Data Driven App Data Model Data Model Data Model Data ModelData Model
  • 19. Data Wrangling / Data Science Master Database "Data Lake" Business Intelligence Data Analytics Raw Data R Data Model Data Model Data Model Prediction Models must integrate back with production environment:
  • 20. Same Data, Difference Contexts… Redundant Models.
  • 21. Data Architecture Issues { Database Schema
 JSON Data
 Data Object Classes
 Avro/Parquet Redundant Data Definitions: Considerable Development / Maintenance / Operational Overhead
  • 22. Data Science / Data Wrangling Issues 
 Data Harmonization: Merging Datasets from Multiple Sources
 
 Loss of Context: Feature f123 = Column135 X Column45 / Column13
 Side note: Let’s stop using CSV files for datasets!
 No more flat datasets!
  • 23. Goals
  • 24. Goals: Reduce redundancy in Data Definitions Enforce Clean/Harmonized Data
 Use Contextual Datasets Use Best Software Components (Databases, Analytics, …) Use Familiar Tools (IDE, git, Languages, R)
  • 25. OWL as Modeling Language
  • 26. Web Ontology Language (OWL) Specifies an Ontology (“Data Model”) Formal Semantics, W3C Standard Provides a language to describe the meaning of data properties and how they relate to classes. Example: Mammal
 Necessary Conditions: warm-blooded, vertebrate animal, has hair or fur, secrets milk, (typically) live birth Greater descriptive power than Schema (SQL Tables) and Serialization Frameworks (Avro)
  • 27. Why OWL? If we can more formally specify what the data *means*, then we can have a single data model (ontology) apply to our entire architecture, and data can be transformed automatically locally as per the needs of a specific software module. Manually coded data transforms may be “lossy” and/or introduce errors, so eliminating them helps keep data clean.
  • 28. Why OWL? (continued) Example: if we specify what a “Document” is, then a text- mining analyzer will know how to access the textual data without further prompting. Example: if we specify Features for use in Machine Learning in the ontology, then features can be generated automatically to train Machine Learning Models, and the same features would be generated when we use the model in production.
  • 29. Why OWL? (continued) Note: As ontologies can extend other ontologies, rather than a single ontology, a collection of linked ontologies can be used, allowing segmentation across an organization.
  • 30. Vital Core Ontology Protege Editor… Nodes, Edges, HyperNodes, HyperEdges get URIs John/WorksFor/IBM —> Node / Edge / Node
  • 32. NYC Dept of Education Domain Ontology
  • 33. Generating Data Bindings with VitalSigns: Ontology VitalSigns Groovy Bindings Semantic Bindings Hadoop Bindings Prolog Bindings Graph Bindings HBase Bindings JavaScript Bindings Code/Schema Generation vitalsigns generate -ont name…
  • 34. person123.name = "John" person123.worksFor.company456 <person123> <hasName> "John" <worksFor123> <hasSource> <person123> <worksFor123> <hasDestination> <company456> <worksFor123> <hasType> <worksFor> person123, Node:type=Person, Node:hasName="John" worksFor123, Edge:type=worksFor, Edge:hasSource=person123, Edge:hasDestination=company456 Groovy RDF HBase Data Representations
  • 35. VitalSigns Generation —> JAR Library Runtime Domain Ontology Domain Ontology Domain Ontology Domain Ontology VitalSigns Class
  • 37. Developing with the Ontology in UI, Hadoop, NLP, Scripts, ... Node:Person Node:PersonEdge:hasFriend Set<Friend> person123.getFriends() Eclipse IDE
  • 38. // Reference to an NYCSchool object NYCSchool school123 = … // get from database
 ! // Get a list of programs, local context (cache) List<NYCSchoolProgram> programs = school123.getPrograms() ! // Get list of programs, global context (database) List<NYCSchoolProgram> programs = school123.getPrograms(Context.ServiceWide) ! JVM Development
  • 39. Using JSON-Schema Data in JavaScript for(var i = 0 ; i < progressReports.length; i++) { var r = progressReports[i]; var sub = $('<ul>'); sub.append('<li>Overall Grade : ' + r.progReportOverallGrade + '</li>'); sub.append('<li>Progress Grade: ' + r.progReportProgressGrade + '</li>'); sub.append('<li>Environment Grade: ' + r.progReportEnvironmentGrade + '</li>'); sub.append('<li>College and Career Readiness Grade: ' + r.progRepCollegeAndCareerReadinessGrade+ '</li>'); sub.append('<li>Performance Grade: ' + r.progReportPerformanceGrade+ '</li>'); sub.append('<li>Closing the Achievement Gap Points: ' + r.progReportClosingTheAchievementGapPoints+ '</li>'); sub.append('<li>Percentile Rank: ' + r.progReportPercentileRank + '</li>'); sub.append('<li>Overall Score: ' + r.progReportOverallScore + '</li>'); }
  • 40. NoSQL Queries Query API / CRUD Operations
 ! Queries generated into “native” NoSQL Query format: Sparql / Triplestore (Allegrograph) HBase / DynamoDB MongoDB Hive/HiveQL (on Spark/Hadoop2.0) Query Types: “Select” and “Graph” Abstract type of datastore from application/analytics code Pass in a “native” query when necessary
  • 41. Data Serialization, Analytics Jobs Data Serialized into file format by blocks of objects Leverage Hadoop Serialization Standards: Sequence File, Avro, Parquet Get data in and out of HDFS Files Spark/Hadoop jobs passed a set of objects as input
 URI of object is key Data Objects are serialized into Compressed Strings for transport over Flume, etc.
  • 42. Machine Learning Via Hadoop, Spark, R Mahout, MLLib Build Predictive Models Classification, Clustering... Use Features defined in Ontology Learn Target defined in Ontology Models consume Ontology Data as input
  • 43. Natural Language Processing/Text Mining Topic Categorization… Extract Entities… Text Features from Ontology Classes extending Document…
  • 44. Graph Analytics GraphX, Giraph: PageRank, Centrality, Interest Graph, …
  • 45. Inference / Rules Use Semantic Web Rule Engines / Reasoners
 ! Load Ontology + RDF Representation of Data Instances (Individuals)
  • 46. R Analytics Load Serialized Data into R Dataframes ! Reference Classes and Properties by Name in Dataframes (cleaner code than huge number of columns)
  • 47. Graph Visualization with Cytoscape Data already in Node/Edge Graph Form
  • 50. NYC Schools Architecture Mobile App JSON Schema VertX Vital Flow Queue Rule Engine NLP DynamoDB Vital Prime VitalService Client NYC Schools Data Model R Serialized Data Data Insights
  • 52. Collaboration/Tools git - code revision system OWL Ontologies treated as code artifact Coordinate across Teams:
 “Front End”, “Back End”, “Operations”, “Business Intelligence”, “Data Science”… Coordinate across Enterprise: Departments / Business Units “Data Model of Record”
  • 54. vitalsigns command line vitalsigns generate vitalsigns upversion/downversion code/schema generation increase version patch number
 move previous version to archive
 rename OWL file including username JAR files pushed into Maven
 (Continuous Integration)
  • 55. Git Integration git: add, commit, push, pull diff: determine differences merge: merge two Ontologies detect types of Ontology changes merge into new patch version
  • 56. OWL as Data Modeling Language:
 Data Architecture & Data Science / Analytics Conclusions Leverage Existing Tools, Components Reduce model redundancy, reduce effort. A Means to Collaborate Across Teams: Data Model of Record Cleaner Data Integrate additional analysis
  • 57. For more information, please contact: Marc C. Hadfield http://vital.ai marc@vital.ai 917.463.4776 Thank You!