SlideShare une entreprise Scribd logo
1  sur  49
Data Structure Graphs
An overview
Presentation by @dougneedham
Introduction
 @dougneedham
 Data Guy - Started as a DBA in the Marine Corps, evolved to Architect, now Data
Scientist.
 Oracle, SQL Server, Cassandra, Hadoop, MySQL, Spark.
 I have a strong relational/traditional background.
 Perpetual Student
 Learning new things challenges our assumptions. Forces us to take a new perspective
on “old” problems. Eventually maybe even shows us that there is a better way to solve
a problem.
Introducing Data Structure Graphs
 Data Structure Graph Level 1 (DSG-L1)– This is roughly like an Entity Relationship
Diagram (ERD) Tables are Vertices, Foreign Keys are Edges.
 A DSG-L1 can show you where you are going to have the most interesting query
performance of your tables.
 Data Structure Graph Level 2 (DSG-L2) – Each Vertex in this graph is an application.
Each Edge is data transfer. Roughly equivalent to what we used to call Data Flow
diagrams.
 A DSG-L2 can show you where the most amount of work is going on in your
Enterprise.
 Data Structure Graph Dependency (DSG-D) – Each vertex is a job, script, program, or
process that is dependent on something happening in sequence before it can do its
work.
 A DSG-D can show you the sequence of events that need to take place in order for
something to be completed.
Definition
 A Data Structure Graph is a group of atomic entities that are related to each other,
stored in a repository, then moved from one persistence layer to another, rendered as
a Graph.
 A group of atomic entities.
 Related to each other.
 Stored in a repository.
 Moved from one persistence layer to another.
 Rendered as a Graph.
In summary: Social Network analysis applied
to data modeling.
 Data modeling is a topic we are all familiar with here at data modeling zone.
 Social Network analysis is, perhaps, something new.
 So a little background on the topic we may not be familiar with.
What is Social Network Analysis?
 “Social network analysis (SNA) is a strategy for investigating social structures through the use of network and
graph theories.
 It characterizes networked structures in terms of nodes (individual actors, people, or things within the network)
and the ties or edges (relationships or interactions) that connect them.
 Examples of social structures commonly visualized through social network analysis include
 social media networks,
 friendship and
 acquaintance networks,
 kinship,
 disease transmission, and
 sexual relationships.
 These networks are often visualized through sociograms in which nodes are represented as points and ties are represented
as lines.” – Wikipedia
 https://en.wikipedia.org/wiki/Social_network_analysis
Example From wiki:
"Kencf0618FacebookNetwork" by Kencf0618 -
Own work. Licensed under CC BY-SA 3.0 via
Wikimedia Commons -
https://commons.wikimedia.org/wiki/File:Kencf0
618FacebookNetwork.jpg#/media/File:Kencf061
8FacebookNetwork.jpg
A little History
 The 7 Bridges of Konigsberg
 Every tome on Graph theory or Network analysis devotes a small portion of there time
to the 7 Bridges of Konigsberg.
 If I don’t cover this with you, the gods of mathematics will strike me down, and never
allow me to do analysis again in the future.
The Bridges
The Problem
 Folks enjoyed there Sunday afternoon strolls across the bridges, but occasionally people
would wonder if one particular route was more efficient than another.
 Eventually Leonhard Euler was brought into the debate about the efficiency problem.
 Euler used Vertices to represent the land masses and edges (or arcs, at the time) to
represent bridges. He realized the odd number of edges per vertex made the problem
unsolvable.
 Sarada Herke provides for one of the best explanations of the solution Solution to
Konigsberg
 Basically the solution is that a vertex must have an even number of edges in order to make
it possible to start from one vertex, and arrive at the point of origin without crossing any
edge twice. Essentially, the number of bridges must be an even number. (more details in
the above video)
 And here is the cool thing about mathematicians. If we tell you something is impossible, we
have to tell you why in a way you can understand it. But he also invented the branch of
mathematics today we call Graph Theory.
 http://en.wikipedia.org/wiki/Leonhard_Euler
A few terms
 Stand back, we are going to talk about math!
 Basically we are talking about a bunch of dots joined together by lines
 Vertex – Dot on a graph
 Edge – Line connecting the two points
 Edge_Label – this is a term I coined originally related to Data Structure Graphs that helps trace a
path. If you label your edges, and you have multiple edges with the same label in a Graph you can
quite easily identify walks, paths, and cycles through your graph.
 A lot of things are networks if you look at them the right way.
 Mark Newman has done a number of really cool presentations, available on YouTube about Network
analysis.
 https://www.youtube.com/watch?v=lETt7IcDWLI
More terms
 What is a path?
 Shortest path – How are two vertices connected?
 Longest Path – Tracing the flow of an interesting item through a large collection of
applications.
 Directed Graphs – or Digraphs
 If you rearrange things how does the layout affect understanding?
 This is not just data visualization, it can also be used for prediction.
https://www.youtube.com/watch?v=rwA-y-XwjuU
Final terms
 Centrality – Hub and Authority
 This is almost a whole topic by itself, since there are different types of Centrality:
 Degree Centrality, Eigenvector Centrality, PageRank, etc…
 Longest Path – Tracing the flow of an interesting item through a large collection of applications.
 Power law.
 What is a path?
 Centrality – Hub and Authority
 This is almost a whole topic by itself, since there are different types of Centrality:
 Degree Centrality, Eigenvector Centrality, PageRank, etc…
 Transitivity
 Homophily – how things are similar
 Directed Graphs – or Digraphs
 Contagion – How do things “spread” through a network?
 Let’s rearrange things, how does the layout affect understanding?
 Order of a graph – number of vertices
 Size of the graph – number of edges
 This is not just data visualization, it can also be used for prediction. https://www.youtube.com/watch?v=rwA-y-XwjuU
The Math doesn’t change.
 One thing I like about Graphs –
 The Math does not change.
 The math behind Graph theory can be a little intense, but it does not change
regardless of the scale of the graph.
 Once you understand how to “do the math” on a small graph, those same Math's
apply to a Graph whether it is a graph of the people in this room, or a graph of the
people on this planet.
Before we get to the analysis we must collect data.
 Dbeaver can reverse engineer an ERD.
 Point it at the source system, select a few options, then you have a diagram.
 I wrote a small piece of Python code to translate the XML to a file suitable for import into Gephi.
 One small caveat: the Foreign keys have to be defined for Dbeaver to work. If the foreign keys are not
defined the output file will need to be modified.
 Also, some aggregate or summary tables may not help your visualization.
 This is subjective, so it is at the discretion of the person reviewing the diagram.
 If you remove tables from the graph, please provide documentation such that the visualization can be
compared to the reality of your data model with no discrepancies.
 Url for Dbeaver is here: https://dbeaver.jkiss.org/
 (This section is a little hand-wavy I know but the tool, or method for creating the file for import into
Gephi is largely irrelevant.)
Gephi
 http://gephi.github.io/
 From the website: “Gephi is an interactive visualization and exploration platform for all
kinds of networks and complex systems, dynamic and hierarchical graphs.”
 We are going to use data from generated from my book: Data Structure Graphs.
 These are inspired by my experience consulting, but do not represent an actual data
model, or etl process.
 The following slides are for a DSG Level 2 (Etl process).
Gephi Startup
New Project, Data Table, Import data.
Load as “Edges Table” Source, Target (required)
Choose Create Missing Nodes
After a few calculations and layout runs
PageRank – Which application is most important?
A few more tweaks
Where is that Node with the highest PageRank?
Now things get interesting:
 New metrics for our data model follow.
 Remember all those metrics we defined earlier?
 Here are many of them:
Data Table
Configure Labels
Labeled by degree count
Change some of the coloring
Visualization
Export to Excel
Finally, here we are.
 Within a Data Architecture there are lots of moving pieces. ETL, FTP, SFTP, Web-
Services, External data feeds. Data moving into Data Marts, and Data Warehouses.
Data Moving between applications.
 Let’s imagine how to visualize this using the information we just gained.
Data Structure Graphs
 Today, there are a few tools like ERWin, and SQL Developer that begin to organize
visualizations in this manner.
 Very few of them allow you to perform analysis on the visualization.
 As you find new tools that do this, please let me know.
 I would love to evaluate those tools and see what interesting metrics can be arrived at
from new tools.
Dijkstra's algorithm
 Some of you may have heard of Dijkstra’s algorithm.
 It is a method for finding the shortest path between two nodes on a Graph.
 This is a great optimization technique, but what if you need to find the longest path?
 What “Edge_Label” has the most influence on my organization?
 Iterate through each Edge_Label, create a subgraph that consists of only the nodes
this Edge_Label touches, then calculate the diameter of that Graph.
 The Edge_Label that is longest has the most “impact” on your organization.
 This is mostly applied to Data Structure Graph Level 2.
Now let’s answer some questions.
 Which table is “most important” to ensure you are importing to build a data warehouse?
 The tables with the higher centrality measures.
 For an operational system these will also be the tables that have the most queries written against them.
 These will be your bottlenecks for any system.
 Is this data model optimized for reading or writing?
 What is the density of the data model?
 The higher density is optimized for write, lower density is optimized for read.
Barabasi-Albert model and Scale free networks.
 Preferential attachment.
 There are a few different models available for analysis and prediction of networks.
 A Barabsi-Albert model can be summarized as a “rich get richer” model. In other words, the more
connected a node is, when new nodes are added, they are more than likely connected to these well
connected nodes.
 This suspiciously sounds similar to our data modeling concepts related to conformed dimensions.
 My suspicion is there are many data models that fit this model.
 Please send me some anonymized data models. I want to research this more.
Some theoretical thoughts.
 Let’s assume we have an equation for the growth of every table we have collected
from our little topological study above(more on this in a couple slides).
 Let us further assume we have a graph of the same tables.
 Can you do anything interesting with this?
 The derivative of each equation shows us the growth rate of the table.
 What happens if we plug that derivative in the entropy equation for the graph?
 What would this represent?
 Could this be considered an valuation method?
 A way to put a dollar value on a data model?
 If you try it, let me know what you find out.
Apply the theory.
 Using a few metrics from each table we can do some clustering.
 Take the number of columns of a table, the centrality measure, and the growth rate
you have a vector for each table.
 Doing some simple cosine similarity on these vectors will tell you mathematically which
tables are similar.
 Is this finding consistent with expectations?
 If not should the model be adjusted?
 What does this result say to you?
Deriving the growth rate of each table.
 Little R demonstration to follow.
 Using a design methodology like the data vault mandates that every table have date timestamps for when the
data is loaded.
 Collect how many records are loaded per day.
 A calculation that represents the growth formula for each table can be derived with R.
 Using the growth rate, centrality, and the width of a table (number of columns) you can do cosine similarity
to determine the tables that are mathematically similar to each other.
 Using this information you may be able to reallocate the infrastructure that the data warehouse sits on.
 Is every table stored on the same disk storage media? Does it need to be?
 How about caching? Using these metrics alone you can make a well informed decision about your storage
platform.
 The following image is a small topological representation of this process.
 This is still slightly theoretical, and I welcome having a conversation with anyone that may want to know
more.
 Again, send me anonymized data. Hopefully along with the Data Structure Graph you generated from your
data.
This is what the topology may look like.
Consider the following:
 If you need assistance, contact me directly (I am easy to find @dougneedham)
 Network/Graph Analysis is cool.
 It can show you some interesting things about your data that you may not have
considered.
What did I leave out?
 Graphs that change over time – What happens when you remove a single Edge or
Vertex?
 Comparing two networks – If you have the same number of edges and nodes, are two
graphs the same?
 Contagion – How will data spread through the network. (Since a DSG represents
different types of Edges based on Edge_Label, Contagion should not affect the entire
network). This is also commonly known as data lineage. If you don’t have a tool that
does it, with a bit of metadata management this can be derived from a Data Structure
Graph Level 2
Other Analysis
 What else can be done with Social Network Analysis?
 How about risk exposure to banks?
 http://www.federalreserve.gov/newsevents/speech/yellen20130104a.htm
A little history
One other cool bit of Math
 How many reports can your dimensional data model support?
 Do you have the situation where people want to create a project out of a report,
rather than do a proper data model design up front?
 Here is some help.
 The upper bound of the total number of reports that a conformed dimension data
model can support is calculated by:
 Calculate the number of selectable columns in each dimension (2 𝑐
− 1)
 Create the adjacency matrix for the dimensions to facts
 A bit of multiplication.
 More details here: http://bit.ly/MeasuringDimensionalModels
Graphs are Cool!
 Help me.
 Please send me anonymized data.
 In order to present more about how the mathematics of Graph theory, and social
network analysis can be applied in general to the application of data modeling, I need
more data. 
 This is a fascinating topic, if you want to reach out to me directly I can be reached at:
dougthedataguy@gmail.com
 Here is my GitHub for the code and data from the book, and examples:
http://bit.ly/DataStructureGraph_github
https://dougneedham.shinyapps.io/DataStructureGraph
Hard to see, I know, but the top diagram is the “master graph”, the bottom image is a single Edge_Label. You can see how an individual
data entity flows through an organization.
My book
Goes through a number of examples for doing an Graph analysis of a fictional organization.
Final Thoughts – Questions?

Contenu connexe

Tendances (20)

CPSC 125 Ch 5 Sec 1
CPSC 125 Ch 5 Sec 1CPSC 125 Ch 5 Sec 1
CPSC 125 Ch 5 Sec 1
 
Adjacency list
Adjacency listAdjacency list
Adjacency list
 
Graph Data Structure
Graph Data StructureGraph Data Structure
Graph Data Structure
 
Graphs
GraphsGraphs
Graphs
 
Graph in data structure
Graph in data structureGraph in data structure
Graph in data structure
 
Graph representation
Graph representationGraph representation
Graph representation
 
Matrix representation of graph
Matrix representation of graphMatrix representation of graph
Matrix representation of graph
 
Chapter9 graph data structure
Chapter9  graph data structureChapter9  graph data structure
Chapter9 graph data structure
 
Graph data structure and algorithms
Graph data structure and algorithmsGraph data structure and algorithms
Graph data structure and algorithms
 
Applications of graphs
Applications of graphsApplications of graphs
Applications of graphs
 
Graphs
GraphsGraphs
Graphs
 
Graphs
GraphsGraphs
Graphs
 
LEC 12-DSALGO-GRAPHS(final12).pdf
LEC 12-DSALGO-GRAPHS(final12).pdfLEC 12-DSALGO-GRAPHS(final12).pdf
LEC 12-DSALGO-GRAPHS(final12).pdf
 
Graphs
GraphsGraphs
Graphs
 
Data Structures - Lecture 10 [Graphs]
Data Structures - Lecture 10 [Graphs]Data Structures - Lecture 10 [Graphs]
Data Structures - Lecture 10 [Graphs]
 
Graphs in data structure
Graphs in data structureGraphs in data structure
Graphs in data structure
 
Graph
GraphGraph
Graph
 
Lecture 5b graphs and hashing
Lecture 5b graphs and hashingLecture 5b graphs and hashing
Lecture 5b graphs and hashing
 
d
dd
d
 
Data structure computer graphs
Data structure computer graphsData structure computer graphs
Data structure computer graphs
 

Similaire à Data Structure Graph DMZ #DMZone

Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Doug Needham
 
Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Doug Needham
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?Samet KILICTAS
 
Intro to Graph Theory w Neo4J
Intro to Graph Theory w Neo4JIntro to Graph Theory w Neo4J
Intro to Graph Theory w Neo4JRay Lukas
 
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Benjamin Nussbaum
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeArangoDB Database
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxMalla Reddy University
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling TechniqueCarmen Sanborn
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RGraphRM
 
Knowledge graphs, meet Deep Learning
Knowledge graphs, meet Deep LearningKnowledge graphs, meet Deep Learning
Knowledge graphs, meet Deep LearningConnected Data World
 
Intro to Graph Theory
Intro to Graph TheoryIntro to Graph Theory
Intro to Graph TheoryRay Lukas
 
Distributed Link Prediction in Large Scale Graphs using Apache Spark
Distributed Link Prediction in Large Scale Graphs using Apache SparkDistributed Link Prediction in Large Scale Graphs using Apache Spark
Distributed Link Prediction in Large Scale Graphs using Apache SparkAnastasios Theodosiou
 
Ontology based semantics and graphical notation as directed graphs
Ontology based semantics and graphical notation as directed graphsOntology based semantics and graphical notation as directed graphs
Ontology based semantics and graphical notation as directed graphsJohann Höchtl
 
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best PracticesNeo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best PracticesNeo4j
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
Big Data Conference
Big Data ConferenceBig Data Conference
Big Data ConferenceDataTactics
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationRich Heimann
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 

Similaire à Data Structure Graph DMZ #DMZone (20)

Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
 
Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview.
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Intro to Graph Theory w Neo4J
Intro to Graph Theory w Neo4JIntro to Graph Theory w Neo4J
Intro to Graph Theory w Neo4J
 
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con R
 
Knowledge graphs, meet Deep Learning
Knowledge graphs, meet Deep LearningKnowledge graphs, meet Deep Learning
Knowledge graphs, meet Deep Learning
 
Intro to Graph Theory
Intro to Graph TheoryIntro to Graph Theory
Intro to Graph Theory
 
Distributed Link Prediction in Large Scale Graphs using Apache Spark
Distributed Link Prediction in Large Scale Graphs using Apache SparkDistributed Link Prediction in Large Scale Graphs using Apache Spark
Distributed Link Prediction in Large Scale Graphs using Apache Spark
 
Ontology based semantics and graphical notation as directed graphs
Ontology based semantics and graphical notation as directed graphsOntology based semantics and graphical notation as directed graphs
Ontology based semantics and graphical notation as directed graphs
 
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best PracticesNeo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Mr bi
Mr biMr bi
Mr bi
 
Big Data Conference
Big Data ConferenceBig Data Conference
Big Data Conference
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics Corporation
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 

Dernier

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 

Dernier (20)

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 

Data Structure Graph DMZ #DMZone

  • 1. Data Structure Graphs An overview Presentation by @dougneedham
  • 2. Introduction  @dougneedham  Data Guy - Started as a DBA in the Marine Corps, evolved to Architect, now Data Scientist.  Oracle, SQL Server, Cassandra, Hadoop, MySQL, Spark.  I have a strong relational/traditional background.  Perpetual Student  Learning new things challenges our assumptions. Forces us to take a new perspective on “old” problems. Eventually maybe even shows us that there is a better way to solve a problem.
  • 3. Introducing Data Structure Graphs  Data Structure Graph Level 1 (DSG-L1)– This is roughly like an Entity Relationship Diagram (ERD) Tables are Vertices, Foreign Keys are Edges.  A DSG-L1 can show you where you are going to have the most interesting query performance of your tables.  Data Structure Graph Level 2 (DSG-L2) – Each Vertex in this graph is an application. Each Edge is data transfer. Roughly equivalent to what we used to call Data Flow diagrams.  A DSG-L2 can show you where the most amount of work is going on in your Enterprise.  Data Structure Graph Dependency (DSG-D) – Each vertex is a job, script, program, or process that is dependent on something happening in sequence before it can do its work.  A DSG-D can show you the sequence of events that need to take place in order for something to be completed.
  • 4. Definition  A Data Structure Graph is a group of atomic entities that are related to each other, stored in a repository, then moved from one persistence layer to another, rendered as a Graph.  A group of atomic entities.  Related to each other.  Stored in a repository.  Moved from one persistence layer to another.  Rendered as a Graph.
  • 5. In summary: Social Network analysis applied to data modeling.  Data modeling is a topic we are all familiar with here at data modeling zone.  Social Network analysis is, perhaps, something new.  So a little background on the topic we may not be familiar with.
  • 6. What is Social Network Analysis?  “Social network analysis (SNA) is a strategy for investigating social structures through the use of network and graph theories.  It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties or edges (relationships or interactions) that connect them.  Examples of social structures commonly visualized through social network analysis include  social media networks,  friendship and  acquaintance networks,  kinship,  disease transmission, and  sexual relationships.  These networks are often visualized through sociograms in which nodes are represented as points and ties are represented as lines.” – Wikipedia  https://en.wikipedia.org/wiki/Social_network_analysis
  • 7. Example From wiki: "Kencf0618FacebookNetwork" by Kencf0618 - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Kencf0 618FacebookNetwork.jpg#/media/File:Kencf061 8FacebookNetwork.jpg
  • 8. A little History  The 7 Bridges of Konigsberg  Every tome on Graph theory or Network analysis devotes a small portion of there time to the 7 Bridges of Konigsberg.  If I don’t cover this with you, the gods of mathematics will strike me down, and never allow me to do analysis again in the future.
  • 10. The Problem  Folks enjoyed there Sunday afternoon strolls across the bridges, but occasionally people would wonder if one particular route was more efficient than another.  Eventually Leonhard Euler was brought into the debate about the efficiency problem.  Euler used Vertices to represent the land masses and edges (or arcs, at the time) to represent bridges. He realized the odd number of edges per vertex made the problem unsolvable.  Sarada Herke provides for one of the best explanations of the solution Solution to Konigsberg  Basically the solution is that a vertex must have an even number of edges in order to make it possible to start from one vertex, and arrive at the point of origin without crossing any edge twice. Essentially, the number of bridges must be an even number. (more details in the above video)  And here is the cool thing about mathematicians. If we tell you something is impossible, we have to tell you why in a way you can understand it. But he also invented the branch of mathematics today we call Graph Theory.  http://en.wikipedia.org/wiki/Leonhard_Euler
  • 11. A few terms  Stand back, we are going to talk about math!  Basically we are talking about a bunch of dots joined together by lines  Vertex – Dot on a graph  Edge – Line connecting the two points  Edge_Label – this is a term I coined originally related to Data Structure Graphs that helps trace a path. If you label your edges, and you have multiple edges with the same label in a Graph you can quite easily identify walks, paths, and cycles through your graph.  A lot of things are networks if you look at them the right way.  Mark Newman has done a number of really cool presentations, available on YouTube about Network analysis.  https://www.youtube.com/watch?v=lETt7IcDWLI
  • 12. More terms  What is a path?  Shortest path – How are two vertices connected?  Longest Path – Tracing the flow of an interesting item through a large collection of applications.  Directed Graphs – or Digraphs  If you rearrange things how does the layout affect understanding?  This is not just data visualization, it can also be used for prediction. https://www.youtube.com/watch?v=rwA-y-XwjuU
  • 13. Final terms  Centrality – Hub and Authority  This is almost a whole topic by itself, since there are different types of Centrality:  Degree Centrality, Eigenvector Centrality, PageRank, etc…  Longest Path – Tracing the flow of an interesting item through a large collection of applications.  Power law.  What is a path?  Centrality – Hub and Authority  This is almost a whole topic by itself, since there are different types of Centrality:  Degree Centrality, Eigenvector Centrality, PageRank, etc…  Transitivity  Homophily – how things are similar  Directed Graphs – or Digraphs  Contagion – How do things “spread” through a network?  Let’s rearrange things, how does the layout affect understanding?  Order of a graph – number of vertices  Size of the graph – number of edges  This is not just data visualization, it can also be used for prediction. https://www.youtube.com/watch?v=rwA-y-XwjuU
  • 14. The Math doesn’t change.  One thing I like about Graphs –  The Math does not change.  The math behind Graph theory can be a little intense, but it does not change regardless of the scale of the graph.  Once you understand how to “do the math” on a small graph, those same Math's apply to a Graph whether it is a graph of the people in this room, or a graph of the people on this planet.
  • 15. Before we get to the analysis we must collect data.  Dbeaver can reverse engineer an ERD.  Point it at the source system, select a few options, then you have a diagram.  I wrote a small piece of Python code to translate the XML to a file suitable for import into Gephi.  One small caveat: the Foreign keys have to be defined for Dbeaver to work. If the foreign keys are not defined the output file will need to be modified.  Also, some aggregate or summary tables may not help your visualization.  This is subjective, so it is at the discretion of the person reviewing the diagram.  If you remove tables from the graph, please provide documentation such that the visualization can be compared to the reality of your data model with no discrepancies.  Url for Dbeaver is here: https://dbeaver.jkiss.org/  (This section is a little hand-wavy I know but the tool, or method for creating the file for import into Gephi is largely irrelevant.)
  • 16. Gephi  http://gephi.github.io/  From the website: “Gephi is an interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs.”  We are going to use data from generated from my book: Data Structure Graphs.  These are inspired by my experience consulting, but do not represent an actual data model, or etl process.  The following slides are for a DSG Level 2 (Etl process).
  • 18. New Project, Data Table, Import data.
  • 19. Load as “Edges Table” Source, Target (required)
  • 21. After a few calculations and layout runs
  • 22. PageRank – Which application is most important?
  • 23. A few more tweaks
  • 24. Where is that Node with the highest PageRank?
  • 25. Now things get interesting:  New metrics for our data model follow.  Remember all those metrics we defined earlier?  Here are many of them:
  • 29. Change some of the coloring
  • 32. Finally, here we are.  Within a Data Architecture there are lots of moving pieces. ETL, FTP, SFTP, Web- Services, External data feeds. Data moving into Data Marts, and Data Warehouses. Data Moving between applications.  Let’s imagine how to visualize this using the information we just gained.
  • 33. Data Structure Graphs  Today, there are a few tools like ERWin, and SQL Developer that begin to organize visualizations in this manner.  Very few of them allow you to perform analysis on the visualization.  As you find new tools that do this, please let me know.  I would love to evaluate those tools and see what interesting metrics can be arrived at from new tools.
  • 34. Dijkstra's algorithm  Some of you may have heard of Dijkstra’s algorithm.  It is a method for finding the shortest path between two nodes on a Graph.  This is a great optimization technique, but what if you need to find the longest path?  What “Edge_Label” has the most influence on my organization?  Iterate through each Edge_Label, create a subgraph that consists of only the nodes this Edge_Label touches, then calculate the diameter of that Graph.  The Edge_Label that is longest has the most “impact” on your organization.  This is mostly applied to Data Structure Graph Level 2.
  • 35. Now let’s answer some questions.  Which table is “most important” to ensure you are importing to build a data warehouse?  The tables with the higher centrality measures.  For an operational system these will also be the tables that have the most queries written against them.  These will be your bottlenecks for any system.  Is this data model optimized for reading or writing?  What is the density of the data model?  The higher density is optimized for write, lower density is optimized for read.
  • 36. Barabasi-Albert model and Scale free networks.  Preferential attachment.  There are a few different models available for analysis and prediction of networks.  A Barabsi-Albert model can be summarized as a “rich get richer” model. In other words, the more connected a node is, when new nodes are added, they are more than likely connected to these well connected nodes.  This suspiciously sounds similar to our data modeling concepts related to conformed dimensions.  My suspicion is there are many data models that fit this model.  Please send me some anonymized data models. I want to research this more.
  • 37. Some theoretical thoughts.  Let’s assume we have an equation for the growth of every table we have collected from our little topological study above(more on this in a couple slides).  Let us further assume we have a graph of the same tables.  Can you do anything interesting with this?  The derivative of each equation shows us the growth rate of the table.  What happens if we plug that derivative in the entropy equation for the graph?  What would this represent?  Could this be considered an valuation method?  A way to put a dollar value on a data model?  If you try it, let me know what you find out.
  • 38. Apply the theory.  Using a few metrics from each table we can do some clustering.  Take the number of columns of a table, the centrality measure, and the growth rate you have a vector for each table.  Doing some simple cosine similarity on these vectors will tell you mathematically which tables are similar.  Is this finding consistent with expectations?  If not should the model be adjusted?  What does this result say to you?
  • 39. Deriving the growth rate of each table.  Little R demonstration to follow.  Using a design methodology like the data vault mandates that every table have date timestamps for when the data is loaded.  Collect how many records are loaded per day.  A calculation that represents the growth formula for each table can be derived with R.  Using the growth rate, centrality, and the width of a table (number of columns) you can do cosine similarity to determine the tables that are mathematically similar to each other.  Using this information you may be able to reallocate the infrastructure that the data warehouse sits on.  Is every table stored on the same disk storage media? Does it need to be?  How about caching? Using these metrics alone you can make a well informed decision about your storage platform.  The following image is a small topological representation of this process.  This is still slightly theoretical, and I welcome having a conversation with anyone that may want to know more.  Again, send me anonymized data. Hopefully along with the Data Structure Graph you generated from your data.
  • 40. This is what the topology may look like.
  • 41. Consider the following:  If you need assistance, contact me directly (I am easy to find @dougneedham)  Network/Graph Analysis is cool.  It can show you some interesting things about your data that you may not have considered.
  • 42. What did I leave out?  Graphs that change over time – What happens when you remove a single Edge or Vertex?  Comparing two networks – If you have the same number of edges and nodes, are two graphs the same?  Contagion – How will data spread through the network. (Since a DSG represents different types of Edges based on Edge_Label, Contagion should not affect the entire network). This is also commonly known as data lineage. If you don’t have a tool that does it, with a bit of metadata management this can be derived from a Data Structure Graph Level 2
  • 43. Other Analysis  What else can be done with Social Network Analysis?  How about risk exposure to banks?  http://www.federalreserve.gov/newsevents/speech/yellen20130104a.htm
  • 45. One other cool bit of Math  How many reports can your dimensional data model support?  Do you have the situation where people want to create a project out of a report, rather than do a proper data model design up front?  Here is some help.  The upper bound of the total number of reports that a conformed dimension data model can support is calculated by:  Calculate the number of selectable columns in each dimension (2 𝑐 − 1)  Create the adjacency matrix for the dimensions to facts  A bit of multiplication.  More details here: http://bit.ly/MeasuringDimensionalModels
  • 46. Graphs are Cool!  Help me.  Please send me anonymized data.  In order to present more about how the mathematics of Graph theory, and social network analysis can be applied in general to the application of data modeling, I need more data.   This is a fascinating topic, if you want to reach out to me directly I can be reached at: dougthedataguy@gmail.com  Here is my GitHub for the code and data from the book, and examples: http://bit.ly/DataStructureGraph_github
  • 47. https://dougneedham.shinyapps.io/DataStructureGraph Hard to see, I know, but the top diagram is the “master graph”, the bottom image is a single Edge_Label. You can see how an individual data entity flows through an organization.
  • 48. My book Goes through a number of examples for doing an Graph analysis of a fictional organization.
  • 49. Final Thoughts – Questions?

Notes de l'éditeur

  1. This is an overview of what I call Data Structure Graphs given at the Data Modeling Zone conference in October of 2017.
  2. Let me introduce myself. I have an incredibly traditional background. I have certifications in Oracle and SQL Server, I started my career as a mainframe DBA.