SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
RECONSTRUCTING
PRIMARY INFORMATION
FROM SECONDARY
INFORMATION
HYPERGIANT.COMBYDREWJ.LIPMAN,PH.D.
As part of this, they engage in
something known as feature
engineering: the process of using
existing, known information or
features to extract new informa-
tion or features about particular
data.2
A classic example is extracting
the age or gender of Titanic’s
passengers from their salutations.
This information (age and gender)
is, generally, not included in the
passenger manifest, but it becomes
very useful when trying to predict
whether a particular passenger
survived or calculate overall sur-
vival rates.
1
Press, Gil. “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says.”
Forbes.com. 2016. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-
least-enjoyable-data-science-task-survey-says/#76969b3d6f63.
2
Microsoft Azure. “Feature engineering in data science.” Microsoft.com. 2017. https://docs.microsoft.com/
en-us/azure/machine-learning/team-data-science-process/create-features.
DATA
SCIENTISTS
SPEND
OF THEIR TIME
CLEANING AND
ORGANIZING
DATA.1
60%
2/
HYPERGIANT.COM
FEATURE
ENGINEERING
The other version of feature engineering – and the focus of our ex-
ploration here – is backwards construction. This is formally known
as an inverse problem, a case in which the features originally used
to construct the given information are recovered. For example, each
passenger’s gender and age were used to establish the correct sal-
utation for addressing them on the Titanic. The inverse problem is
to take the resulting salutation – a form of secondary information
– and compute the inputted age and gender – a form of primary infor-
mation – from which it was derived.
Feature engineering tends to be applied in two versions. One version
is deemed forward construction, meaning that the extracted features
are directly derived from the original information. In other words,
all of the information needed for the new feature is contained in the
original feature. (Extracting age from a date of birth is a clear ex-
ample of this.)
/3
BACKWARDS CONSTRUCTION IS A FORM OF FEATURE
ENGINEERING THAT USES SECONDARY INFORMATION
(INFORMATION DERIVED FROM PRIMARY FEATURES)
TO RECOVER MISSING, PRIMARY FEATURES.
4/
THE TOOLS
While the framework of an inverse problem is very general, different
problems produce distinct difficulties. Some of the new, recommen-
ded tools for this form of feature engineering are graph databases
or knowledge graphs. These can model how often datapoint features
interact or how a single feature interacts across a database.
Across the examples in the following pages, graph databases (GDBs)
will be modeled using vertex and arc-labeled directed graphs
(digraphs): more specifically, a primary digraph and a secondary di-
graph. The vertices (and their weights) of the secondary digraph are
the same as those of the primary digraph, while the arcs (and their
weights) of the secondary digraph are derived from those of the pri-
mary digraph. Using this language, the feature engineering problem
is described as follows: Given the secondary digraph – and, poten-
tially, parts of the primary digraph – extract useful information
and features about the primary digraph that were previously unknown.
This framework opens several questions, not least of which is de-
termining when it is feasible to extract a particular feature: arc
weights or the arcs, themselves. Some variations of this problem
(all of which can be extended to include partial, primary informa-
tion) are:
3
“Directed graph.” Wikipedia.com. https://en.wikipedia.org/wiki/Directed_graph.
What is the minimum amount of secondary digraph features needed
to extract a particular feature?
What features can be extracted given a set of secondary features?
HYPERGIANT.COM/5
How dependent are the resulting methods on the topologies of the
chosen GDBs?
Where complete information about a feature cannot be extracted,
can useful, partial information be extracted instead?
Three example problems will be used to highlight the various difficul-
ties and useful methods that apply to recovering a primary graph.
Note that if a feature is symmetric, then the relations in our visu-
alizations will be represented as edges without direction. That is,
if the order of a relation between two vertices does not matter, then
the relation will be represented by a line rather than an arrow.
EXAMPLE
PROBLEMS
6/
FIRST EXAMPLE PROBLEM
The primary information here is the Parents Graph. That is, we are
given a set of people going back several generations, and we have
an arc from a vertex x to another vertex y, which will be denoted
as (x,y), where x is a parent of y. The derived information is the
Cousins Graph, which is a graph with an edge (x,y) if x and y are
first cousins. Equivalently, there is a path in the Parents Graph
with arcs (a,x),(b,a),(b,c),(c,y). That is, there is an edge in the
Cousins Graph if and only if x and y have a common grandparent, b.
An example of a Parents Graph and its associated Cousins Graph. Note
that x and y are cousins if and only if they share a grandparent. In
this case, they share b. Note that the relation between x and y is
symmetric and, therefore, lacks arrows.
b
a c
x y
PARENT
PARENT
PARENT
PARENT
PRIMARY: PARENTS GRAPH SECONDARY: COUSINS GRAPH
b
a c
x yCOUSINS
HYPERGIANT.COM/7
SECOND EXAMPLE PROBLEM
The primary information here is the Maternal Graph. This is similar
to the Parents Graph, but now, each vertex has at most one parent:
their mother. The secondary information is the Maternal Cousins
Graph, where there are arcs (a,x),(b,a),(b,c),(c,y) in the Maternal
Graph. That is, the Maternal Cousins Graph has an edge (x,y) if and
only if x and y have the same maternal grandmother, b.
An example of a Maternal Graph and its Cousins Graph. Note that x
and y are cousins if and only if they share the same maternal grand-
mother, b. Note that the relation between x and y is symmetric and,
therefore, lacks arrows.
b
a c
x y
MOTHER
MOTHER
MOTHER
MOTHER
PRIMARY: MATERNAL GRAPH SECONDARY: M. COUSINS GRAPH
b
a c
x yMATERNAL COUSINS
8/
THIRD EXAMPLE PROBLEM
The primary information here is the Handshake Graph – a timestamped
graph of handshakes among a collection of people that has an edge
between x and y with weight t if x and y shook hands at time t. The
secondary information is known as the Influence Digraph, which has
an arc (x,y) if there is a path (of arbitrary length) connecting x
to y, where the weights (timestamps) are increasing for each edge.
For simplicity, we make the assumption that no person is shaking two
hands at the same time. Note that the Influence Digraph can be used
to model potential pathogen spread across a group of people: an arc
(x,y) indicates that y could be infected by a pathogen originating
at x. The lack of such an arc indicates the impossibility of a con-
tact pathogen spreading from x to y.
An example of a timestamped Handshake Graph and its Influence Di-
graph. Note that the arc (x,z) exists since there is a path xy,yz
with increasing timestamps 1 and 2 that allows x to influence z.
PRIMARY: HANDSHAKE GRAPH
x
z
y
1
2
x
z
y
SECONDARY: INFLUENCE DIGRAPH
HYPERGIANT.COM/9
Across all of these examples, it is assumed that a single piece of
information is missing: an arc from the Parents Graph, an arc from
the Maternal Graph, and a timestamp from the Handshake Graph. It is
also assumed that all of the information from the Cousins Graph, Ma-
ternal Cousins Graph, and Influence Digraph is known. Note that there
is also a graph topology implied by the structure; namely, the as-
sumption that each vertex on the Maternal Graph has at most one arc
coming in.
In graph theory, a class of problems similar to feature engineer-
ing exists. These are known as graph reconstruction problems and,
in this area, there are long-standing, open conjectures about graph
reconstruction.
Starting with Kelly and Ulam’s reconstruction theory, there are many
questions and results about reconstructing and rebuilding original
graphs from some other data originally derived from them.4
For exam-
ple, in Ulam’s reconstruction problem, the secondary information is
a collection of (unlabeled) graphs called the deck, with each graph
in the deck produced by deleting a distinct vertex from the origi-
nal graph.
The conjecture is that this secondary information uniquely deter-
mines the graph. That is, given two graphs with more than three ver-
tices, if the decks are the same, then the graphs are the same. This
conjecture can be stated as: given the secondary graphs, the unique,
correct, primary graph can be constructed.
A LITTLE BACKGROUND IN
GRAPH THEORY
4
MFarhadian, Ameneh. A Simple Explanation for the Reconstruction of Graphs. 2017.
https://arxiv.org/pdf/1704.01454.pdf.
Since Ulam’s reconstruction theory was first stated in 1957 (by Kel-
ly), many partial results have been found, and similar questions
have been posed using differently-derived information. This includes
Harary’s variation for edge deletion, where the edge-deck is the set
of all subgraphs formed by deleting an edge.5
Similarly, some classes of graphs – Regular Graphs, Trees, Discon-
nected Graphs, Maximal Planar Graphs, and Outerplanar Graphs, in
particular – have been shown to be reconstructable. In the setting
of directed graphs, there are infinite families of non-reconstruc-
table graphs, in particular tournaments when they are not strongly
connected.
While a wide range of questions involving the recovery of origi-
nal graphs from secondary information have been posed, they, for
the most part, cannot be applied to GDBs as they either only apply
to unlabeled, unweighted, and undirected graphs (as in the original
reconstruction conjecture) or are properties that generally do not
appear in field conditions.
This presents a large number of open problems based on reconstruct-
ing the primary GDB from secondary information – in particular,
reconstructing labeled, edge-weighted digraphs from secondary di-
graphs.
10/
5
Harary, F., “On the reconstruction of a graph from a collection of subgraphs.” Theory of Graphs and its
Applications (Proc. Sympos. Smolenice, 1963). Publ. House Czechoslovak Acad. Sci., Prague, 1964, pp. 47–52.
HYPERGIANT.COM/11
APPROACHES
UNIQUENESS
AND PROBABI-
LISTIC
As with many, classical graph reconstruction problems, particular
problems are often intractable. In particular, there are examples
of Parents Graphs that have the same derived Cousins Graph and are
the same after deleting one, directed edge. Consider the example Pa-
rents Graphs with vertex sets {a,b,c,d,e,f}. The first one has arc set
{(a,c),(c,e),(b,d),(d,f),(b,c)} and the second has arc set {(a,c),(c,
e),(b,d),(d,f),(a,d)}. Note that the only difference is that the first
Parents Graph has arc (b,c), while the second has (a,d). Their Cousins
Graphs, however, are the same – both have the unique edge (e,f).
In this example, we see two Parents Graphs that differ only in their
arcs (a,d) and (b,c). The Cousins Graphs for each, however, are iso-
morphic. This means that – even when all of the edges, except (a,d)
and (b,c) are known – one cannot tell which of these two Parents
Graphs generated the Cousins Graph.
PRIMARY: PARENTS GRAPH
a b
c d
e f
a b
c d
e f
PRIMARY: PARENTS GRAPH
a b
c d
e f
SECONDARY: COUSINS GRAPH
12/
w x y z
y z w x
In this example, even the additional information that w, x, y, and z
are all directly related is not enough to determine the exact mater-
nal relations. Without more information, it is impossible to deter-
mine where x is the mother of y or z is the mother of w.
While this example shows that recovering the missing edge is fun-
damentally intractable, useful information can still be extracted
from these relationships. For example, if the cousin relationships
are rich enough, then there is a finite number of potential locations
for the missing arc in the Parents Graph. That is, a small number of
potentials can be determined for further querying, which can be very
useful if the tests used are expensive.
For instance, constructing the Cousins Graph of the Parents Graph
with the deleted edge restricts the location of the missing edge to
a subgraph of the Parents Graph related to the vertices with the
missing relationship. Similarly, in the Handshake Graph, the missing
timestamp can take on a (potentially unbounded) range of values. But
if the Influence Digraph contains enough arcs that used the edge with
the missing timestamp, then the interval of values can be bound.
An easily understandable example of the difficulty lies with the Ma-
ternal Graph. Suppose the true Maternal Graph has vertices {w,x,y,z}
and arcs {(w,x),(x,y),(y,z)}, but is missing the arc (x,y). Note
that the Maternal Cousins Graph is a graph with no arcs. This means
that, even with the knowledge of what the Maternal Graph looks like
– a directed path without information outside of the model (say,
date of birth) – one cannot tell if the missing arc is (x,y) or
(z,w) as both produce isomorphic Maternal Graphs and Maternal
Cousins Graphs.
HYPERGIANT.COM/13
Alternatively, consider a true Maternal Graph with vertices
{v,w,x,y,z}, arcs {(v,w),(w,x),(v,y),(y,z)}, and the missing arc
(v,w). The Maternal Cousins Graph, in this case, has one edge,
(x,z). Given that this is (as mentioned) a Maternal Graph, it is
certain that each vertex has at most one arc coming in, which means
that the missing arc is either incoming to v or w – there are no
other options. However, only (v,w) will produce the edge (x,z) in
the Maternal Cousins Graph.
In this example, the presence of a cousin relationship be-
tween x and z in the Maternal Graph implies that x and z
have the same maternal grandmother. Since v is the maternal
grandmother of z and w is the mother of x, one can conclude
that v is the mother of w.
v
w y
x z
PRIMARY: MATERNAL GRAPH SECONDARY: MATERNAL COUSINS GRAPH
v
w y
x z
14/
Given that these problems generally do not have a single solution,
when are the solutions unique? And, if they are not, then can useful
information still be extracted? Take the example of the Maternal
Graph: If the Maternal Cousins Graph has a rich enough structure,
then the possible locations of the missing arc are limited. More
specifically, if the Maternal Cousins Graph is constructed without
the missing arc and the edge (x,y) from the original Maternal Cous-
ins Graph does not appear in the new one, then it is clear that the
arc needed to produce this cousin relationship is not there. The
arc that models the mother of x, grandmother of x, mother of y, or
grandmother of y is, indeed, missing. But given that exactly one arc
is missing, data scientists can be sure that x or y will have their
grandmother known; thus, the exact arc that is missing in relation
to x and y can be deduced.
For example, if the arc modeling the mother of x is missing, then
one knows that it has to be a daughter of the grandmother of y, but
not y’s mother (i.e., an aunt of y). The secondary graph, in this
case, provides enough information so that – in conjunction with the
known topology of the Maternal Graph – the missing arc can be lim-
ited to a relatively small set of possibilities. Therefore, an (al-
most) unique solution is derived due to the topology of the Maternal
Graph and the rich structure of the Maternal Cousins Graph.
Similarly, if the timestamp from an edge (x,y) in the Handshake
Graph is missing, but it is used on a path from x to a vertex w –
that is, the path (x,y) also has a path from y to w – then an upper
bound can be attached to the missing timestamp. If all of the paths
from y to w where the timestamps increase (producing arcs in the
Influence Digraph) are listed, then the timestamp of (x,y) must be
smaller than the largest initial timestamp. That is, if a unique
path from y to w exists and the first edge has timestamp t, then the
only way the path from x to w can use the edge (x,y) is if the miss-
ing timestamp is smaller than t. Similarly, the missing timestamp
can be bounded below if it is the last edge on a path with
HYPERGIANT.COM/15
In this figure, x and y shook hands at an unknown time t, while the
Influence Digraph includes arcs (x,w) and (a,x). So, while the ex-
act time of the handshake remains unknown, the time can still be
bounded. Two paths (with increasing times) exist from a to x, which
depicts that (at the earliest) t happened after time 1. The two
paths from x to w show that, at the latest, t happened before time
4. Therefore, t lies between 1 and 4. If the further assumption that
all interactions take place at integer times is applied, then one
can safely say that t is either 2 or 3.
increasing timestamps. While this does not, in general, provide an
exact value, it does provide an interval in which the handshake
occurred. And, if additional information (such as all timestamps
stored as integer values) is known, then a specific value may, in
fact, be extracted.
PRIMARY: HANDSHAKE GRAPH SECONDARY: INFLUENCE DIGRAPH
a
y wx
12
4
?
a
y wx
Backwards construction in feature engineering offers great potential
for extracting new features, determining when a unique answer can be
extracted, formalizing the information required to extract features,
and providing meaningful possibilities when the answer to a problem
is not unique. Future directions of inquiry, for those interested in
broadening their knowledge of this and other approaches include:
A more robust understanding of when a unique solution can be ex-
tracted;
Studying the relationship between the definition of secondary in-
formation, the size and location of potential answers, and how
both of these relate to multiple, missing features;
Studying the existence of multiple sources of secondary informa-
tion;
And determining a metric for how robust secondary information is
in relation to recovering missing, primary information.
FUTURE
DIRECTIONS
16/
HYPERGIANT.COM/17
ABOUT
THE
AUTHOR
Dr. Drew Lipman is head of the data science department at Hypergi-
ant. He has a doctorate in discrete mathematics focusing on how to
transform difficult algebraic geometric problems into easy to compute
graph theory parameters.
Dr. Lipman’s recent work includes building a machine intelligent
mixologist for TGI Fridays and building the world’s first personal-
ized health platform.
drew@hypergiant.com
DREWJ.LIPMAN,PH.D.
WE ARE HYPERGIANT
TOMORROWING TODAYTM
A guiding light for Fortune 500 companies. Analyz-
ing data. Teaching machines to teach themselves.
Providing understanding, creation, and implementa-
tion at the intersection of experience and machine
intelligence. Merging with partners to create pow-
erful technology solutions and smarter, more effi-
cient human workforces.
18/
[ BURN AFTER READING ]
Reconstructing Primary Information from Secondary Information

Contenu connexe

Similaire à Reconstructing Primary Information from Secondary Information

Graph theory ppt.pptx
Graph theory ppt.pptxGraph theory ppt.pptx
Graph theory ppt.pptxsaranyajey
 
Techniques that Facebook use to Analyze and QuerySocial Graphs
Techniques that Facebook use to Analyze and QuerySocial GraphsTechniques that Facebook use to Analyze and QuerySocial Graphs
Techniques that Facebook use to Analyze and QuerySocial GraphsHaneen Droubi
 
Learning for Optimization: EDAs, probabilistic modelling, or ...
Learning for Optimization: EDAs, probabilistic modelling, or ...Learning for Optimization: EDAs, probabilistic modelling, or ...
Learning for Optimization: EDAs, probabilistic modelling, or ...butest
 
An experimental evaluation of similarity-based and embedding-based link predi...
An experimental evaluation of similarity-based and embedding-based link predi...An experimental evaluation of similarity-based and embedding-based link predi...
An experimental evaluation of similarity-based and embedding-based link predi...IJDKP
 
Semi-Supervised Discriminant Analysis Based On Data Structure
Semi-Supervised Discriminant Analysis Based On Data StructureSemi-Supervised Discriminant Analysis Based On Data Structure
Semi-Supervised Discriminant Analysis Based On Data Structureiosrjce
 
An incremental approach to attribute reduction of dynamic set-valued informat...
An incremental approach to attribute reduction of dynamic set-valued informat...An incremental approach to attribute reduction of dynamic set-valued informat...
An incremental approach to attribute reduction of dynamic set-valued informat...Guangming Lang
 
An Overview Applications Of Graph Theory In Real Field
An Overview Applications Of Graph Theory In Real FieldAn Overview Applications Of Graph Theory In Real Field
An Overview Applications Of Graph Theory In Real FieldLori Moore
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn GraphAshwani kumar
 
Object class recognition by unsupervide scale invariant learning - kunal
Object class recognition by unsupervide scale invariant learning - kunalObject class recognition by unsupervide scale invariant learning - kunal
Object class recognition by unsupervide scale invariant learning - kunalKunal Kishor Nirala
 
Finding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCircleFinding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCirclecharlingual
 
A Theorem-Proving Approach To Spatial Problem-Solving
A Theorem-Proving Approach To Spatial Problem-SolvingA Theorem-Proving Approach To Spatial Problem-Solving
A Theorem-Proving Approach To Spatial Problem-SolvingDereck Downing
 

Similaire à Reconstructing Primary Information from Secondary Information (20)

matrices-1.pdf
matrices-1.pdfmatrices-1.pdf
matrices-1.pdf
 
Graph theory ppt.pptx
Graph theory ppt.pptxGraph theory ppt.pptx
Graph theory ppt.pptx
 
Techniques that Facebook use to Analyze and QuerySocial Graphs
Techniques that Facebook use to Analyze and QuerySocial GraphsTechniques that Facebook use to Analyze and QuerySocial Graphs
Techniques that Facebook use to Analyze and QuerySocial Graphs
 
Linear algebra havard university
Linear algebra havard universityLinear algebra havard university
Linear algebra havard university
 
Learning for Optimization: EDAs, probabilistic modelling, or ...
Learning for Optimization: EDAs, probabilistic modelling, or ...Learning for Optimization: EDAs, probabilistic modelling, or ...
Learning for Optimization: EDAs, probabilistic modelling, or ...
 
An experimental evaluation of similarity-based and embedding-based link predi...
An experimental evaluation of similarity-based and embedding-based link predi...An experimental evaluation of similarity-based and embedding-based link predi...
An experimental evaluation of similarity-based and embedding-based link predi...
 
F0742328
F0742328F0742328
F0742328
 
Four data models in GIS
Four data models in GISFour data models in GIS
Four data models in GIS
 
Variograms
VariogramsVariograms
Variograms
 
Lattice2 tree
Lattice2 treeLattice2 tree
Lattice2 tree
 
E017373946
E017373946E017373946
E017373946
 
Semi-Supervised Discriminant Analysis Based On Data Structure
Semi-Supervised Discriminant Analysis Based On Data StructureSemi-Supervised Discriminant Analysis Based On Data Structure
Semi-Supervised Discriminant Analysis Based On Data Structure
 
Data analysis02 twovariables
Data analysis02 twovariablesData analysis02 twovariables
Data analysis02 twovariables
 
An incremental approach to attribute reduction of dynamic set-valued informat...
An incremental approach to attribute reduction of dynamic set-valued informat...An incremental approach to attribute reduction of dynamic set-valued informat...
An incremental approach to attribute reduction of dynamic set-valued informat...
 
An Overview Applications Of Graph Theory In Real Field
An Overview Applications Of Graph Theory In Real FieldAn Overview Applications Of Graph Theory In Real Field
An Overview Applications Of Graph Theory In Real Field
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn Graph
 
Object class recognition by unsupervide scale invariant learning - kunal
Object class recognition by unsupervide scale invariant learning - kunalObject class recognition by unsupervide scale invariant learning - kunal
Object class recognition by unsupervide scale invariant learning - kunal
 
Finding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCircleFinding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCircle
 
A Theorem-Proving Approach To Spatial Problem-Solving
A Theorem-Proving Approach To Spatial Problem-SolvingA Theorem-Proving Approach To Spatial Problem-Solving
A Theorem-Proving Approach To Spatial Problem-Solving
 
Data analysis05 clustering
Data analysis05 clusteringData analysis05 clustering
Data analysis05 clustering
 

Dernier

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 

Dernier (20)

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 

Reconstructing Primary Information from Secondary Information

  • 2. As part of this, they engage in something known as feature engineering: the process of using existing, known information or features to extract new informa- tion or features about particular data.2 A classic example is extracting the age or gender of Titanic’s passengers from their salutations. This information (age and gender) is, generally, not included in the passenger manifest, but it becomes very useful when trying to predict whether a particular passenger survived or calculate overall sur- vival rates. 1 Press, Gil. “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says.” Forbes.com. 2016. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming- least-enjoyable-data-science-task-survey-says/#76969b3d6f63. 2 Microsoft Azure. “Feature engineering in data science.” Microsoft.com. 2017. https://docs.microsoft.com/ en-us/azure/machine-learning/team-data-science-process/create-features. DATA SCIENTISTS SPEND OF THEIR TIME CLEANING AND ORGANIZING DATA.1 60% 2/
  • 3. HYPERGIANT.COM FEATURE ENGINEERING The other version of feature engineering – and the focus of our ex- ploration here – is backwards construction. This is formally known as an inverse problem, a case in which the features originally used to construct the given information are recovered. For example, each passenger’s gender and age were used to establish the correct sal- utation for addressing them on the Titanic. The inverse problem is to take the resulting salutation – a form of secondary information – and compute the inputted age and gender – a form of primary infor- mation – from which it was derived. Feature engineering tends to be applied in two versions. One version is deemed forward construction, meaning that the extracted features are directly derived from the original information. In other words, all of the information needed for the new feature is contained in the original feature. (Extracting age from a date of birth is a clear ex- ample of this.) /3 BACKWARDS CONSTRUCTION IS A FORM OF FEATURE ENGINEERING THAT USES SECONDARY INFORMATION (INFORMATION DERIVED FROM PRIMARY FEATURES) TO RECOVER MISSING, PRIMARY FEATURES.
  • 4. 4/ THE TOOLS While the framework of an inverse problem is very general, different problems produce distinct difficulties. Some of the new, recommen- ded tools for this form of feature engineering are graph databases or knowledge graphs. These can model how often datapoint features interact or how a single feature interacts across a database. Across the examples in the following pages, graph databases (GDBs) will be modeled using vertex and arc-labeled directed graphs (digraphs): more specifically, a primary digraph and a secondary di- graph. The vertices (and their weights) of the secondary digraph are the same as those of the primary digraph, while the arcs (and their weights) of the secondary digraph are derived from those of the pri- mary digraph. Using this language, the feature engineering problem is described as follows: Given the secondary digraph – and, poten- tially, parts of the primary digraph – extract useful information and features about the primary digraph that were previously unknown. This framework opens several questions, not least of which is de- termining when it is feasible to extract a particular feature: arc weights or the arcs, themselves. Some variations of this problem (all of which can be extended to include partial, primary informa- tion) are: 3 “Directed graph.” Wikipedia.com. https://en.wikipedia.org/wiki/Directed_graph. What is the minimum amount of secondary digraph features needed to extract a particular feature? What features can be extracted given a set of secondary features?
  • 5. HYPERGIANT.COM/5 How dependent are the resulting methods on the topologies of the chosen GDBs? Where complete information about a feature cannot be extracted, can useful, partial information be extracted instead? Three example problems will be used to highlight the various difficul- ties and useful methods that apply to recovering a primary graph. Note that if a feature is symmetric, then the relations in our visu- alizations will be represented as edges without direction. That is, if the order of a relation between two vertices does not matter, then the relation will be represented by a line rather than an arrow. EXAMPLE PROBLEMS
  • 6. 6/ FIRST EXAMPLE PROBLEM The primary information here is the Parents Graph. That is, we are given a set of people going back several generations, and we have an arc from a vertex x to another vertex y, which will be denoted as (x,y), where x is a parent of y. The derived information is the Cousins Graph, which is a graph with an edge (x,y) if x and y are first cousins. Equivalently, there is a path in the Parents Graph with arcs (a,x),(b,a),(b,c),(c,y). That is, there is an edge in the Cousins Graph if and only if x and y have a common grandparent, b. An example of a Parents Graph and its associated Cousins Graph. Note that x and y are cousins if and only if they share a grandparent. In this case, they share b. Note that the relation between x and y is symmetric and, therefore, lacks arrows. b a c x y PARENT PARENT PARENT PARENT PRIMARY: PARENTS GRAPH SECONDARY: COUSINS GRAPH b a c x yCOUSINS
  • 7. HYPERGIANT.COM/7 SECOND EXAMPLE PROBLEM The primary information here is the Maternal Graph. This is similar to the Parents Graph, but now, each vertex has at most one parent: their mother. The secondary information is the Maternal Cousins Graph, where there are arcs (a,x),(b,a),(b,c),(c,y) in the Maternal Graph. That is, the Maternal Cousins Graph has an edge (x,y) if and only if x and y have the same maternal grandmother, b. An example of a Maternal Graph and its Cousins Graph. Note that x and y are cousins if and only if they share the same maternal grand- mother, b. Note that the relation between x and y is symmetric and, therefore, lacks arrows. b a c x y MOTHER MOTHER MOTHER MOTHER PRIMARY: MATERNAL GRAPH SECONDARY: M. COUSINS GRAPH b a c x yMATERNAL COUSINS
  • 8. 8/ THIRD EXAMPLE PROBLEM The primary information here is the Handshake Graph – a timestamped graph of handshakes among a collection of people that has an edge between x and y with weight t if x and y shook hands at time t. The secondary information is known as the Influence Digraph, which has an arc (x,y) if there is a path (of arbitrary length) connecting x to y, where the weights (timestamps) are increasing for each edge. For simplicity, we make the assumption that no person is shaking two hands at the same time. Note that the Influence Digraph can be used to model potential pathogen spread across a group of people: an arc (x,y) indicates that y could be infected by a pathogen originating at x. The lack of such an arc indicates the impossibility of a con- tact pathogen spreading from x to y. An example of a timestamped Handshake Graph and its Influence Di- graph. Note that the arc (x,z) exists since there is a path xy,yz with increasing timestamps 1 and 2 that allows x to influence z. PRIMARY: HANDSHAKE GRAPH x z y 1 2 x z y SECONDARY: INFLUENCE DIGRAPH
  • 9. HYPERGIANT.COM/9 Across all of these examples, it is assumed that a single piece of information is missing: an arc from the Parents Graph, an arc from the Maternal Graph, and a timestamp from the Handshake Graph. It is also assumed that all of the information from the Cousins Graph, Ma- ternal Cousins Graph, and Influence Digraph is known. Note that there is also a graph topology implied by the structure; namely, the as- sumption that each vertex on the Maternal Graph has at most one arc coming in. In graph theory, a class of problems similar to feature engineer- ing exists. These are known as graph reconstruction problems and, in this area, there are long-standing, open conjectures about graph reconstruction. Starting with Kelly and Ulam’s reconstruction theory, there are many questions and results about reconstructing and rebuilding original graphs from some other data originally derived from them.4 For exam- ple, in Ulam’s reconstruction problem, the secondary information is a collection of (unlabeled) graphs called the deck, with each graph in the deck produced by deleting a distinct vertex from the origi- nal graph. The conjecture is that this secondary information uniquely deter- mines the graph. That is, given two graphs with more than three ver- tices, if the decks are the same, then the graphs are the same. This conjecture can be stated as: given the secondary graphs, the unique, correct, primary graph can be constructed. A LITTLE BACKGROUND IN GRAPH THEORY 4 MFarhadian, Ameneh. A Simple Explanation for the Reconstruction of Graphs. 2017. https://arxiv.org/pdf/1704.01454.pdf.
  • 10. Since Ulam’s reconstruction theory was first stated in 1957 (by Kel- ly), many partial results have been found, and similar questions have been posed using differently-derived information. This includes Harary’s variation for edge deletion, where the edge-deck is the set of all subgraphs formed by deleting an edge.5 Similarly, some classes of graphs – Regular Graphs, Trees, Discon- nected Graphs, Maximal Planar Graphs, and Outerplanar Graphs, in particular – have been shown to be reconstructable. In the setting of directed graphs, there are infinite families of non-reconstruc- table graphs, in particular tournaments when they are not strongly connected. While a wide range of questions involving the recovery of origi- nal graphs from secondary information have been posed, they, for the most part, cannot be applied to GDBs as they either only apply to unlabeled, unweighted, and undirected graphs (as in the original reconstruction conjecture) or are properties that generally do not appear in field conditions. This presents a large number of open problems based on reconstruct- ing the primary GDB from secondary information – in particular, reconstructing labeled, edge-weighted digraphs from secondary di- graphs. 10/ 5 Harary, F., “On the reconstruction of a graph from a collection of subgraphs.” Theory of Graphs and its Applications (Proc. Sympos. Smolenice, 1963). Publ. House Czechoslovak Acad. Sci., Prague, 1964, pp. 47–52.
  • 11. HYPERGIANT.COM/11 APPROACHES UNIQUENESS AND PROBABI- LISTIC As with many, classical graph reconstruction problems, particular problems are often intractable. In particular, there are examples of Parents Graphs that have the same derived Cousins Graph and are the same after deleting one, directed edge. Consider the example Pa- rents Graphs with vertex sets {a,b,c,d,e,f}. The first one has arc set {(a,c),(c,e),(b,d),(d,f),(b,c)} and the second has arc set {(a,c),(c, e),(b,d),(d,f),(a,d)}. Note that the only difference is that the first Parents Graph has arc (b,c), while the second has (a,d). Their Cousins Graphs, however, are the same – both have the unique edge (e,f). In this example, we see two Parents Graphs that differ only in their arcs (a,d) and (b,c). The Cousins Graphs for each, however, are iso- morphic. This means that – even when all of the edges, except (a,d) and (b,c) are known – one cannot tell which of these two Parents Graphs generated the Cousins Graph. PRIMARY: PARENTS GRAPH a b c d e f a b c d e f PRIMARY: PARENTS GRAPH a b c d e f SECONDARY: COUSINS GRAPH
  • 12. 12/ w x y z y z w x In this example, even the additional information that w, x, y, and z are all directly related is not enough to determine the exact mater- nal relations. Without more information, it is impossible to deter- mine where x is the mother of y or z is the mother of w. While this example shows that recovering the missing edge is fun- damentally intractable, useful information can still be extracted from these relationships. For example, if the cousin relationships are rich enough, then there is a finite number of potential locations for the missing arc in the Parents Graph. That is, a small number of potentials can be determined for further querying, which can be very useful if the tests used are expensive. For instance, constructing the Cousins Graph of the Parents Graph with the deleted edge restricts the location of the missing edge to a subgraph of the Parents Graph related to the vertices with the missing relationship. Similarly, in the Handshake Graph, the missing timestamp can take on a (potentially unbounded) range of values. But if the Influence Digraph contains enough arcs that used the edge with the missing timestamp, then the interval of values can be bound. An easily understandable example of the difficulty lies with the Ma- ternal Graph. Suppose the true Maternal Graph has vertices {w,x,y,z} and arcs {(w,x),(x,y),(y,z)}, but is missing the arc (x,y). Note that the Maternal Cousins Graph is a graph with no arcs. This means that, even with the knowledge of what the Maternal Graph looks like – a directed path without information outside of the model (say, date of birth) – one cannot tell if the missing arc is (x,y) or (z,w) as both produce isomorphic Maternal Graphs and Maternal Cousins Graphs.
  • 13. HYPERGIANT.COM/13 Alternatively, consider a true Maternal Graph with vertices {v,w,x,y,z}, arcs {(v,w),(w,x),(v,y),(y,z)}, and the missing arc (v,w). The Maternal Cousins Graph, in this case, has one edge, (x,z). Given that this is (as mentioned) a Maternal Graph, it is certain that each vertex has at most one arc coming in, which means that the missing arc is either incoming to v or w – there are no other options. However, only (v,w) will produce the edge (x,z) in the Maternal Cousins Graph. In this example, the presence of a cousin relationship be- tween x and z in the Maternal Graph implies that x and z have the same maternal grandmother. Since v is the maternal grandmother of z and w is the mother of x, one can conclude that v is the mother of w. v w y x z PRIMARY: MATERNAL GRAPH SECONDARY: MATERNAL COUSINS GRAPH v w y x z
  • 14. 14/ Given that these problems generally do not have a single solution, when are the solutions unique? And, if they are not, then can useful information still be extracted? Take the example of the Maternal Graph: If the Maternal Cousins Graph has a rich enough structure, then the possible locations of the missing arc are limited. More specifically, if the Maternal Cousins Graph is constructed without the missing arc and the edge (x,y) from the original Maternal Cous- ins Graph does not appear in the new one, then it is clear that the arc needed to produce this cousin relationship is not there. The arc that models the mother of x, grandmother of x, mother of y, or grandmother of y is, indeed, missing. But given that exactly one arc is missing, data scientists can be sure that x or y will have their grandmother known; thus, the exact arc that is missing in relation to x and y can be deduced. For example, if the arc modeling the mother of x is missing, then one knows that it has to be a daughter of the grandmother of y, but not y’s mother (i.e., an aunt of y). The secondary graph, in this case, provides enough information so that – in conjunction with the known topology of the Maternal Graph – the missing arc can be lim- ited to a relatively small set of possibilities. Therefore, an (al- most) unique solution is derived due to the topology of the Maternal Graph and the rich structure of the Maternal Cousins Graph. Similarly, if the timestamp from an edge (x,y) in the Handshake Graph is missing, but it is used on a path from x to a vertex w – that is, the path (x,y) also has a path from y to w – then an upper bound can be attached to the missing timestamp. If all of the paths from y to w where the timestamps increase (producing arcs in the Influence Digraph) are listed, then the timestamp of (x,y) must be smaller than the largest initial timestamp. That is, if a unique path from y to w exists and the first edge has timestamp t, then the only way the path from x to w can use the edge (x,y) is if the miss- ing timestamp is smaller than t. Similarly, the missing timestamp can be bounded below if it is the last edge on a path with
  • 15. HYPERGIANT.COM/15 In this figure, x and y shook hands at an unknown time t, while the Influence Digraph includes arcs (x,w) and (a,x). So, while the ex- act time of the handshake remains unknown, the time can still be bounded. Two paths (with increasing times) exist from a to x, which depicts that (at the earliest) t happened after time 1. The two paths from x to w show that, at the latest, t happened before time 4. Therefore, t lies between 1 and 4. If the further assumption that all interactions take place at integer times is applied, then one can safely say that t is either 2 or 3. increasing timestamps. While this does not, in general, provide an exact value, it does provide an interval in which the handshake occurred. And, if additional information (such as all timestamps stored as integer values) is known, then a specific value may, in fact, be extracted. PRIMARY: HANDSHAKE GRAPH SECONDARY: INFLUENCE DIGRAPH a y wx 12 4 ? a y wx
  • 16. Backwards construction in feature engineering offers great potential for extracting new features, determining when a unique answer can be extracted, formalizing the information required to extract features, and providing meaningful possibilities when the answer to a problem is not unique. Future directions of inquiry, for those interested in broadening their knowledge of this and other approaches include: A more robust understanding of when a unique solution can be ex- tracted; Studying the relationship between the definition of secondary in- formation, the size and location of potential answers, and how both of these relate to multiple, missing features; Studying the existence of multiple sources of secondary informa- tion; And determining a metric for how robust secondary information is in relation to recovering missing, primary information. FUTURE DIRECTIONS 16/
  • 17. HYPERGIANT.COM/17 ABOUT THE AUTHOR Dr. Drew Lipman is head of the data science department at Hypergi- ant. He has a doctorate in discrete mathematics focusing on how to transform difficult algebraic geometric problems into easy to compute graph theory parameters. Dr. Lipman’s recent work includes building a machine intelligent mixologist for TGI Fridays and building the world’s first personal- ized health platform. drew@hypergiant.com DREWJ.LIPMAN,PH.D.
  • 18. WE ARE HYPERGIANT TOMORROWING TODAYTM A guiding light for Fortune 500 companies. Analyz- ing data. Teaching machines to teach themselves. Providing understanding, creation, and implementa- tion at the intersection of experience and machine intelligence. Merging with partners to create pow- erful technology solutions and smarter, more effi- cient human workforces. 18/ [ BURN AFTER READING ]