Reconstructing Primary Information from Secondary Information

RECONSTRUCTING
PRIMARY INFORMATION
FROM SECONDARY
INFORMATION
HYPERGIANT.COMBYDREWJ.LIPMAN,PH.D.

As part of this, they engage in
something known as feature
engineering: the process of using
existing, known information or
features to extract new informa-
tion or features about particular
data.2
A classic example is extracting
the age or gender of Titanic’s
passengers from their salutations.
This information (age and gender)
is, generally, not included in the
passenger manifest, but it becomes
very useful when trying to predict
whether a particular passenger
survived or calculate overall sur-
vival rates.
1
Press, Gil. “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says.”
Forbes.com. 2016. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-
least-enjoyable-data-science-task-survey-says/#76969b3d6f63.
2
Microsoft Azure. “Feature engineering in data science.” Microsoft.com. 2017. https://docs.microsoft.com/
en-us/azure/machine-learning/team-data-science-process/create-features.
DATA
SCIENTISTS
SPEND
OF THEIR TIME
CLEANING AND
ORGANIZING
DATA.1
60%
2/

HYPERGIANT.COM
FEATURE
ENGINEERING
The other version of feature engineering – and the focus of our ex-
ploration here – is backwards construction. This is formally known
as an inverse problem, a case in which the features originally used
to construct the given information are recovered. For example, each
passenger’s gender and age were used to establish the correct sal-
utation for addressing them on the Titanic. The inverse problem is
to take the resulting salutation – a form of secondary information
– and compute the inputted age and gender – a form of primary infor-
mation – from which it was derived.
Feature engineering tends to be applied in two versions. One version
is deemed forward construction, meaning that the extracted features
are directly derived from the original information. In other words,
all of the information needed for the new feature is contained in the
original feature. (Extracting age from a date of birth is a clear ex-
ample of this.)
/3
BACKWARDS CONSTRUCTION IS A FORM OF FEATURE
ENGINEERING THAT USES SECONDARY INFORMATION
(INFORMATION DERIVED FROM PRIMARY FEATURES)
TO RECOVER MISSING, PRIMARY FEATURES.

4/
THE TOOLS
While the framework of an inverse problem is very general, different
problems produce distinct difficulties. Some of the new, recommen-
ded tools for this form of feature engineering are graph databases
or knowledge graphs. These can model how often datapoint features
interact or how a single feature interacts across a database.
Across the examples in the following pages, graph databases (GDBs)
will be modeled using vertex and arc-labeled directed graphs
(digraphs): more specifically, a primary digraph and a secondary di-
graph. The vertices (and their weights) of the secondary digraph are
the same as those of the primary digraph, while the arcs (and their
weights) of the secondary digraph are derived from those of the pri-
mary digraph. Using this language, the feature engineering problem
is described as follows: Given the secondary digraph – and, poten-
tially, parts of the primary digraph – extract useful information
and features about the primary digraph that were previously unknown.
This framework opens several questions, not least of which is de-
termining when it is feasible to extract a particular feature: arc
weights or the arcs, themselves. Some variations of this problem
(all of which can be extended to include partial, primary informa-
tion) are:
3
“Directed graph.” Wikipedia.com. https://en.wikipedia.org/wiki/Directed_graph.
What is the minimum amount of secondary digraph features needed
to extract a particular feature?
What features can be extracted given a set of secondary features?

HYPERGIANT.COM/5
How dependent are the resulting methods on the topologies of the
chosen GDBs?
Where complete information about a feature cannot be extracted,
can useful, partial information be extracted instead?
Three example problems will be used to highlight the various difficul-
ties and useful methods that apply to recovering a primary graph.
Note that if a feature is symmetric, then the relations in our visu-
alizations will be represented as edges without direction. That is,
if the order of a relation between two vertices does not matter, then
the relation will be represented by a line rather than an arrow.
EXAMPLE
PROBLEMS

6/
FIRST EXAMPLE PROBLEM
The primary information here is the Parents Graph. That is, we are
given a set of people going back several generations, and we have
an arc from a vertex x to another vertex y, which will be denoted
as (x,y), where x is a parent of y. The derived information is the
Cousins Graph, which is a graph with an edge (x,y) if x and y are
first cousins. Equivalently, there is a path in the Parents Graph
with arcs (a,x),(b,a),(b,c),(c,y). That is, there is an edge in the
Cousins Graph if and only if x and y have a common grandparent, b.
An example of a Parents Graph and its associated Cousins Graph. Note
that x and y are cousins if and only if they share a grandparent. In
this case, they share b. Note that the relation between x and y is
symmetric and, therefore, lacks arrows.
b
a c
x y
PARENT
PARENT
PARENT
PARENT
PRIMARY: PARENTS GRAPH SECONDARY: COUSINS GRAPH
b
a c
x yCOUSINS

HYPERGIANT.COM/7
SECOND EXAMPLE PROBLEM
The primary information here is the Maternal Graph. This is similar
to the Parents Graph, but now, each vertex has at most one parent:
their mother. The secondary information is the Maternal Cousins
Graph, where there are arcs (a,x),(b,a),(b,c),(c,y) in the Maternal
Graph. That is, the Maternal Cousins Graph has an edge (x,y) if and
only if x and y have the same maternal grandmother, b.
An example of a Maternal Graph and its Cousins Graph. Note that x
and y are cousins if and only if they share the same maternal grand-
mother, b. Note that the relation between x and y is symmetric and,
therefore, lacks arrows.
b
a c
x y
MOTHER
MOTHER
MOTHER
MOTHER
PRIMARY: MATERNAL GRAPH SECONDARY: M. COUSINS GRAPH
b
a c
x yMATERNAL COUSINS

8/
THIRD EXAMPLE PROBLEM
The primary information here is the Handshake Graph – a timestamped
graph of handshakes among a collection of people that has an edge
between x and y with weight t if x and y shook hands at time t. The
secondary information is known as the Influence Digraph, which has
an arc (x,y) if there is a path (of arbitrary length) connecting x
to y, where the weights (timestamps) are increasing for each edge.
For simplicity, we make the assumption that no person is shaking two
hands at the same time. Note that the Influence Digraph can be used
to model potential pathogen spread across a group of people: an arc
(x,y) indicates that y could be infected by a pathogen originating
at x. The lack of such an arc indicates the impossibility of a con-
tact pathogen spreading from x to y.
An example of a timestamped Handshake Graph and its Influence Di-
graph. Note that the arc (x,z) exists since there is a path xy,yz
with increasing timestamps 1 and 2 that allows x to influence z.
PRIMARY: HANDSHAKE GRAPH
x
z
y
1
2
x
z
y
SECONDARY: INFLUENCE DIGRAPH

HYPERGIANT.COM/9
Across all of these examples, it is assumed that a single piece of
information is missing: an arc from the Parents Graph, an arc from
the Maternal Graph, and a timestamp from the Handshake Graph. It is
also assumed that all of the information from the Cousins Graph, Ma-
ternal Cousins Graph, and Influence Digraph is known. Note that there
is also a graph topology implied by the structure; namely, the as-
sumption that each vertex on the Maternal Graph has at most one arc
coming in.
In graph theory, a class of problems similar to feature engineer-
ing exists. These are known as graph reconstruction problems and,
in this area, there are long-standing, open conjectures about graph
reconstruction.
Starting with Kelly and Ulam’s reconstruction theory, there are many
questions and results about reconstructing and rebuilding original
graphs from some other data originally derived from them.4
For exam-
ple, in Ulam’s reconstruction problem, the secondary information is
a collection of (unlabeled) graphs called the deck, with each graph
in the deck produced by deleting a distinct vertex from the origi-
nal graph.
The conjecture is that this secondary information uniquely deter-
mines the graph. That is, given two graphs with more than three ver-
tices, if the decks are the same, then the graphs are the same. This
conjecture can be stated as: given the secondary graphs, the unique,
correct, primary graph can be constructed.
A LITTLE BACKGROUND IN
GRAPH THEORY
4
MFarhadian, Ameneh. A Simple Explanation for the Reconstruction of Graphs. 2017.
https://arxiv.org/pdf/1704.01454.pdf.

Since Ulam’s reconstruction theory was first stated in 1957 (by Kel-
ly), many partial results have been found, and similar questions
have been posed using differently-derived information. This includes
Harary’s variation for edge deletion, where the edge-deck is the set
of all subgraphs formed by deleting an edge.5
Similarly, some classes of graphs – Regular Graphs, Trees, Discon-
nected Graphs, Maximal Planar Graphs, and Outerplanar Graphs, in
particular – have been shown to be reconstructable. In the setting
of directed graphs, there are infinite families of non-reconstruc-
table graphs, in particular tournaments when they are not strongly
connected.
While a wide range of questions involving the recovery of origi-
nal graphs from secondary information have been posed, they, for
the most part, cannot be applied to GDBs as they either only apply
to unlabeled, unweighted, and undirected graphs (as in the original
reconstruction conjecture) or are properties that generally do not
appear in field conditions.
This presents a large number of open problems based on reconstruct-
ing the primary GDB from secondary information – in particular,
reconstructing labeled, edge-weighted digraphs from secondary di-
graphs.
10/
5
Harary, F., “On the reconstruction of a graph from a collection of subgraphs.” Theory of Graphs and its
Applications (Proc. Sympos. Smolenice, 1963). Publ. House Czechoslovak Acad. Sci., Prague, 1964, pp. 47–52.

HYPERGIANT.COM/11
APPROACHES
UNIQUENESS
AND PROBABI-
LISTIC
As with many, classical graph reconstruction problems, particular
problems are often intractable. In particular, there are examples
of Parents Graphs that have the same derived Cousins Graph and are
the same after deleting one, directed edge. Consider the example Pa-
rents Graphs with vertex sets {a,b,c,d,e,f}. The first one has arc set
{(a,c),(c,e),(b,d),(d,f),(b,c)} and the second has arc set {(a,c),(c,
e),(b,d),(d,f),(a,d)}. Note that the only difference is that the first
Parents Graph has arc (b,c), while the second has (a,d). Their Cousins
Graphs, however, are the same – both have the unique edge (e,f).
In this example, we see two Parents Graphs that differ only in their
arcs (a,d) and (b,c). The Cousins Graphs for each, however, are iso-
morphic. This means that – even when all of the edges, except (a,d)
and (b,c) are known – one cannot tell which of these two Parents
Graphs generated the Cousins Graph.
PRIMARY: PARENTS GRAPH
a b
c d
e f
a b
c d
e f
PRIMARY: PARENTS GRAPH
a b
c d
e f
SECONDARY: COUSINS GRAPH

12/
w x y z
y z w x
In this example, even the additional information that w, x, y, and z
are all directly related is not enough to determine the exact mater-
nal relations. Without more information, it is impossible to deter-
mine where x is the mother of y or z is the mother of w.
While this example shows that recovering the missing edge is fun-
damentally intractable, useful information can still be extracted
from these relationships. For example, if the cousin relationships
are rich enough, then there is a finite number of potential locations
for the missing arc in the Parents Graph. That is, a small number of
potentials can be determined for further querying, which can be very
useful if the tests used are expensive.
For instance, constructing the Cousins Graph of the Parents Graph
with the deleted edge restricts the location of the missing edge to
a subgraph of the Parents Graph related to the vertices with the
missing relationship. Similarly, in the Handshake Graph, the missing
timestamp can take on a (potentially unbounded) range of values. But
if the Influence Digraph contains enough arcs that used the edge with
the missing timestamp, then the interval of values can be bound.
An easily understandable example of the difficulty lies with the Ma-
ternal Graph. Suppose the true Maternal Graph has vertices {w,x,y,z}
and arcs {(w,x),(x,y),(y,z)}, but is missing the arc (x,y). Note
that the Maternal Cousins Graph is a graph with no arcs. This means
that, even with the knowledge of what the Maternal Graph looks like
– a directed path without information outside of the model (say,
date of birth) – one cannot tell if the missing arc is (x,y) or
(z,w) as both produce isomorphic Maternal Graphs and Maternal
Cousins Graphs.

HYPERGIANT.COM/13
Alternatively, consider a true Maternal Graph with vertices
{v,w,x,y,z}, arcs {(v,w),(w,x),(v,y),(y,z)}, and the missing arc
(v,w). The Maternal Cousins Graph, in this case, has one edge,
(x,z). Given that this is (as mentioned) a Maternal Graph, it is
certain that each vertex has at most one arc coming in, which means
that the missing arc is either incoming to v or w – there are no
other options. However, only (v,w) will produce the edge (x,z) in
the Maternal Cousins Graph.
In this example, the presence of a cousin relationship be-
tween x and z in the Maternal Graph implies that x and z
have the same maternal grandmother. Since v is the maternal
grandmother of z and w is the mother of x, one can conclude
that v is the mother of w.
v
w y
x z
PRIMARY: MATERNAL GRAPH SECONDARY: MATERNAL COUSINS GRAPH
v
w y
x z

14/
Given that these problems generally do not have a single solution,
when are the solutions unique? And, if they are not, then can useful
information still be extracted? Take the example of the Maternal
Graph: If the Maternal Cousins Graph has a rich enough structure,
then the possible locations of the missing arc are limited. More
specifically, if the Maternal Cousins Graph is constructed without
the missing arc and the edge (x,y) from the original Maternal Cous-
ins Graph does not appear in the new one, then it is clear that the
arc needed to produce this cousin relationship is not there. The
arc that models the mother of x, grandmother of x, mother of y, or
grandmother of y is, indeed, missing. But given that exactly one arc
is missing, data scientists can be sure that x or y will have their
grandmother known; thus, the exact arc that is missing in relation
to x and y can be deduced.
For example, if the arc modeling the mother of x is missing, then
one knows that it has to be a daughter of the grandmother of y, but
not y’s mother (i.e., an aunt of y). The secondary graph, in this
case, provides enough information so that – in conjunction with the
known topology of the Maternal Graph – the missing arc can be lim-
ited to a relatively small set of possibilities. Therefore, an (al-
most) unique solution is derived due to the topology of the Maternal
Graph and the rich structure of the Maternal Cousins Graph.
Similarly, if the timestamp from an edge (x,y) in the Handshake
Graph is missing, but it is used on a path from x to a vertex w –
that is, the path (x,y) also has a path from y to w – then an upper
bound can be attached to the missing timestamp. If all of the paths
from y to w where the timestamps increase (producing arcs in the
Influence Digraph) are listed, then the timestamp of (x,y) must be
smaller than the largest initial timestamp. That is, if a unique
path from y to w exists and the first edge has timestamp t, then the
only way the path from x to w can use the edge (x,y) is if the miss-
ing timestamp is smaller than t. Similarly, the missing timestamp
can be bounded below if it is the last edge on a path with

HYPERGIANT.COM/15
In this figure, x and y shook hands at an unknown time t, while the
Influence Digraph includes arcs (x,w) and (a,x). So, while the ex-
act time of the handshake remains unknown, the time can still be
bounded. Two paths (with increasing times) exist from a to x, which
depicts that (at the earliest) t happened after time 1. The two
paths from x to w show that, at the latest, t happened before time
4. Therefore, t lies between 1 and 4. If the further assumption that
all interactions take place at integer times is applied, then one
can safely say that t is either 2 or 3.
increasing timestamps. While this does not, in general, provide an
exact value, it does provide an interval in which the handshake
occurred. And, if additional information (such as all timestamps
stored as integer values) is known, then a specific value may, in
fact, be extracted.
PRIMARY: HANDSHAKE GRAPH SECONDARY: INFLUENCE DIGRAPH
a
y wx
12
4
?
a
y wx

Backwards construction in feature engineering offers great potential
for extracting new features, determining when a unique answer can be
extracted, formalizing the information required to extract features,
and providing meaningful possibilities when the answer to a problem
is not unique. Future directions of inquiry, for those interested in
broadening their knowledge of this and other approaches include:
A more robust understanding of when a unique solution can be ex-
tracted;
Studying the relationship between the definition of secondary in-
formation, the size and location of potential answers, and how
both of these relate to multiple, missing features;
Studying the existence of multiple sources of secondary informa-
tion;
And determining a metric for how robust secondary information is
in relation to recovering missing, primary information.
FUTURE
DIRECTIONS
16/

HYPERGIANT.COM/17
ABOUT
THE
AUTHOR
Dr. Drew Lipman is head of the data science department at Hypergi-
ant. He has a doctorate in discrete mathematics focusing on how to
transform difficult algebraic geometric problems into easy to compute
graph theory parameters.
Dr. Lipman’s recent work includes building a machine intelligent
mixologist for TGI Fridays and building the world’s first personal-
ized health platform.
drew@hypergiant.com
DREWJ.LIPMAN,PH.D.

WE ARE HYPERGIANT
TOMORROWING TODAYTM
A guiding light for Fortune 500 companies. Analyz-
ing data. Teaching machines to teach themselves.
Providing understanding, creation, and implementa-
tion at the intersection of experience and machine
intelligence. Merging with partners to create pow-
erful technology solutions and smarter, more effi-
cient human workforces.
18/
[ BURN AFTER READING ]

Reconstructing Primary Information from Secondary Information

Reconstructing Primary Information from Secondary Information

Recommandé

Recommandé

Contenu connexe

Similaire à Reconstructing Primary Information from Secondary Information

Similaire à Reconstructing Primary Information from Secondary Information (20)

Dernier

Dernier (20)

Reconstructing Primary Information from Secondary Information