DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Multi-Paradigm Data Science
On the many dimensions of Knowledge Discovery
Data Natives, Berlin, November 17th, 2017
Dr. Kai Gansel
ADDITIVE GmbH
kai.gansel@additive-net.de

Dimensions of Knowledge Discovery I and II: Data
��
��
high-dimensional data
low-dimensional data
bigdata
not-so-bigdata
Statistics & Modeling
Data Mining
ClusterPC
ML & NN
2�� DN2017_Kai_ADDITIVE.nb

Dimensions of Knowledge Discovery III and IV: Approach
��
��
fuzzy question
exact question
exactdata
fuzzydata
Statistics
Data Mining
ML & NN
DN2017_Kai_ADDITIVE.nb ��3

Dimensions of Knowledge Discovery V: Goal
��
Understanding
- Science -
Prediction
- Engineering -
Data Mining
Statistics
Modeling
Machine Learning
Neural Networks
Modeling

Example: Statistics
Role of genetic variants in health and disease
(Kuehn, 2016)

Correlation of SNPs with schizophrenic phenotypes
(Lencz et al., 2013)

Special Topic: Higher order correlations
Definition
(Schneider & Grün, 2003)
An observed correlation between items or events is called genuine if it cannot be explained by correlations of lower order, i.e. by a random superposition of any of its constituent parts.
Meaning
Genuine higher order patterns are based on non-random, interacting processes and reflect the correlational structure of these processes. The appearances of such patterns may provide insights into
their hidden causes.

General task
W = region defining one data point
τ = class / feature / quality
Application areas: visited websites, market basket analysis...
...you name it!
The problem
Combinatorial explosion of the number of candidate patterns and tests with increasing number of dimensions:
n = 20; 2^n - n - 1
1 048 555

Reducing the complexity of data: DimensionReduce
Advantages of dimensionality reduction:
◼ It reduces the time and storage space required.
◼ Removal of multi-collinearity improves the performance of any machine learning model.
◼ It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.
Here are some multi-dimensional example data:
data = Import[NotebookDirectory[] <> "Example.dat"];
Rearrange example data to represent individual measurements.
Structure of the data:
ListPlot[Tally[First /@ data], PlotRange → All, Filling → Axis, AxesLabel → {"Sort ID", "Number of measurements"}]
20 40 60 80 100
Sort ID
10
20
30
40
50
Number of measurements

ListLinePlot[data[[346, 2]], PlotRange → All, AxesLabel → {"Mass ID", "Value"}, Epilog → Text["Measurement 346nSort ID: " <> ToString[data[[346, 1]]], {14 000, 6000}]]
5000 10000 15000 20000 25000
Mass ID
2000
4000
6000
8000
10000
Value
Measurement 346
Sort ID: 55
Dimensions[Transpose[data][[2]]]
{346, 25 780}
Project the data onto a 3-dimensional subspace:
data3D = DimensionReduce[Transpose[data][[2]], 3];
data3D = Get[NotebookDirectory[] <> "Data3D.txt"];
ListPlot[data3D[[346]], PlotRange → All, AxesLabel → {"Component", "Value"},
Filling → Axis, Epilog → Text["Measurement 346nSort ID: " <> ToString[data[[346, 1]]], {2, 20}]]
1.5 2.0 2.5 3.0
Component
-20
-10
10
20
Value
Measurement 346
Sort ID: 55
Dimensions[data3D]
{346, 3}

ListPointPlot3D[data3D]

Clustering and classifying data: ClusterClassify
ClusterClassify automatically determines the number of clusters and classifies the data accordingly:
Manipulate[With[{CC = ClusterClassify[data3D, Method → method][data3D]}, ListPointPlot3D[Map[Last, GatherBy[Transpose[{CC, data3D}], First], {2}],
ImageSize → 500, PlotLegends → SwatchLegend[Union[CC], LegendLabel → "Cluster ID", LegendFunction → Panel, LegendMarkers → "SphereBubble"]]],
{method, {"GaussianMixture", "DBSCAN", "MeanShift", "Agglomerate", "NeighborhoodContraction"}}, SaveDefinitions → True]
��
Cluster ID
1
2
3

Classifying data: Classify
Rock-paper-scissors
Click Reset.
Hold up a fist in front of the camera. Click Rock. Change your hand to paper as you click the paper button, same for scissors. Capture 10-12 images of each. Click stop when you are done. Click Train and
wait. Click Watch and hold up some rock paper scissors gestures and it should recognize what you are doing.
Data = 0
��
Capture: Rock Paper Scissors Watch Stop
��

Find the optimal parameters of a classifier
Load a dataset and split it into a training set and a test set.
data = RandomSample[ExampleData[{"MachineLearning", "Titanic"}, "Data"]];
training = data[[ ;; 1000]];
test = data[[1001 ;;]];
Define a function computing the performance of a classifier as a function of its (hyper)parameters.
loss[{c_, gamma_, b_, d_}] :=
-ClassifierMeasurements[Classify[training, Method → {"SupportVectorMachine", "KernelType" → "Polynomial", "SoftMarginParameter" → Exp[c],
"GammaScalingParameter" → Exp[gamma], "BiasParameter" → Exp[b], "PolynomialDegree" → d}], test, "LogLikelihoodRate"];
Define the possible value of the parameters.
region = ImplicitRegion[And[-3. ≤ c ≤ 3., -3. ≤ gamma ≤ 3., -1. ≤ b ≤ 2., 1 ≤ d ≤ 3, d ∈ Integers], {c, gamma, b, d}]
Search for a good set of parameters.
bmo = BayesianMinimization[loss, region]
bmo["MinimumConfiguration"]
Train a classifier with these parameters.
Classify[training, Method → {"SupportVectorMachine", "KernelType" → "Polynomial", "SoftMarginParameter" → Exp[2.979837222482109`],
"GammaScalingParameter" → Exp[-2.1506497693543025`], "BiasParameter" → Exp[-0.9038364134482837`], "PolynomialDegree" → 2}]
ClassifierMeasurements[%, test, "Accuracy"]

Neural Networks: Digit classification
Use the MNIST database of handwritten digits to train a convolutional network to predict the digit given an image.
First obtain the training and validation data.
resource = ResourceObject["MNIST"];
trainingData = ResourceData[resource, "TrainingData"];
testData = ResourceData[resource, "TestData"];
RandomSample[trainingData, 5]
Define a convolutional neural network that takes in 28×28 grayscale images as input.
lenet = NetChain[{ConvolutionLayer[20, 5], Ramp, PoolingLayer[2, 2], ConvolutionLayer[50, 5], Ramp, PoolingLayer[2, 2], FlattenLayer[], 500, Ramp, 10, SoftmaxLayer[]},
"Output" → NetDecoder[{"Class", Range[0, 9]}], "Input" → NetEncoder[{"Image", {28, 28}, "Grayscale"}]]
NetChain 
��
��
�-�� (�� ×��×��)
� �� -�� (�� ×��×��)
� �� -�� (�� ×��×��)
� �� -�� (�� ×��×��)
� �� -�� (�� ×�×�)
� �� -�� (�� ×�×�)
� �� -�� (�� ×�×�)
� �� (�� )
� �� (�� )
� �� (�� )
�� (�� )
�� (�� )
��
(��)

Train the network for one training round.
lenet = NetTrain[lenet, trainingData, ValidationSet → testData, MaxTrainingRounds → 1];
Evaluate the trained network directly on images randomly sampled from the validation set.
imgs = Keys@RandomSample[testData, 5];
Thread[imgs → lenet[imgs]]
 → 4, → 0, → 6, → 7, → 2

Create a ClassifierMeasurements object from the trained network and the validation set.
cm = ClassifierMeasurements[lenet, testData]
ClassifierMeasurementsObject
��
��

Obtain the accuracy of the network on the validation set.
cm["Accuracy"]
0.9801
Obtain a plot of the confusion matrix of the network predictions on the validation set.
cm["ConfusionMatrixPlot"]
975
1150
1017
1013
979
882
947
1049
983
1005
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
980
1135
1032
1010
982
892
958
1028
974
1009
predicted class
actualclass
963
0
1
0
0
1
5
1
2
2
0
1132
5
0
0
0
5
4
0
4
1
1
1001
2
2
1
0
7
2
0
0
1
2
992
0
9
0
2
3
4
1
0
2
0
964
0
3
0
2
7
0
0
0
4
0
874
3
0
0
1
3
0
2
0
1
2
936
0
2
1
3
1
14
7
1
1
1
1008
5
8
4
0
5
5
2
4
5
3
952
3
5
0
0
0
12
0
0
3
6
979

Neural Networks: Unsupervised learning with autoencoders
Train an autoencoder network to reconstruct images of handwritten digits a�er projecting them to a lower-dimensional “code” vector space. Use these code vectors to perform clustering and visualiza-
tion.
First obtain the training data, then select images corresponding to digits 0 through 4.
resource = ResourceObject["MNIST"];
trainingData = ResourceData[resource, "TrainingData"];
trainingSubset = Select[trainingData, Last[#] ≤ 4 &];
testData = ResourceData[resource, "TestData"];
testSubset = Select[testData, Last[#] ≤ 4 &];
RandomSample[trainingSubset, 8]
 → 1, → 3, → 0, → 0, → 4, → 2, → 1, → 4
Obtain the “mean image” to subtract from the training data.
trainingImages = Keys[trainingSubset];
meanImage = Image[Mean@Map[ImageData, trainingImages]]
Create a network to train that produces both the reconstruction and the reconstruction error.

net = NetGraph[{FlattenLayer[], 50, Ramp, 784, Tanh, ReshapeLayer[{1, 28, 28}], MeanSquaredLossLayer[]},
{1 → 2 → 3 → 4 → 5 → 6 → NetPort["Output"], 6 → NetPort[7, "Input"], NetPort["Input"] → NetPort[7, "Target"]},
"Input" → NetEncoder[{"Image", {28, 28}, "Grayscale", "MeanImage" → meanImage}], "Output" → NetDecoder[{"Image", "Grayscale"}]]
NetGraph 
1 2 3 4 5 6 Output
7Input Loss
Input
Ramp Tanh
784 50 50 784 784 1 ⨯ 28 ⨯ 28
1
⨯
28
⨯
28
1 ⨯ 28 ⨯ 28
1 ⨯ 28 ⨯ 28
ℝ
FlattenLayer ReshapeLayer
LinearLayer MeanSquaredLossLayer
ElementwiseLayer

Train the network to minimize the reconstruction error.
trained4 = NetTrain[net, <|"Input" → trainingImages|>, "Loss"];
Obtain a subnetwork that performs only reconstruction.
reconstructor = Take[trained4, {NetPort["Input"], NetPort["Output"]}]
NetGraph  1 2 3 4 5 6 OutputInput
Ramp Tanh
784 50 50 784 784 1 ⨯ 28 ⨯ 281 ⨯ 28 ⨯ 28
FlattenLayer ElementwiseLayer
LinearLayer ReshapeLayer

Reconstruct some sample images.
ImageAdd[reconstructor[#], meanImage] & /@  , , , , 
 , , , , 
Obtain a subnetwork that produces the code vector.

encoder = Take[trained4, {NetPort["Input"], 4}]
NetGraph  1 2 3 4Input Output
784 50 501 ⨯ 28 ⨯ 28 784
FlattenLayer Ramp
LinearLayer

Compute codes for all of the test images.
testImages = Keys[testSubset];
features = encoder[testImages];
Project the code vectors to three dimensions and visualize them along with the original classes (not seen by the network). The digit classes tend to cluster together.
coords = DimensionReduce[features, 3];
classes = Values[testSubset];
Table[Extract[coords, Position[classes, i]], {i, 0, 4}]
ListPointPlot3D[Table[Extract[coords, Position[classes, i]], {i, 0, 4}], PlotLegends → PointLegend[96, Range[0, 4]],
BoxRatios → 1, Axes → None, Boxed → True, PlotStyle → Map[ColorData[96], Range[1, 5]], AspectRatio → 1]
0
1
2
3
4

Visualize a hierarchical clustering of random representatives from each class.
representatives = Catenate@GroupBy[testSubset, Last → First, RandomSample[#, 6] &];
ClusteringTree[encoder[representatives] → Map[ImageCrop, representatives]]

Neural Networks: Avoid overfitting using a hold-out set
Use the ValidationSet option to NetTrain to ensure that the trained net does not overfit the input data. This is commonly referred to as a test or hold-out dataset.
Create synthetic training data based on a Gaussian curve.
data = Table[x → Exp[-x^2] + RandomVariate[NormalDistribution[0, .15]], {x, -3, 3, .2}];
plot = ListPlot[List @@@ data, PlotStyle → Red]
-3 -2 -1 1 2 3
-0.2
0.2
0.4
0.6
0.8
1.0
Train a net with a large number of parameters relative to the amount of training data.
net = NetChain[{150, Tanh, 150, Tanh, 1}, "Input" → "Scalar", "Output" → "Scalar"];
net1 = NetTrain[net, data, Method → "ADAM"]
NetChain 
��
��
�� (�� )
� �� (�� )
� �� (�� )
� �� (�� )
� �� (�� )
� �� (�� )
��

The resulting net overfits the data, learning the noise in addition to the underlying function.

Show[Plot[net1[x], {x, -3, 3}], plot]
-3 -2 -1 1 2 3
-0.2
0.2
0.4
0.6
0.8
1.0
Subdivide the data into a training set and a hold-out validation set.
data = RandomSample[data];
{train, test} = TakeDrop[data, 24];
Use the ValidationSet option to have NetTrain select the net that achieved the lowest validation loss during training.
net2 = NetTrain[net, train, ValidationSet → test]
NetChain 
��
��
�� (�� )
� �� (�� )
� �� (�� )
� �� (�� )
� �� (�� )
� �� (�� )
��

The result returned by NetTrain was the net that generalized best to points in the validation set, as measured by validation loss. This penalizes overfitting, as the noise present in the training data is
uncorrelated with the noise present in the validation set.

Show[Plot[net2[x], {x, -3, 3}], plot]
-3 -2 -1 1 2 3
0.2
0.4
0.6
0.8
1.0

Model-based Prediction
Train a Gaussian process predictor on a simple dataset.
data = {-1.2 → 1.2, 1.0 → 1.4, 2.2 → 1.6, 3.1 → 1.8, 4.5 → 1.6};
p = Predict[data, Method → "GaussianProcess"]
Visualize the predicted values along with a confidence interval.
Show[Plot[{p[x], p[x] + StandardDeviation[p[x, "Distribution"]], p[x] - StandardDeviation[p[x, "Distribution"]]},
{x, -2, 6}, PlotStyle → {Blue, Gray, Gray}, Filling → {2 → {3}}, Exclusions → False, PerformanceGoal → "Speed",
PlotLegends → {"Prediction", "Confidence Interval"}], ListPlot[List @@@ data, PlotStyle → Red, PlotLegends → {"Data"}]]
-2 2 4 6
1.2
1.3
1.4
1.5
1.6
1.7
1.8
Prediction
Confidence Interval
Data

Dealing with complexity: Graph analysis
Import a DIMACS file:
gr = Import["http://mat.gsia.cmu.edu/COLOR/instances/homer.col", "DIMACS"]
Get the metadata:
Import["http://mat.gsia.cmu.edu/COLOR/instances/homer.col", "Elements"]
{AdjacencyMatrix, EdgeRules, Graph, Graphics, VertexCount}
Edge rules:
rules = Import["http://mat.gsia.cmu.edu/COLOR/instances/homer.col", "EdgeRules"]
Most frequently occurring character:
Commonest[Flatten[List @@@ rules]]
{452}
Achilles neighborhood:

NeighborhoodGraph[gr, 452]
Highlight the subgraph:

HighlightGraph[gr, NeighborhoodGraph[gr, 452], ImageSize → Large]

Conclusion
◼ Don’t restrict yourself to any particular approach or method without need!
◼ Don’t imply the answer when defining a question!
◼ Stay curious!

Thanks for listening!
For questions and suggestions, contact kai.gansel@additive-net.de.
http://www.additive-mathematica.de
ADDITIVE So�- und Hardware für Technik und Wissenscha� GmbH
Max-Planck-Staße 22b, 61381 Friedrichsdorf
Sales: 06172 - 5905 - 30 // mathematica@additive-net.de
Academy: 06172 - 5905 - 90 // academy@additive-net.de
Support: 06172 - 5905 - 20 // support@additive-net.de

DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Similaire à DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive (20)

Plus de Dataconomy Media

Plus de Dataconomy Media (20)

Dernier

Dernier (20)

DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive