Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)
This document summarizes an international workshop on intelligent analysis of environmental data hosted by the Institute of Geomatics and Analysis of Risk at the University of Lausanne. The workshop covered topics like monitoring networks, exploratory spatial data analysis, predictions and simulations using methods like geostatistics and machine learning. It provided examples of applying these methods to problems in environmental fields like measuring radiation, metals, and radon levels.
Similaire à Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)
Optical modeling and design of freeform surfaces using anisotropic Radial Bas...Milan Maksimovic
Similaire à Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland) (20)
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)
1. International Workshop:
Intelligent Analysis of Environmental Data
Institute of Geomatics and
Analysis of Risk (IGAR)
University of Lausanne,
Switzerland
Prof. Mikhail Kanevski
M. Kanevski, Palermo 2009 1
2. Comments and questions to:
• Mikhail.Kanevski@unil.ch
– www.unil.ch/igar
– www.geokernels.org
M. Kanevski, Palermo 2009 2
4. Geo- and Environmental Data
(classes, continuous, images, networks, geomanifolds,…)
• Spatio-temporal
• Multi-scale
• Multivariate
• Highly variable at many scales
• High-dimensional geo-feature spaces
• Uncertainties
• ………….
• In some cases we do have science-based
models: data/knowledge/models integration
M. Kanevski, Palermo 2009 4
5. Spatio-temporal data in terms of
patterns/structures:
a. pattern recognition (pattern
discovery, pattern extraction),
b. pattern modelling,
c. pattern prediction
M. Kanevski, Palermo 2009 5
6. Main Topics:
• Review and posing of typical problems.
• From “numbers” to data
• Collection of data: Monitoring networks and data
representativity? Monitoring network optimisation.
• Get more information value from your data –
EXPLORE ! Exploratory spatio-temporal data
analysis (EDA, ESDA).
• Predictions/estimations or simulations? Risk
analysis and mapping
• Let data speak for themselves: learning from data.
Data mining, Machine learning.
M. Kanevski, Palermo 2009 6
7. Methods:
• Monitoring networks descriptions
• Geostatistics: predictions/simulations
• Machine Learning(neural nets, SLT):
– Neural networks: MLP, PNN, GRNN, RBF, SOM.
ANNEX models. Hybrid models
– Support Vector Machines
• Recent trends in geostatistics: Multiple-points
geostatistics, pattern based geostatistics.
• Bayesian approach for uncertainty assessment,
integration of data and science-based models
(Bayesian Maximum Entropy)
M. Kanevski, Palermo 2009 7
8. Spatial data analysis: typical tasks
• Predict a value at a given point.
• Build a map (isolines, 3D surfaces,..).
• Estimate prediction error.
• Take into account measurement errors.
• Risk mapping: Uncertainty mapping around unknown
value. Estimate the probability of exceeding of a
given/decision level.
• Joint predictions of several variables (improve
predictions on primary variable using auxiliary data and
information).
• Optimization of monitoring network (design/ redesign)
• Simulations: modelling of spatial uncertainty and
variability
• Data/Science-based models assimilation/fusion
• Image analysis. Remote sensing
• Spatio-temporal events (forest fires, epidemiology,
crime,…)
• Predictions/simulations in high dimensional spaces
• ………………………………………..
M. Kanevski, Palermo 2009 8
9. Generic Methodology
Data Base
DATA
Management System
Statistical Quick Monitoring
Description Visualisation Network Analysis
Variography Deterministic Monitoring
Interpolations Network
Cross-validation Generation
Machine Learning
Geostatistical
Algorithms
Predictions & Simulations
Decision-oriented Mapping GIS,
M. Kanevski, Palermo 2009
Remote Sensing
9
10. GEOSTATISTICAL ANALYSIS
• Basic/Naïve statistical analysis. EDA
• ESDA (regionalized EDA)
• Structural analysis. Spatial correlation analysis
(variography)
• Model selection: Cross-validation, jack-knife,…
• Prediction and error mapping for decision
making (family of kriging models)
• Probability and Risk mapping. Conditional
stochastic simulations
M. Kanevski, Palermo 2009 10
11. Some Geostatistics
• Exploration of spatial correlations
• Family of kriging models (simple, ordinary,
disjunctive, indicator,…)
• Conditional Stochastic Simulations
M. Kanevski, Palermo 2009 11
19. Use of Cs137 to
improve Sr90
predictions
(reduced errors
and uncertainty).
Decision-oriented
mapping:
« Thick isolines »
M. Kanevski, Palermo 2009 19
25. Geostatistics: some comments
• Geostatistics is a powerful and well elaborated
model-dependent approach.
• Geostatistics proposes a variety of models for spatial
data analysis and modeling. It has long and
successful history of developments and applications
• Some problems:
Nonlinearity
Non-stationarity
Two-point statistics
Data/models integration
Data mining. Pattern recognition
• Hybrid Models (ANN/SVM + Geostat) can help.
M. Kanevski, Palermo 2009 25
26. Some useful comments, conclusions
and future research
• 1. Detection of patterns: try k-NN or GRNN
• as an exploratory tools
• Cross-validation: leave-one-out, leave k-out,
jackknife,etc. as a control tool
• Model selection and model asssessment
M. Kanevski, Palermo 2009 26
28. K-NN prediction:
NN methods use those k-observations in the training data
set T closest in input space to prediction point x to
estimate Y
k
∧ 1
Y= ∑( x) yi
k xi ∈ Nk
Where Nk(x) is the neighborhood of x defined by the
closest points in the training set
M. Kanevski, Palermo 2009 28
29. k-NN Classifiers
These classifiers are memory-based and do
not require any model to be fit! Given a
query point x, we find the k training points
closest in the distance to x and then
classify using MAJORITY vote among the
k neighbors.
M. Kanevski, Palermo 2009 29
30. Because it uses only the training point closest to
the query point, the bias of the 1-nn estimate is
often low, but the variance is high.
A famous result of Cover and Hurt (1967) shows
that asymptotically the error rate of the 1-nn
classifier is never more than twice the Bayes
rate.
This result can provide a rough idea about the best
performance that is possible in a given problem:
if the 1-nn rule has a 10% error rate, then
asymptotically the Bayes error rate is at least
5%.
M. Kanevski, Palermo 2009 30
40. Machine Learning Algorithms
• Machine learning is an area of artificial intelligence
concerned with the development of techniques
which allow computers to "learn".
• More specifically, machine learning is a method
for creating computer programs by the analysis of
data sets. Machine learning overlaps heavily with
statistics, since both fields study the analysis of
data, but unlike statistics, machine learning is
concerned with the algorithmic complexity of
computational implementations. ...
M. Kanevski, Palermo 2009 40
41. Algorithms
Common algorithm types include:
• supervised learning – where the algorithm generates a function that
maps inputs to desired outputs.
• unsupervised learning – which models a set of inputs: labeled
examples are not available.
• semi-supervised learning – which combines both labeled and
unlabeled examples to generate an appropriate function or classifier.
• reinforcement learning – where the algorithm learns a policy of how to
act given an observation of the world. Every action has some impact in
the environment, and the environment provides feedback that guides
the learning algorithm.
• transduction – similar to supervised learning, but does not explicitly
construct a function: instead, tries to predict new outputs based on
training inputs, training outputs, and new inputs.
• The performance and computational analysis of machine learning
algorithms is a branch of statistics known as
computational learning theory.
M. Kanevski, Palermo 2009 41
42. ML Topics (short lists)
• Machine learning topics
• Modeling conditional probability density functions,
regression and classification
– Artificial neural networks
– Decision trees
– Gene expression programming
– Genetic Programming
– Gaussian process regression
– Linear discriminant analysis
– k-Nearest Neighbor
– Minimum message length
– Perceptron
– Quadratic classifier
– Radial basis functions
– Support vector machines
M. Kanevski, Palermo 2009 42
43. ML Topics (continued)
• Modeling probability density functions through generative models:
– Expectation-maximization algorithm
– Graphical models including Bayesian networks and Markov Random Fields
– Generative Topographic Mapping
• Appromixate inference techniques:
– Markov chain Monte Carlo method
– Variational Bayes
• Meta-Learning (Ensemble methods):
– Boosting
– Bootstrap Aggregating aka Bagging
– Random forest
– Weighted Majority Algorithm
• Optimization: most of methods listed above either use optimization or are
instances of optimization algorithms.
• Multi-objective Machine Learning: An approach that addresses multiple, and
often confliciting learning objectives explicitly using Pareto-based multi-
objective optimization techniques.
M. Kanevski, Palermo 2009 43
44. Machine Learning
• Artificial Neural Networks
3. Multilayer perceptrons (MLP)
4. General Regression Neural
Networks (GRNN)
• Statistical Learning Theory
Support Vector Classification
Support Vector Regression
Monitoring Networks Optimization
M. Kanevski, Palermo 2009 44
45. A Generic Model of
Learning from Data/Examples
Generator Supervisor
Learning
Machine
M. Kanevski, Palermo 2009 45
46. The Problem of Risk Minimization
In order to choose the best available model
to the supervisor’s response, one measure
the LOSS or discrepancy L(y,f(x,α))
between the response y of the supervisor
to a given input x and the response f(x,α)
provided by the Loss Measure.
M. Kanevski, Palermo 2009 46
47. Three Main Learning Problems
• Regression Estimation. Let the supervisor’s
answer y, be a real value, and let f(x,α ), α∈Λ ,
be a set of real functions which contains the
regression function
f ( x, α) = ydF ( y ¦ x )
0 ∫
M. Kanevski, Palermo 2009 47
48. The Problem of Risk Minimization
Consider the expected value of the loss,
given by the risk functional
R (α) = ∫ L( y , f ( x, α))dF ( x, y )
The goal is to find the function f(x,α 0) which minimises
the risk in the situation where the joint pdf is
unknown and the only available information is
contained in the training set.
M. Kanevski, Palermo 2009 48
50. Three Main Learning Problems
• Pattern Recognition (classification).
y = {0,1}, classification error:
0, if y = f ( x,α )
L( y, f ( x,α )) =
1, if y ≠ f ( x,α )
M. Kanevski, Palermo 2009 50
52. Three Main Learning Problems
• Regression Estimation
It is known that regression function is the one
which minimizes the following loss-function:
L( y, f ( x, α )) = ( y − f ( x, α )) 2
M. Kanevski, Palermo 2009 52
54. Three Main Learning Problems
• Density Estimation. For this problem
we consider the following loss-
function:
L( p( x,α )) = − log p( x,α )
M. Kanevski, Palermo 2009 54
55. Inductive, Deductive and Transductive
F(x,y)
Induction Deduction
Training samples
(xi, yi) (ynew,xnew)
Transduction
M. Kanevski, Palermo 2009 55
56. Why Machine Learning algorithms?
• Universal, nonlinear, robust tools
• Data adapted
• Easy data and knowledge integration
• Efficient in high dimensional spaces
• Good generalisation (low prediction
error)
• Input/feature selection
M. Kanevski, Palermo 2009 56
57. Our experience, some applications
• Hydrogeology, pollution/contamination (soil, water, air,
food chains,…), topo-climatic modelling, geophysics
• Renewable resources – wind fields
• Natural hazards/risks: forest fires, avalanches, indoor
radon,
• Optimization of monitoring networks
• Crime data, epidemiology
• MNL for remote sensing, change detection
• Socio-economic spatio-temporal multivariate data
• Spatial econometrics. Financial data. Econophysics
• Fractals, Chaos, EVT,
• Time series
M. Kanevski, Palermo 2009 57
59. Guillaume d'Occam (1285 - 1349)
“Pluralitas non est ponenda sine
necessitate”
Occam’s razor:
“The more simple explanation of the
phenomena is more likely to be
correct”
M. Kanevski, Palermo 2009 59
60. Model Assessment and Model
Selection:
Two separate goals
M. Kanevski, Palermo 2009 60
61. Model Selection:
Estimating the performance of different
models in order to choose the
(approximate) best one
Model Assessment:
Having chosen a final model, estimating its
prediction error (generalization error) on
new data
M. Kanevski, Palermo 2009 61
62. If we are in a data-rich situation, the best
solution is to split randomly (?) data
Raw Data
Train: 50% Validation:25% Test:25%
(Train) (test) (validation)
M. Kanevski, Palermo 2009 62
63. Interpretation
• The training set is used to fit the models
• The validation set is used to estimate prediction
error for model selection (tuning
hyperparameters)
• The test set is used for assessment of the
generalization error of the final chosen model
Elements of Statistical Learning- Hastie, Tibshirani & Friedman 2001
M. Kanevski, Palermo 2009 63
64. Bias and Variance.
Model’s complexity
c. Underfitting
3
2.5
2 b. Overfitting
3
1.5
2.5
1
2
0.5
1.5
2 4 6 8 10 1
0.5
2 4 6 8 10
M. Kanevski, Palermo 2009 64
65. One of the most serious problems that arises in
connectionist learning by neural networks is
overfitting of the provided training examples.
This means that the learned function fits very
closely the training data however it does not
generalise well, that is it can not model
sufficiently well unseen data from the same task.
Solution: Balance the statistical bias and statistical
variance when doing neural network learning in
order to achieve smallest average generalization
error
M. Kanevski, Palermo 2009 65
67. We can derive an expression for the
expected prediction error of a
regression at an input point X=x0
using squared-error loss:
M. Kanevski, Palermo 2009 67
68. ∧
Err ( x0 ) = E[(Y − f ( x0 )) ¦ X = x0 ] =
2
∧ ∧ ∧
σ + [ E f ( x0 ) − f ( x0 )] + E[ f ( x0 ) − E f ( x0 )] =
2
ε
2 2
∧ ∧
σ + Bias ( f ( x0 )) + Var ( f ( x0 )) =
2
ε
2
IrreducibleError + Bias + Variance 2
M. Kanevski, Palermo 2009 68
69. • The first term is the variance of the target around
its true mean f(x0), and cannot be avoided no
matter how well we estimate f(x0), unless σε2=0.
• The second term is the squared bias, the amount
by which the average of our estimate differs from
the true mean
• The last term is the variance, the expected
squared deviation of ∧ around its mean.
f ( x0 )
M. Kanevski, Palermo 2009 69
70. For the k-NN regression fit
∧
Err ( x0 ) = E[(Y − f ( x0 )) ¦ X = x0 ] = 2
k
1
σ + [ f ( x0 ) − ∑ f ( xl )] + σ ε / k
2
ε
2 2
k l =1
Here we assume for simplicity that training
inputs are fixed, and the randomness arises
from the Y. The number of neighbors k is
inversely related to the model complexity
M. Kanevski, Palermo 2009 70
71. Elements of Statistical Learning. Hastie, Tibshirani & Friedman 2001
M. Kanevski, Palermo 2009 71
73. • A neural network is only as good as the
training data!
• Poor training data inevitably leads to an
unreliable and unpredictable network.
• Exploratory Data Analysis and data
preprocessing are extremely important!!!
M. Kanevski, Palermo 2009 73
74. • If possible, prior to training, add some
noise or other randomness to your
example (such as a random scaling
factor). This helps to account for noise and
natural variability in real data, and tends to
produce a more reliable network.
M. Kanevski, Palermo 2009 74
76. Data F1,F2,...,Fn
Structural analysis Statistical Trend
Variogram
Raw Data Variogram
description analysis Data for
training validation testing
Lag (km) ANN architecture choice
Validation Testing
Statistical description
ANN Training
Multivariate structural
analysis
Accuracy Test ANN estimates for F1,F2,...,Fn
Variogram model for residuals
Validation Residual Variogram
ANN Residuals
F1,F2,...,Fn
Variogram
Cross-
validation
Lag (km)
Final estimates
Cokriging
(ANN + Geostatistics)
errors estimates
NNRK/CK
Algorithm
M. Kanevski, Palermo 2009 76
77. Model: Neural Network Residual Cokriging
Artificial Neural
Network Estimate Final estimate of 90Sr with
Geostatistical Estimate
NNRCK
of the Residuals
M. Kanevski, Palermo 2009 77
78. Conclusions
• Machine Learning: universal data-driven
recently developed approach with many
successful applications. Nonlinear, robust.
Integration of different types of data and
information. Efficient in high dimensional
space.
• But: Depends on the quality and quantity of
data. Uncertainty characterization.
Diagnostic tools. Hyper-parameters tuning.
M. Kanevski, Palermo 2009 78
79. Topics for the research
• Multitask learning
• Automatic feature selection/ feature extraction
• Uncertainties characterisation
• Understanding and visluation of high
dimensional data
• Modelling on geomanifold, semi-supervised
learning
• Active learning
• MLA and simulations?
• ……………………………………………………
M. Kanevski, Palermo 2009 79
80. Thank you for your attention!
www.geokernels.org
2004
2008
2009
www.unil.ch/igar
M. Kanevski, Palermo 2009 80