Comparison of learning methods to predict N2O fluxes and N leaching
1. A comparison of learning methods to predict N2O
fluxes and N leaching
Nathalie Villa-Vialaneix
http://www.nathalievilla.org
Joined work with Marco (Follador & Ratto) and Adrian Leip (EC, Ispra,
Italy)
April, 27th, 2012 - BIA, INRA Auzeville
SAMM (Université Paris 1) &
IUT de Carcassonne (Université de Perpignan)
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 1 / 27
3. DNDC-Europe model description
Sommaire
1 DNDC-Europe model description
2 Methodology
3 Results
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 3 / 27
4. DNDC-Europe model description
General overview
Modern issues in agriculture
• fight against the food crisis;
• while preserving environments.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 4 / 27
5. DNDC-Europe model description
General overview
Modern issues in agriculture
• fight against the food crisis;
• while preserving environments.
EC needs simulation tools to
• link the direct aids with the respect of standards ensuring proper
management;
• quantify the environmental impact of European policies (“Cross
Compliance”).
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 4 / 27
6. DNDC-Europe model description
Cross Compliance Assessment Tool
DNDC is a biogeochemical model.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 5 / 27
7. DNDC-Europe model description
Zoom on DNDC-EUROPE
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 6 / 27
8. DNDC-Europe model description
Moving from DNDC-Europe to metamodeling
Needs for metamodeling
• easier integration into CCAT
• faster execution and responding scenario analysis
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 7 / 27
9. DNDC-Europe model description
Moving from DNDC-Europe to metamodeling
Needs for metamodeling
• easier integration into CCAT
• faster execution and responding scenario analysis
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 7 / 27
10. DNDC-Europe model description
Data [Villa-Vialaneix et al., 2012]
Data extracted from the biogeochemical simulator DNDC-EUROPE: ∼
19 000 HSMU (Homogeneous Soil Mapping Units 1km2
but the area is
quite varying) used for corn cultivation:
• corn corresponds to 4.6% of UAA;
• HSMU for which at least 10% of the agricultural land was used for
corn were selected.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 8 / 27
11. DNDC-Europe model description
Data [Villa-Vialaneix et al., 2012]
Data extracted from the biogeochemical simulator DNDC-EUROPE:
11 input (explanatory) variables (selected by experts and previous
simulations)
• N FR (N input through fertilization; kg/ha y);
• N MR (N input through manure spreading; kg/ha y);
• Nfix (N input from biological fixation; kg/ha y);
• Nres (N input from root residue; kg/ha y);
• BD (Bulk Density; g/cm3
);
• SOC (Soil organic carbon in topsoil; mass fraction);
• PH (Soil pH);
• Clay (Ratio of soil clay content);
• Rain (Annual precipitation; mm/y);
• Tmean (Annual mean temperature; C);
• Nr (Concentration of N in rain; ppm).
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 8 / 27
12. DNDC-Europe model description
Data [Villa-Vialaneix et al., 2012]
Data extracted from the biogeochemical simulator DNDC-EUROPE:
2 outputs to be estimated (independently) from the inputs:
• N2O fluxes (greenhouse gaz);
• N leaching (one major cause for water pollution).
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 8 / 27
14. Methodology
Methodology
Purpose: Comparison of several metamodeling approaches (accuracy,
computational time...).
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 10 / 27
15. Methodology
Methodology
Purpose: Comparison of several metamodeling approaches (accuracy,
computational time...).
For every data set, every output and every method,
1 The data set was split into a training set and a test set (on a
80%/20% basis);
2 The regression function was learned from the training set (with a
full validation process for the hyperparameter tuning);
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 10 / 27
16. Methodology
Methodology
Purpose: Comparison of several metamodeling approaches (accuracy,
computational time...).
For every data set, every output and every method,
1 The data set was split into a training set and a test set (on a
80%/20% basis);
2 The regression function was learned from the training set (with a
full validation process for the hyperparameter tuning);
3 The performances were calculated on the basis of the test set: for
the test set, predictions were made from the inputs and compared to
the true outputs.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 10 / 27
17. Methodology
Methods
• 2 linear models:
• one with the 11 explanatory variables;
• one with the 11 explanatory variables plus several nonlinear
transformations of these variables (square, log...): stepwise AIC was
used to train the model;
• MLP
• SVM
• RF
• 3 approaches based on splines: ACOSSO (ANOVA splines), SDR
(improvement of the previous one) and DACE (kriging based
approach).
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 11 / 27
18. Methodology
Regression
Consider the problem where:
• Y ∈ R has to be estimated from X ∈ Rd
;
• we are given a learning set, i.e., N i.i.d. observations of (X, Y),
(x1, y1), . . . , (xN, yN).
Example: Predict N2O fluxes from PH, climate, concentration of N in rain,
fertilization for a large number of HSMU . . .
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 12 / 27
19. Methodology
Multilayer perceptrons (MLP)
A “one-hidden-layer perceptron” takes the form:
Φw : x ∈ Rd
→
Q
i=1
w
(2)
i
G xT
w
(1)
i
+ w
(0)
i
+ w
(2)
0
where:
• the w are the weights of the MLP that have to be learned from the
learning set;
• G is a given activation function: typically, G(z) = 1−e−z
1+e−z ;
• Q is the number of neurons on the hidden layer. It controls the
flexibility of the MLP. Q is a hyper-parameter that is usually tuned
during the learning process.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 13 / 27
20. Methodology
Symbolic representation of MLP
INPUTS
x1
x2
. . .
xd
w
(1)
11
w
(1)
pQ
Neuron 1
Neuron Q
φw(x)
w
(2)
1
w
(2)
Q
+w
(0)
Q
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 14 / 27
21. Methodology
Learning MLP
• Learning the weights: w are learned by a mean squared error
minimization scheme :
w∗
= arg min
w
N
i=1
L(yi, Φw(xi)).
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 15 / 27
22. Methodology
Learning MLP
• Learning the weights: w are learned by a mean squared error
minimization scheme penalized by a weight decay to avoid
overfitting (ensure a better generalization ability):
w∗
= arg min
w
N
i=1
L(yi, Φw(xi))+C w 2
.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 15 / 27
23. Methodology
Learning MLP
• Learning the weights: w are learned by a mean squared error
minimization scheme penalized by a weight decay to avoid
overfitting (ensure a better generalization ability):
w∗
= arg min
w
N
i=1
L(yi, Φw(xi))+C w 2
.
Problem: MSE is not quadratic in w and thus some solutions can be
local minima.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 15 / 27
24. Methodology
Learning MLP
• Learning the weights: w are learned by a mean squared error
minimization scheme penalized by a weight decay to avoid
overfitting (ensure a better generalization ability):
w∗
= arg min
w
N
i=1
L(yi, Φw(xi))+C w 2
.
Problem: MSE is not quadratic in w and thus some solutions can be
local minima.
• Tuning the hyper-parameters, C and Q: simple validation was
used to tune first C and Q.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 15 / 27
25. Methodology
SVM
SVM is also an algorithm based on penalized error loss minimization:
1 Basic linear SVM for regression: Φ(w,b) is of the form x → wT
x + b
with (w, b) solution of
arg min
N
i=1
L (yi, Φ(w,b)(xi)) + λ w 2
where
• λ is a regularization (hyper) parameter (to be tuned);
• L (y, ˆy) = max{|y − ˆy| − , 0} is an -insensitive loss function
See -insensitive loss function
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 16 / 27
26. Methodology
SVM
SVM is also an algorithm based on penalized error loss minimization:
1 Basic linear SVM for regression
2 Non linear SVM for regression are the same except that a non
linear (fixed) transformation of the inputs is previously made:
ϕ(x) ∈ H is used instead of x.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 16 / 27
27. Methodology
SVM
SVM is also an algorithm based on penalized error loss minimization:
1 Basic linear SVM for regression
2 Non linear SVM for regression are the same except that a non
linear (fixed) transformation of the inputs is previously made:
ϕ(x) ∈ H is used instead of x.
Kernel trick: in fact, ϕ is never explicit but used through a kernel,
K : Rd
× Rd
→ R. This kernel is used for K(xi, xj) = ϕ(xi)T
ϕ(xj).
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 16 / 27
28. Methodology
SVM
SVM is also an algorithm based on penalized error loss minimization:
1 Basic linear SVM for regression
2 Non linear SVM for regression are the same except that a non
linear (fixed) transformation of the inputs is previously made:
ϕ(x) ∈ H is used instead of x.
Kernel trick: in fact, ϕ is never explicit but used through a kernel,
K : Rd
× Rd
→ R. This kernel is used for K(xi, xj) = ϕ(xi)T
ϕ(xj).
Common kernel: Gaussian kernel
Kγ(u, v) = e−γ u−v 2
is known to have good theoretical properties both for accuracy and
generalization.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 16 / 27
29. Methodology
Learning SVM
• Learning (w, b): w = N
i=1 αiK(xi, .) and b are calculated by an
exact optimization scheme (quadratic programming). The only step
that can be time consumming is the calculation of the kernel matrix:
K(xi, xj) for i, j = 1, . . . , N.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 17 / 27
30. Methodology
Learning SVM
• Learning (w, b): w = N
i=1 αiK(xi, .) and b are calculated by an
exact optimization scheme (quadratic programming). The only step
that can be time consumming is the calculation of the kernel matrix:
K(xi, xj) for i, j = 1, . . . , N.
The resulting ˆΦN
is known to be of the form:
ˆΦN
(x) =
N
i=1
αiK(xi, x) + b
where only a few αi are non zero. The corresponding xi are called
support vectors.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 17 / 27
31. Methodology
Learning SVM
• Learning (w, b): w = N
i=1 αiK(xi, .) and b are calculated by an
exact optimization scheme (quadratic programming). The only step
that can be time consumming is the calculation of the kernel matrix:
K(xi, xj) for i, j = 1, . . . , N.
The resulting ˆΦN
is known to be of the form:
ˆΦN
(x) =
N
i=1
αiK(xi, x) + b
where only a few αi are non zero. The corresponding xi are called
support vectors.
• Tuning of the hyper-parameters, C = 1/λ, and γ: simple
validation has been used. To limit waste of time, has not been
tuned in our experiments but set to the default value (1) which
ensured 0.5N support vectors at most.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 17 / 27
32. Methodology
From regression tree to random forest
Example of a regression tree
|
SOCt < 0.095
PH < 7.815
SOCt < 0.025
FR < 130.45 clay < 0.185
SOCt < 0.025
SOCt < 0.145
FR < 108.45
PH < 6.5
4.366 7.100
15.010 8.975
2.685 5.257
26.260
28.070 35.900 59.330
Each split is made such that
the two induced subsets have
the greatest homogeneity pos-
sible.
The prediction of a final node
is the mean of the Y value of
the observations belonging to
this node.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 18 / 27
33. Methodology
Random forest
Basic principle: combination of a large number of under-efficient
regression trees (the prediction is the mean prediction of all trees).
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 19 / 27
34. Methodology
Random forest
Basic principle: combination of a large number of under-efficient
regression trees (the prediction is the mean prediction of all trees).
For each tree, two simplifications of the original method are performed:
1 A given number of observations are randomly chosen among the
training set: this subset of the training data set is called in-bag sample
whereas the other observations are called out-of-bag and are used to
control the error of the tree;
2 For each node of the tree, a given number of variables are randomly
chosen among all possible explanatory variables.
The best split is then calculated on the basis of these variables and of the
chosen observations. The chosen observations are the same for a given
tree whereas the variables taken into account change for each split.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 19 / 27
35. Methodology
Learning a random forest
Random forest are not very sensitive to hyper-parameters (number of
observations for each tree, number of variables for each split): the default
values have been used.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 20 / 27
36. Methodology
Learning a random forest
Random forest are not very sensitive to hyper-parameters (number of
observations for each tree, number of variables for each split): the default
values have been used.
The number of trees should be large enough for the mean squared error
based on out-of-sample observations to stabilize:
0 100 200 300 400 500
0246810
trees
Error
Out−of−bag (training)
Test
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 20 / 27
39. Results
Influence of the training sample size
5 6 7 8 9
0.60.70.80.91.0
N leaching prediction
log size (training)
R2
LM1
LM2
Dace
SDR
ACOSSO
MLP
SVM
RF
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 22 / 27
40. Results
Computational time
Use LM1 LM2 Dace SDR Acosso
Train <1 s. 50 min 80 min 4 hours 65 min n
Prediction <1 s. <1 s. 90 s. 14 min 4 min.
Use MLP SVM RF
Train 2.5 hours 5 hours 15 min
Prediction 1 s. 20 s. 5 s.
Time for DNDC: about 200 hours with a desktop computer and about 2
days using cluster 7!
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 23 / 27
41. Results
Further comparisons
Evaluation of the different step (time/difficulty)
Training Validation Test
LM1 ++ +
LM2 + +
ACOSSO = + -
SDR = + -
DACE = - -
MLP - - +
SVM = - -
RF + + +
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 24 / 27
42. Results
Understanding which inputs are important
Importance: A measure to estimate the importance of the input variables
can be defined by:
• for a given input variable randomly permute the input values and
calculate the prediction from this new randomly permutated inputs;
• compare the accuracy of these predictions to accuracy of the
predictions obtained with the true inputs: the increase of mean
squared error is called the importance.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 25 / 27
43. Results
Understanding which inputs are important
Importance: A measure to estimate the importance of the input variables
can be defined by:
• for a given input variable randomly permute the input values and
calculate the prediction from this new randomly permutated inputs;
• compare the accuracy of these predictions to accuracy of the
predictions obtained with the true inputs: the increase of mean
squared error is called the importance.
This comparison is made on the basis of data that are not used to define
the machine, either the validation set or the out-of-bag observations.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 25 / 27
44. Results
Understanding which inputs are important
Example (N2O, RF):
q
q
q
q
q
q
q
q
q
q
q
2 4 6 8 10
51015202530
Rank
Importance(meandecreaseMSE)
pH
Nr N_MR
Nfix
N_FR
clay NresTmean BD rain
SOC
The variables SOC and PH are the most important for accurate
predictions.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 25 / 27
45. Results
Understanding which inputs are important
Example (N leaching, SVM):
q
q
q q
q
q
q
q
q q
q
2 4 6 8 10
050010001500
Rank
Importance(decreaseMSE)
N_FR
Nres pH
Nr
clay
rain
SOC
Tmean Nfix
BD
N_MR
The variables N_MR, N_FR, Nres and pH are the most important for
accurate predictions.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 25 / 27
46. Results
Thank you for your attention
Any questions?
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 26 / 27
47. Results
Villa-Vialaneix, N., Follador, M., Ratto, M., and Leip, A. (2012).
A comparison of eight metamodeling techniques for the simulation of
n2o fluxes and n leaching from corn crops.
Environmental Modelling and Software, 34:51–66.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 27 / 27