This document discusses trait mining using the Focused Identification of Germplasm Strategy (FIGS) approach. FIGS uses climate data to predict which plant accessions are most likely to contain useful genetic traits before conducting field trials. Case studies are presented showing predictive links between climate data and traits like disease resistance in barley and wheat. The document outlines the FIGS methodology and discusses its potential to efficiently select germplasm for evaluation and increase genetic gains in crop breeding.
The 7 Things I Know About Cyber Security After 25 Years | April 2024
FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)
1.
2. • Trait mining with FIGS
– Predictive link between climate
data and trait data
– Case studies:
• Morphological traits, Nordic barley
• Net blotch on barley
• Stem rust on wheat
• Stem rust, Ug99 on bread wheat
landraces
Wheat at Alnarp, June 2010
2
4. • Scientists and plant breeders want a
few hundred germplasm accessions to
evaluate for a particular trait.
• How does the scientist select a small
subset likely to have the useful trait?
4
6. What is
Focused Identification of Germplasm Strategy
Mediterranean Sea
South Australia
Origin of Concept:
Boron toxicity of wheat and
barley example of late 1980s Slide made by
M.C. Mackay, 1995
7. Illustration by Mackay (1995)
based on latitude & longitude
Data layers sieve accessions
Temperature
Salinity score
Origin of FIGS:
Michael Mackay (1986,
1990, 1995)
Elevation
Rainfall
Agro-climatic zone
Disease
distribution
FOCUSED IDENTIFICATION OF GERMPLASM STRATEGY 7
9. – Identification of plant germplasm
with a higher likelihood of having
desired genetic diversity for a target
trait property.
– Using climate data for prediction of
crop traits a priori BEFORE the field
trials.
Bread wheat at Nöbbelöv in Lund
9
10. • Focused Identification of
Germplasm Strategy (FIGS).
• Identify new and useful
genetic diversity for crop
improvement.
• Based on eco-geographic data
analysis using climate data.
European mountain ash
(Sorbus aucuparia L.)
at Alnarp, July 2004 10
12. The climate at the original source
location, where the plant
germplasm was developed is
correlated to the trait property.
To build a computer model
explaining the crop trait score
from the climate data.
12
13. Genebank
accessions Field trials (€€€)
Trait
(landraces & CWR)
data
High cost
data
Climate
data Low cost
data
13
14. Wild relatives are shaped Primitive cultivated crops are Traditional cultivated crops
by the environment shaped by local climate and (landraces) are shaped by climate
humans and humans
Modern cultivated crops are Perhaps future crops are shaped
mostly shaped by humans (plant in the molecular laboratory…?
breeders)
14
15. It is possible that the
human mediated selection
of landraces will contribute
to the link between
ecogeography and traits.
During traditional
cultivation the farmer will
select for and introduce
germplasm for improved
suitability of the landrace
to the local conditions.
15
16. The climate data can be extracted
from the WorldClim dataset.
http://www.worldclim.org/
(Hijmans et al., 2005)
Data from weather stations
worldwide are combined to a
continuous surface layer.
Climate data for each landrace is Precipitation: 20 590 stations
extracted from this surface layer.
Temperature: 7 280 stations
16
17. Layers used in these early FIGS studies:
• Precipitation (rainfall)
• Maximum temperatures
• Minimum temperatures
Some of the other layers available:
• Potential evapotranspiration (water-loss)
• Agro-climatic Zone (UNESCO classification)
• Soil classification (FAO Soil map)
• Aridity (dryness) Eddy De Pauw
(ICARDA, 2008)
(mean values for month and year)
17
18. • Landraces and wild relatives
– The link between climate data and the trait
data is required for trait mining with FIGS.
Modern cultivars are not expected to show this
predictive link (complex pedigree).
• Georeferenced accessions
– Trait mining with FIGS is based on multivariate
models using climate data from the source
location of the germplasm. To extract climate
data the accessions need to be accurately
georeferenced.
18
19. Field observations by Agnese
Kolodinska Brantestam (2002-
2003)
Multi-way N-PLS data
analysis, Dag Endresen (2009-
2010)
19
Priekuli (LVA) Bjørke (NOR) Landskrona (SWE)
20. Experiment Heading Ripening Length Harvest Volumetric Thousand
Site Year days days of plant index weight grain weight
LVA 20021 n.s. n.s. n.s. n.s. *** n.s.
LVA 2003 *** n.s. ** ** *** n.s.
NOR 2002 - * ** *** ** n.s.
NOR 2003 ** *** *** * * n.s.
SWE 2002 ** *** n.s. ** * n.s.
SWE 20032 n.s. ** n.s. n.s. ** n.s.
*** Significant at the 0.001 level (p-value)
1 LVA 2002 Germination on spikes (very wet June)
** Significant at the 0.01 level
2 SWE 2003 Incomplete grain filling (very dry June)
* Significant at the 0.05 level
n.s. Not significant (at the above levels)
Endresen, D.T.F. (2010). Predictive association between trait data and ecogeographic data for
Nordic barley landraces. Crop Science 50: 2418-2430. DOI: 10.2135/cropsci2010.03.0174 20
21. • Positive predictive value (PPV)
• PPV = True positives / (True positives + False positives)
• Classification performance for the identification of
resistant samples (positives)
• Positive diagnostic likelihood ratio (LR+)
• LR+ = sensitivity / (1 – specificity)
• Less sensitive to prevalence than PPV
21
22. Green dots indicate collecting sites for resistant wheat landraces and red
dots collecting sites for susceptible landraces.
Field experiments made in
USDA GRIN, trait data online: Minnesota, North Dakota
http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?1041 and Georgia in the USA
22
23. Dataset (unit) PPV LR+ Estimated gain
Net blotch (accession) 0.54 (0.48-0.60) 1.75 (1.42-2.17) 1.35 (1.19-1.50)
Random 0.40 (0.35-0.45) 0.99 (0.84-1.17) 0.99 (0.87-1.12)
(40 % resistant samples)
PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood Ratio
Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw (2011). Predictive association
between biotic stress traits and ecogeographic data for wheat and barley landraces. Crop
Science 51: 2036-2055. DOI: 10.2135/cropsci2010.12.0717
23
24. Green dots indicate collecting sites for resistant wheat landraces and red
dots collecting sites for susceptible landraces.
USDA GRIN, trait data online: Field experiments made in
http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?65049 Minnesota by Don McVey
24
25. Dataset (unit) PPV LR+ Estimated gain
Stem rust (accession) 0.54 (0.50-0.59) 3.07 (2.66-3.54) 1.95 (1.79-2.09)
Random 0.29 (0.26-0.33) 1.04 (0.90-1.20) 1.03 (0.91-1.16)
(28 % resistant samples)
Stem rust (site) 0.50 (0.40-0.60) 4.00 (2.85-5.66) 2.51 (2.02-2.98)
Random 0.19 (0.13-0.26) 0.94 (0.63-1.39) 0.95 (0.66-1.33)
(20 % resistant samples)
PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood Ratio
Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw (2011). Predictive association
between biotic stress traits and ecogeographic data for wheat and barley landraces. Crop
Science 51: 2036-2055. DOI: 10.2135/cropsci2010.12.0717
25
26. Classifier method AUC Cohen’s Kappa
Principal Component Regression 0.69 (0.68-0.70) 0.40 (0.37-0.42)
(PCR)
Partial Least Squares (PLS) 0.69 (0.68-0.70) 0.41 (0.39-0.43)
Random Forest (RF) 0.70 (0.69-0.71) 0.42 (0.40-0.44)
Support Vector Machines (SVM) 0.71 (0.70-0.72) 0.44 (0.42-0.45)
Artificial Neural Networks (ANN) 0.71 (0.70-0.72) 0.44 (0.42-0.46)
AUC = Area Under the ROC Curve (ROC, Receiver Operating Curve)
Bari, A., K. Street, , M. Mackay, D.T.F. Endresen, E. De Pauw, and A. Amri
(2011). Focused Identification of Germplasm Strategy (FIGS) detects wheat
stem rust resistance linked to environment variables. Genetic Resources and
Crop Evolution [online first]. doi:10.1007/s10722-011-9775-5; Published
online 3 Dec 2011.
Abdallah Bari (ICARDA)
26
27. Ug99 set with 4563 wheat landraces screened for Ug99 in Yemen 2007, 10.2 % resistant accessions. The true
trait scores for 20% of the accessions (825 samples) were revealed. We used trait mining with SIMCA to
select 500 accessions more likely to be resistant from 3728 accession with true scores hidden (to the person
making the analysis). The FIGS set was observed to hold 25.8 % resistant samples and thus 2.3 times higher
than expected by chance.
27
28. Classifier method PPV LR+ Estimated gain
kNN (pre-study) 0.29 (0.13-0.53) 5.61 (2.21-14.28) 4.14 (1.86-7.57)
SIMCA 0.28 (0.14-0.48) 5.26 (2.51-11.01) 4.00 (2.00-6.86)
Ensemble classifier 0.33 (0.12-0.65) 8.09 (2.23-29.42) 6.47 (2.05-11.06)
Random 0.06 (0.01-0.27) 0.95 (0.13-6.73) 0.97 (0.16-4.35)
(pre-study, 550 + 275 accessions)
Ensemble 0.26 (0.22-0.30) 2.78 (2.34-3.31) 2.32 (2.00-2.68)
Random 0.11 (0.09-0.15) 1.02 (0.77-1.36) 0.95 (0.77-1.32)
(blind study, 825 + 3738 accessions)
PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood Ratio
Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw, K. Nazari, and A. Yahyaoui (2012). Sources
of Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat Identified Using Focused
Identification of Germplasm Strategy (FIGS). Crop Science [online first]. doi:
10.2135/cropsci2011.08.0427; Published online 8 Dec 2011. 28
29. • PDF available from:
– http://db.tt/lZMpwgJ
• Available from Libris (Sweden)
– ISBN: 978-91-628-8268-6
29
30. Michael Clemet Mackay (2011)
• PDF available from:
– http://pub.epsilon.slu.se/8439/
• Available from Libris (Sweden)
– ISBN: 978-91-576-7634-4
30
32. • Advice the planning of new
collecting/gathering expeditions
– Identification of relevant areas were the
crop species is predicted to be present
(using GBIF data and niche models).
– Focus on areas least well represented in
the genebank collection (maximize
diversity).
– Focus on areas with a higher likelihood
for a desired target trait (FIGS).
See http://gisweb.ciat.cgiar.org/GapAnalysis/ for more information.
32
33. Species
distribution Wormwood (Artemisia absinthium L.)
model
(7 364 records)
Using the Maxent
desktop software.
33
34. http://data.gbif.org/datasets/network/2
Using GBIF/TDWG technology (and The compatibility of data standards
contributing to its development), the between PGR and biodiversity collections
PGR community can more easily made it possible to integrate the
establish specific PGR networks worldwide germplasm collections into the
without duplicating GBIF's work. biodiversity community (TDWG, GBIF).
34
35.
36. • Data collection and preparation
• Geo-referencing of collecting locations
• Initial data exploration
• Pre-processing of dataset
• Choose modeling method
• Calibration of model
• Validation of model
• Validation of prediction results
36
37. Example of georeferencing for
NGB9529, a barley landrace
reported as originating from
Lyderupgaard using KRAK.dk
and maps.google.com
37
38. The influence plot
(residuals against
leverage) shows
sample
(FRO) observed at
Priekuli in 2003
(replicate 2) with a
very high leverage -
well separated from
the “data cloud”.
After looking into the
raw data, this
observation point
was removed as
(set to NaN).
38
39. Mean centering removes the absolute intensity to avoid the model to focus on the variables with
the highest numerical values (intensity).
Scaling makes the relative distribution of values (range spread) more equal between variables.
Auto-scaling is a combination of mean centering and variance scaling.
After auto-scaling all variables have a mean of zero and a standard deviation of one.
The objective is to help the model to separate the relevant information from the noise.
39
40. No preprocessing Mean centering Auto scale
Priekuli
Bjørke
Landskrona
40
41. – No model can ever be absolutely correct
– A simulation model can only be an
approximation
– A model is always created for a specific
purpose
– The simulation model is applied to make
predictions based on new fresh data
– Be aware to avoid extrapolation problems
41
42. – For the initial calibration or training
step.
– Further calibration, tuning step
– Often cross-validation on the training
set is used to reduce the
consumption of raw data.
– For the model validation or goodness
of fit testing.
– External data, not used in the model
calibration.
42
43. 36 variables
Min. temperature Max. temperature Precipitation
mode 1
14 samples
Jan, Feb, Mar, … Jan, Feb, Mar, … Jan, Feb, Mar, …
(mode 2) (mode 2) (mode 2)
1st level for mode 3 2nd level for mode 3 3rd level for mode 3
Precipitation
Max temp
Min temp
14 samples (mode 1)
14 samples (mode 1)
12 months (mode 2) 12 months (mode 2) 43
44. The two PARAFAC models
each calibrated from two
independent split-half
subsets, both converge to
a very similar solution as
the model calibrated
from the complete
dataset.
The PARAFAC model is
thus a general and stable
model for the scope of
Scandinavia.
Example used here is the Trait
data model (mode 1) from
Endresen (2010).
44
46. The distance between the model (predictions) and
the reference values (validation) is the residuals.
Example of a bad model
calibration
Calibration step
Cross-validation indicates
the appropriate model
Be aware of over-fitting! NB! Model validation! complexity. 46
48. • Parallel Factor Analysis (PARAFAC) (Multi-way)
• Multi-linear Partial Least Squares (N-PLS) (Multi-way)
• Soft Independent Modeling of Class Analogy (SIMCA)
• k-Nearest Neighbor (kNN)
• Partial Least Squares Discriminant Analysis (PLS-DA)
• Linear Discriminant Analysis (LDA)
• Principal component logistic regression (PCLR)
• Generalized Partial Least Squares (GPLS)
• Random Forests (RF)
• Neural Networks (NN)
• Support Vector Machines (SVM)
These methods above are the modeling methods used by Endresen (2010),
Endresen et al (2011, 2012), and Bari et al (2012).
• Boosted Regression Trees (BRT)
• Multivariate Regression Trees (MRT)
• Bayesian Regression Trees
48
49. Example from the stem rust set:
2 PCs Principal component 3 3 PCs
Resistant samples
1 PC
* Intermediate
Susceptible
Illustration modified from Wise et al., 2006:201 (PLS Toolbox software manual)
49
50. • Pearson product-moment correlation (R) (-1 to 1)
• Coefficient of determination (R2) (0 to 1)
• Cohen’s Kappa (K) (-1 to 1)
• Proportion observed agreement (PO) (0 to 1)
• Proportion positive agreement (PA) (0 to 1)
• Positive predictive value (PPV) (0 to 1)
• Positive diagnostic likelihood ratio (LR+) (from 0)
• Sensitivity and specificity
• Area under the curve (AUC)
– Receiver operating characteristics (ROC)
• Root mean square error (RMSE)
– RMSE of calibration (RMSEC)
– RMSE of cross-validation (RMSECV)
– RMSE of prediction (RMSEP)
• Predicted residual sum of squares (PRESS)
50
52. Predictions for the cross-validated
(leave-one-out) samples for a N-
PLS model
Predicted trait scores
Endresen (2010) for trait 5 (volumetric weight)
observations from Priekuli, 2002 (mean of the
replications) and 6 principal components.
Correlation coefficient:
Sum squared residuals:
Actual trait scores 52
53. Predictions for the cross-
validated (leave-one-out)
samples with a N-PLS model.
Predicted trait scores
Endresen (2010) for trait 5 (volumetric
weight) observations from Bjørke, 2002 (mean
of the replications) and 4 principal
components.
Correlation coefficient:
Sum squared residuals:
Actual trait scores 53
54. • Often the critical levels ( ) for the p-value significance is set
as 0.05, 0.01 and 0.001 (5 %, 1 %, 0.1 %).
• The significance level is often marked with one star (*) for
the 0.05 level, two stars (**) for the 0.01 level and three
stars (***) for the 0.001 level.
– 5% (even a random effect when an experiment is repeated 20
times is likely to be observed one time)
– 1% (if an experiment is repeated 100 times a random effect is
likely to be observed one time)
– 0.1% (if an experiment is repeated 1000 times a random effect is
likely to be observed one time)
54
55. PGR Secure (EU 7th Framework)
Workshop FIGS approach
9-13 Jan 2012, Madrid
Dag Endresen
dag.endresen@gmail.com
Abdallah Bari (ICARDA)
abdallah.bari@gmail.com
55
Editor's Notes
PGR Secure Workshop, 9-13 January, Madrid, Spain. Focused Identification of Germplasm Strategy (FIGS) analysis for the wild relatives of the cultivated plants. Predictive characterization of crop wild relatives. http://pgrsecure.org Photo: Oat (Avena sativa L.) at Alnarp on 3 Aug 2010, by Dag Endresen.
Photo: Wheat, Triticum aestivum L., at Nöbbelöv in Lund Sweden, June 2010 by Dag Endresen. URL: http://www.flickr.com/photos/dag_endresen/4826175873/, https://picasaweb.google.com/dag.endresen/GermplasmCrops#5497796034327520578
Genetic resources from the wild relatives of the cultivated plants contributes the raw material required for domesticated forms and the furtherdevelopment of these food crops. Genebanks preserve and provides plant genetic resources for utilization by plant breeders and other bona fide use.
Photo from the USDA Photo archive. Slide text by Ken Street, ICARDA FIGS team (2009).
Photo: Dag Endresen.Field of sugar beet (Beta vulgarisL.) at Alnarp (June 2005). URL: http://www.flickr.com/photos/dag_endresen/4189812241/
Photo: Bread wheat (Triticum aestivum L.) at Nöbbelöv in Lund July 2010 by Dag Endresen. URL: http://www.flickr.com/photos/dag_endresen/4826565058/
Photo: European mountain ash (Sorbusaucuparia L.) July 2004 by Dag Endresen.https://picasaweb.google.com/115547050550954466285/GermplasmCrops#5362091001327096818
Illustration of trait mining with ecoclimatic GIS layers. GIS layers included in the illustration are from the ICARDA ecoclimatic database, average: annual temperature (front), annual precipitation (middle), and winter precipitation (back) (De Pauw, 2003).
The assumption for trait mining using the FIGS approach is that there is a link between trait properties for crop landraces and crop wild relatives and the eco-climatic environment at the source location (collecting site). And that this link can be captured and described using a computer modeling approach.
Landrace samples (genebank seed accessions)Trait observations (experimental design) - High cost dataClimate data (for the landrace location of origin) - Low cost dataThe accession identifier (accession number) provides the bridge to the crop trait observations.The longitude, latitude coordinates for the original collecting site of the accessions (landraces) provide the bridge to the environmental data.
Modern agriculture uses advanced plant varieties based on the most productive genetics. The original land races and wild forms produce lower yields, but their greater genetic variation contains a higher diversity in e.g. resistance to disease. High-yielding modern crops are therefore vulnerable when a new disease arises.
Illustration traditional cattle farming: http://commons.wikimedia.org/wiki/File:Traditional_farming_Guinea.jpg (USAID, Public Domain).
The WorldClim dataset is described in: Hijmans, R.J., S.E. Cameron, J.L. Parra, P.G. Jones and A. Jarvis, 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965-1978.NOAA GHCN-Monthly version 2: http://www.ncdc.noaa.gov/oa/climate/ghcn-monthly/index.phpWeather stations, precipitation: 20 590 stations; temperature:7280 stations.
Endresen, D.T.F. (2010). Predictive association between trait data and ecogeographic data for Nordic barley landraces. Crop Science 50: 2418-2430. DOI: 10.2135/cropsci2010.03.0174
GRIN database (USDA-ARS, National Plant Germplasm System, Germplasm Resources Information Network, online http://www.ars-grin.gov/npgs).USDA GRIN, trait data online: http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?1041.Dr. Harold Bockelman extracted the trait data (C&E).
Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw (2011). Predictive association between biotic stress traits and ecogeographic data for wheat and barley landraces. Crop Science 51: 2036-2055. DOI: 10.2135/cropsci2010.12.0717
GRIN database (USDA-ARS, National Plant Germplasm System, Germplasm Resources Information Network, online http://www.ars-grin.gov/npgs). USDA GRIN, trait data online: http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?65049. Dr. Harold Bockelmanextracted the trait data (C&E).
Photo: USDA ARS Image k1192-1, http://www.ars.usda.gov/is/graphics/photos/mar09/k11192-1.htm
GRIN database (USDA-ARS, National Plant Germplasm System, Germplasm Resources Information Network, online http://www.ars-grin.gov/npgs). USDA GRIN, trait data online: http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?65049. Dr. Harold Bockelmanextracted the trait data (C&E). Photo: USDA ARS Image Archive, http://www.ars.usda.gov/is/graphics/photos/
Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw, K. Nazari, and A. Yahyaoui (2012). Sources of Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat Identified Using Focused Identification of Germplasm Strategy (FIGS). Crop Science [online first]. doi: 10.2135/cropsci2011.08.0427; Published online 8 Dec 2011.
Photo: Wheat infected by stem rust (Ug99) at the Kenya Agricultural Research Station in Njoro northwest of Nairobi. This study is in press and will soon be available from Crop Science.
Endresen, D.T.F. (2011). Utilization of plant genetic resources, a lifeboat to the gene pool. PhD thesis. Department of Agriculture and Ecology, Faculty of Life Sciences, University of Copenhagen. ISBN: 978-91-628-8268-6. Available at http://goo.gl/pYa9x, verified 7 Jan 2011.
Mackay, M.C. (2011). Surfing the genepool, the effective and efficient use of plant genetic resources. PhD thesis. Department of Plant Breeding and Biotechnology, Faculty of landscape planning, horticulture and agricultural science, Swedish University of Agricultural Sciences (SLU), Alnarp. ISBN: 978-91-576-7634-4. Available at http://pub.epsilon.slu.se/8439/, verified 7 Jan 2011.
More than 7.4 million genebank accessions; and more than 1400 genebanks - including approximately 140 large genebanks each holding more than 10.000 accessions: Second Report on the State of the World’s Plant Genetic Resources for Food and Agriculture (2010) Food and Agriculture Organization of the United Nations (FAO).Photo: Genebank Accession at IPK Gatersleben by Dag Endresen, 2010.
For more information on Gap Analysis see: http://gisweb.ciat.cgiar.org/GapAnalysis/.Photo: South of Tunisia, http://www.flickr.com/photos/dag_endresen/4221301525/
NordGen study in June 2010, Wormwood (Artemisia absinthiumL.). Species distribution model using the Maxent desktop ecological niche modeling software. Only the niche model study to identify suitable locations for the presence of the species was made. The next step to combine this result with a FIGS study to identify locations where a target trait in these predicted populations would be likely to be found was only planned but not completed for this experiment conducted at NordGen.
A data exchange protocol, format, infrastructure and a network for sharing datasets are important elements.
PGR Secure Workshop, 9-13 January, Madrid, Spain. Focused Identification of Germplasm Strategy (FIGS) analysis for the wild relatives of the cultivated plants. Photo: Oat (Avena sativa L.) at Alnarp on 3 Aug 2010, by Dag Endresen. This previous slides were for the first day of the workshop (9 Jan 2012). The following slides were for the second day of the workshop (10 Jan 2012).
These are the steps to follow in a trait mining experiment using the FIGS approach.
KRAK: http://www.krak.dk/query?mop=aq&mapstate=7%3B9.305588071850734%3B56.61105751259899%3Bh%3B9.282591620463698%3B56.61775781407488%3B9.328584523237769%3B56.60435721112311%3B853%3B469&what=map_adr# Google Maps: http://maps.google.com/maps/ms?ie=UTF8&hl=en&msa=0&msid=107144586665622662057.00045ff98921bd0418037&ll=56.606941,9.297695&spn=0.055554,0.150204&t=h&z=13
NGB6300 (accide 9039, FRO) observed at Priekuli, Latvia in 2003, replicate 2 is highlighted in red (with high leverage as indicated by the high value for Hoteling T2).NGB776 (accide 8510, SWE) observed at Landskrona, Sweden in 2002 (Replicate 2, LYR312) is highlighted in green.
Box-plot of the trait scores to illustrate the effect of the preprocessing. First row is no preprocessing; row 2 is mean-centering (centering across mode 1, samples); last row is auto-scale (centering across mode 1 and scaling across mode 2, traits).Mean centeringremoves the absolute intensity information (the mean for each variable is subtracted from the individual data values). This pre-processing strategy is applied to avoid the model to focus on the variables with the highest numerical values (intensity). Scaling: In general, scaling a variable in the data can be viewed as a multiplication of the corresponding column vector entries with some number. If the significances of the variables to the model are known prior to modeling, then it might be a good idea to upscale the highly relevant variables. In contrast, if a variable is supposed to bear merely noise, then its significance must be downscaled. However this is a rare case in reality. Therefore, unit-variance scaling (UV-scaling) is most often used. Moreover scaling itself is sometimes associated with UV-scaling. (Johann Gasteiger, and Dr. Thomas Engel (editors). 2003. Chemoinformatics: a textbook. Wiley-VCH, Weinheim. ISBN 9783527306817. Page 214)
Box-plot of the trait scores to illustrate the effect of the preprocessing. First row is no preprocessing; row 2 is mean-centering (centering across mode 1, samples); last row is auto-scale (centering across mode 1 and scaling across mode 2, traits).
We often divide the data for a simulation model project in three equal parts: one set for initial model calibration or training, one set for further calibration or fine tuning; and one test set for validation on the model.
Multi-way analysis using a multi-linear data cube representation of the dataset preserves much more information for the model to learn from compared to a more common 2-way data table. The multi-way model can calibrate against systematic patterns in all dimensions (ways) of the data cube including the dimensions for the different types and months for the climate variables - while the reduced 2-way data table will only present systematic variation across the accessions/samples (1st way) and across all of the climate variables (2nd way).
The PARAFAC algorithm provides a very powerful and compact model with the consumption of very few degrees of freedom (!). However one must be particular careful to validate the PARAFAC model solution because the PARAFAC sometimes calibrate to bad solutions. The split half approach is a suitable method where the dataset is split by random in two parts each used to calibrate independent PARAFAC models. The profiles from a plot of the scores and loadings for the latent factors (synthetic variables) will give a good indication if the two halves of the dataset calibrate to the same PARAFAC model solutions. The calculated core consistency is another useful indicator for these PARAFAC models.
Map to illustrated the first successful split-half subsets. Set 1: NGB6300, NGB27, NGB469, NGB776, NGB4701, NGB2072, NGB4641 are indicated with blue placemarks. Set 2: NGB792, NGB13458, NGB9529, NGB468, NGB775, NGB456, NGB2565 are indicated with red placemarks. Map of the second good split-half. Set 1: NGB456, NGB9529, NGB469, NGB2072, NGB468, NGB4641, NGB776 are indicated with blue placemarks. Set 2: NGB4701, NGB27, NGB2565, NGB792, NGB13458, NGB6300, NGB775 are indicated by red placemarks.
Residuals can tell much about the model. If the residuals are not on a normal distribution this is often an indication that there remain systematic information in the dataset that the model has not captured.
Photo: Wheat field at Nöbbelöv, July 2010, by Dag Endresen.
Left side illustration is modified from Wise et al., 2006:201 (PLS Toolbox software manual). The right side illustration is made by the PLS Toolbox software in MATLAB.
Photo: Wheat field at Nöbbelöv, July 2010, by Dag Endresen.
The 2 by 2 confusion matrix provides the starting point for the calculation of many useful performance indicators – for classification of ordinal and categorical trait variables.
Notice that the Root Mean Square Error from Cross-Validation (RMSECV) reach a minimum at 6 latent factors(LVs) while the Root Mean Square Error from Calibration (RMSEC) continue to fall. Models with a higher complexity than 5 to 6 LVs will thus most likely be overfitted to the training data.
Notice that the Root Mean Square Error from Cross-Validation (RMSECV) reach a minimum at 9 latent factors(LVs) while the Root Mean Square Error from Calibration (RMSEC) continue to fall. Notice also that the RMSECV level out around 4 LVs after which the cross-validation performance show very minor gains before a final drop from 8 to 9 LVs. Models with a higher complexity than 4 LVs will here most likely be overfitted to the training data.
http://en.wikipedia.org/wiki/Correlation, http://en.wikipedia.org/wiki/Coefficient_of_determination, http://en.wikipedia.org/wiki/Statistical_model_validationTable of critical values for r: http://www.runet.edu/~jaspelme/statsbook/Chapter%20files/Table_of_Critical_Values_for_r.pdfTable of critical values for r: http://www.gifted.uconn.edu/siegle/research/Correlation/corrchrt.htmTable of critical values for r: http://www.jeremymiles.co.uk/misc/tables/pearson.html