How to Troubleshoot Apps for the Modern Connected Worker
Data mining to reveal biological invaders
1. Data Mining to Reveal
Biological Invaders
VOJTĚCH JAROŠÍK
1Department of Ecology, Faculty of Science, Charles
University, Prague and 2Institute of Botany, Academy
of Sciences of the Czech Republic, Průhonice, the
Czech Republic
1 2
2. Backgroud on biological invasions
• Economic impacts
• Changes in habitat properties
• Loss of species diversity
• Biological homogenization
8. Aim
• Data-mining tools were originally designed for analyzing vast
databases of often incomplete data, with an aim to find financial
frauds, suitable candidates for loans, potential customers and
other uncertain outputs
• I will show that searching for potential invasive species and their
traits responsible for invasiveness, or identifying factors that
distinguish invasible communities from those that resist invasion
are similar risk assessments
– This is perhaps the main reason why CART® and related methods are
becoming increasingly popular in the field of invasion biology
– Identifying homogeneous groups with high or low risk and
constructing rules for making predictions about individual cases is, in
essence, the same for financial credit scoring as for pest risk
assessment
– In both cases, one searches for rules that can be used to predict
uncertain future events
9. Basic principles of data mining
• Main literature sources
• Binary recursive partitioning
• Classification And Regression Trees (CART®)
• Random forests (RFTM)
10. Basic principles of data mining
• Classification and Regression Trees (CART®):
– Breiman L, Friedman JH, Olshen RA, Stone CG (1984) Classification and Regression Trees. Belmont, Wadsworth
International Group
– Steinberg D, Colla P (1995) CART: Tree-structured Non-parametric Data Analysis. Salford Systems, San Diego, USA
– Steinberg D, Colla P (1997) CART: Classification and Regression Trees. Salford Systems, San Diego, USA
– Steinberg G, Golovnya M (2006) CART 6.0 User's Manual. Salford Systems, San Diego, USA
– De'ath G, Fabricius KE (2000) Classification and regression trees: a powerful yet simple technique for ecological data
analysis. Ecology, 81, 3178-3192
– Bourg NA, McShea WJ, Gill DE (2005) Putting a cart before the search: successful habitat prediction for a rare forest
herb. Ecology, 86, 2793-2804
– Jarošík V (2011) CART and related methods. In: Encyclopedia of Biological Invasions (eds Simberloff D, Rejmánek
M), pp. 104-108. University of California Press, Berkeley and Los Angeles, USA
• Random ForestsTM
– Breiman L (2001) Random Forests. Machine Learning, 45, 5--32
– Breiman L, Cutler A (2004) Random Forests TM. An Implementation of Leo Breiman’s RFTM by Salford Systems.
Salford Systems, San Diego, USA
– Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, et al. (2007) Random forests for classification in ecology.
Ecology, 88, 2783-2792
– Hochachka WM, Caruana R, Fink D, Munson A, Riedewald M, et al. (2007) Data-mining discovery of pattern and
process in ecological systems. Journal of Wildlife Management, 71, 242-2437
• TreeNetTM
– Friedman JH (1999) Stochastic Gradient Boosting. Technical report, Dept. of Statistics, Stanford University.
– Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29, 1189-
1232
– Friedman JH (2002) Stochastic gradient boosting. Computational Statistics and Data Analysis, 38, 367–378
– Hastie TJ, Tibshirani RJ, Friedman JH (2001) The Elements of Statistical Learning: Data Mining, Inference, and
Prediction. Springer, New York
– Jarošík V, Pyšek P, Kadlec T (2011) Alien plants in urban nature reserves: from red-list species to future invaders.
NeoBiota, 10, 27–46
11. Basic principles of predictive mining
the data are successively split Binary recursive partitioning
12. Basic principles of CART®
CART® provide graphical, highly intuitive inside
Run off
Class Cases %
None Absent 384 60.3 High, Medium, Low
Present 253 39.7
N= 637
Road density outside Natural areas outside
Class Cases % Class Cases %
=< 0.1 Absent 287 91.4 > 0.1 =< 0.9 Absent 97 30.0 > 0.9
Present 27 8.3 Present 226 70.0
N= 314 N= 323
Terminal node 1 Terminal node 2 Terminal node 3
Class Cases % Class Cases % Class Cases % Road present inside
Absent 277 96.9 Absent 10 35.7 Absent 34 15.7 Class Cases %
Present 9 3.1 Present 18 64.3 Present 183 84.3 No Absent 63 59.4 Yes
N= 286 N= 28 N= 217 Present 43 40.6
Terminal node 4 N= 106
Terminal node 5
Class Cases %
Class Cases %
Absent 34 89.5
Absent 29 42.6
Present 4 10.5
Present 39 57.4
N= 38
From: N= 68
The tree is represented (in defiance of
gravity) with the root standing for undivided
data at the top
14. Basic principles of RandomForestsTM
Random forests can be seen as an extension of classification trees by fitting many sub-trees
to parts of the dataset and then combining the predictions from all trees
Figure 3. Ranking of
importance values (%) for all
invasive species. Ranking is
scaled relative to the best
performing variable based
on out of-bag method of
random forests. White bars
are predictors from the
outside of Kruger National
Park and grey bars from the
inside.
From:
15. Important properties of data mining
models
• Exploratory and flexible
• Non parametric
• Surrogates
• Penalization
• Weighting
• Scoring
• Artificial placing some predictors at the top of
a tree
16. Data mining models are exploratory
• Unlike the classical linear methods, the data
mining techniques enable predictions to be
made from the data and to identify the most
important predictors by screening a large
number of candidate variables without
requiring the user to make any assumptions
about the form of the relationships between
the predictors and the target variable, and
without a priori formulated hypotheses
17. Data mining models are flexible
• These techniques are also more flexible than traditional statistical analyses
because they can reveal structures in the dataset that are other than
linear, and solve complex interactions
• Unlike linear models, which uncover a single dominant structure in the
data, data mining models are designed to work with data that might have
multiple structures:
– The models can use the same explanatory variable in different parts of the
tree, dealing effectively with nonlinear relationships and higher order
interactions
• In fact, provided there are enough observations, the more complex the data
and the more variables that are available, the better models will appear
compared to alternative methods.
– With a complex data set, understandable and generally interpretable results
often can be found only by constructing data mining models
• Data mining models are also excellent for initial data inspection
– Models are often used to select a manageable number of core measures from
databases with hundreds of variables
– A useful subset of predictors from a large set of variables can then be used in
building a formal linear model
18. With a complex data set, understandable and generally interpretable results often
can be found only by constructing data mining models
From:
19. With a complex data set, understandable and generally interpretable results
often can be found only by constructing data mining models
Table 2. Test statistics and significances of the explanatory variables and their interactions in the AIC
minimal adequate models for proportion of archaeophytes. Non-significant variables and their
interactions are not shown.
R2 = 0.82
Explanatory variable archaeophytes
df Deviance P R2
Habitat type 31 90521.2 <0.001 0.760
Riverside position 1 28.5 <0.001 <0.001
Vegetation cover 1 88.4 <0.001 <0.001
Surrounding urban/industrial land 1 660.7 <0.001 0.005
Surrounding agricultural land 1 852.8 <0.001 0.007
Altitudinal floristic region 2 1755.1 <0.001 0.015
Altitude 1 647.7 <0.001 0.005
Temperature 1 28.3 <0.001 <0.001
Precipitation 1 392.8 <0.001 0.003
Habitat type x riverside position n.s.
Habitat type x vegetation cover 32 373 <0.001 0.003
Habitat type x urban/industrial 32 480.2 <0.001 0.004
Habitat type x agricultural 32 278.7 <0.001 0.002
Habitat type x human density 32 172.9 <0.001 0.001
Habitat type x floristic region 53 405.3 <0.001 0.003
Habitat type x altitude 32 284.0 <0.001 0.002
From:
Habitat type x temperature 32 91.1 <0.001 <0.001
Habitat type x precipitation 32 181.2 <0.001 0.001
Riverside position x precipitation 1 11.4 <0.001 <0.001
Vegetation cover x agricultural 1 34.6 <0.001 <0.001
Vegetation cover x human density 1 4.4 0.036 <0.001
Vegetation cover x floristic region 2 9.4 0.009 <0.001
Vegetation cover x precipitation 1 9.2 0.002 <0.001
Urban/industrial x human density n.s.
Urban/industrial x floristic region 2 9.4 0.009 <0.001
Agricultural x altitude n.s.
Agricultural x temperature 1 38.0 <0.001 <0.001
Agricultural x precipitation 1 10.0 0.002 <0.001
Floristic region x altitude 2 11.1 0.004 <0.001
Floristic region x temperature 2 8.2 0.016 <0.001
Floristic region x precipitation n.s.
Altitude x precipitation 1 6.2 0.013 <0.001
Temperature x precipitation n.s.
20. With a complex data set, understandable and generally interpretable results
often can be found only by constructing data mining models
R2 = 0.86
From:
21. With a complex data set, understandable and generally interpretable results
often can be found only by constructing data mining models
R2 = 0.74
ANOVA table and deletion tests for linear minimal adequate model (MAM) of ANCOVA describing population characteristics
determining the impact on diversity between invaded and uninvaded pairs of plots.
ANOVA on MAM Deletion tests on MAM
Source of variation Df SS MS R2 (R2adj.) F Df P
Impact on diversity H':
Species x Height 13 25.4219 1.9555 1.93 13, 100 0.04
From:
Species x Differences in height 13 5.0490 0.3884 2.11 13, 100 0.02
Species x Differences in cover 13 14.1461 1.0882 10 13, 100 < 0.001
Height x Differences in Height 1 0.3246 0.3246 4.44 1, 88 0.04
Height x Differences in cover 1 0.2572 0.2572 4.83 1, 88 0.03
Cover x Differences in cover 1 0.6834 0.6834 6.48 1, 88 0.01
Residuals 87 9.1763 0.1055
Total 129 55.0584 0.83 (0.76) 10.36 42, 87 < 0.001
22. Models are often used to select a manageable number of core measures…A useful
subset of predictors … can then be used in building a formal linear model
Water run-off
Class Cases %
=< 6.0 Absent 384 60.3 > 6.0 1
Probability of an alien record
Present 253 39.7
N= 637
0
Road density outside
Class Cases %
Ro
=< 0.1 Absent 323 89.7 > 0.1 ad
de
Present 37 10.3 Terminal node 3 ns
-of
f
ity
N= 360 10 run
Class Cases % ter
km Wa
Terminal node 1 Terminal node 2 Absent 61 22.0
Class Cases % Class Cases % Present 216 78.0
Absent 313 94.6 Absent 10 34.5 N= 277
Present 18 5.4 Present 19 65.5
N= 331 N= 29
From:
23. Models are often used to select a manageable number of core measures…A useful
subset of predictors … can then be used in building a formal linear model
From:
24. Models are often used to select a manageable number of core measures…A useful
subset of predictors … can then be used in building a formal linear model
The probability of non-native plant presence was
associated with water runoff (0–27 million m3) and major
1
road density within 10 km radius outside the KNP
boundary (0–0.15 km/km2). Overall significance of the
Probability of an alien record
model is G2 = 531.54, df=3, P<0.0001. Parameters of the
model are given in the Table below.
Parameter Estimate ASE Standardized ASE G2 df P
estimate
Intercept -6.30 0.66 -0.73 0.18
0
Run-off 0.74 0.08 3.37 0.31 264.82 1 0.0001
Road density 69.70 8.59 1.54 0.20 127.96 1 0.0001
Ro
ad
de
ns
-of
f Run-off × Road -5.90 0.89 -1.63 0.25 49.13 1 0.0001
ity n
10 te r ru density
km Wa
From:
25. Data mining methods are non parametric
• Consequently, unlike with parametric linear
models
– Nonnormal distribution does not require
transformations
• Because the trees are invariant to monotonic
transformations of continuous predictor variables, no
transformation is needed prior to analyses
• Outliers among the predictors generally do not affect the
models because splits usually occur at non-outlier values
• (However, in some circumstances, transformation of the
target variable may be important to alleviate variance
heterogeneity).
26. Surrogates
• Surrogates of each split, describing splitting rules
that closely mimic the action of the primary
split, can be assessed and ranked according to
their association values, with the highest possible
value 1.0 corresponding to the surrogate
producing exactly the same split as the primary
split
– Surrogates can then be used to replace an expensive
primary explanatory variable by a less expensive or
easier to interpret, though probably less accurate one
– Surrogates can also be used to build alternative trees
on surrogates
29. Surrogates can be used to build alternative trees
Appendix S6. Alternative model to the optimal classification tree. The alternative categorical
split “Run off” (none: no rivers intersected the segment; low: 2-10; medium: 10-15; high: 15-
26 million m3 / quaternary watershed / annum) replaced the continuous primary split “Water
run-off” at node 1 of the optimal tree (Fig. 3) as a surrogate with association value = 0.86.
Overall misclassification rate of the model is 13.5%, sensitivity 0.90 and specificity 0.80.
“Natural areas outside” refers to the percentage of natural areas in a 5-km radius outside the
KNP boundary, “Road present inside” refers to the presence or absence of roads in a studied
segment inside KNP. Otherwise as in Fig. 3.
From:
30. Surrogates
• Surrogates of each split, describing splitting rules
that closely mimic the action of the primary split,
can be assessed and ranked according to their
association values, with the highest possible
value 1.0 corresponding to the surrogate
producing exactly the same split as the primary
split
– Surrogates also serve to treat missing values, because
the alternative splits are used to classify a case when
its primary splitting variable is missing
31. Surrogates serve to treat missing values
• The fact that data mining techniques can handle
data gaps by calculating surrogate variables to
replace missing values often enable to use larger
dataset than in linear models in which usually all
cases with missing values had to be discarded
• Consequently, the data mining techniques can
often reveal additional factors relevant for
predictions that may have been overlooked when
tested by classical statistical approaches
32. Calculating surrogate variables to replace missing values …enable to use larger dataset… the data mining
techniques thus often reveal additional factors relevant for predictions
N = 136
33. Calculating surrogate variables to replace missing values …enable to use larger dataset… the data mining
techniques thus often reveal additional factors relevant for predictions
Man-made habitat
Class %
yes success 54.5 no
failure 45.5
N = 173
Habitat Pathway: Escaped
Class % outdoor, Class %
indoor success 72.5 both yes success 28.6 no
failure 27.5 failure 71.4
N = 114 N = 59
Terminal node 1 Terminal node 6 Terminal node 7
Class % World region Class % Class %
success 91.7 Class % Americas, success 75.0 success 16.6
failure 8.3 Australasia success 64.8 Europe failure 25.0 failure 83.4
N = 23 failure 35.2 N=8 N = 51
N = 91
Terminal node 2
Class %
success 86.7 Spatial extent
failure 13.3 local, Class % regional,
N = 23 international success 51.6 national
failure 48.4
N = 68
Terminal node 5
Class %
Area infested success 16.3
Class % failure 83.7
<= 0.2 ha success 77.0 > 0.2 ha N = 34
failure 23.0
N = 34
Terminal node 3 Terminal node 4
Class % Class %
success 44.6 success 89.8
failure 55.4 failure 10.2
N=9 N = 25
Fig. 1. Optimal classification tree for factors relating to success and failure of 173 eradication
campaigns against invertebrate plant pests, weeds and plant pathogens (viruses/viroids,
bacteria and fungi) . Overall misclassification rate of the optimal tree is 15.8% compared to
50% for the null model, with 16.7% misclassified success and 14.8% failure cases. Sensitivity
is 83.3 and specificity 85.2% for learning samples, and 77.1 and 69.0%, respectively, for
cross-validated samples.
Submitted to: PLoS One
Which factors affect the success or failure of eradication campaigns against alien species?
Therese Pluess1, Vojtěch Jarošík2,3*, Petr Pyšek3,2, Ray Cannon4, Jan Pergl3, Annemarie Breukers5, Sven Bacher1
1University of Fribourg, Department of Biology, Ecology & Evolution Unit, Chemin du Musée 10, 1700 Fribourg, Switzerland
2Charles University in Prague, Faculty of Science, Department of Ecology, CZ-128 44 Praha 2, Czech Republic
3Institute of Botany, Academy of Sciences of the Czech Republic, CZ-252 43 Průhonice, Czech Republic
4The Food and Environment Research Agency, Sand Hutton, York YO41 1LZ, United Kingdom
5LEI, part of Wageningen UR, P.O. Box 8130, 6700 EW, Wageningen, The Netherlands
34. Penalization
• As it is easier to be a good splitter on a small
number of records (e.g., splitting a node with just
two records), to prevent missing explanatory
variables from having an advantage as
splitters, the power of explanatory variables
should be penalized in proportion to the degree
to which they are missing
• High-level categorical predictor variables have
inherently higher splitting power than continuous
predictors and therefore can also be penalized to
level the playing field
35. High-level categorical explanatory variables have inherently
higher splitting power than continuous explanatory variables
and therefore can also be penalized to level the playing field
36. Weighting
• Weighting enables one to give a different weight to each
case in analysis
• A usual application is on proportional data, where
proportions calculated from larger samples give more
precise estimates, and therefore, proportional response
variables are weighted by their sample sizes
• A similar approach can be applied to stratified sampling on
strata having different sampling intensities
• Data-mining models can also accommodate situations in
which some misclassifications are more serious than others
– For instance, if invasion risks are classified as
low, moderate, and high, it would be more costly to classify a
high-risk species as low-risk than as moderate-risk
– This can be treated by specifying a differential penalty by
weighting different way misclassifying high, moderate, and low
risk
38. Weighting: stratified sampling
Man-made habitat
Class %
yes success 54.5 no
failure 45.5
N = 173
Habitat Pathway: Escaped
Class % outdoor, Class %
indoor success 72.5 both yes success 28.6 no
failure 27.5 failure 71.4
N = 114 N = 59
Terminal node 1 Terminal node 6 Terminal node 7
Class % World region Class % Class %
success 91.7 Class % Americas, success 75.0 success 16.6
failure 8.3 Australasia success 64.8 Europe failure 25.0 failure 83.4
N = 23 failure 35.2 N=8 N = 51
N = 91
Terminal node 2
Class %
success 86.7 Spatial extent
failure 13.3 local, Class % regional,
N = 23 international success 51.6 national
failure 48.4
N = 68
Terminal node 5
Class %
Area infested success 16.3
Class % failure 83.7
<= 0.2 ha success 77.0 > 0.2 ha N = 34
failure 23.0
N = 34
Terminal node 3 Terminal node 4
Class % Class %
success 44.6 success 89.8
failure 55.4 failure 10.2
N=9 N = 25
Fig. 1. Optimal classification tree for factors relating to success and failure of 173 eradication
campaigns against invertebrate plant pests, weeds and plant pathogens (viruses/viroids,
bacteria and fungi). Overall misclassification rate of the optimal tree is 15.8% compared to
50% for the null model, with 16.7% misclassified success and 14.8% failure cases. Sensitivity
is 83.3 and specificity 85.2% for learning samples, and 77.1 and 69.0%, respectively, for
cross-validated samples.
Submitted to: PLoS One
Which factors affect the success or failure of eradication campaigns against alien species?
Therese Pluess1, Vojtěch Jarošík2,3*, Petr Pyšek3,2, Ray Cannon4, Jan Pergl3, Annemarie Breukers5, Sven Bacher1
1University of Fribourg, Department of Biology, Ecology & Evolution Unit, Chemin du Musée 10, 1700 Fribourg, Switzerland
2Charles University in Prague, Faculty of Science, Department of Ecology, CZ-128 44 Praha 2, Czech Republic
3Institute of Botany, Academy of Sciences of the Czech Republic, CZ-252 43 Průhonice, Czech Republic
4The Food and Environment Research Agency, Sand Hutton, York YO41 1LZ, United Kingdom
5LEI, part of Wageningen UR, P.O. Box 8130, 6700 EW, Wageningen, The Netherlands
41. Scoring
Ageratum 77.8%
Argemone 84.8%
Chromolaena 89.5%
Opuntia 95.4%
Xanthium 95.6%
Lantana 98.1%
-25.0 -20.0 -15.0 -10.0 -5.0 0.0 5.0 10.0 15.0 20.0 25.0
Predictability of presence comparing to all aliens with 92.9% of correct predictions
42. Artificial placing some factors at the top of a tree using
Splitter Variable Disallow Criteria
Area infested
Class %
<= 4905 ha success 54.5 > 4905 ha
failure 45.5
N = 173
Sanitary control Reaction time
Class % Class %
yes success 66.7 no <= 11 yrs success 32.5 > 11 yrs
failure 33.3 failure 67.5
N = 101 N = 72
Terminal node 1 Terminal node 8
Class % Man-made habitat Detectability Class %
success 86.1 Class % Class % easy, success 15.2
failure 13.9 yes success 60.8 no difficult success 46.6 failure 84.8
intermediate
N = 29 failure 39.2 failure 53.4 N = 32
N = 72 N = 40
Terminal node 2 Terminal node 6 Terminal node 7
Class % Class % Class %
success 82.4 Pathway: escaped success 80.1 success 31.8
failure 17.6 Class % failure 19.9 failure 68.2
N = 37 yes success 42.3 no N=9 N = 31
failure 57.7
N = 35
Terminal node 3
Class % World region
success 85.7 Class % Australasia,
failure 14.3 Americas success 25.7 Europe
N=7 failure 74.3
N = 28
Terminal node 4 Terminal node 5
Class % Class %
success 59.3 success 11.5
failure 40.7 failure 88.5
N=7 N = 21
Fig. 3. Optimal classification tree with event-specific factors placed at the top of the tree. Overall misclassification rate of the optimal tree is
18.0% with 15.3% misclassified success and 23.4% failure cases. Sensitivity and specificity are respectively 84.7 and 76.6% for learning, and 66.7
and 64.9% for cross-validated samples.
Submitted to: PLoS One
Which factors affect the success or failure of eradication campaigns against alien species?
Therese Pluess1, Vojtěch Jarošík2,3*, Petr Pyšek3,2, Ray Cannon4, Jan Pergl3, Annemarie Breukers5, Sven Bacher1
1University of Fribourg, Department of Biology, Ecology & Evolution Unit, Chemin du Musée 10, 1700 Fribourg, Switzerland
2Charles University in Prague, Faculty of Science, Department of Ecology, CZ-128 44 Praha 2, Czech Republic
3Institute of Botany, Academy of Sciences of the Czech Republic, CZ-252 43 Průhonice, Czech Republic
4The Food and Environment Research Agency, Sand Hutton, York YO41 1LZ, United Kingdom
5LEI, part of Wageningen UR, P.O. Box 8130, 6700 EW, Wageningen, The Netherlands
43. Limitations
• Data mining models are good, but not as good to solve completely all
problems with data that violate a basic assumption of the independence
of errors of observations due to temporal or spatial autocorrelation, or
due to a related problem of phylogenetic relatedness
• Data mining methods cannot distinguish fixed and random effects and
thus cannot be used with mixed effect and nested statistical designs
• The tree-growing method is data intensive, requiring many more cases
than classical regression
– While for multiple regression it is usually recommended to keep the
number of explanatory variables six to ten times smaller than the number
of observations, for classification trees, at least 200 cases are
recommended for binary response variables and about 100 more cases for
each additional level of a categorical variable
– Efficiency of trees decreases rapidly with decreasing sample size, and for
small data sets, no test samples may be available
44. The tree-growing method is data intensive, requiring
many more cases than classical regression
A. Variable importance for plant richness B. Variable importance for butterfly richness
A. area A. altitudinal range
B. w asteland D. soil
F. natural perim eter F. distance to natural habitat
B. arable land A. aspect
B. build-up area F. natural perim eter
D. soil A. area
C. forest B. forest
B. w asteland
A. aspect A. isolation
A. altitudinal range F. built-up perim eter
B. forest A. railw ay
C. orchards A. altitude
B. orchards D. bare bedrock
B. pasture B. pasture
D. bare bedrock F. distance to built-up area
F. built-up perim eter C. arable land
C. arable land B. arable land
A. isolation C. w asteland
A. railw ay C. pasture
D. quarry D. quarry
C. orchards
A. reserve age
C. grassland
A. altitude C. build-up area
B. grassland A. reserve age
C. grassland B. grassland
C. pasture B. orchards
C. w asteland B. build-up area
C. build-up area C. forest
F. distance to natural habitat E. plants
F. distance to built-up area E. com m unities
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Relative importance (%) Relative importance (%)
Figure A1 Rank of importance of the individual explanatory variables from the six groups of explanatory variables (A. Geography, B. Past habitats, C. Present
habitats, D. Substrate, E. Vegetation heterogeneity, F. Urbanization) from random forests classifications, used for predicting the above and below median value for each
observation of species richness and endangered species across 48 reserves. Variable importance is rescaled to have values between 0 and 100. A. Variable importance
for plant richness: final error rate 45.8%, sensitivity 50.0%, specificity 58.3%. B. Variable importance for butterfly richness: final error rate 31.6%, sensitivity
63.6%, specificity 73.1%.
45. Hints for further development: a
problem of species relatedness
• The problem is that related species can have similar
traits due to inherited traits from a common ancestor
• Consequently, the traits of related species are not
independent, and in statistical analyses, this
relatedness have to be taken into account
• Considering effects of species relatedness is important
not only to treat a lack of statistical independence, but
also to solve practical implications
– For instance, when one need to know how species
belonging to a particular taxon (like family, order, or class)
are predisposed to invasion
46. Species relatedness
approximated by taxonomic
hierarchy
• As a first approximation, instead of a real relatedness based
on inherited traits from common ancestors described by a
phylogenetic tree, we can use taxonomic
hierarchy, supposing that the species relatedness is
increasing from subkingdoms within kingdoms to subspecies
within species
• First, the trees can be constructed only with the highest
taxonomic level and then including each lower taxonomic
level, one after another
– This successional treatment of taxonomy enable to reveal at
which taxonomic level the examined trait has the largest effect
47. Species relatedness
approximated by taxonomic
hierarchy
• As a first approximation, instead of a real relatedness based on inherited
traits from common ancestors, we can use taxonomic hierarchy, supposing
that the species relatedness is increasing from subkingdoms within
kingdoms to subspecies within species
• Second, the splitting power of taxonomic levels in regression and
classification trees can be penalized proportionally to the number of
categories at each taxonomic level, taking advantage of that that the
number of units at the taxonomic levels is increasing from the highest to
the lowest
– This penalization, weighting the taxonomic levels from the lowest to the
highest proportionally to the number of categories at each level, thus can
reveal a real weighted effect of a particular taxon on a particular species trait
48. Species relatedness
following phylogenetic
tree
• The best way how to treat species relatedness is to use the actual species
phylogenetic distances calculated from a specific phylogenetic tree
• Technically, this can be done by calculating principal coordinate axes, that
describe distances between all taxa in a phylogenetic tree
• However, their use in CART® is hindered by the fact that these principal
coordinate axes are orthogonal, and classification and regression trees
exhibit their greatest strengths with a highly nonlinear structure and
complex interactions
– Their usefulness decreases with increasing linearity of the relationships, and
consequently, on mutually independent principle coordinates, no trees are
built
49. Hints for further development: a
problem of species relatedness
The problem can
be solved by using
the Splitter
Variable Disallow
Criteria in CART®
50. The problem of species relatedness could be solved
by artificial placing some factors at the top of a tree
• When species
relatedness
approximated by
taxonomic hierarchy,
the Splitter Variable
Disallow Criteria can
be directly used for
appropriate placing
the individual levels of
the nested taxonomic
hierarchy
51. The problem of species relatedness could be solved
by artificial placing some factors at the top of a tree
• When species relatedness directly
follows phylogenetic distances
between taxa on a phylogenetic
tree, the Splitter Variable Disallow
Criteria can be again directly used to
place the individual splits
• A further development of Splitter
Variable Disallow Criteria might
enable to include also the real
distances between the individual
taxa, by developing a disallow criteria
defining not only the split region, but
also the distances between the
individual splits, using e.g. a
predefined importance values for the
distances among the splits
– This might be useful also for other
applications in which it can appear
useful to place some predictors not
only deliberately at a chosen place in
a tree structure, but also to define
the precise distance between the
predictors
54. Acknowledgements
• Grants
– ALARM: Assessing LArge-scale environmental Risks with tested
Methods (European Union 6th Framework Programme), DAISIE:
Delivering Alien Invasive Species Inventories for Europe
(European Union 6th Framework Programme), PRATIQUE:
Enhancements of Pest Risk Analysis Techniques (European
Union 7th Framework Programme, 2008–2011)
– Czech Science Foundation grant no. 206/09/0563
– Long-term research projects no. RVO 67985939 from the
Academy of Sciences of the Czech Republic, MSM 21620828 and
LC06073 from the Ministry of Education, Youth and Sports of
the Czech Republic
– Praemium Academiae award from the Academy of Sciences of
the Czech Republic to Petr Pyšek