SlideShare une entreprise Scribd logo
1  sur  55
Data Mining to Reveal
     Biological Invaders
            VOJTĚCH JAROŠÍK
    1Department   of Ecology, Faculty of Science, Charles
    University, Prague and 2Institute of Botany, Academy
     of Sciences of the Czech Republic, Průhonice, the
                       Czech Republic
1                                              2
Backgroud on biological invasions
•   Economic impacts
•   Changes in habitat properties
•   Loss of species diversity
•   Biological homogenization
Economic impacts
Economic impacts




Skibbereen 1847 by Cork artist James Mahony   Emigrants Leave Ireland, engraving by Henry Doyle
Changes of habitat properties
Extinction of native species




Brown tree snake on a fence post in Guam
Biological homogenization
Aim
• Data-mining tools were originally designed for analyzing vast
  databases of often incomplete data, with an aim to find financial
  frauds, suitable candidates for loans, potential customers and
  other uncertain outputs
• I will show that searching for potential invasive species and their
  traits responsible for invasiveness, or identifying factors that
  distinguish invasible communities from those that resist invasion
  are similar risk assessments
   – This is perhaps the main reason why CART® and related methods are
     becoming increasingly popular in the field of invasion biology
   – Identifying homogeneous groups with high or low risk and
     constructing rules for making predictions about individual cases is, in
     essence, the same for financial credit scoring as for pest risk
     assessment
   – In both cases, one searches for rules that can be used to predict
     uncertain future events
Basic principles of data mining

•   Main literature sources
•   Binary recursive partitioning
•   Classification And Regression Trees (CART®)
•   Random forests (RFTM)
Basic principles of data mining
•   Classification and Regression Trees (CART®):
     –   Breiman L, Friedman JH, Olshen RA, Stone CG (1984) Classification and Regression Trees. Belmont, Wadsworth
         International Group
     –   Steinberg D, Colla P (1995) CART: Tree-structured Non-parametric Data Analysis. Salford Systems, San Diego, USA
     –   Steinberg D, Colla P (1997) CART: Classification and Regression Trees. Salford Systems, San Diego, USA
     –   Steinberg G, Golovnya M (2006) CART 6.0 User's Manual. Salford Systems, San Diego, USA
     –   De'ath G, Fabricius KE (2000) Classification and regression trees: a powerful yet simple technique for ecological data
         analysis. Ecology, 81, 3178-3192
     –   Bourg NA, McShea WJ, Gill DE (2005) Putting a cart before the search: successful habitat prediction for a rare forest
         herb. Ecology, 86, 2793-2804
     –   Jarošík V (2011) CART and related methods. In: Encyclopedia of Biological Invasions (eds Simberloff D, Rejmánek
         M), pp. 104-108. University of California Press, Berkeley and Los Angeles, USA
•   Random ForestsTM
     –   Breiman L (2001) Random Forests. Machine Learning, 45, 5--32
     –   Breiman L, Cutler A (2004) Random Forests TM. An Implementation of Leo Breiman’s RFTM by Salford Systems.
         Salford Systems, San Diego, USA
     –   Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, et al. (2007) Random forests for classification in ecology.
         Ecology, 88, 2783-2792
     –   Hochachka WM, Caruana R, Fink D, Munson A, Riedewald M, et al. (2007) Data-mining discovery of pattern and
         process in ecological systems. Journal of Wildlife Management, 71, 242-2437
•   TreeNetTM
     –   Friedman JH (1999) Stochastic Gradient Boosting. Technical report, Dept. of Statistics, Stanford University.
     –   Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29, 1189-
         1232
     –   Friedman JH (2002) Stochastic gradient boosting. Computational Statistics and Data Analysis, 38, 367–378
     –   Hastie TJ, Tibshirani RJ, Friedman JH (2001) The Elements of Statistical Learning: Data Mining, Inference, and
         Prediction. Springer, New York
     –   Jarošík V, Pyšek P, Kadlec T (2011) Alien plants in urban nature reserves: from red-list species to future invaders.
         NeoBiota, 10, 27–46
Basic principles of predictive mining
the data are successively split   Binary recursive partitioning
Basic principles of CART®
CART® provide graphical, highly intuitive inside

                                                             Run off
                                                         Class Cases       %
                                             None        Absent     384    60.3    High, Medium, Low
                                                         Present    253    39.7
                                                                N= 637




                      Road density outside                                                      Natural areas outside
                      Class Cases     %                                                        Class Cases      %
            =< 0.1    Absent    287 91.4      > 0.1                                  =< 0.9    Absent      97 30.0      > 0.9
                      Present    27   8.3                                                      Present    226 70.0
                             N= 314                                                                   N= 323
Terminal node 1                                Terminal node 2            Terminal node 3
Class Cases     %                              Class Cases     %          Class Cases     %                             Road present inside
Absent     277 96.9                            Absent     10 35.7         Absent     34 15.7                            Class Cases     %
Present      9  3.1                            Present    18 64.3         Present 183 84.3                      No      Absent     63 59.4     Yes
      N= 286                                          N= 28                      N= 217                                 Present    43 40.6
                                                                                                 Terminal node 4               N= 106
                                                                                                                                              Terminal node 5
                                                                                                 Class Cases     %
                                                                                                                                              Class Cases     %
                                                                                                 Absent     34 89.5
                                                                                                                                              Absent     29 42.6
                                                                                                 Present      4 10.5
                                                                                                                                              Present    39 57.4
                                                                                                        N= 38
  From:                                                                                                                                              N= 68



                                                                                               The tree is represented (in defiance of
                                                                                               gravity) with the root standing for undivided
                                                                                               data at the top
Basic principles of CART®




      From:
Basic principles of RandomForestsTM
Random forests can be seen as an extension of classification trees by fitting many sub-trees
to parts of the dataset and then combining the predictions from all trees
                                                               Figure 3. Ranking of
                                                               importance values (%) for all
                                                               invasive species. Ranking is
                                                               scaled relative to the best
                                                               performing variable based
                                                               on out of-bag method of
                                                               random forests. White bars
                                                               are predictors from the
                                                               outside of Kruger National
                                                               Park and grey bars from the
                                                               inside.


                                                               From:
Important properties of data mining
             models
•   Exploratory and flexible
•   Non parametric
•   Surrogates
•   Penalization
•   Weighting
•   Scoring
•   Artificial placing some predictors at the top of
    a tree
Data mining models are exploratory
• Unlike the classical linear methods, the data
  mining techniques enable predictions to be
  made from the data and to identify the most
  important predictors by screening a large
  number of candidate variables without
  requiring the user to make any assumptions
  about the form of the relationships between
  the predictors and the target variable, and
  without a priori formulated hypotheses
Data mining models are flexible
• These techniques are also more flexible than traditional statistical analyses
  because they can reveal structures in the dataset that are other than
  linear, and solve complex interactions
• Unlike linear models, which uncover a single dominant structure in the
  data, data mining models are designed to work with data that might have
  multiple structures:
    – The models can use the same explanatory variable in different parts of the
      tree, dealing effectively with nonlinear relationships and higher order
      interactions
• In fact, provided there are enough observations, the more complex the data
  and the more variables that are available, the better models will appear
  compared to alternative methods.
    – With a complex data set, understandable and generally interpretable results
      often can be found only by constructing data mining models
• Data mining models are also excellent for initial data inspection
    – Models are often used to select a manageable number of core measures from
      databases with hundreds of variables
    – A useful subset of predictors from a large set of variables can then be used in
      building a formal linear model
With a complex data set, understandable and generally interpretable results often
             can be found only by constructing data mining models




From:
With a complex data set, understandable and generally interpretable results
          often can be found only by constructing data mining models
Table 2. Test statistics and significances of the explanatory variables and their interactions in the AIC
minimal adequate models for proportion of archaeophytes. Non-significant variables and their
interactions are not shown.
                                                                                                             R2 = 0.82
Explanatory variable                                                         archaeophytes
                                                              df        Deviance         P           R2
Habitat type                                                  31          90521.2     <0.001        0.760
Riverside position                                             1             28.5     <0.001       <0.001
Vegetation cover                                               1             88.4     <0.001       <0.001
Surrounding urban/industrial land                              1            660.7     <0.001        0.005
Surrounding agricultural land                                  1            852.8     <0.001        0.007
Altitudinal floristic region                                   2           1755.1     <0.001        0.015
Altitude                                                       1            647.7     <0.001        0.005
Temperature                                                    1             28.3     <0.001       <0.001
Precipitation                                                  1            392.8     <0.001        0.003
Habitat type x riverside position                                                       n.s.
Habitat type x vegetation cover                               32              373     <0.001        0.003
Habitat type x urban/industrial                               32            480.2     <0.001        0.004
Habitat type x agricultural                                   32            278.7     <0.001        0.002
Habitat type x human density                                  32            172.9     <0.001        0.001
Habitat type x floristic region                               53            405.3     <0.001        0.003
Habitat type x altitude                                       32            284.0     <0.001        0.002
                                                                                                            From:
Habitat type x temperature                                    32             91.1     <0.001       <0.001
Habitat type x precipitation                                  32            181.2     <0.001        0.001
Riverside position x precipitation                             1             11.4     <0.001       <0.001
Vegetation cover x agricultural                                1             34.6     <0.001       <0.001
Vegetation cover x human density                               1               4.4     0.036       <0.001
Vegetation cover x floristic region                            2               9.4     0.009       <0.001
Vegetation cover x precipitation                               1               9.2     0.002       <0.001
Urban/industrial x human density                                                        n.s.
Urban/industrial x floristic region                            2               9.4     0.009       <0.001
Agricultural x altitude                                                                 n.s.
Agricultural x temperature                                     1             38.0     <0.001       <0.001
Agricultural x precipitation                                   1             10.0      0.002       <0.001
Floristic region x altitude                                    2             11.1      0.004       <0.001
Floristic region x temperature                                 2               8.2     0.016       <0.001
Floristic region x precipitation                                                        n.s.
Altitude x precipitation                                       1               6.2     0.013       <0.001
Temperature x precipitation                                                             n.s.
With a complex data set, understandable and generally interpretable results
        often can be found only by constructing data mining models


                                                         R2 = 0.86




                                               From:
With a complex data set, understandable and generally interpretable results
            often can be found only by constructing data mining models




                  R2 = 0.74




ANOVA table and deletion tests for linear minimal adequate model (MAM) of ANCOVA describing population characteristics
determining the impact on diversity between invaded and uninvaded pairs of plots.

                                               ANOVA on MAM                           Deletion tests on MAM
Source of variation                Df          SS       MS         R2 (R2adj.)       F          Df        P
Impact on diversity H':
Species x Height                  13        25.4219     1.9555                     1.93     13, 100      0.04
                                                                                                                  From:
Species x Differences in height   13        5.0490      0.3884                     2.11     13, 100      0.02
Species x Differences in cover    13        14.1461     1.0882                      10      13, 100     < 0.001
Height x Differences in Height     1        0.3246      0.3246                     4.44      1, 88       0.04
Height x Differences in cover      1        0.2572      0.2572                     4.83      1, 88       0.03
Cover x Differences in cover       1        0.6834      0.6834                     6.48      1, 88       0.01
Residuals                         87        9.1763      0.1055
Total                             129       55.0584               0.83 (0.76)      10.36     42, 87     < 0.001
Models are often used to select a manageable number of core measures…A useful
     subset of predictors … can then be used in building a formal linear model


                                                             Water run-off
                                                         Class Cases       %
                                        =< 6.0           Absent     384 60.3   > 6.0                                                 1




                                                                                                    Probability of an alien record
                                                         Present    253 39.7
                                                                N= 637




                                                                                                                                     0

                         Road density outside
                         Class Cases     %
                                                                                                                                         Ro
                =< 0.1   Absent    323 89.7      > 0.1                                                                                        ad
                                                                                                                                                   de
                         Present    37 10.3                                    Terminal node 3                                                       ns
                                                                                                                                                                                   -of
                                                                                                                                                                                         f
                                                                                                                                                       ity
                                N= 360                                                                                                                       10              run
                                                                               Class Cases     %                                                                       ter
                                                                                                                                                                km   Wa
   Terminal node 1                                Terminal node 2              Absent     61 22.0
   Class Cases     %                              Class Cases     %            Present   216 78.0
   Absent    313 94.6                             Absent     10 34.5                  N= 277
   Present    18   5.4                            Present    19 65.5
         N= 331                                          N= 29




From:
Models are often used to select a manageable number of core measures…A useful
   subset of predictors … can then be used in building a formal linear model




  From:
Models are often used to select a manageable number of core measures…A useful
                                    subset of predictors … can then be used in building a formal linear model


                                                                                                The probability of non-native plant presence was
                                                                                                associated with water runoff (0–27 million m3) and major
                                 1
                                                                                                road density within 10 km radius outside the KNP
                                                                                                boundary (0–0.15 km/km2). Overall significance of the
Probability of an alien record




                                                                                                model is G2 = 531.54, df=3, P<0.0001. Parameters of the
                                                                                                model are given in the Table below.
                                                                                             Parameter      Estimate ASE   Standardized ASE    G2      df   P
                                                                                                                           estimate
                                                                                             Intercept      -6.30   0.66   -0.73        0.18
                                  0
                                                                                             Run-off        0.74    0.08   3.37         0.31   264.82 1         0.0001

                                                                                             Road density   69.70   8.59   1.54         0.20   127.96 1         0.0001
                                      Ro
                                           ad
                                                de
                                                  ns
                                                                                   -of
                                                                                         f   Run-off × Road -5.90   0.89   -1.63        0.25   49.13   1        0.0001
                                                     ity                       n
                                                           10          te r ru               density
                                                                km   Wa




                                 From:
Data mining methods are non parametric
• Consequently, unlike with parametric linear
  models
  – Nonnormal distribution does not require
    transformations
     • Because the trees are invariant to monotonic
       transformations of continuous predictor variables, no
       transformation is needed prior to analyses
     • Outliers among the predictors generally do not affect the
       models because splits usually occur at non-outlier values
     • (However, in some circumstances, transformation of the
       target variable may be important to alleviate variance
       heterogeneity).
Surrogates
• Surrogates of each split, describing splitting rules
  that closely mimic the action of the primary
  split, can be assessed and ranked according to
  their association values, with the highest possible
  value 1.0 corresponding to the surrogate
  producing exactly the same split as the primary
  split
   – Surrogates can then be used to replace an expensive
     primary explanatory variable by a less expensive or
     easier to interpret, though probably less accurate one
   – Surrogates can also be used to build alternative trees
     on surrogates
Surrogates can be used to build alternative trees
Surrogates can be used to build alternative trees




 From:
Surrogates can be used to build alternative trees




Appendix S6. Alternative model to the optimal classification tree. The alternative categorical
split “Run off” (none: no rivers intersected the segment; low: 2-10; medium: 10-15; high: 15-
26 million m3 / quaternary watershed / annum) replaced the continuous primary split “Water
run-off” at node 1 of the optimal tree (Fig. 3) as a surrogate with association value = 0.86.
Overall misclassification rate of the model is 13.5%, sensitivity 0.90 and specificity 0.80.
“Natural areas outside” refers to the percentage of natural areas in a 5-km radius outside the
KNP boundary, “Road present inside” refers to the presence or absence of roads in a studied
segment inside KNP. Otherwise as in Fig. 3.

 From:
Surrogates
• Surrogates of each split, describing splitting rules
  that closely mimic the action of the primary split,
  can be assessed and ranked according to their
  association values, with the highest possible
  value 1.0 corresponding to the surrogate
  producing exactly the same split as the primary
  split
   – Surrogates also serve to treat missing values, because
     the alternative splits are used to classify a case when
     its primary splitting variable is missing
Surrogates serve to treat missing values
• The fact that data mining techniques can handle
  data gaps by calculating surrogate variables to
  replace missing values often enable to use larger
  dataset than in linear models in which usually all
  cases with missing values had to be discarded
• Consequently, the data mining techniques can
  often reveal additional factors relevant for
  predictions that may have been overlooked when
  tested by classical statistical approaches
Calculating surrogate variables to replace missing values …enable to use larger dataset… the data mining
                 techniques thus often reveal additional factors relevant for predictions




                                                                N = 136
Calculating surrogate variables to replace missing values …enable to use larger dataset… the data mining
                 techniques thus often reveal additional factors relevant for predictions
                                                                                             Man-made habitat
                                                                                             Class         %
                                                                            yes              success      54.5        no
                                                                                             failure      45.5
                                                                                                     N = 173

                                Habitat                                                                                                          Pathway: Escaped
                                Class         %            outdoor,                                                                              Class          %
                     indoor     success      72.5          both                                                                    yes           success      28.6   no
                                failure      27.5                                                                                                failure      71.4
                                        N = 114                                                                                                          N = 59
      Terminal node 1                                                                                               Terminal node 6                                       Terminal node 7
      Class           %                                        World region                                         Class         %                                       Class           %
      success        91.7                                      Class          %            Americas,                success      75.0                                     success        16.6
      failure         8.3                    Australasia       success      64.8           Europe                   failure      25.0                                     failure        83.4
              N = 23                                           failure      35.2                                            N=8                                                   N = 51
                                                                       N = 91
                              Terminal node 2
                              Class           %
                              success        86.7                                              Spatial extent
                              failure        13.3                          local,              Class          %        regional,
                                      N = 23                               international       success      51.6       national
                                                                                               failure      48.4
                                                                                                       N = 68
                                                                                                                           Terminal node 5
                                                                                                                           Class           %
                                                              Area infested                                                success        16.3
                                                              Class          %                                             failure        83.7
                                           <= 0.2 ha          success      77.0               > 0.2 ha                             N = 34
                                                              failure      23.0
                                                                      N = 34
                              Terminal node 3                                                 Terminal node 4
                              Class         %                                                 Class           %
                              success      44.6                                               success        89.8
                              failure      55.4                                               failure        10.2
                                      N=9                                                             N = 25


                                                                Fig. 1. Optimal classification tree for factors relating to success and failure of 173 eradication
                                                                campaigns against invertebrate plant pests, weeds and plant pathogens (viruses/viroids,
                                                                bacteria and fungi) . Overall misclassification rate of the optimal tree is 15.8% compared to
                                                                50% for the null model, with 16.7% misclassified success and 14.8% failure cases. Sensitivity
                                                                is 83.3 and specificity 85.2% for learning samples, and 77.1 and 69.0%, respectively, for
                                                                cross-validated samples.

     Submitted to: PLoS One
     Which factors affect the success or failure of eradication campaigns against alien species?
     Therese Pluess1, Vojtěch Jarošík2,3*, Petr Pyšek3,2, Ray Cannon4, Jan Pergl3, Annemarie Breukers5, Sven Bacher1
     1University  of Fribourg, Department of Biology, Ecology & Evolution Unit, Chemin du Musée 10, 1700 Fribourg, Switzerland
     2Charles  University in Prague, Faculty of Science, Department of Ecology, CZ-128 44 Praha 2, Czech Republic
     3Institute of Botany, Academy of Sciences of the Czech Republic, CZ-252 43 Průhonice, Czech Republic
     4The Food and Environment Research Agency, Sand Hutton, York YO41 1LZ, United Kingdom
     5LEI, part of Wageningen UR, P.O. Box 8130, 6700 EW, Wageningen, The Netherlands
Penalization
• As it is easier to be a good splitter on a small
  number of records (e.g., splitting a node with just
  two records), to prevent missing explanatory
  variables from having an advantage as
  splitters, the power of explanatory variables
  should be penalized in proportion to the degree
  to which they are missing
• High-level categorical predictor variables have
  inherently higher splitting power than continuous
  predictors and therefore can also be penalized to
  level the playing field
High-level categorical explanatory variables have inherently
higher splitting power than continuous explanatory variables
 and therefore can also be penalized to level the playing field
Weighting
• Weighting enables one to give a different weight to each
  case in analysis
• A usual application is on proportional data, where
  proportions calculated from larger samples give more
  precise estimates, and therefore, proportional response
  variables are weighted by their sample sizes
• A similar approach can be applied to stratified sampling on
  strata having different sampling intensities
• Data-mining models can also accommodate situations in
  which some misclassifications are more serious than others
   – For instance, if invasion risks are classified as
     low, moderate, and high, it would be more costly to classify a
     high-risk species as low-risk than as moderate-risk
   – This can be treated by specifying a differential penalty by
     weighting different way misclassifying high, moderate, and low
     risk
Weighting: a usual application is on proportional data
Weighting: stratified sampling
                                                                                        Man-made habitat
                                                                                        Class         %
                                                                       yes              success      54.5        no
                                                                                        failure      45.5
                                                                                                N = 173

                           Habitat                                                                                                          Pathway: Escaped
                           Class         %            outdoor,                                                                              Class          %
                indoor     success      72.5          both                                                                    yes           success      28.6   no
                           failure      27.5                                                                                                failure      71.4
                                   N = 114                                                                                                          N = 59
 Terminal node 1                                                                                               Terminal node 6                                       Terminal node 7
 Class           %                                        World region                                         Class         %                                       Class           %
 success        91.7                                      Class          %            Americas,                success      75.0                                     success        16.6
 failure         8.3                    Australasia       success      64.8           Europe                   failure      25.0                                     failure        83.4
         N = 23                                           failure      35.2                                            N=8                                                   N = 51
                                                                  N = 91
                         Terminal node 2
                         Class           %
                         success        86.7                                              Spatial extent
                         failure        13.3                          local,              Class          %        regional,
                                 N = 23                               international       success      51.6       national
                                                                                          failure      48.4
                                                                                                  N = 68
                                                                                                                      Terminal node 5
                                                                                                                      Class           %
                                                         Area infested                                                success        16.3
                                                         Class          %                                             failure        83.7
                                      <= 0.2 ha          success      77.0               > 0.2 ha                             N = 34
                                                         failure      23.0
                                                                 N = 34
                         Terminal node 3                                                 Terminal node 4
                         Class         %                                                 Class           %
                         success      44.6                                               success        89.8
                         failure      55.4                                               failure        10.2
                                 N=9                                                             N = 25


                                                           Fig. 1. Optimal classification tree for factors relating to success and failure of 173 eradication
                                                           campaigns against invertebrate plant pests, weeds and plant pathogens (viruses/viroids,
                                                           bacteria and fungi). Overall misclassification rate of the optimal tree is 15.8% compared to
                                                           50% for the null model, with 16.7% misclassified success and 14.8% failure cases. Sensitivity
                                                           is 83.3 and specificity 85.2% for learning samples, and 77.1 and 69.0%, respectively, for
                                                           cross-validated samples.

Submitted to: PLoS One
Which factors affect the success or failure of eradication campaigns against alien species?
Therese Pluess1, Vojtěch Jarošík2,3*, Petr Pyšek3,2, Ray Cannon4, Jan Pergl3, Annemarie Breukers5, Sven Bacher1
1University  of Fribourg, Department of Biology, Ecology & Evolution Unit, Chemin du Musée 10, 1700 Fribourg, Switzerland
2Charles  University in Prague, Faculty of Science, Department of Ecology, CZ-128 44 Praha 2, Czech Republic
3Institute of Botany, Academy of Sciences of the Czech Republic, CZ-252 43 Průhonice, Czech Republic
4The Food and Environment Research Agency, Sand Hutton, York YO41 1LZ, United Kingdom
5LEI, part of Wageningen UR, P.O. Box 8130, 6700 EW, Wageningen, The Netherlands
Scoring




Ageratum houstonianum   Chromolaena odorata   Xanthium strumarium




  Argemone ochroleuca                           Lantana camara
                            Opuntia stricta
Scoring
Scoring
        Ageratum 77.8%


                   Argemone 84.8%


                            Chromolaena 89.5%


                                                      Opuntia 95.4%


                                                      Xanthium 95.6%


                                                          Lantana 98.1%


-25.0      -20.0    -15.0       -10.0    -5.0   0.0     5.0      10.0     15.0   20.0   25.0

   Predictability of presence comparing to all aliens with 92.9% of correct predictions
Artificial placing some factors at the top of a tree using
                                                 Splitter Variable Disallow Criteria
                                                                                                   Area infested
                                                                                                   Class         %
                                                                                <= 4905 ha         success      54.5            > 4905 ha
                                                                                                   failure      45.5
                                                                                                           N = 173

                           Sanitary control                                                                                                                                             Reaction time
                           Class         %                                                                                                                                              Class          %
               yes         success      66.7     no                                                                                                                     <= 11 yrs       success      32.5   > 11 yrs
                           failure      33.3                                                                                                                                            failure      67.5
                                   N = 101                                                                                                                                                      N = 72
Terminal node 1                                                                                                                                                                                               Terminal node 8
Class           %                                     Man-made habitat                                                                                     Detectability                                      Class           %
success        86.1                                   Class          %                                                                                     Class          %         easy,                     success        15.2
failure        13.9                        yes        success      60.8   no                                                           difficult           success      46.6                                  failure        84.8
                                                                                                                                                                                    intermediate
        N = 29                                        failure      39.2                                                                                    failure      53.4                                          N = 32
                                                              N = 72                                                                                               N = 40
                        Terminal node 2                                                                                   Terminal node 6                                             Terminal node 7
                        Class           %                                                                                 Class         %                                             Class           %
                        success        82.4                               Pathway: escaped                                success      80.1                                           success        31.8
                        failure        17.6                               Class          %                                failure      19.9                                           failure        68.2
                                N = 37                              yes   success      42.3          no                           N=9                                                         N = 31
                                                                          failure      57.7
                                                                                  N = 35
                                                  Terminal node 3
                                                  Class         %                                     World region
                                                  success      85.7                                   Class          %             Australasia,
                                                  failure      14.3                     Americas      success      25.7            Europe
                                                          N=7                                         failure      74.3
                                                                                                              N = 28
                                                                          Terminal node 4                                            Terminal node 5
                                                                          Class         %                                            Class           %
                                                                          success      59.3                                          success        11.5
                                                                          failure      40.7                                          failure        88.5
                                                                                  N=7                                                        N = 21




Fig. 3. Optimal classification tree with event-specific factors placed at the top of the tree. Overall misclassification rate of the optimal tree is
18.0% with 15.3% misclassified success and 23.4% failure cases. Sensitivity and specificity are respectively 84.7 and 76.6% for learning, and 66.7
and 64.9% for cross-validated samples.
                Submitted to: PLoS One
                Which factors affect the success or failure of eradication campaigns against alien species?
                Therese Pluess1, Vojtěch Jarošík2,3*, Petr Pyšek3,2, Ray Cannon4, Jan Pergl3, Annemarie Breukers5, Sven Bacher1
                1University  of Fribourg, Department of Biology, Ecology & Evolution Unit, Chemin du Musée 10, 1700 Fribourg, Switzerland
                2Charles  University in Prague, Faculty of Science, Department of Ecology, CZ-128 44 Praha 2, Czech Republic
                3Institute of Botany, Academy of Sciences of the Czech Republic, CZ-252 43 Průhonice, Czech Republic
                4The Food and Environment Research Agency, Sand Hutton, York YO41 1LZ, United Kingdom
                5LEI, part of Wageningen UR, P.O. Box 8130, 6700 EW, Wageningen, The Netherlands
Limitations
• Data mining models are good, but not as good to solve completely all
  problems with data that violate a basic assumption of the independence
  of errors of observations due to temporal or spatial autocorrelation, or
  due to a related problem of phylogenetic relatedness
• Data mining methods cannot distinguish fixed and random effects and
  thus cannot be used with mixed effect and nested statistical designs
• The tree-growing method is data intensive, requiring many more cases
  than classical regression
   – While for multiple regression it is usually recommended to keep the
     number of explanatory variables six to ten times smaller than the number
     of observations, for classification trees, at least 200 cases are
     recommended for binary response variables and about 100 more cases for
     each additional level of a categorical variable
   – Efficiency of trees decreases rapidly with decreasing sample size, and for
     small data sets, no test samples may be available
The tree-growing method is data intensive, requiring
                            many more cases than classical regression
                                          A. Variable importance for plant richness                                                                     B. Variable importance for butterfly richness

                           A. area                                                                                        A. altitudinal range
                    B. w asteland                                                                                                         D. soil
          F. natural perim eter                                                                               F. distance to natural habitat
                   B. arable land                                                                                                     A. aspect
                B. build-up area                                                                                        F. natural perim eter
                            D. soil                                                                                                      A. area
                         C. forest                                                                                                     B. forest
                                                                                                                                  B. w asteland
                        A. aspect                                                                                                   A. isolation
            A. altitudinal range                                                                                       F. built-up perim eter
                         B. forest                                                                                                    A. railw ay
                     C. orchards                                                                                                     A. altitude
                     B. orchards                                                                                             D. bare bedrock
                       B. pasture                                                                                                    B. pasture
               D. bare bedrock                                                                                  F. distance to built-up area
         F. built-up perim eter                                                                                                  C. arable land
                  C. arable land                                                                                                 B. arable land
                      A. isolation                                                                                               C. w asteland
                        A. railw ay                                                                                                  C. pasture
                         D. quarry                                                                                                     D. quarry
                                                                                                                                   C. orchards
                 A. reserve age
                                                                                                                                  C. grassland
                       A. altitude                                                                                           C. build-up area
                    B. grassland                                                                                               A. reserve age
                    C. grassland                                                                                                  B. grassland
                       C. pasture                                                                                                  B. orchards
                   C. w asteland                                                                                              B. build-up area
               C. build-up area                                                                                                        C. forest
F. distance to natural habitat                                                                                                         E. plants
  F. distance to built-up area                                                                                                E. com m unities

                                      0     10     20     30     40      50      60      70   80   90   100                                         0       10     20     30     40      50      60      70   80   90   100

                                                               Relative importance (%)                                                                                         Relative importance (%)


Figure A1 Rank of importance of the individual explanatory variables from the six groups of explanatory variables (A. Geography, B. Past habitats, C. Present
habitats, D. Substrate, E. Vegetation heterogeneity, F. Urbanization) from random forests classifications, used for predicting the above and below median value for each
observation of species richness and endangered species across 48 reserves. Variable importance is rescaled to have values between 0 and 100. A. Variable importance
for plant richness: final error rate 45.8%, sensitivity 50.0%, specificity 58.3%. B. Variable importance for butterfly richness: final error rate 31.6%, sensitivity
63.6%, specificity 73.1%.
Hints for further development: a
     problem of species relatedness
• The problem is that related species can have similar
  traits due to inherited traits from a common ancestor
• Consequently, the traits of related species are not
  independent, and in statistical analyses, this
  relatedness have to be taken into account
• Considering effects of species relatedness is important
  not only to treat a lack of statistical independence, but
  also to solve practical implications
   – For instance, when one need to know how species
     belonging to a particular taxon (like family, order, or class)
     are predisposed to invasion
Species relatedness
                             approximated by taxonomic
                             hierarchy


• As a first approximation, instead of a real relatedness based
  on inherited traits from common ancestors described by a
  phylogenetic tree, we can use taxonomic
  hierarchy, supposing that the species relatedness is
  increasing from subkingdoms within kingdoms to subspecies
  within species
• First, the trees can be constructed only with the highest
  taxonomic level and then including each lower taxonomic
  level, one after another
   – This successional treatment of taxonomy enable to reveal at
     which taxonomic level the examined trait has the largest effect
Species relatedness
                                   approximated by taxonomic
                                   hierarchy


• As a first approximation, instead of a real relatedness based on inherited
  traits from common ancestors, we can use taxonomic hierarchy, supposing
  that the species relatedness is increasing from subkingdoms within
  kingdoms to subspecies within species
• Second, the splitting power of taxonomic levels in regression and
  classification trees can be penalized proportionally to the number of
  categories at each taxonomic level, taking advantage of that that the
  number of units at the taxonomic levels is increasing from the highest to
  the lowest
   – This penalization, weighting the taxonomic levels from the lowest to the
     highest proportionally to the number of categories at each level, thus can
     reveal a real weighted effect of a particular taxon on a particular species trait
Species relatedness
                                           following phylogenetic
                                           tree


• The best way how to treat species relatedness is to use the actual species
  phylogenetic distances calculated from a specific phylogenetic tree
• Technically, this can be done by calculating principal coordinate axes, that
  describe distances between all taxa in a phylogenetic tree
• However, their use in CART® is hindered by the fact that these principal
  coordinate axes are orthogonal, and classification and regression trees
  exhibit their greatest strengths with a highly nonlinear structure and
  complex interactions
    – Their usefulness decreases with increasing linearity of the relationships, and
      consequently, on mutually independent principle coordinates, no trees are
      built
Hints for further development: a
problem of species relatedness
                     The problem can
                     be solved by using
                     the Splitter
                     Variable Disallow
                     Criteria in CART®
The problem of species relatedness could be solved
by artificial placing some factors at the top of a tree
                             • When species
                               relatedness
                               approximated by
                               taxonomic hierarchy,
                               the Splitter Variable
                               Disallow Criteria can
                               be directly used for
                               appropriate placing
                               the individual levels of
                               the nested taxonomic
                               hierarchy
The problem of species relatedness could be solved
by artificial placing some factors at the top of a tree
                             • When species relatedness directly
                               follows phylogenetic distances
                               between taxa on a phylogenetic
                               tree, the Splitter Variable Disallow
                               Criteria can be again directly used to
                               place the individual splits
                             • A further development of Splitter
                               Variable Disallow Criteria might
                               enable to include also the real
                               distances between the individual
                               taxa, by developing a disallow criteria
                               defining not only the split region, but
                               also the distances between the
                               individual splits, using e.g. a
                               predefined importance values for the
                               distances among the splits
                                 – This might be useful also for other
                                   applications in which it can appear
                                   useful to place some predictors not
                                   only deliberately at a chosen place in
                                   a tree structure, but also to define
                                   the precise distance between the
                                   predictors
Acknowledgements
• Co-authors




      Petr Pyšek            Jan Pergl
Acknowledgements
• Wikipedia for freely available pictures
Acknowledgements
• Grants
   – ALARM: Assessing LArge-scale environmental Risks with tested
     Methods (European Union 6th Framework Programme), DAISIE:
     Delivering Alien Invasive Species Inventories for Europe
     (European Union 6th Framework Programme), PRATIQUE:
     Enhancements of Pest Risk Analysis Techniques (European
     Union 7th Framework Programme, 2008–2011)
   – Czech Science Foundation grant no. 206/09/0563
   – Long-term research projects no. RVO 67985939 from the
     Academy of Sciences of the Czech Republic, MSM 21620828 and
     LC06073 from the Ministry of Education, Youth and Sports of
     the Czech Republic
   – Praemium Academiae award from the Academy of Sciences of
     the Czech Republic to Petr Pyšek
Acknowledgements
• And last but not least, …

Contenu connexe

Plus de Salford Systems

Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Salford Systems
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Salford Systems
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningSalford Systems
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerSalford Systems
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like YouSalford Systems
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To RememberSalford Systems
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetSalford Systems
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideSalford Systems
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to marsSalford Systems
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher EducationSalford Systems
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingSalford Systems
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hivSalford Systems
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning CombinationSalford Systems
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSalford Systems
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998Salford Systems
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPMSalford Systems
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7Salford Systems
 
TreeNet Overview - Updated October 2012
TreeNet Overview  - Updated October 2012TreeNet Overview  - Updated October 2012
TreeNet Overview - Updated October 2012Salford Systems
 
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles and CART  Decision Trees:  A Winning CombinationTreeNet Tree Ensembles and CART  Decision Trees:  A Winning Combination
TreeNet Tree Ensembles and CART Decision Trees: A Winning CombinationSalford Systems
 

Plus de Salford Systems (20)

Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data Mining
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele Cutler
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To Remember
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example Dataset
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher Education
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modeling
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hiv
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
 
SPM v7.0 Feature Matrix
SPM v7.0 Feature MatrixSPM v7.0 Feature Matrix
SPM v7.0 Feature Matrix
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARS
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPM
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7
 
TreeNet Overview - Updated October 2012
TreeNet Overview  - Updated October 2012TreeNet Overview  - Updated October 2012
TreeNet Overview - Updated October 2012
 
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles and CART  Decision Trees:  A Winning CombinationTreeNet Tree Ensembles and CART  Decision Trees:  A Winning Combination
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
 

Dernier

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Dernier (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Data mining to reveal biological invaders

  • 1. Data Mining to Reveal Biological Invaders VOJTĚCH JAROŠÍK 1Department of Ecology, Faculty of Science, Charles University, Prague and 2Institute of Botany, Academy of Sciences of the Czech Republic, Průhonice, the Czech Republic 1 2
  • 2. Backgroud on biological invasions • Economic impacts • Changes in habitat properties • Loss of species diversity • Biological homogenization
  • 4. Economic impacts Skibbereen 1847 by Cork artist James Mahony Emigrants Leave Ireland, engraving by Henry Doyle
  • 5. Changes of habitat properties
  • 6. Extinction of native species Brown tree snake on a fence post in Guam
  • 8. Aim • Data-mining tools were originally designed for analyzing vast databases of often incomplete data, with an aim to find financial frauds, suitable candidates for loans, potential customers and other uncertain outputs • I will show that searching for potential invasive species and their traits responsible for invasiveness, or identifying factors that distinguish invasible communities from those that resist invasion are similar risk assessments – This is perhaps the main reason why CART® and related methods are becoming increasingly popular in the field of invasion biology – Identifying homogeneous groups with high or low risk and constructing rules for making predictions about individual cases is, in essence, the same for financial credit scoring as for pest risk assessment – In both cases, one searches for rules that can be used to predict uncertain future events
  • 9. Basic principles of data mining • Main literature sources • Binary recursive partitioning • Classification And Regression Trees (CART®) • Random forests (RFTM)
  • 10. Basic principles of data mining • Classification and Regression Trees (CART®): – Breiman L, Friedman JH, Olshen RA, Stone CG (1984) Classification and Regression Trees. Belmont, Wadsworth International Group – Steinberg D, Colla P (1995) CART: Tree-structured Non-parametric Data Analysis. Salford Systems, San Diego, USA – Steinberg D, Colla P (1997) CART: Classification and Regression Trees. Salford Systems, San Diego, USA – Steinberg G, Golovnya M (2006) CART 6.0 User's Manual. Salford Systems, San Diego, USA – De'ath G, Fabricius KE (2000) Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology, 81, 3178-3192 – Bourg NA, McShea WJ, Gill DE (2005) Putting a cart before the search: successful habitat prediction for a rare forest herb. Ecology, 86, 2793-2804 – Jarošík V (2011) CART and related methods. In: Encyclopedia of Biological Invasions (eds Simberloff D, Rejmánek M), pp. 104-108. University of California Press, Berkeley and Los Angeles, USA • Random ForestsTM – Breiman L (2001) Random Forests. Machine Learning, 45, 5--32 – Breiman L, Cutler A (2004) Random Forests TM. An Implementation of Leo Breiman’s RFTM by Salford Systems. Salford Systems, San Diego, USA – Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, et al. (2007) Random forests for classification in ecology. Ecology, 88, 2783-2792 – Hochachka WM, Caruana R, Fink D, Munson A, Riedewald M, et al. (2007) Data-mining discovery of pattern and process in ecological systems. Journal of Wildlife Management, 71, 242-2437 • TreeNetTM – Friedman JH (1999) Stochastic Gradient Boosting. Technical report, Dept. of Statistics, Stanford University. – Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29, 1189- 1232 – Friedman JH (2002) Stochastic gradient boosting. Computational Statistics and Data Analysis, 38, 367–378 – Hastie TJ, Tibshirani RJ, Friedman JH (2001) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York – Jarošík V, Pyšek P, Kadlec T (2011) Alien plants in urban nature reserves: from red-list species to future invaders. NeoBiota, 10, 27–46
  • 11. Basic principles of predictive mining the data are successively split Binary recursive partitioning
  • 12. Basic principles of CART® CART® provide graphical, highly intuitive inside Run off Class Cases % None Absent 384 60.3 High, Medium, Low Present 253 39.7 N= 637 Road density outside Natural areas outside Class Cases % Class Cases % =< 0.1 Absent 287 91.4 > 0.1 =< 0.9 Absent 97 30.0 > 0.9 Present 27 8.3 Present 226 70.0 N= 314 N= 323 Terminal node 1 Terminal node 2 Terminal node 3 Class Cases % Class Cases % Class Cases % Road present inside Absent 277 96.9 Absent 10 35.7 Absent 34 15.7 Class Cases % Present 9 3.1 Present 18 64.3 Present 183 84.3 No Absent 63 59.4 Yes N= 286 N= 28 N= 217 Present 43 40.6 Terminal node 4 N= 106 Terminal node 5 Class Cases % Class Cases % Absent 34 89.5 Absent 29 42.6 Present 4 10.5 Present 39 57.4 N= 38 From: N= 68 The tree is represented (in defiance of gravity) with the root standing for undivided data at the top
  • 13. Basic principles of CART® From:
  • 14. Basic principles of RandomForestsTM Random forests can be seen as an extension of classification trees by fitting many sub-trees to parts of the dataset and then combining the predictions from all trees Figure 3. Ranking of importance values (%) for all invasive species. Ranking is scaled relative to the best performing variable based on out of-bag method of random forests. White bars are predictors from the outside of Kruger National Park and grey bars from the inside. From:
  • 15. Important properties of data mining models • Exploratory and flexible • Non parametric • Surrogates • Penalization • Weighting • Scoring • Artificial placing some predictors at the top of a tree
  • 16. Data mining models are exploratory • Unlike the classical linear methods, the data mining techniques enable predictions to be made from the data and to identify the most important predictors by screening a large number of candidate variables without requiring the user to make any assumptions about the form of the relationships between the predictors and the target variable, and without a priori formulated hypotheses
  • 17. Data mining models are flexible • These techniques are also more flexible than traditional statistical analyses because they can reveal structures in the dataset that are other than linear, and solve complex interactions • Unlike linear models, which uncover a single dominant structure in the data, data mining models are designed to work with data that might have multiple structures: – The models can use the same explanatory variable in different parts of the tree, dealing effectively with nonlinear relationships and higher order interactions • In fact, provided there are enough observations, the more complex the data and the more variables that are available, the better models will appear compared to alternative methods. – With a complex data set, understandable and generally interpretable results often can be found only by constructing data mining models • Data mining models are also excellent for initial data inspection – Models are often used to select a manageable number of core measures from databases with hundreds of variables – A useful subset of predictors from a large set of variables can then be used in building a formal linear model
  • 18. With a complex data set, understandable and generally interpretable results often can be found only by constructing data mining models From:
  • 19. With a complex data set, understandable and generally interpretable results often can be found only by constructing data mining models Table 2. Test statistics and significances of the explanatory variables and their interactions in the AIC minimal adequate models for proportion of archaeophytes. Non-significant variables and their interactions are not shown. R2 = 0.82 Explanatory variable archaeophytes df Deviance P R2 Habitat type 31 90521.2 <0.001 0.760 Riverside position 1 28.5 <0.001 <0.001 Vegetation cover 1 88.4 <0.001 <0.001 Surrounding urban/industrial land 1 660.7 <0.001 0.005 Surrounding agricultural land 1 852.8 <0.001 0.007 Altitudinal floristic region 2 1755.1 <0.001 0.015 Altitude 1 647.7 <0.001 0.005 Temperature 1 28.3 <0.001 <0.001 Precipitation 1 392.8 <0.001 0.003 Habitat type x riverside position n.s. Habitat type x vegetation cover 32 373 <0.001 0.003 Habitat type x urban/industrial 32 480.2 <0.001 0.004 Habitat type x agricultural 32 278.7 <0.001 0.002 Habitat type x human density 32 172.9 <0.001 0.001 Habitat type x floristic region 53 405.3 <0.001 0.003 Habitat type x altitude 32 284.0 <0.001 0.002 From: Habitat type x temperature 32 91.1 <0.001 <0.001 Habitat type x precipitation 32 181.2 <0.001 0.001 Riverside position x precipitation 1 11.4 <0.001 <0.001 Vegetation cover x agricultural 1 34.6 <0.001 <0.001 Vegetation cover x human density 1 4.4 0.036 <0.001 Vegetation cover x floristic region 2 9.4 0.009 <0.001 Vegetation cover x precipitation 1 9.2 0.002 <0.001 Urban/industrial x human density n.s. Urban/industrial x floristic region 2 9.4 0.009 <0.001 Agricultural x altitude n.s. Agricultural x temperature 1 38.0 <0.001 <0.001 Agricultural x precipitation 1 10.0 0.002 <0.001 Floristic region x altitude 2 11.1 0.004 <0.001 Floristic region x temperature 2 8.2 0.016 <0.001 Floristic region x precipitation n.s. Altitude x precipitation 1 6.2 0.013 <0.001 Temperature x precipitation n.s.
  • 20. With a complex data set, understandable and generally interpretable results often can be found only by constructing data mining models R2 = 0.86 From:
  • 21. With a complex data set, understandable and generally interpretable results often can be found only by constructing data mining models R2 = 0.74 ANOVA table and deletion tests for linear minimal adequate model (MAM) of ANCOVA describing population characteristics determining the impact on diversity between invaded and uninvaded pairs of plots. ANOVA on MAM Deletion tests on MAM Source of variation Df SS MS R2 (R2adj.) F Df P Impact on diversity H': Species x Height 13 25.4219 1.9555 1.93 13, 100 0.04 From: Species x Differences in height 13 5.0490 0.3884 2.11 13, 100 0.02 Species x Differences in cover 13 14.1461 1.0882 10 13, 100 < 0.001 Height x Differences in Height 1 0.3246 0.3246 4.44 1, 88 0.04 Height x Differences in cover 1 0.2572 0.2572 4.83 1, 88 0.03 Cover x Differences in cover 1 0.6834 0.6834 6.48 1, 88 0.01 Residuals 87 9.1763 0.1055 Total 129 55.0584 0.83 (0.76) 10.36 42, 87 < 0.001
  • 22. Models are often used to select a manageable number of core measures…A useful subset of predictors … can then be used in building a formal linear model Water run-off Class Cases % =< 6.0 Absent 384 60.3 > 6.0 1 Probability of an alien record Present 253 39.7 N= 637 0 Road density outside Class Cases % Ro =< 0.1 Absent 323 89.7 > 0.1 ad de Present 37 10.3 Terminal node 3 ns -of f ity N= 360 10 run Class Cases % ter km Wa Terminal node 1 Terminal node 2 Absent 61 22.0 Class Cases % Class Cases % Present 216 78.0 Absent 313 94.6 Absent 10 34.5 N= 277 Present 18 5.4 Present 19 65.5 N= 331 N= 29 From:
  • 23. Models are often used to select a manageable number of core measures…A useful subset of predictors … can then be used in building a formal linear model From:
  • 24. Models are often used to select a manageable number of core measures…A useful subset of predictors … can then be used in building a formal linear model The probability of non-native plant presence was associated with water runoff (0–27 million m3) and major 1 road density within 10 km radius outside the KNP boundary (0–0.15 km/km2). Overall significance of the Probability of an alien record model is G2 = 531.54, df=3, P<0.0001. Parameters of the model are given in the Table below. Parameter Estimate ASE Standardized ASE G2 df P estimate Intercept -6.30 0.66 -0.73 0.18 0 Run-off 0.74 0.08 3.37 0.31 264.82 1 0.0001 Road density 69.70 8.59 1.54 0.20 127.96 1 0.0001 Ro ad de ns -of f Run-off × Road -5.90 0.89 -1.63 0.25 49.13 1 0.0001 ity n 10 te r ru density km Wa From:
  • 25. Data mining methods are non parametric • Consequently, unlike with parametric linear models – Nonnormal distribution does not require transformations • Because the trees are invariant to monotonic transformations of continuous predictor variables, no transformation is needed prior to analyses • Outliers among the predictors generally do not affect the models because splits usually occur at non-outlier values • (However, in some circumstances, transformation of the target variable may be important to alleviate variance heterogeneity).
  • 26. Surrogates • Surrogates of each split, describing splitting rules that closely mimic the action of the primary split, can be assessed and ranked according to their association values, with the highest possible value 1.0 corresponding to the surrogate producing exactly the same split as the primary split – Surrogates can then be used to replace an expensive primary explanatory variable by a less expensive or easier to interpret, though probably less accurate one – Surrogates can also be used to build alternative trees on surrogates
  • 27. Surrogates can be used to build alternative trees
  • 28. Surrogates can be used to build alternative trees From:
  • 29. Surrogates can be used to build alternative trees Appendix S6. Alternative model to the optimal classification tree. The alternative categorical split “Run off” (none: no rivers intersected the segment; low: 2-10; medium: 10-15; high: 15- 26 million m3 / quaternary watershed / annum) replaced the continuous primary split “Water run-off” at node 1 of the optimal tree (Fig. 3) as a surrogate with association value = 0.86. Overall misclassification rate of the model is 13.5%, sensitivity 0.90 and specificity 0.80. “Natural areas outside” refers to the percentage of natural areas in a 5-km radius outside the KNP boundary, “Road present inside” refers to the presence or absence of roads in a studied segment inside KNP. Otherwise as in Fig. 3. From:
  • 30. Surrogates • Surrogates of each split, describing splitting rules that closely mimic the action of the primary split, can be assessed and ranked according to their association values, with the highest possible value 1.0 corresponding to the surrogate producing exactly the same split as the primary split – Surrogates also serve to treat missing values, because the alternative splits are used to classify a case when its primary splitting variable is missing
  • 31. Surrogates serve to treat missing values • The fact that data mining techniques can handle data gaps by calculating surrogate variables to replace missing values often enable to use larger dataset than in linear models in which usually all cases with missing values had to be discarded • Consequently, the data mining techniques can often reveal additional factors relevant for predictions that may have been overlooked when tested by classical statistical approaches
  • 32. Calculating surrogate variables to replace missing values …enable to use larger dataset… the data mining techniques thus often reveal additional factors relevant for predictions N = 136
  • 33. Calculating surrogate variables to replace missing values …enable to use larger dataset… the data mining techniques thus often reveal additional factors relevant for predictions Man-made habitat Class % yes success 54.5 no failure 45.5 N = 173 Habitat Pathway: Escaped Class % outdoor, Class % indoor success 72.5 both yes success 28.6 no failure 27.5 failure 71.4 N = 114 N = 59 Terminal node 1 Terminal node 6 Terminal node 7 Class % World region Class % Class % success 91.7 Class % Americas, success 75.0 success 16.6 failure 8.3 Australasia success 64.8 Europe failure 25.0 failure 83.4 N = 23 failure 35.2 N=8 N = 51 N = 91 Terminal node 2 Class % success 86.7 Spatial extent failure 13.3 local, Class % regional, N = 23 international success 51.6 national failure 48.4 N = 68 Terminal node 5 Class % Area infested success 16.3 Class % failure 83.7 <= 0.2 ha success 77.0 > 0.2 ha N = 34 failure 23.0 N = 34 Terminal node 3 Terminal node 4 Class % Class % success 44.6 success 89.8 failure 55.4 failure 10.2 N=9 N = 25 Fig. 1. Optimal classification tree for factors relating to success and failure of 173 eradication campaigns against invertebrate plant pests, weeds and plant pathogens (viruses/viroids, bacteria and fungi) . Overall misclassification rate of the optimal tree is 15.8% compared to 50% for the null model, with 16.7% misclassified success and 14.8% failure cases. Sensitivity is 83.3 and specificity 85.2% for learning samples, and 77.1 and 69.0%, respectively, for cross-validated samples. Submitted to: PLoS One Which factors affect the success or failure of eradication campaigns against alien species? Therese Pluess1, Vojtěch Jarošík2,3*, Petr Pyšek3,2, Ray Cannon4, Jan Pergl3, Annemarie Breukers5, Sven Bacher1 1University of Fribourg, Department of Biology, Ecology & Evolution Unit, Chemin du Musée 10, 1700 Fribourg, Switzerland 2Charles University in Prague, Faculty of Science, Department of Ecology, CZ-128 44 Praha 2, Czech Republic 3Institute of Botany, Academy of Sciences of the Czech Republic, CZ-252 43 Průhonice, Czech Republic 4The Food and Environment Research Agency, Sand Hutton, York YO41 1LZ, United Kingdom 5LEI, part of Wageningen UR, P.O. Box 8130, 6700 EW, Wageningen, The Netherlands
  • 34. Penalization • As it is easier to be a good splitter on a small number of records (e.g., splitting a node with just two records), to prevent missing explanatory variables from having an advantage as splitters, the power of explanatory variables should be penalized in proportion to the degree to which they are missing • High-level categorical predictor variables have inherently higher splitting power than continuous predictors and therefore can also be penalized to level the playing field
  • 35. High-level categorical explanatory variables have inherently higher splitting power than continuous explanatory variables and therefore can also be penalized to level the playing field
  • 36. Weighting • Weighting enables one to give a different weight to each case in analysis • A usual application is on proportional data, where proportions calculated from larger samples give more precise estimates, and therefore, proportional response variables are weighted by their sample sizes • A similar approach can be applied to stratified sampling on strata having different sampling intensities • Data-mining models can also accommodate situations in which some misclassifications are more serious than others – For instance, if invasion risks are classified as low, moderate, and high, it would be more costly to classify a high-risk species as low-risk than as moderate-risk – This can be treated by specifying a differential penalty by weighting different way misclassifying high, moderate, and low risk
  • 37. Weighting: a usual application is on proportional data
  • 38. Weighting: stratified sampling Man-made habitat Class % yes success 54.5 no failure 45.5 N = 173 Habitat Pathway: Escaped Class % outdoor, Class % indoor success 72.5 both yes success 28.6 no failure 27.5 failure 71.4 N = 114 N = 59 Terminal node 1 Terminal node 6 Terminal node 7 Class % World region Class % Class % success 91.7 Class % Americas, success 75.0 success 16.6 failure 8.3 Australasia success 64.8 Europe failure 25.0 failure 83.4 N = 23 failure 35.2 N=8 N = 51 N = 91 Terminal node 2 Class % success 86.7 Spatial extent failure 13.3 local, Class % regional, N = 23 international success 51.6 national failure 48.4 N = 68 Terminal node 5 Class % Area infested success 16.3 Class % failure 83.7 <= 0.2 ha success 77.0 > 0.2 ha N = 34 failure 23.0 N = 34 Terminal node 3 Terminal node 4 Class % Class % success 44.6 success 89.8 failure 55.4 failure 10.2 N=9 N = 25 Fig. 1. Optimal classification tree for factors relating to success and failure of 173 eradication campaigns against invertebrate plant pests, weeds and plant pathogens (viruses/viroids, bacteria and fungi). Overall misclassification rate of the optimal tree is 15.8% compared to 50% for the null model, with 16.7% misclassified success and 14.8% failure cases. Sensitivity is 83.3 and specificity 85.2% for learning samples, and 77.1 and 69.0%, respectively, for cross-validated samples. Submitted to: PLoS One Which factors affect the success or failure of eradication campaigns against alien species? Therese Pluess1, Vojtěch Jarošík2,3*, Petr Pyšek3,2, Ray Cannon4, Jan Pergl3, Annemarie Breukers5, Sven Bacher1 1University of Fribourg, Department of Biology, Ecology & Evolution Unit, Chemin du Musée 10, 1700 Fribourg, Switzerland 2Charles University in Prague, Faculty of Science, Department of Ecology, CZ-128 44 Praha 2, Czech Republic 3Institute of Botany, Academy of Sciences of the Czech Republic, CZ-252 43 Průhonice, Czech Republic 4The Food and Environment Research Agency, Sand Hutton, York YO41 1LZ, United Kingdom 5LEI, part of Wageningen UR, P.O. Box 8130, 6700 EW, Wageningen, The Netherlands
  • 39. Scoring Ageratum houstonianum Chromolaena odorata Xanthium strumarium Argemone ochroleuca Lantana camara Opuntia stricta
  • 41. Scoring Ageratum 77.8% Argemone 84.8% Chromolaena 89.5% Opuntia 95.4% Xanthium 95.6% Lantana 98.1% -25.0 -20.0 -15.0 -10.0 -5.0 0.0 5.0 10.0 15.0 20.0 25.0 Predictability of presence comparing to all aliens with 92.9% of correct predictions
  • 42. Artificial placing some factors at the top of a tree using Splitter Variable Disallow Criteria Area infested Class % <= 4905 ha success 54.5 > 4905 ha failure 45.5 N = 173 Sanitary control Reaction time Class % Class % yes success 66.7 no <= 11 yrs success 32.5 > 11 yrs failure 33.3 failure 67.5 N = 101 N = 72 Terminal node 1 Terminal node 8 Class % Man-made habitat Detectability Class % success 86.1 Class % Class % easy, success 15.2 failure 13.9 yes success 60.8 no difficult success 46.6 failure 84.8 intermediate N = 29 failure 39.2 failure 53.4 N = 32 N = 72 N = 40 Terminal node 2 Terminal node 6 Terminal node 7 Class % Class % Class % success 82.4 Pathway: escaped success 80.1 success 31.8 failure 17.6 Class % failure 19.9 failure 68.2 N = 37 yes success 42.3 no N=9 N = 31 failure 57.7 N = 35 Terminal node 3 Class % World region success 85.7 Class % Australasia, failure 14.3 Americas success 25.7 Europe N=7 failure 74.3 N = 28 Terminal node 4 Terminal node 5 Class % Class % success 59.3 success 11.5 failure 40.7 failure 88.5 N=7 N = 21 Fig. 3. Optimal classification tree with event-specific factors placed at the top of the tree. Overall misclassification rate of the optimal tree is 18.0% with 15.3% misclassified success and 23.4% failure cases. Sensitivity and specificity are respectively 84.7 and 76.6% for learning, and 66.7 and 64.9% for cross-validated samples. Submitted to: PLoS One Which factors affect the success or failure of eradication campaigns against alien species? Therese Pluess1, Vojtěch Jarošík2,3*, Petr Pyšek3,2, Ray Cannon4, Jan Pergl3, Annemarie Breukers5, Sven Bacher1 1University of Fribourg, Department of Biology, Ecology & Evolution Unit, Chemin du Musée 10, 1700 Fribourg, Switzerland 2Charles University in Prague, Faculty of Science, Department of Ecology, CZ-128 44 Praha 2, Czech Republic 3Institute of Botany, Academy of Sciences of the Czech Republic, CZ-252 43 Průhonice, Czech Republic 4The Food and Environment Research Agency, Sand Hutton, York YO41 1LZ, United Kingdom 5LEI, part of Wageningen UR, P.O. Box 8130, 6700 EW, Wageningen, The Netherlands
  • 43. Limitations • Data mining models are good, but not as good to solve completely all problems with data that violate a basic assumption of the independence of errors of observations due to temporal or spatial autocorrelation, or due to a related problem of phylogenetic relatedness • Data mining methods cannot distinguish fixed and random effects and thus cannot be used with mixed effect and nested statistical designs • The tree-growing method is data intensive, requiring many more cases than classical regression – While for multiple regression it is usually recommended to keep the number of explanatory variables six to ten times smaller than the number of observations, for classification trees, at least 200 cases are recommended for binary response variables and about 100 more cases for each additional level of a categorical variable – Efficiency of trees decreases rapidly with decreasing sample size, and for small data sets, no test samples may be available
  • 44. The tree-growing method is data intensive, requiring many more cases than classical regression A. Variable importance for plant richness B. Variable importance for butterfly richness A. area A. altitudinal range B. w asteland D. soil F. natural perim eter F. distance to natural habitat B. arable land A. aspect B. build-up area F. natural perim eter D. soil A. area C. forest B. forest B. w asteland A. aspect A. isolation A. altitudinal range F. built-up perim eter B. forest A. railw ay C. orchards A. altitude B. orchards D. bare bedrock B. pasture B. pasture D. bare bedrock F. distance to built-up area F. built-up perim eter C. arable land C. arable land B. arable land A. isolation C. w asteland A. railw ay C. pasture D. quarry D. quarry C. orchards A. reserve age C. grassland A. altitude C. build-up area B. grassland A. reserve age C. grassland B. grassland C. pasture B. orchards C. w asteland B. build-up area C. build-up area C. forest F. distance to natural habitat E. plants F. distance to built-up area E. com m unities 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Relative importance (%) Relative importance (%) Figure A1 Rank of importance of the individual explanatory variables from the six groups of explanatory variables (A. Geography, B. Past habitats, C. Present habitats, D. Substrate, E. Vegetation heterogeneity, F. Urbanization) from random forests classifications, used for predicting the above and below median value for each observation of species richness and endangered species across 48 reserves. Variable importance is rescaled to have values between 0 and 100. A. Variable importance for plant richness: final error rate 45.8%, sensitivity 50.0%, specificity 58.3%. B. Variable importance for butterfly richness: final error rate 31.6%, sensitivity 63.6%, specificity 73.1%.
  • 45. Hints for further development: a problem of species relatedness • The problem is that related species can have similar traits due to inherited traits from a common ancestor • Consequently, the traits of related species are not independent, and in statistical analyses, this relatedness have to be taken into account • Considering effects of species relatedness is important not only to treat a lack of statistical independence, but also to solve practical implications – For instance, when one need to know how species belonging to a particular taxon (like family, order, or class) are predisposed to invasion
  • 46. Species relatedness approximated by taxonomic hierarchy • As a first approximation, instead of a real relatedness based on inherited traits from common ancestors described by a phylogenetic tree, we can use taxonomic hierarchy, supposing that the species relatedness is increasing from subkingdoms within kingdoms to subspecies within species • First, the trees can be constructed only with the highest taxonomic level and then including each lower taxonomic level, one after another – This successional treatment of taxonomy enable to reveal at which taxonomic level the examined trait has the largest effect
  • 47. Species relatedness approximated by taxonomic hierarchy • As a first approximation, instead of a real relatedness based on inherited traits from common ancestors, we can use taxonomic hierarchy, supposing that the species relatedness is increasing from subkingdoms within kingdoms to subspecies within species • Second, the splitting power of taxonomic levels in regression and classification trees can be penalized proportionally to the number of categories at each taxonomic level, taking advantage of that that the number of units at the taxonomic levels is increasing from the highest to the lowest – This penalization, weighting the taxonomic levels from the lowest to the highest proportionally to the number of categories at each level, thus can reveal a real weighted effect of a particular taxon on a particular species trait
  • 48. Species relatedness following phylogenetic tree • The best way how to treat species relatedness is to use the actual species phylogenetic distances calculated from a specific phylogenetic tree • Technically, this can be done by calculating principal coordinate axes, that describe distances between all taxa in a phylogenetic tree • However, their use in CART® is hindered by the fact that these principal coordinate axes are orthogonal, and classification and regression trees exhibit their greatest strengths with a highly nonlinear structure and complex interactions – Their usefulness decreases with increasing linearity of the relationships, and consequently, on mutually independent principle coordinates, no trees are built
  • 49. Hints for further development: a problem of species relatedness The problem can be solved by using the Splitter Variable Disallow Criteria in CART®
  • 50. The problem of species relatedness could be solved by artificial placing some factors at the top of a tree • When species relatedness approximated by taxonomic hierarchy, the Splitter Variable Disallow Criteria can be directly used for appropriate placing the individual levels of the nested taxonomic hierarchy
  • 51. The problem of species relatedness could be solved by artificial placing some factors at the top of a tree • When species relatedness directly follows phylogenetic distances between taxa on a phylogenetic tree, the Splitter Variable Disallow Criteria can be again directly used to place the individual splits • A further development of Splitter Variable Disallow Criteria might enable to include also the real distances between the individual taxa, by developing a disallow criteria defining not only the split region, but also the distances between the individual splits, using e.g. a predefined importance values for the distances among the splits – This might be useful also for other applications in which it can appear useful to place some predictors not only deliberately at a chosen place in a tree structure, but also to define the precise distance between the predictors
  • 52. Acknowledgements • Co-authors Petr Pyšek Jan Pergl
  • 53. Acknowledgements • Wikipedia for freely available pictures
  • 54. Acknowledgements • Grants – ALARM: Assessing LArge-scale environmental Risks with tested Methods (European Union 6th Framework Programme), DAISIE: Delivering Alien Invasive Species Inventories for Europe (European Union 6th Framework Programme), PRATIQUE: Enhancements of Pest Risk Analysis Techniques (European Union 7th Framework Programme, 2008–2011) – Czech Science Foundation grant no. 206/09/0563 – Long-term research projects no. RVO 67985939 from the Academy of Sciences of the Czech Republic, MSM 21620828 and LC06073 from the Ministry of Education, Youth and Sports of the Czech Republic – Praemium Academiae award from the Academy of Sciences of the Czech Republic to Petr Pyšek
  • 55. Acknowledgements • And last but not least, …