This document summarizes a study that used feature selection and classification methods to identify tree species in high-resolution satellite images. The researchers tested 35 features on over 1000 ground reference samples to rank their effectiveness for classification. They found that 6 spectral features performed best when used in a 5-nearest neighbor classifier, achieving over 80% accuracy for tree species identification. While species proportions were estimated accurately, stem numbers per species showed only moderate correlation with field data. Future work could explore more advanced classifiers, cross-validation, and improving stem number estimation.
Feature Selection for Tree Species ID in VHR Satellite Images
1. Feature Selection for Tree Species Identification in Very High Resolution Satellite Images Matthieu Molinier and Heikki Astola VTT Technical Research Centre of Finland [email_address] , [email_address] IGARSS 2011 Vancouver, 28.7.2011
2.
3. NewForest approach in forest variable estimation Modelling based on satellite image pixel reflectances and contextual features Individual tree crown (ITC) detection and crown width estimation Combining data to predict total amount and size variation by species segmentation estimates Refined, more accurate species-wise estimates
4.
5.
6.
7. Input for feature selection – 35 + 4 features R G B NIR PAN mean intensity within 1.5 m radius around tree candidates ( TC ) SPECTRAL (5) – set A CONTEXTUAL (9) – set B From PAN , 7.5 m radius around TC mean mean / median skewness kurtosis contrast pm1 : mean of brightest pixels ps1 : std of brightest pixels pm2 : mean of darkest pixels ps2 : std of darkest pixels SEGMENT-WISE (21) – set C From PAN , 3 segment sizes : 50 m 2 , 85 m 2 , 125 m 2 mean mean / median skewness kurtosis std : standard deviation pmean : partial mean pstd : partial standard deviation Probe variables random vectors or random permutations of a feature vector probe_gauss1 , probe_gauss2 probe_shuffle1 , probe_shuffle2
8. Class definitions and training scheme WHOLE DATASET (1164 samples) 900 trees, 264 non-trees TESTING (391) MODEL DESIGN (773) 2 / 3 1 / 3 TRAINING (512) VAL (261) 2 / 3 1 / 3 stratified sampling to preserve classes proportions model building ranking Class # Class name 1 pine 2 spruce 3 deciduous 4 shadow 5 open area / sunlit 6 bare ground 7 green vegetation Tree classes Non-tree classes
9.
10.
11. 6-10 features is enough Spectral features performed best segment-wise features not suited to mixed species study Overall classification accuracy on tree classes over 80% Probe variables selected more often in the first places with LDA than with kNN : linear classifier too simple. Quadratic LDA was overfitting. kNN, k=5 best overall performance, and lowest difference from training to validation error => lower risk of overfitting
This presentation was supposed to be given on Monday but I could not make it because the flight from London was delayed.
Good quality, not many clouds Spruce dominant species
100% pure species plots for training because : mixed plots not good for contextual features (radius of analysis around the tree, overlap) Possible to obtain both plot level error and stand level error Not to mix the datasets Train and test data not measured exactly in the same way, and not by the same operators – one by forest centres (public), one by a private company
1.5m : signature only of the tree 7.5 m : context – neighboring trees Partial mean and std : cutting the tails of the distributions Permutation of the feature vector (all tres / samples) Probe variables are obviously not related or correlated to the target forest variables Are the true variables more relevant than the probes ? If a variable is ranked worse than a probe for a given classification task, it should not be selected
Primary interest in tree classes Keep classes proportions
ATTENTION : correlation between two features does not mean we can eliminate one from the selection (toy examples from Guyon et al.)
Always start from the simplest model – linear Nonlinear but keep it simple