SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez nos Conditions d’utilisation et notre Politique de confidentialité.
SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez notre Politique de confidentialité et nos Conditions d’utilisation pour en savoir plus.
Quantifying Error in Training Data for Mapping and Monitoring the Earth System - A Workshop on “Quantifying Error in Training Data for Mapping and Monitoring the Earth System” was held on January 8-9, 2019 at Clark University, with support from Omidyar Network’s Property Rights Initiative, now PlaceFund.
Generating Training Data from Noisy
LEAD GEOSPATIAL DATA SCIENTIST
ML Hub Earth
Machine Learning commons for EO
Standards and best practices
Global Land Cover Training Dataset
Human-verified training dataset
Using open-source Sentinel-2 imagery
10 m spatial resolution.
Global and geo-diverse
10 Sentinel-2 bands: Red, Green, Blue, Red-Edge1-3, NIR, Narrow NIR, SWIR1-2
20 m bands scaled to 10m using bi-cubic interpolation
GlobeLand30 labels for 2010 used as a source
Classes mapped to REF Land Cover Taxonomy
Labels re-gridded to Sentinel-2 grid using nearest neighbor
Labels filtered by agreement with classes from Sentinel-2’s 20m scene classification
(produced as part of atmospheric correction)
Filtered labels used as reference labels for training
A pixel-based supervised Random Forests model trained for each scene.
Pixels without valid reflectance are excluded from training.
Training on class-stratified samples of half the pixels in a scene with one
Sentinel-2 pixel at 10 m for each label pixel at 30 m.
Predictions are made on all pixels marked with usable classes during Level-2A
processing, including pixels labeled as unclassified.
Annual labels will be generated by aggregating time series of predictions and
probabilities from the same tile throughout the year.
88.75% average model accuracy across 4 diverse scenes.
Some classes, like water and snow/ice, predicted with high accuracy and high
confidence across all scenes.
Other classes, like wetland and (semi) natural vegetation, are subtler and were
expected to be more difficult to classify.
Woody vegetation and cultivated vegetation were predicted relatively
accurately and not confused with each other, as a result of including 20 m red
edge bands, resampled to 10 m.
Artificial bare ground tended to be predicted in unclassified regions (in
reference data), taking over areas of natural bare ground and cultivated
vegetation and suggesting that traces of human activity would lead to pixels
classified as artificial bare ground in off-vegetation season.
What about non-categorical variables?
True value of categorical variables vs true value of continuous variables:
All measurements of continuous variables are prone to uncertainty (noise and
How to reduce/eliminate these uncertainties in training data?
Noisy and biased measurement systems
slide courtesy of K. McColl
Generating Training Dataset
Triple collocation (TC) is a technique for estimating the unknown error standard
deviations (or RMSEs) of three mutually independent measurement systems,
without treating any one system as zero-error “truth”.
𝑄𝑖𝑗 ≡ 𝐶𝑜𝑣 𝑋𝑖, 𝑋𝑗 𝜎𝜀𝑖
= 𝑄𝑖𝑖 −
𝑄 𝑖𝑗 𝑄𝑖𝑘
TC-based RMSE estimates at each pixel are used to compute a priori probability
(𝑃𝑖) of selecting a particular dataset:
Sample time series of a pixel
𝑋1 𝑋2 𝑋3