Interpreting yield variation in commercial production of crops / Como interpretar la variación de la productividad a partir de información comercial de cultivos

www.ciat.cgiar.org Agricultura Eco-Eficiente para Reducir la Pobrezawww.ciat.cgiar.org Agricultura Eco-Eficiente para Reducir la Pobreza
Interpreting yield variation in commercial
production of crops
DAPA
(Decision and Policy Analysis Program)

Farmers’ production
experiences/ commercial
production of crops
Principles of
operational
research
Modern
information
technology
What we
do
Environmental characterization of the production
system
Analysis of the Observations to optimize the system
Kg/Arbol Temperatura Edad
Observations made by farmers according to their
particular circumstances
Interpreting yield variation in commercial production of crops

Distribution of yield
The challenges !
Parametric, non-parametric?.... The reality!
Introduction

23
• Models rely on on assumptions of:
• Normality
• Homogeneity of Variance
• Independence
• Mostly based on linear relationships
• Models do not rely on assumptions
• Linear/ non-linear relationships
The challenges !
Parametric, non-parametric?... depends on distribution of residuals
Introduction
PARAMETRIC
NON- PARAMETRIC

As Sharon quoted:
“La sabiduria del internet”:
I have never come across a situation where a normal test is the right
thing to do.
When the sample size is small, even big departures from normality
are not detected, and when your sample size is large, even the
smallest deviation from normality will lead to a rejected null
http://stackoverflow.com/questions/7781798/seeing-if-data-is-normally-
distributed-in-r :
The challenges !
Parametric, non-parametric?
Introduction

“La sabiduria de”: Nassim Nicholas Taleb a “superhero of the mind”
(The Black Swan, Fooled by Randommess, Antifragile) - Nassim Nicholas Taleb
The statistical regress argument
“We need the data to tells us what the probability distribution is,
and a probability distribution to tell us how much data we need”
The challenges !
Introduction

The challenges !
Introduction
In terms of Big Data
• Approaching “N=All”
• The first is to collect and use a lot of data rather than settle for small amounts
or samples, as researchers have done for well over a century
• We can learn from a large body of information things that we could not
comprehend when we used only smaller amounts
• Sometimes to inform is better than explain – Looking for patterns
Doctors save lives in Canada by knowing that something is likely to occur,
this can be far more important than understanding exactly why
Big Data (Foreign Affairs magazine / McKinsey's High Tech)

What people think it is…
What it actually is…
Was clear for Antoine de Saint-Exupéry
(The little prince )
What people think it is…
What it actually is… Some of our
findings !
The challenges !
Parametric, non-parametric? Not always normal distribution !
Introduction

Analytical approaches
V1 V2 V3 V4 V5 … V60 L 2 L 3 L 4 L 5 … Kg/plot
Obs 1 0.1 18 3 312 0.3 … 89 0 1 0 1 0 … 2.39
Obs 2 0.2 15 4 526 0.1 … 52 1 0 0 0 1 … 30.35
Obs 3 0.6 14 1 489 0.2 … 64 0 1 1 1 1 … 42.25
Obs 4 0.05 19 2 523 0.5 … 13 0 0 0 0 1 … 52.50
Obs 5 0.4 13 3 214 0.6 … 57 1 1 1 1 1 …
Obs 6 0.8 12 4 265 0.4 … 24 1 1 0 1 0 … 82.25
Obs 7 0.2 15 1 236 0.8 … 26 0 0 1 0 0 … 89.28
Obs 8 0.1 17 3 541 0.1 … 35 0 1 1 1 0 … 125.0
Obs9 0.6 16 2 845 0.3 … 51 0 0 1 1 0 … 142.8
Obs10 0.1 18 1 126 0.1 … 43 1 1 0 0 1 … 150.0
… … … … … … … … … … … … … … …
Obs3000 0.04 15 3 235 0.6 … 85 1 1 1 1 0 … 180
70.52
L 1
Supervised models – Parametric and non parametrics
Independent variables/ Inpust/predictors
dependent
/output/
response
(known)
…
11

12
L 1
Unsupervised
models
V1 V2 V3 V4 V5 … V60 L 2 L 3 L 4 L 5
Obs 1 0.1 18 3 312 0.3 … 89 0 1 0 1 0
Obs 2 0.2 15 4 526 0.1 … 52 1 0 0 0 1
Obs 3 0.6 14 1 489 0.2 … 64 0 1 1 1 1
Obs 4 0.05 19 2 523 0.5 … 13 0 0 0 0 1
Obs 5 0.4 13 3 214 0.6 … 57 1 1 1 1 1
Obs 6 0.8 12 4 265 0.4 … 24 1 1 0 1 0
Obs 7 0.2 15 1 236 0.8 … 26 0 0 1 0 0
Obs 8 0.1 17 3 541 0.1 … 35 0 1 1 1 0
Obs9 0.6 16 2 845 0.3 … 51 0 0 1 1 0
Obs10 0.1 18 1 126 0.1 … 43 1 1 0 0 1
… … … … … … … … … … … … …
Obs3000 0.04 15 3 235 0.6 … 85 1 1 1 1 0
L 1
…………
…………
…………
…………
…………
…………
…………
…………
…………
…………
…………
…………
Analytical approaches – Parametric and non
parametrics
Self-organizing Maps (SOM)
Observations close to each other in the
visualization space
-4 -2 0 2 4 6 8
-4
-2
0
2
4
Axis1
Axis2

1st case study- Andean blackberry based on ANNs
Scatter plot displaying MLP predicted yield versus real Andean blackberry yield, using only the
validation dataset1715
R² = 0.892
-0.2
0.3
0.8
1.3
1.8
-0.2 0.3 0.8 1.3 1.8
Predictedyield(kg/plant/week)
Real yield (kg/plant/week)
Predicted
Supervised models - Non-linear regression
Coefficient of determination= 0.89
Histogram displaying yield data distribution of Andean blackberry
(Kg/plant/week)
Numberofobservations

0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
EffDepth
TempAvg_1
Na_un_chical
Na_un_cusba
TempAvg_0
TempAvg_2
TempAvg_3
ExtDrain
PrecAcc_1
Trmm_3
Nar-Cal
Cal_riosu_zr
Srtm
Slope
PrecAcc_0
Trmm_2
Na_un_cusal
Trmm_0
PrecAcc_3
TempRang_0
TempRang_2
AB_Thorn_N
Na_un_lajac
PrecAcc_2
Trmm_1
IntDrain
TempRang_3
TempRang_1
12 20 3 5 17 23 26 11 22 16 2 7 8 9 19 15 4 13 28 18 24 1 6 25 14 10 27 21
%Sensitivity
Sensitivity distribution of the model with respect to the inputs/predictors
Jiménez, D., Cock, J., Satizábal, F., Barreto, M., Pérez-Uribe, A., Jarvis, A. and Van Damme, P., 2009. Computers and
Electronics in Agriculture. 69 (2): 198–208
Sensitivity Matrix
Results - Andean blackberry
16
Effective soil depth
Temperature averages
Geographic location

(a) Kohonen map displaying the resultant 6 clusters and their labels according to yield values (b)
Component plane of Andean blackberry yield, the scale bar (right) indicates the range value of
productivity in kg/plant/week The upper side exhibits high values of yield, whereas the lower displays
low values
Unsupervised model - Visualization – component planes - SOM
17
Andean blackberry yieldKohonen map – 6 clusters
(a) (b)

Component plane of effective soil depth. The scale bar (right) indicates the range value in cm of soil depth:
the upper side of the scale exhibits high values, whereas the lower displays low values
18

Components planes of the temperature averages. In all ﬁgures, the scale bar (right)
indicates the range value in ◦C of temperature. The upper side exhibits high values,
whereas the lower displays low values
19

Component planes of the speciﬁcs geographic areas Nariño–La Union–Chical alto (left) and Nariño–La
union–Cusillo bajo (right). The highest values indicate presence and the lowest absence as they are
categorical variables
Visualization – component planes - SOM
20
Nariño - La Union – Chical Alto Nariño - La Union – Cusillo bajo

Drawbacks
20
• Crop management factors not included (only variety)
• Only non-parametric approaches (Based on ANNs)
• Limited spatial variation (Two locations- two departaments)
Advantages
• Predictor-predictor and predictor- response dependencies through Kohonen’s
Maps
• Combination of factors
• Non-linear approach

2nd case study- Lulo
Distribution of R2 obtained with each model
Regression R2
(mean)
Confidence
interval (95%)
Robust (linear) 0.65 0.63 - 0.66
MLP (non-linear) 0.69 0.67 - 0.70
Both models explained more than 60% of
variability in Lulo production
2321
Histogram displaying yield data distribution of lulo
(g/plant/week)
R2
provided by each approach
MLP
Robust regression
0.2877 0.3545 0.4214 0.4883 0.5552 0.6221 0.6889 0.7558 0.8227
0
2
4
6
8
10
12
14
16
18
20
22
24
26
NumberofobservationsNumberofobservations
Numberofobservations
Supervised modelling

Results - Lulo
The Sensitivity Matrix
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
%Sensitivity
Jiménez, D., Cock, J., Jarvis, A., Garcia, J., Satizábal, H.F., Van Damme, Pérez-Uribe, A., and Barreto, M., 2010.
Interpretation of Commercial Production Information: A case study of lulo, an under-researched Andean fruit.
Agricultural Systems. 104 (3): 258-270
22
Sensitivity distribution of the model with respect to the inputs/predictors
Temperature averages
Slope

(a) U-matrix displaying the distance among prototypes. The scale bar (right) indicates the values of
distance. The upper side exhibits high distances, whilst the lower displays low distances; (b) Kohonen
map displaying the 3 clusters obtained after using the K-means algorithm and the Davies–Bouldin index
The three most relevant variables were used to train a Kohonen map and identify clusters of
Homogeneous Environmental Conditions (HECs)
Results - Lulo
Unsupervised model - Clustering – component planes - SOM
23
U-Matrix Kohonen map – 3 clusters

Results - Lulo
Clustering – component planes - SOM
A mixed model with the categorical variables of three HECs, location and farmer
explained more than 80% of variation in lulo yield
Parameters Estimate
(g/plant/week)
Standard
Error
%
of total variance
Model including categorical variables of 3 HECs, location and farm
HEC 1.85 2.01 61.2%
Location 0.07 0.20 2.5%
Site-Farm 0.57 0.21 19.0%
Error 0.52 0.04 17.3%
Total 100.0%
Variance components of the mixed model estimations
24

Variable ranges HEC
Slope (degrees) EffDepth (cm) TempAvg_0
(°C)
5-14 21-40 15 -16.5 1
8-15 32-69 15 -18.9 2
13-24 40-67 15.8 -19 3
HEC 3 yielded 41 g/plant/week
more fruit than average
Results - Lulo
-30.00
-20.00
-10.00
0.00
10.00
20.00
30.00
40.00
50.00
1 2 3
Luloyield(g/plant/week)
Effects of clusters of environmental
conditions
25

Results - Lulo
Farm 7 and 9 in HEC 3. Farm 7 produced 68 g/plant/week less than average, whilst
farm 9 produced 51 g/plant/week more than average
-80.00
-60.00
-40.00
-20.00
0.00
20.00
40.00
60.00
1 2 3 4 5 8 17 5 6 8 10 11 12 13 15 16 17 19 20 7 9 14 18 19 20 21
1 2 3
Luloyield(g/plant/week)
Effects of farms across clusters of environmental conditions
1 2 3
26
Jiménez, D., Cock, J., Jarvis, A., Garcia, J., Satizábal, H.F., Van Damme, Pérez-Uribe, A., and Barreto, M., 2010. Interpretation of Commercial Production
Information: A case study of lulo, an under-researched Andean fruit. Agricultural Systems. 104 (3): 258-270

Drawbacks
20
• Crop management factors not included (only variety)
• Compared with the Andean blackberry study, even more limited spatial
Variation (locations within one department)
Advantages
• Iterative procedure (combination of parametric & non parametric /linear & non-
linear)
• Combination of factors
• The study is the first formal research study that evidences the yield gap
between farmers under similar climatic conditions in Colombia...provided the
basis for the site-specific analytical approaches
• Successfully identified farms that have superior management practices for
given environmental conditions

23
Facto Class (Clusters de Clima)
-1.0 -0.5 0.0 0.5 1.0
-1.0-0.50.00.51.0
Variables factor map (PCA)
Dim 1 (44.64%)
Dim2(27.62%)
bio_1
bio_2
bio_3
bio_4
bio_5
bio_6
bio_7
bio_8
bio_9bio_10bio_11
bio_12bio_13
bio_14
bio_15
bio_16
bio_17bio_18
bio_19
-5 0 5 10
-4-20246
Dim 1 (43.43%)
Dim2(29.83%)
Cluster
1
2
3
4
5
6
7
8
3er Estudio de Caso- Plátano

23
PCA
CATPCA (Clusters de Suelo)

23
C4S5
Cluster de Clima 4
Cluster Suelo 5

C4S5
Modelo Linear Generalizado ( MLG)
Log(Yield) = (1.22) + densidad de siembra (0.0008) + E
El modelo - Dependencias entre predictores y la variable de respuesta
Nivel de
significancia al 5%
Log (Y) = B0 + X (B1) + E

Log (Y) = B0 + X (B1) + X(B2) + E
C5S5
Log(Yield) = 0.80 + densidad de siembra (0.00101) + MezcVar (0.324154) + E
Nivel de
significancia al 5%

23
log(Yield) = β0+ β1 𝑋1 + β2 𝑋2 + … + ε
𝑒log(𝑌𝑖𝑒𝑙𝑑)
= 𝑒β0+ β1 𝑋1+ β2 𝑋2+ … + ε
(No linear)
𝑌𝑖𝑒𝑙𝑑 = 𝑒β0+ β1 𝑋1+ β2 𝑋2+ … + ε (regresando a unidad inicial Tons/ha)
𝑌𝑖𝑒𝑙𝑑 = 𝑒β0 𝑒β1 𝑋1 𝑒β2 𝑋2 … 𝑒ε (dependencias entre predictores y Tons/ha)
Con el modelo es posible calcular en cuantas veces se aumenta o
disminuye el rendimiento, mediante el cambio de una práctica específica
• Interpretación de los parámetros

23
Log(Yield) = (1.22) + densidad de siembra (0.0008) + E
Yield = 𝒆(1.22) 𝒆densidad de siembra (0.0008) 𝒆E
Densidad de siembra = 100  𝑒100 (0.008)
Con un nivel de confianza del 90%, se puede esperar que por cada
100 árboles/ha, el rendimiento anual en tons/ha aumente de un
3.2% a un 14.2%.
C4S5(Densidad de siembra)

23
3rd case study- Plantain
Mezc Var = 𝟎. 𝟎𝟎𝟏𝟎  𝑒presencia (0.0010)
Con un nivel de confianza del 90% se puede esperar que sembrar
variedades mezcladas pueda aumentar la producción en más de 10.46%.
Log(Yield) = 0.80 + densidad de siembra (0.00101) + Mezc Var (0.324154) + E
Yield = 𝒆(0.80) 𝒆 densidad de siembra (0.00101) 𝒆Mezc Var (0.00101) 𝒆E
C5S5 (Mezcla de Variedades)

23
C4S5 (densidad de siembra)
Yield = 𝒆(−2.078) 𝒆 densidad de siembra (0.0077) 𝒆dibujo de siembra(0.2079) 𝒆E
Con un nivel de confianza del 90%, se puede esperar que por cada 10
árboles/ha que se aumente en la densidad de siembra, el rendimiento anual
en toneladas por hectárea puede aumentar de un 2.3% a un 13.2 %
Densidad de siembra = 10 𝑒10 (0.0077)
4to Estudio de Caso- Aguacate

23
C2S4 (Dibujo de siembra)
Yield = 𝒆(3.6) 𝒆 densidad de siembra (−0.006) 𝒆variedad (0.434) 𝒆dibujo de siembra (0.7946) 𝒆E
Dibujo de siembra = 10 𝑒presencia (0.7946)
Con un nivel de confianza de 90%, se puede esperar que un productor de esta zona
que siembre en tresbolillo en vez de cuadrado, puede aumentar su producción en
más de 30.21%
4to Estudio de Caso- Aguacate

Drawbacks
20
• Not enough crop management factors to applied a hierarchical approach such as
mixed models
• Limited temporal variation
Advantages
• Iterative procedure (combination of parametric and semi-parametric)
• Crop management factors included (Farmer can control them)
• Predictors- response dependencies through GLM
• Large spatial variation
• Soil information included
• Linear & non-linear approach

-5 0 5
-4
-2
0
2
4
Factor 1: 3.8369 (48%)
Factor2:2.518(31.5%)
1194
24752476247724782479
248424852486
248724882489
24902491249224932494249724982499250025012502250325042505250625072508
2510
25112513251425152516251725182519
2524
25252526252725282529253025312532
2533
2534253525362537
253825392540301030113012301330143015301630173018301930203021
30223023
3024
302530263027
302830293030303130323033
303530363037303830393040304130423043
3044304530463047
3048304930503051305230533054305730583059
3060
306230633064
30653067
3360
736
13201321132213231324132513261327132813291331133213331335
1355
13591360136113621363
136413651366136713681369137013811382
1386
139013911392139313941395
1399140014011402
1403140414051415
1416
1417141914201421
1422
15501551
159416111612
1616
1624
20642067
206920702077207820792081208420892090209320962099210021012102
21042105
2106
2110211121122113211421152116211721182119212021212122212321242125212621272128212921302131213221332134213521362137
21382139214021412142214321442145
2146
2147214821492150
2433
254625472548254925502551255225532554255525562557255825592560256125622563256425652566256725682569257025712572
2573257425752576257725782579
2580
2728
577578579580581582583584585586587588589590592595596597605610613615
619621 624643650
670671672673674675676679680682
687
690691
692
839840842844845
2706270727082709
271127122713
2714
271527162717271827192720
2721272227262727
272927302731
2736
27402741
2742
274327442745
2748
2749275027512752275327542757
2791
3182319832003261326232633264326532663267326832693270
3271
32723273
99809981
99829983
9985
9986
9987
9988
9989
9990
9991
647
869870871872873874875876877878879880893894895896897898899900901904905906907908909
910911912913
914915918919920923924925929938950951
953954955956957958
964965
1983
1984
1985
2012
2014
2386
2390
2465
248024812482
2483
249524962509
2512
2520252125222523
2822
28242825282628282829
2830
2836
284928502851
285528562857285828592860286128632864
2865
286628672868287028722873
287728782879288028812885
303430553056306130663068
3107
31313132
3324
9984bio_7bio_12bio_13
bio_4bio_6bio_15
cons_mths
bio_14
cl1
cl2
cl3
Parametric methods
•Ordinary Least Squares regression (OLS)
•Principal component analysis (PCA)
•Robust linear regressions
•Mixed Models
•Best Linear Unbiased Prediction (BLUP)
•Facto Class (Factor analysis, Ward's method ,
K-means
•Categorical Principal Components Analysis
(CATPCA)
Semi or non-parametric methods
• Generalized linear model (GLM)
• Self Organizing Maps (SOM)
• Multilayer perceptron (MLP)
• Fuzzy logic
Analytical approaches – Data-driven
We adapt a range of methodologies to the analysis of real data … rather than data to some
methodologies.

Interpreting yield variation in commercial production of crops / Como interpretar la variación de la productividad a partir de información comercial de cultivos

Recommandé

Recommandé

Contenu connexe

Similaire à Interpreting yield variation in commercial production of crops / Como interpretar la variación de la productividad a partir de información comercial de cultivos

Similaire à Interpreting yield variation in commercial production of crops / Como interpretar la variación de la productividad a partir de información comercial de cultivos (20)

Plus de Decision and Policy Analysis Program

Plus de Decision and Policy Analysis Program (20)

Dernier

Dernier (20)

Interpreting yield variation in commercial production of crops / Como interpretar la variación de la productividad a partir de información comercial de cultivos