3. http://goo.gl/j8F64
Q: What to do
about rare
Domain Segmentations zones?
A: Select the nearest ones from the rest
3
But how? 3!
4. http://goo.gl/j8F64
In the literature: within vs cross = ??
Before This work
Kitchenham et al. TSE Cross vs within are no
2007 rigid boundaries
Within-company learning They are soft borders
(just use local data) And we can move a few
Cross-company learning examples across the border
(just use data from other And after making those
companies) moves
Results mixed “Cross” same as “local”
No clear win from cross or
within
4
6. http://goo.gl/j8F64
The Locality(1) Assumption
Data divides best on one attribute
1. development centers of developers;
2. project type; e.g. embedded, etc;
3. development language
4. application type (MIS; GNC; etc);
5. targeted hardware platform;
6. in-house vs outsourced projects;
7. Etc
If Locality(1) : hard to use
data across these boundaries
Then harder to build effort models:
Need to collect local data (slow)
6
7. http://goo.gl/j8F64
The Locality(N) Assumption
Data divides best on
combination of attributes
If Locality(N)
Easier to use data across
these boundaries
Relevant data spread all
around
little diamonds floating in the
dust
7
8. http://goo.gl/j8F64
Roadmap
WHY:
Motivation
WHAT:
Background (SEE = software effort estimation)
HOW:
Technology (TEAK)
Results
With TEAK, no distinction cross / within.
Related Work
SO WHAT:
Conclusions
8
9. http://goo.gl/j8F64
Roadmap
WHY:
Motivation
WHAT:
Background (SEE = software effort estimation)
HOW:
Technology (TEAK)
Results
With TEAK, no distinction cross / within.
Related Work
SO WHAT:
Conclusions
9
10. http://goo.gl/j8F64
What is SEE?
Software effort estimation (SEE) is
the activity of estimating the total
effort required to complete a
software project (Keung2008 [1]).
SEE is heavily investigated since early 80’s (Mendes2003
[2]), (Kemerer1987 [3]) (Boehm1981 [4])
10
11. http://goo.gl/j8F64
What is the SEE problem?
SEE as an industry problem:
• oftware projects (60%-80%) encounter overruns
S
• vg. overrun is 89% (Standish Group 2004)
A
• cc. to Jorgensen the amount is less (around 30%),
A
but still dire (Jorgensen2011 [4])
11
12. http://goo.gl/j8F64
Active research area
Jorgensen & Shepperd reviews 304 journal papers after
filtering (Jorgensen2007 [8])
• or “software effort cost” in “2000-2011” period IEEE
F
Xplore returns
• 098 Conference papers
1
• 61 Journal papers
1
Jorgensen & Shepperd literature review reveals
(Jorgensen2007 [8])
• Since 80’s 61% of SEE studies deal with new model
proposal and comparison to old ones
12
13. http://goo.gl/j8F64
Roadmap
WHY:
Motivation
WHAT:
Background (SEE = software effort estimation)
HOW:
Technology (TEAK)
Results
With TEAK, no distinction cross / within.
Related Work
SO WHAT:
Conclusions
13
14. http://goo.gl/j8F64
TEAK = ABE0 + instance selection
Kocaguneli et al. 2011, ASE journal
17,000+ variants of analogy-based effort estimation
ABE0 = analogy-based effort estimator, version 0
just the most commonly used analogy method
Normalized numerics ; min to max, 0 to 1
Euclidean distance (ignoring dependent variables)
Equal weighting to all attributes
Return median effort of k-nearest neighbors
Instance selection
Smart way to adjust training data
14
15. http://goo.gl/j8F64
How to find relevant training data?
independent
attributes
Use similar? w x y z class
similar 1 0 1 1 1 2
similar 2 0 1 1 1 3
different 1 7 7 6 2 5
different 2 1 9 1 8 8
Use more variant? different 3 5 4 2 6 10
alien 1 74 15 73 56 20
alien 2 77 45 13 6 40
alien 3 35 99 31 21 60
alien 4 49 55 37 4 80
Use aliens ?
15
16. http://goo.gl/j8F64
Variance pruning
independent
attributes
KEEP !
w x y z class
similar 1 0 1 1 1 2
similar 2 0 1 1 1 3
different 1 7 7 6 2 5
different 2 1 9 1 8 8
different 3 5 4 2 6 10
alien 1 74 15 73 56 20
alien 2 77 45 13 6 40
alien 3 35 99 31 21 60
alien 4 49 55 37 4 80
PRUNE !
1) Sort the clusters by “variance”
“Easy path”: cull the examples
2) Prune those high variance things
that hurt the learner
3) Estimate on the rest
16
17. http://goo.gl/j8F64
TEAK: clustering + variance pruning
(TSE, Jan 2011)
• TEAK is a variance-based
instance selector
• t is built via GAC trees
I
• TEAK is a two-pass system
• irst pass selects low-
F
variance relevant projects
• econd pass retrieves
S
projects to estimate from
17
18. http://goo.gl/j8F64
Essential point
TEAK finds local regions
important to the estimation of
particular cases
TEAK finds those regions via
locality(N)
Not locality(1)
18
19. http://goo.gl/j8F64
Roadmap
WHY:
Motivation
WHAT:
Background (SEE = software effort estimation)
HOW:
Technology (TEAK)
Results
With TEAK, no distinction cross / within
Related Work
SO WHAT:
Conclusions
19
20. http://goo.gl/j8F64
Within and Cross Datasets
Out of 20 datasets, only 6 are found suitable for within/cross experiments
Note: all
Locality(1)
divisions
20
21. http://goo.gl/j8F64
Experiment1: Performance Comparison
of Within and Cross-Source Data
• TEAK on within & cross data for each
dataset group (lines separate groups)
• LOOCV used for runs
• 20 runs performed for each treatment
• Results evaluated w.r.t. MAR, MMRE,
MdMRE and Pred(30),
but see http://goo.gl/6q0tw
• If within data outperforms cross, the
dataset is highlighted with gray
• See only 2 datasets highlighted
21
23. http://goo.gl/j8F64
Experiment2: Retrieval Tendency of TEAK
from Within and Cross-Source Data
Diagonal (WC)
vs. Off-Diagonal
(CC) selection
percentages
sorted
Percentiles of diagonals
and off-diagonals
23
24. http://goo.gl/j8F64
Roadmap
WHY:
Motivation
WHAT:
Background (SEE = software effort estimation)
HOW:
Technology (TEAK)
Results
With TEAK no distinction cross /within
Related Work
SO WHAT:
Conclusions
24
25. http://goo.gl/j8F64
Roadmap
WHY:
Motivation
WHAT:
Background (SEE = software effort estimation)
HOW:
Technology (TEAK)
Results
With TEAK, no distinction cross / within
Related Work
SO WHAT:
Conclusions
25
26. http://goo.gl/j8F64
Highlights
1. Don’t listen to everyone
When listening to a crowd, first
filter the noise
2. Once the noise clears: bits of
me are similar to bits of you
Probability of selecting cross or
within instances is the same
3. Cross-vs-within is not a
useful distinction
Locality(1) not informative
Enables “cross-company”
learning
26
27. http://goo.gl/j8F64
Implications
Companies can learn
from each other’s data
Businesscase for building
shared repository
Maybe, there are general
effects in SE
effects that transcend
boundaries of one
company
27
28. http://goo.gl/j8F64
Future Work
1. Check external validity
Does cross == within (after
instance selection) in other data?
2. Build more repositories
More useful than previously
thought for effort estimation
3. Synonym discovery
Can only use cross-data if it has
the same ontology
auto-generate lexicons to map
terms between data sets?
28
30. http://goo.gl/j8F64
References
1) J. W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation,” 2008 15th Asia-Pacific Software Engineering
Conference, pp. 495–502, 2008. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber= 4724583
2) E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of cost estimation models for web hypermedia applications,” Empirical
Software Engineering, vol. 8, no. 2, pp. 163–196, 2003.
3) C. Kemerer, “An empirical validation of software cost estimation models,” Communications of the ACM, vol. 30, no. 5, pp. 416–429, May 1987.
4) B. W. Boehm, Software Engineering Economics.Upper Saddle River, NJ, USA: Prentice Hall PTR, 1981.
5) Magne Jørgensen , “Contrasting ideal and realistic conditions as a means to improve judgment-based software development effort estimation”, Information
and Software Technology (July 2011) doi:10.1016/j.infsof.2011.07.001
6) Johann Rost, Robert L. Glass, “The Dark Side of Software Engineering: Evil on Computing Projects”, Wiley John & Sons Inc., 2011.
7) Magne Jørgensen, “Contrasting ideal and realistic conditions as a means to improve judgment-based software development effort estimation”, Information
and Software Technology (July 2011) doi:10.1016/j.infsof.2011.07.001 Key: citeulike:9626144
8) M. Jorgensen and M. Shepperd, “A systematic review of software development cost estimation studies,” IEEE Trans. Softw. Eng., vol. 33, no. 1, pp. 33–53,
2007.
9) M. Shepperd and G. F. Kadoda, “Comparing software prediction techniques using simulation,” IEEE Trans. Software Eng, vol. 27, no. 11, pp. 1014–1022, 2001.
10) I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies of software prediction models,” IEEE Transactions on Software
Engineering, vol. vol, pp. 31no5pp380–391, May 2005.
11) B. Kitchenham, E. Mendes, and G. H. Travassos. Cross versus Within-Company Cost Estimation Studies: A Systematic Review. IEEE Trans. Softw. Eng., 33(5):
316–329, 2007.
12) T. Menzies, Z. Chen, J. Hihn, and K. Lum. Selecting Best Practices for Effort Estimation. IEEE Transaction on Software Engineering, 32(11):883–895, 2006.
13) K. Lum, J. Powell, and J. Hihn.Validation of Spacecraft Software Cost Estimation Models for Flight and Ground Systems. In ISPA Conference Proceedings,
Software Modeling Track, May 2002.
14) J. Keung, E. Kocaguneli, and T. Menzies. A Ranking Stability Indicator for Selecting the Best Effort Estimator in Software Cost Estimation. Automated
Software Engineering (submitted), 2011.
15) M. Shepperd and C. Schofield. Estimating Software Project Effort Using Analogies. IEEE Transactions on Software Engineering, 23(12), Nov. 1997.
16) L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.Classification and Regression Trees, 1984.
17) T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy. Cross-project defect prediction. ESEC/FSE’09, page 91, 2009.
18) B. Turhan, T. Menzies, A. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical
Software Engineering, 14(5):540–578, 2009.
19) E. Kocaguneli, G. Gay, T. Menzies,Y.Yang, and J. W. Keung. When to use data from other projects for effort estimation. In ASE’10, pages 321–324, 2010.
30
31. http://goo.gl/j8F64
Related work
Cross-vs-within
(Defect prediction) Other
Zimmermann et al. FSE’09 Keung et al. 2011
pairs of project (x,y) 90 effort estimators
For 96% of pairs, predictors
Best methods built multiple
from “x” failed for “y”
local models (CART, CBR)
No relevancy filtering
Single dimensional models
Opposite result: comparatively worse
Turhan et al. ESE’09 Instance selection
If nearest neighbor filtering, Can discard 70 to 90% of data
predictors from “x” work without hurting accuracy
well for “y”
But no variance filtering Since 1974, 100s of papers
http://goo.gl/8iAUz
31