SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
How to Find Relevant Data
            for
    Effort Estimation

                Ekrem Kocaguneli, Tim Menzies


     LCSEE, West Virginia University, Morgantown/USA



 1
http://goo.gl/j8F64


USD DOD military projects (last decade)


                                 You must
                                segment to
                               find relevant
                                    data




2
http://goo.gl/j8F64
                                               Q: What to do
                                                about rare
Domain Segmentations                               zones?




        A: Select the nearest ones from the rest
3
        But how?                                                      3!
http://goo.gl/j8F64

In the literature: within vs cross = ??

Before                                 This work

  Kitchenham        et al. TSE          Cross vs within are no
     2007                                rigid boundaries
       Within-company learning            They are soft borders
        (just use local data)              And we can move a few
       Cross-company learning              examples across the border
        (just use data from other          And after making those
        companies)                          moves
  Results      mixed                             “Cross” same as “local”
         No clear win from cross or
          within
 4
http://goo.gl/j8F64
Some data does not divide
neatly on existing dimensions




5
http://goo.gl/j8F64


The Locality(1) Assumption
    Data divides best on one attribute
     1.  development centers of developers;
     2.  project type; e.g. embedded, etc;
     3.  development language
     4.  application type (MIS; GNC; etc);
     5.  targeted hardware platform;
     6.  in-house vs outsourced projects;
     7.  Etc

    If Locality(1) : hard to use
     data across these boundaries
         Then harder to build effort models:
         Need to collect local data (slow)

 6
http://goo.gl/j8F64


The Locality(N) Assumption
  Data  divides best on
     combination of attributes

  If     Locality(N)
         Easier to use data across
          these boundaries
            Relevant data spread all
             around
            little diamonds floating in the
             dust




 7
http://goo.gl/j8F64


Roadmap
         WHY:
           Motivation

         WHAT:
           Background (SEE = software effort estimation)

         HOW:
           Technology (TEAK)

         Results
           With TEAK, no distinction cross / within.

         Related Work

         SO WHAT:
           Conclusions


 8
http://goo.gl/j8F64


Roadmap
         WHY:
           Motivation

         WHAT:
           Background (SEE = software effort estimation)

         HOW:
           Technology (TEAK)

         Results
           With TEAK, no distinction cross / within.

         Related Work

         SO WHAT:
           Conclusions

 9
http://goo.gl/j8F64


What is SEE?


Software effort estimation (SEE) is
the activity of estimating the total
effort required to complete a
software project (Keung2008 [1]).



SEE is heavily investigated since early 80’s (Mendes2003
[2]), (Kemerer1987 [3]) (Boehm1981 [4])


10
http://goo.gl/j8F64


What is the SEE problem?



SEE as an industry problem:
   •  oftware projects (60%-80%) encounter overruns
    S
        •  vg. overrun is 89% (Standish Group 2004)
         A
        •  cc. to Jorgensen the amount is less (around 30%),
         A
        but still dire (Jorgensen2011 [4])




11
http://goo.gl/j8F64


Active research area
 Jorgensen & Shepperd reviews 304 journal papers after
 filtering (Jorgensen2007 [8])
      •  or “software effort cost” in “2000-2011” period IEEE
       F
      Xplore returns
          •  098 Conference papers
            1
          •  61 Journal papers
            1


 Jorgensen & Shepperd literature review reveals
 (Jorgensen2007 [8])
     • Since 80’s 61% of SEE studies deal with new model
     proposal and comparison to old ones

12
http://goo.gl/j8F64


Roadmap
         WHY:
           Motivation

         WHAT:
           Background (SEE = software effort estimation)

         HOW:
           Technology (TEAK)

         Results
           With TEAK, no distinction cross / within.

         Related Work

         SO WHAT:
           Conclusions


 13
http://goo.gl/j8F64


TEAK = ABE0 + instance selection
  Kocaguneli     et al. 2011, ASE journal
      17,000+ variants of analogy-based effort estimation

  ABE0    = analogy-based effort estimator, version 0
      just the most commonly used analogy method
      Normalized numerics ; min to max, 0 to 1
      Euclidean distance (ignoring dependent variables)
      Equal weighting to all attributes
      Return median effort of k-nearest neighbors

  Instance   selection
      Smart way to adjust training data

 14
http://goo.gl/j8F64


How to find relevant training data?
                                         independent
                                           attributes

          Use similar?                   w     x    y    z   class
                            similar 1     0    1    1    1     2
                            similar 2     0    1    1    1     3
                           different 1    7    7    6    2     5
                           different 2    1    9    1    8     8
     Use more variant?     different 3    5    4    2    6    10
                             alien 1     74   15   73   56    20
                             alien 2     77   45   13    6    40
                             alien 3     35   99   31   21    60
                             alien 4     49   55   37    4    80
            Use aliens ?




15
http://goo.gl/j8F64


Variance pruning
                                                independent
                                                  attributes
          KEEP !
                                                w     x    y    z   class
                                   similar 1     0    1    1    1     2
                                   similar 2     0    1    1    1     3
                                  different 1    7    7    6    2     5
                                  different 2    1    9    1    8     8
                                  different 3    5    4    2    6    10
                                    alien 1     74   15   73   56    20
                                    alien 2     77   45   13    6    40
                                    alien 3     35   99   31   21    60
                                    alien 4     49   55   37    4    80
          PRUNE !

1) Sort the clusters by “variance”
                                                 “Easy path”: cull the examples
2) Prune those high variance things
                                                  that hurt the learner
3) Estimate on the rest
16
http://goo.gl/j8F64
TEAK: clustering + variance pruning
      (TSE, Jan 2011)

 • TEAK is a variance-based
 instance selector
 • t is built via GAC trees
  I




• TEAK is a two-pass system
    •  irst pass selects low-
     F
    variance relevant projects
    •  econd pass retrieves
     S
    projects to estimate from

17
http://goo.gl/j8F64


Essential point

  TEAK  finds local regions
  important to the estimation of
  particular cases

  TEAK   finds those regions via
  locality(N)
      Not locality(1)




 18
http://goo.gl/j8F64


Roadmap
         WHY:
           Motivation

         WHAT:
           Background (SEE = software effort estimation)

         HOW:
           Technology (TEAK)

         Results
           With TEAK, no distinction cross / within

         Related Work

         SO WHAT:
           Conclusions

 19
http://goo.gl/j8F64


Within and Cross Datasets
Out of 20 datasets, only 6 are found suitable for within/cross experiments




      Note: all
     Locality(1)
      divisions




20
http://goo.gl/j8F64
Experiment1: Performance Comparison
of Within and Cross-Source Data
              • TEAK on within & cross data for each
              dataset group (lines separate groups)
              • LOOCV used for runs
              • 20 runs performed for each treatment
              • Results evaluated w.r.t. MAR, MMRE,
                 MdMRE and Pred(30),
                 but see http://goo.gl/6q0tw


              • If within data outperforms cross, the
              dataset is highlighted with gray
                   • See only 2 datasets highlighted

21
http://goo.gl/j8F64

Experiment 2: Retrieval Tendency of TEAK
from Within and Cross-Source Data




22
http://goo.gl/j8F64

Experiment2: Retrieval Tendency of TEAK
from Within and Cross-Source Data
                                 Diagonal (WC)
                                 vs. Off-Diagonal
                                 (CC) selection
                                 percentages
                                 sorted




                   Percentiles of diagonals
                   and off-diagonals



 23
http://goo.gl/j8F64


Roadmap
         WHY:
           Motivation

         WHAT:
           Background (SEE = software effort estimation)

         HOW:
           Technology (TEAK)

         Results
           With TEAK no distinction cross /within

         Related Work

         SO WHAT:
           Conclusions
 24
http://goo.gl/j8F64


Roadmap
         WHY:
           Motivation

         WHAT:
           Background (SEE = software effort estimation)

         HOW:
           Technology (TEAK)

         Results
           With TEAK, no distinction cross / within

         Related Work

         SO WHAT:
           Conclusions

 25
http://goo.gl/j8F64


Highlights
1.         Don’t listen to everyone
            When listening to a crowd, first
             filter the noise

2.         Once the noise clears: bits of
           me are similar to bits of you
            Probability of selecting cross or
             within instances is the same

3.         Cross-vs-within is not a
           useful distinction
            Locality(1) not informative
            Enables “cross-company”
             learning
  26
http://goo.gl/j8F64


Implications
                 Companies can learn
                 from each other’s data

                 Businesscase for building
                 shared repository

                 Maybe, there     are general
                 effects in SE
                     effects that transcend
                      boundaries of one
                      company


27
http://goo.gl/j8F64


Future Work
              1.         Check external validity
                        Does cross == within (after
                         instance selection) in other data?

              2.         Build more repositories
                        More useful than previously
                         thought for effort estimation

              3.         Synonym discovery
                      Can only use cross-data if it has
                       the same ontology
                      auto-generate lexicons to map
                       terms between data sets?
28
http://goo.gl/j8F64
Questions?
Comments?




29
http://goo.gl/j8F64


References
1) J. W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation,” 2008 15th Asia-Pacific Software Engineering
      Conference, pp. 495–502, 2008. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber= 4724583
2) E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of cost estimation models for web hypermedia applications,” Empirical
      Software Engineering, vol. 8, no. 2, pp. 163–196, 2003.
3) C. Kemerer, “An empirical validation of software cost estimation models,” Communications of the ACM, vol. 30, no. 5, pp. 416–429, May 1987.
4) B. W. Boehm, Software Engineering Economics.Upper Saddle River, NJ, USA: Prentice Hall PTR, 1981.
5) Magne Jørgensen , “Contrasting ideal and realistic conditions as a means to improve judgment-based software development effort estimation”, Information
      and Software Technology (July 2011) doi:10.1016/j.infsof.2011.07.001
6) Johann Rost, Robert L. Glass, “The Dark Side of Software Engineering: Evil on Computing Projects”, Wiley John & Sons Inc., 2011.
7) Magne Jørgensen, “Contrasting ideal and realistic conditions as a means to improve judgment-based software development effort estimation”, Information
      and Software Technology (July 2011) doi:10.1016/j.infsof.2011.07.001 Key: citeulike:9626144
8) M. Jorgensen and M. Shepperd, “A systematic review of software development cost estimation studies,” IEEE Trans. Softw. Eng., vol. 33, no. 1, pp. 33–53,
      2007.
9) M. Shepperd and G. F. Kadoda, “Comparing software prediction techniques using simulation,” IEEE Trans. Software Eng, vol. 27, no. 11, pp. 1014–1022, 2001.
10) I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies of software prediction models,” IEEE Transactions on Software
      Engineering, vol. vol, pp. 31no5pp380–391, May 2005.
11) B. Kitchenham, E. Mendes, and G. H. Travassos. Cross versus Within-Company Cost Estimation Studies: A Systematic Review. IEEE Trans. Softw. Eng., 33(5):
      316–329, 2007.
12) T. Menzies, Z. Chen, J. Hihn, and K. Lum. Selecting Best Practices for Effort Estimation. IEEE Transaction on Software Engineering, 32(11):883–895, 2006.
13) K. Lum, J. Powell, and J. Hihn.Validation of Spacecraft Software Cost Estimation Models for Flight and Ground Systems. In ISPA Conference Proceedings,
      Software Modeling Track, May 2002.
14) J. Keung, E. Kocaguneli, and T. Menzies. A Ranking Stability Indicator for Selecting the Best Effort Estimator in Software Cost Estimation. Automated
      Software Engineering (submitted), 2011.
15) M. Shepperd and C. Schofield. Estimating Software Project Effort Using Analogies. IEEE Transactions on Software Engineering, 23(12), Nov. 1997.
16) L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.Classification and Regression Trees, 1984.
17) T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy. Cross-project defect prediction. ESEC/FSE’09, page 91, 2009.
18) B. Turhan, T. Menzies, A. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical
      Software Engineering, 14(5):540–578, 2009.
19) E. Kocaguneli, G. Gay, T. Menzies,Y.Yang, and J. W. Keung. When to use data from other projects for effort estimation. In ASE’10, pages 321–324, 2010.




30
http://goo.gl/j8F64

Related work
Cross-vs-within
(Defect prediction)                      Other

    Zimmermann et al. FSE’09                Keung et al. 2011
       pairs of project (x,y)                  90 effort estimators
       For 96% of pairs, predictors
                                                Best methods built multiple
        from “x” failed for “y”
                                                 local models (CART, CBR)
       No relevancy filtering
                                                Single dimensional models
    Opposite result:                            comparatively worse
       Turhan et al. ESE’09                 Instance selection
       If nearest neighbor filtering,          Can discard 70 to 90% of data
        predictors from “x” work                 without hurting accuracy
        well for “y”
       But no variance filtering                   Since 1974, 100s of papers
                                                    http://goo.gl/8iAUz
 31

Contenu connexe

Similaire à How to Find Relevant Data for Effort Estimation

One Person, One Model, One World: Learning Continual User Representation wi...
One Person, One Model, One World:  Learning Continual User Representation  wi...One Person, One Model, One World:  Learning Continual User Representation  wi...
One Person, One Model, One World: Learning Continual User Representation wi...westlakereplab
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksMarkus Scheidgen
 
Sustainable software institute Collaboration workshop
Sustainable software institute Collaboration workshopSustainable software institute Collaboration workshop
Sustainable software institute Collaboration workshopYannick Wurm
 
Duplicates everywhere (Kiev)
Duplicates everywhere (Kiev)Duplicates everywhere (Kiev)
Duplicates everywhere (Kiev)Alexey Grigorev
 
Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning
Yuandong Tian at AI Frontiers : Planning in Reinforcement LearningYuandong Tian at AI Frontiers : Planning in Reinforcement Learning
Yuandong Tian at AI Frontiers : Planning in Reinforcement LearningAI Frontiers
 
brief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANsbrief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANsParham Zilouchian
 
Seeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewSeeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewQuantUniversity
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET Journal
 
Ferreira c ai-se2013-final-handouts
Ferreira   c ai-se2013-final-handoutsFerreira   c ai-se2013-final-handouts
Ferreira c ai-se2013-final-handoutscaise2013vlc
 
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA
 
Grails @ Java User Group Silicon Valley
Grails @ Java User Group Silicon ValleyGrails @ Java User Group Silicon Valley
Grails @ Java User Group Silicon ValleySven Haiges
 
Variants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooVariants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooJaeJun Yoo
 
A Fault Tolerance Concept for Distributed OSGi Applications - Fabian Meyer
A Fault Tolerance Concept for Distributed OSGi Applications - Fabian MeyerA Fault Tolerance Concept for Distributed OSGi Applications - Fabian Meyer
A Fault Tolerance Concept for Distributed OSGi Applications - Fabian Meyermfrancis
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Atari Game State Representation using Convolutional Neural Networks
Atari Game State Representation using Convolutional Neural NetworksAtari Game State Representation using Convolutional Neural Networks
Atari Game State Representation using Convolutional Neural Networksjohnstamford
 
Nips 2016 tutorial generative adversarial networks review
Nips 2016 tutorial  generative adversarial networks reviewNips 2016 tutorial  generative adversarial networks review
Nips 2016 tutorial generative adversarial networks reviewMinho Heo
 

Similaire à How to Find Relevant Data for Effort Estimation (20)

One Person, One Model, One World: Learning Continual User Representation wi...
One Person, One Model, One World:  Learning Continual User Representation  wi...One Person, One Model, One World:  Learning Continual User Representation  wi...
One Person, One Model, One World: Learning Continual User Representation wi...
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for Benchmarks
 
Sustainable software institute Collaboration workshop
Sustainable software institute Collaboration workshopSustainable software institute Collaboration workshop
Sustainable software institute Collaboration workshop
 
Duplicates everywhere (Kiev)
Duplicates everywhere (Kiev)Duplicates everywhere (Kiev)
Duplicates everywhere (Kiev)
 
Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning
Yuandong Tian at AI Frontiers : Planning in Reinforcement LearningYuandong Tian at AI Frontiers : Planning in Reinforcement Learning
Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning
 
brief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANsbrief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANs
 
Seeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewSeeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper review
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and Python
 
Blenderbot
BlenderbotBlenderbot
Blenderbot
 
Ferreira c ai-se2013-final-handouts
Ferreira   c ai-se2013-final-handoutsFerreira   c ai-se2013-final-handouts
Ferreira c ai-se2013-final-handouts
 
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
 
Spock pres
Spock presSpock pres
Spock pres
 
Ready, Set, Refactor
Ready, Set, RefactorReady, Set, Refactor
Ready, Set, Refactor
 
Grails @ Java User Group Silicon Valley
Grails @ Java User Group Silicon ValleyGrails @ Java User Group Silicon Valley
Grails @ Java User Group Silicon Valley
 
Variants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooVariants of GANs - Jaejun Yoo
Variants of GANs - Jaejun Yoo
 
A Fault Tolerance Concept for Distributed OSGi Applications - Fabian Meyer
A Fault Tolerance Concept for Distributed OSGi Applications - Fabian MeyerA Fault Tolerance Concept for Distributed OSGi Applications - Fabian Meyer
A Fault Tolerance Concept for Distributed OSGi Applications - Fabian Meyer
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Kaggle kenneth
Kaggle kennethKaggle kenneth
Kaggle kenneth
 
Atari Game State Representation using Convolutional Neural Networks
Atari Game State Representation using Convolutional Neural NetworksAtari Game State Representation using Convolutional Neural Networks
Atari Game State Representation using Convolutional Neural Networks
 
Nips 2016 tutorial generative adversarial networks review
Nips 2016 tutorial  generative adversarial networks reviewNips 2016 tutorial  generative adversarial networks review
Nips 2016 tutorial generative adversarial networks review
 

Plus de CS, NcState

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdecCS, NcState
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest linkCS, NcState
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...CS, NcState
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9CS, NcState
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).CS, NcState
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceCS, NcState
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits CS, NcState
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab templateCS, NcState
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUCS, NcState
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements EngineeringCS, NcState
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginiaCS, NcState
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software EngineeringCS, NcState
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)CS, NcState
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceCS, NcState
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1CS, NcState
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataCS, NcState
 

Plus de CS, NcState (20)

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
Goldrush
GoldrushGoldrush
Goldrush
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
 
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 

Dernier

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Dernier (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

How to Find Relevant Data for Effort Estimation

  • 1. How to Find Relevant Data for Effort Estimation Ekrem Kocaguneli, Tim Menzies LCSEE, West Virginia University, Morgantown/USA 1
  • 2. http://goo.gl/j8F64 USD DOD military projects (last decade) You must segment to find relevant data 2
  • 3. http://goo.gl/j8F64 Q: What to do about rare Domain Segmentations zones? A: Select the nearest ones from the rest 3 But how? 3!
  • 4. http://goo.gl/j8F64 In the literature: within vs cross = ?? Before This work   Kitchenham et al. TSE   Cross vs within are no 2007 rigid boundaries   Within-company learning   They are soft borders (just use local data)   And we can move a few   Cross-company learning examples across the border (just use data from other   And after making those companies) moves   Results mixed   “Cross” same as “local”   No clear win from cross or within 4
  • 5. http://goo.gl/j8F64 Some data does not divide neatly on existing dimensions 5
  • 6. http://goo.gl/j8F64 The Locality(1) Assumption   Data divides best on one attribute 1.  development centers of developers; 2.  project type; e.g. embedded, etc; 3.  development language 4.  application type (MIS; GNC; etc); 5.  targeted hardware platform; 6.  in-house vs outsourced projects; 7.  Etc   If Locality(1) : hard to use data across these boundaries   Then harder to build effort models:   Need to collect local data (slow) 6
  • 7. http://goo.gl/j8F64 The Locality(N) Assumption   Data divides best on combination of attributes   If Locality(N)   Easier to use data across these boundaries   Relevant data spread all around   little diamonds floating in the dust 7
  • 8. http://goo.gl/j8F64 Roadmap   WHY:   Motivation   WHAT:   Background (SEE = software effort estimation)   HOW:   Technology (TEAK)   Results   With TEAK, no distinction cross / within.   Related Work   SO WHAT:   Conclusions 8
  • 9. http://goo.gl/j8F64 Roadmap   WHY:   Motivation   WHAT:   Background (SEE = software effort estimation)   HOW:   Technology (TEAK)   Results   With TEAK, no distinction cross / within.   Related Work   SO WHAT:   Conclusions 9
  • 10. http://goo.gl/j8F64 What is SEE? Software effort estimation (SEE) is the activity of estimating the total effort required to complete a software project (Keung2008 [1]). SEE is heavily investigated since early 80’s (Mendes2003 [2]), (Kemerer1987 [3]) (Boehm1981 [4]) 10
  • 11. http://goo.gl/j8F64 What is the SEE problem? SEE as an industry problem: •  oftware projects (60%-80%) encounter overruns S •  vg. overrun is 89% (Standish Group 2004) A •  cc. to Jorgensen the amount is less (around 30%), A but still dire (Jorgensen2011 [4]) 11
  • 12. http://goo.gl/j8F64 Active research area Jorgensen & Shepperd reviews 304 journal papers after filtering (Jorgensen2007 [8]) •  or “software effort cost” in “2000-2011” period IEEE F Xplore returns •  098 Conference papers 1 •  61 Journal papers 1 Jorgensen & Shepperd literature review reveals (Jorgensen2007 [8]) • Since 80’s 61% of SEE studies deal with new model proposal and comparison to old ones 12
  • 13. http://goo.gl/j8F64 Roadmap   WHY:   Motivation   WHAT:   Background (SEE = software effort estimation)   HOW:   Technology (TEAK)   Results   With TEAK, no distinction cross / within.   Related Work   SO WHAT:   Conclusions 13
  • 14. http://goo.gl/j8F64 TEAK = ABE0 + instance selection   Kocaguneli et al. 2011, ASE journal   17,000+ variants of analogy-based effort estimation   ABE0 = analogy-based effort estimator, version 0   just the most commonly used analogy method   Normalized numerics ; min to max, 0 to 1   Euclidean distance (ignoring dependent variables)   Equal weighting to all attributes   Return median effort of k-nearest neighbors   Instance selection   Smart way to adjust training data 14
  • 15. http://goo.gl/j8F64 How to find relevant training data? independent attributes Use similar? w x y z class similar 1 0 1 1 1 2 similar 2 0 1 1 1 3 different 1 7 7 6 2 5 different 2 1 9 1 8 8 Use more variant? different 3 5 4 2 6 10 alien 1 74 15 73 56 20 alien 2 77 45 13 6 40 alien 3 35 99 31 21 60 alien 4 49 55 37 4 80 Use aliens ? 15
  • 16. http://goo.gl/j8F64 Variance pruning independent attributes KEEP ! w x y z class similar 1 0 1 1 1 2 similar 2 0 1 1 1 3 different 1 7 7 6 2 5 different 2 1 9 1 8 8 different 3 5 4 2 6 10 alien 1 74 15 73 56 20 alien 2 77 45 13 6 40 alien 3 35 99 31 21 60 alien 4 49 55 37 4 80 PRUNE ! 1) Sort the clusters by “variance” “Easy path”: cull the examples 2) Prune those high variance things that hurt the learner 3) Estimate on the rest 16
  • 17. http://goo.gl/j8F64 TEAK: clustering + variance pruning (TSE, Jan 2011) • TEAK is a variance-based instance selector • t is built via GAC trees I • TEAK is a two-pass system •  irst pass selects low- F variance relevant projects •  econd pass retrieves S projects to estimate from 17
  • 18. http://goo.gl/j8F64 Essential point   TEAK finds local regions important to the estimation of particular cases   TEAK finds those regions via locality(N)   Not locality(1) 18
  • 19. http://goo.gl/j8F64 Roadmap   WHY:   Motivation   WHAT:   Background (SEE = software effort estimation)   HOW:   Technology (TEAK)   Results   With TEAK, no distinction cross / within   Related Work   SO WHAT:   Conclusions 19
  • 20. http://goo.gl/j8F64 Within and Cross Datasets Out of 20 datasets, only 6 are found suitable for within/cross experiments Note: all Locality(1) divisions 20
  • 21. http://goo.gl/j8F64 Experiment1: Performance Comparison of Within and Cross-Source Data • TEAK on within & cross data for each dataset group (lines separate groups) • LOOCV used for runs • 20 runs performed for each treatment • Results evaluated w.r.t. MAR, MMRE, MdMRE and Pred(30), but see http://goo.gl/6q0tw • If within data outperforms cross, the dataset is highlighted with gray • See only 2 datasets highlighted 21
  • 22. http://goo.gl/j8F64 Experiment 2: Retrieval Tendency of TEAK from Within and Cross-Source Data 22
  • 23. http://goo.gl/j8F64 Experiment2: Retrieval Tendency of TEAK from Within and Cross-Source Data Diagonal (WC) vs. Off-Diagonal (CC) selection percentages sorted Percentiles of diagonals and off-diagonals 23
  • 24. http://goo.gl/j8F64 Roadmap   WHY:   Motivation   WHAT:   Background (SEE = software effort estimation)   HOW:   Technology (TEAK)   Results   With TEAK no distinction cross /within   Related Work   SO WHAT:   Conclusions 24
  • 25. http://goo.gl/j8F64 Roadmap   WHY:   Motivation   WHAT:   Background (SEE = software effort estimation)   HOW:   Technology (TEAK)   Results   With TEAK, no distinction cross / within   Related Work   SO WHAT:   Conclusions 25
  • 26. http://goo.gl/j8F64 Highlights 1.  Don’t listen to everyone   When listening to a crowd, first filter the noise 2.  Once the noise clears: bits of me are similar to bits of you   Probability of selecting cross or within instances is the same 3.  Cross-vs-within is not a useful distinction   Locality(1) not informative   Enables “cross-company” learning 26
  • 27. http://goo.gl/j8F64 Implications   Companies can learn from each other’s data   Businesscase for building shared repository   Maybe, there are general effects in SE   effects that transcend boundaries of one company 27
  • 28. http://goo.gl/j8F64 Future Work 1.  Check external validity   Does cross == within (after instance selection) in other data? 2.  Build more repositories   More useful than previously thought for effort estimation 3.  Synonym discovery   Can only use cross-data if it has the same ontology   auto-generate lexicons to map terms between data sets? 28
  • 30. http://goo.gl/j8F64 References 1) J. W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation,” 2008 15th Asia-Pacific Software Engineering Conference, pp. 495–502, 2008. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber= 4724583 2) E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of cost estimation models for web hypermedia applications,” Empirical Software Engineering, vol. 8, no. 2, pp. 163–196, 2003. 3) C. Kemerer, “An empirical validation of software cost estimation models,” Communications of the ACM, vol. 30, no. 5, pp. 416–429, May 1987. 4) B. W. Boehm, Software Engineering Economics.Upper Saddle River, NJ, USA: Prentice Hall PTR, 1981. 5) Magne Jørgensen , “Contrasting ideal and realistic conditions as a means to improve judgment-based software development effort estimation”, Information and Software Technology (July 2011) doi:10.1016/j.infsof.2011.07.001 6) Johann Rost, Robert L. Glass, “The Dark Side of Software Engineering: Evil on Computing Projects”, Wiley John & Sons Inc., 2011. 7) Magne Jørgensen, “Contrasting ideal and realistic conditions as a means to improve judgment-based software development effort estimation”, Information and Software Technology (July 2011) doi:10.1016/j.infsof.2011.07.001 Key: citeulike:9626144 8) M. Jorgensen and M. Shepperd, “A systematic review of software development cost estimation studies,” IEEE Trans. Softw. Eng., vol. 33, no. 1, pp. 33–53, 2007. 9) M. Shepperd and G. F. Kadoda, “Comparing software prediction techniques using simulation,” IEEE Trans. Software Eng, vol. 27, no. 11, pp. 1014–1022, 2001. 10) I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies of software prediction models,” IEEE Transactions on Software Engineering, vol. vol, pp. 31no5pp380–391, May 2005. 11) B. Kitchenham, E. Mendes, and G. H. Travassos. Cross versus Within-Company Cost Estimation Studies: A Systematic Review. IEEE Trans. Softw. Eng., 33(5): 316–329, 2007. 12) T. Menzies, Z. Chen, J. Hihn, and K. Lum. Selecting Best Practices for Effort Estimation. IEEE Transaction on Software Engineering, 32(11):883–895, 2006. 13) K. Lum, J. Powell, and J. Hihn.Validation of Spacecraft Software Cost Estimation Models for Flight and Ground Systems. In ISPA Conference Proceedings, Software Modeling Track, May 2002. 14) J. Keung, E. Kocaguneli, and T. Menzies. A Ranking Stability Indicator for Selecting the Best Effort Estimator in Software Cost Estimation. Automated Software Engineering (submitted), 2011. 15) M. Shepperd and C. Schofield. Estimating Software Project Effort Using Analogies. IEEE Transactions on Software Engineering, 23(12), Nov. 1997. 16) L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.Classification and Regression Trees, 1984. 17) T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy. Cross-project defect prediction. ESEC/FSE’09, page 91, 2009. 18) B. Turhan, T. Menzies, A. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5):540–578, 2009. 19) E. Kocaguneli, G. Gay, T. Menzies,Y.Yang, and J. W. Keung. When to use data from other projects for effort estimation. In ASE’10, pages 321–324, 2010. 30
  • 31. http://goo.gl/j8F64 Related work Cross-vs-within (Defect prediction) Other   Zimmermann et al. FSE’09   Keung et al. 2011   pairs of project (x,y)   90 effort estimators   For 96% of pairs, predictors   Best methods built multiple from “x” failed for “y” local models (CART, CBR)   No relevancy filtering   Single dimensional models   Opposite result: comparatively worse   Turhan et al. ESE’09   Instance selection   If nearest neighbor filtering,   Can discard 70 to 90% of data predictors from “x” work without hurting accuracy well for “y”   But no variance filtering   Since 1974, 100s of papers   http://goo.gl/8iAUz 31