2. Data Data – Input for Analysis and Interpretation Data are generally collected as a basis for action You must always use some method of analysis to extract and interpret the information that lies in the data The type of data that has been collected will determine the type of statistics or analysis that can be performed Making sense of the data is a process in itself Always provide a “context” for data Data has no meaning apart for their context Data should always be presented in such a way that preserves the evidence in the data for all the predictions that might be made from these data
3. Data - 2 Data should be completely and fully described Who collected the data? How were the data collected? When were the data collected? Where were the data collected? What do these values represent? If the data are computed values, how were the values computed from the raw inputs?
4. Data - 3 Variation exists in all data and consists of both noise (random or common cause variation) and signal (nonrandom or special cause variation) Without formal and standardized approaches for analyzing data, you may have difficulty interpreting and using your measurement results When you interpret and act on measurement results, you are presuming that the measurements represent reality
5. Data - 4 To use data safely, you must have simple and effective methods not only for detecting signals that are surrounded by noise, but also for recognizing and dealing with normal process variations when there are no signals present Drawing conclusions and predictions from data depends not only on using appropriate analytical methods and tools, but also on understanding the underlying nature of the data and the appropriateness of assumptions about the conditions and environments in which the data were obtained
6. Data Definitions Categorical vs. Quantitative Variables - Variables can be classified as categorical (aka, qualitative) or quantitative (aka, numerical) Categorical - Categorical variables take on values that are names or labels. The color of a ball (e.g., red, green, blue) or the breed of a dog (e.g., collie, shepherd, terrier) would be examples of categorical variables. Quantitative - Quantitative variables are numerical. They represent a measurable quantity. For example, when we speak of the population of a city, we are talking about the number of people in the city - a measurable attribute of the city. Therefore, population would be a quantitative variable
7. Data Definitions - 2 Discrete vs. Continuous Variables - Quantitative variables can be further classified as discrete or continuous. If a variable can take on any value between two specified values, it is called a continuous variable; otherwise, it is called a discrete variable. Examples to clarify the difference between discrete and continuous variables. Suppose the fire department mandates that all fire fighters must weigh between 150 and 250 pounds. The weight of a fire fighter would be an example of a continuous variable; since a fire fighter's weight could take on any value between 150 and 250 pounds. Suppose we flip a coin and count the number of heads. The number of heads could be any integer value between 0 and plus infinity. However, it could not be any number between 0 and plus infinity. We could not, for example, get 2.5 heads. Therefore, the number of heads must be a discrete variable.
9. Variables Data Variables data is measured and plotted on a continuous scale With variables data, an actual numeric estimate is derived for one or more characteristics of the population being sampled such as: Time Temperature Length Weight Height Volume Voltage Horsepower Torque Speed Cost
10. Variables Data - 2 In software, examples of variables data include: Effort expended - (Number of hours, days, weeks, years, etc., that have been expended by a workforce member on an identified topic) Years of experience - (Total number of years of experience per category) Memory utilization - (% of total memory available) CPU utilization - (% of CPU used at any given moment in time) Cost of rework - (Dollars and cents calculation of the rework based on the effort put forth by anyone involved in the finding and fixing of reported problems)
11. “Counts” Could Be Treated as Variables Data There are many situations where “counts” get used as measures of size: Total number of requirements Total lines of code Total bubbles in a data-flow diagram Customer sites Change requests received Total people assigned to a project When we count these things, we are counting all the entities in a population, not just the occurrence of entities with specific attributes These should always be treated as “variables” data even though they are instances of discrete counts
12. Attributes Data When working with attributes data, the focus is on learning about one or more specific non-numerical characteristics of the population being sampled When attributes data are used for direct comparisons, they must be based on consistent “areas of opportunity” if the comparisons are to be meaningful If the number of defects that are likely to be observed depends on the size (lines of code) of a module or component, all sizes must be nearly equal If the probabilities associated with defect discovery depend on the time spent on inspecting or testing the elapsed time spent must be nearly equal
13. Attributes Data - 2 In general, when the areas of opportunity for observing a specific event are not equal or nearly so, the chances of observing the event will differ across the observations Then we must normalize (convert to rates) by dividing each count by its area of opportunity before valid comparisons are made Conditions that make us willing to assume constant areas of opportunity seem to be less in software environments Normalization is almost always needed for software!
14. Attributes Data - 3 Example: If the defects are being counted and the size of an item inspected influences the number of defects found, some measure of item size will also be needed to convert defect counts to relative rates that can be compared in meaningful ways (defects per lines of code) If the variations in the amount of time spent inspecting or testing can influence the number of defects found, these times should be clearly defined and measured as well
15. Attributes Data - 4 One of the keys to making effective use of attributes data lies in preserving the ordering of each count in space and time Sequence information (the order in time or space in which the data is collected) is almost always needed to correctly interpret counts of attributes Make the counts specific – Make sure there is an operational definition (clear set of rules and procedures) for recognizing an attribute or entity if what gets counted is to be what the user of the data expects the data to be
16. Attributes Data - 5 Attributes data is counted and plotted as discrete events: Shipping errors Percentage waste Number of defects found Number of defective items Number of source statements of a given type Number of lines of comments in a module of n lines Number of people with certain skills on a project Percentage of projects using formal inspections Team size Elapsed time between milestones Staff hours logged per task Backlog Number of priority-one customer complaints Percentage of non-conforming products in the output of an activity or a process
17. The Key to Classifying Data The key to classifying data as attributes data or variables data depends not so much on whether the data are discrete or continuous, but on how they are collected and used The total number of defects found is often used as a measure of the amount of rework or retesting to be performed It is viewed as a measure of size and treated as variables data It is normally used as a count based on attributes The method of analysis you choose for any data will depend on: The questions you are asking The data distribution model you have in mind The assumptions you are willing to make with respect to the nature of the data (Page 79)
19. Distributional ModelsRelationship to Chart Types Each type of chart is related to a set of assumptions (a distributional model) that must hold for that type of chart to be valid. There are six types of charts for “attributes data” NP P C U XmR for counts XmR for rates
20. XmR charts have an advantage over np, p, c, and u charts in that they require fewer and less stringent assumptions They are easier to plat and use They have wide applicability Recommended by many quality-control professionals When assumptions of the distributional model are met, however, the more specialized np, p, c, and u charts can give better bounds for control limits and can offer advantages Distributional Models Relationship to Chart Types - 2
21. Distributional ModelsRelationship to Chart Types - 3 NP Chart – An np chart is used when the count data are binomially distributed and all samples have equal areas of opportunity These conditions occur in manufacturing settings – when there is 100% of lots of size n (n is constant) and the number of defective units in each lot is recorded P Chart – a p chart is used when the data are binomially distributed but the areas of opportunity vary from sample to sample A p chart could be appropriate if the lot size n were to change from lot to lot
22. Distributional ModelsRelationship to Chart Types - 4 C Chart – a c chart is used when the count data are samples from a Poisson distribution and the samples all have equal-sized areas of opportunity U Chart – a u chart is used in place of a c chart when the count data are samples from a Poisson distribution and the areas of opportunity are not constant Defects per thousand lines of code is an example for software NP, P, C and U charts are the traditional control charts used with attributes data XmR Chart – Useful when little is known about the underlying distribution of when the justification for assuming a binomial or Poisson process is questionable Almost always a reasonable choice
23. Distributional ModelsRelationship to Chart Types - 5 More About U Charts – U charts seem to have the greatest prospects for use in software settings U charts require normalization (conversion to rates) when the areas of opportunity are not constant Poisson might be appropriate when counting the number of defects in modules during inspection or testing Defects per thousand lines of source code is an example of attributes data that is a candidate for u charts Although u charts may be appropriate for studying software defect densities in an operational environment, we are not aware of any empirical studies that have generally validated the use of Poisson models for nonoperational environments such as inspections
24. Distributional ModelsRelationship to Chart Types - 6 Defects per module or defects per test are unlikely candidates for u charts, c charts, or any other charts for that matter The ratios are not based on equal areas of opportunity – Can’t be normalized There is no reason to expect them to be constant across all modules or tests when the process is in statistical control
25. Distributional ModelsRelationship to Chart Types - 7 If you are uncertain as to the model that applies, it can make sense to use more than one set of charts If you think you may have a Poisson situation but are not sure that all conditions for a Poisson process are present, then plotting both a u chart and the corresponding XmR charts should bracket the situation If both charts point to the same conclusions, you are unlikely to be led astray If the conclusions differ, then you should investigate your assumptions or the events
26. Presenting Data While it is simple and easy to compare one number with another, such comparisons are limited and weak Limited because the small amount of data used Weak because both of the numbers are subject to variation This makes it difficult to determine just how much of the differences between the values is due to variation in numbers and how much is due to real changes in the process
27. Presenting Data - 2 Graphs – there are two basic graphs that are the most helpful is providing the context for interpreting the current value Time series graph (Run Chart) Have months or years marked off on the horizontal axis and possible values marked off on the vertical axis As you move from left to right, there is a passage of time By visually comparing the current value with the plotted values for the preceding months you can quickly see if the current value is unusual or not Histogram (Tally Plot) An accumulation of the different values as they occur without trying to display the time order sequence
28. Run Charts Number of Required Changes to a Module as the Project Approaches Systems Test Syntax Check Desk Check Code Review Unit Test Integration and Test Systems Test
29. 20 18 16 14 12 10 Number of Days 8 6 4 2 0 32 56 48 46 44 42 40 38 36 54 52 50 34 Product – Service Staff Hours Histograms
30. PROCESS CONTROL CHART TYPE: METRIC: A point above or below the control lines suggests that the measurement has a special preventable or removable cause Upper Control Limit (UCL) The chart is used for continuous and time control of the process and prevention of causes Upper and Lower Control Limits represent the natural variation In the process Center Line (CL) (Mean of data used to set up the chart) The chart is analyzed using standard Rules to define the control status of the process Plotted points are either individual measurements or the means of small groups of measurements Lower Control Limit (LCL) Data relating to the process Statistical Methods for Software Quality Adrian Burr – Mal Owen, 1996 Numerical data taken in time sequence
31. Impacts of Poor Data Quality Inability to conduct hypothesis and predictive modeling Inability to manage the quality and performance software or application development Ineffective process change instead of process improvement Ineffective and inefficient testing causing issues with time to market, field quality, and development costs Products that are costly to use within real-life usage profiles
32. References Brassard, Michael & Ritter, Diane, The Memory Jogger II – A Pocket Guide of Tools for Continuous Improvement & Effective Planning, GOAL/QPC, Salem, New Hampshire, 1994 Florac, W.A. & Carleton, A.D. Measuring the Software Process Addison-Wesley, 1999 Six Sigma Academy, The Black Belt Memory Jogger – A Pocket Guide for Six Sigma Success, GOAL/QPC, Salem, New Hampshire, 2002 Wheeler, Donald J. Understanding Variation: The Key to Managing Chaos, Knoxville, Tennessee: SPC Press, 2000