3. Categorical variables Bar graphs Recall that horizontal axis is the category name and the vertical axis is the count or percentage Create a bar graph for “mobile phone carrier” for the students in this period in class /start with a survey!
4. Categorical Variables Pie Chart the area of each slice of pie reflects the relative frequency of the category the slice represents i.e. if “ATT” is used by 25% of the class, the area of the ATT slice must be 25% of the entire pie Remember/ all categories must be represented in the pie Typically, these are not fun to create
5. Quantitative Data Stemplot (a.k.a. “Stem and Leaf Plot”) A stemplot displays the distribution in a very meaningful way Preview the example of pg 43!
6. Quantitative Data Stemplot steps Arrange the observations numerical order Separate each observation into a stem and a leaf Write stems in a vertical column Write the leaf of each observation next to the stem. Leaves that are closest to the stem are lower in numerical value.
7. Quantitative Data The following measurements are the number of points scored by THS football in each game of the 2009 season. 42, 27, 19, 14, 20, 47, 53, 28, 32, 30, 44, 20
8. Quantitative Data Stemplot steps Arrange the observations numerical order 14, 19, 20, 20, 27, 28, 30, 32, 42, 44, 47, 53
9. Quantitative Data Stemplot steps Separate each observation into a stem and a leaf 1/4, 1/9, 2/0, 2/0, 2/7, 2/8, 3/0, 3/2, 4/2, 4/4, 4/7, 5/3
10. Quantitative Data Stemplot steps Write stems in a vertical column 1/4, 1/9, 2/0, 2/0, 2/7, 2/8, 3/0, 3/2, 4/2, 4/4, 4/7, 5/3 1 2 3 4 5
11. Quantitative Data Write the leaf of each observation next to the stem. Leaves that are closest to the stem are lower in numerical value. 1/4, 1/9, 2/0, 2/0, 2/7, 2/8, 3/0, 3/2, 4/2, 4/4, 4/7, 5/3 1 4, 9 2 0, 0, 7, 8 3 0, 2 4 2, 4, 7 5 3 YAY!
12. Quantitative Data Histogram A histogram is similar to a bar graph, but is used for quantitative data only. Observations are separated into classes (number ranges) All classes must have equal width Like a bar graph, the height of each bar represents the count for each class Example 1.6 on pg 49
13. Quantitative Data Histogram Let’s use the same data from our previous example 14, 19, 20, 20, 27, 28, 30, 32, 42, 44, 47, 53
14. Quantitative Data Histogram 1. Separate the range into classes of equal width Let’s try the following: 00 < score < 14 15 < score < 29 30 < score < 44 45 < score < 60
19. Examining Distributions Look for the pattern and any deviations from the general pattern In written work, you must describe C.U.S.S. Center Unusual features (outliers) Shape Spread Note: CUSS is just a mnemonic device. It is customary to discuss “unusual features” last
20. Examining Distributions Center- We will discuss at greater length later. For now, you can use the median as a measure of center Spread- Also discussed later. For now, give the minimum and maximum values to describe spread
21. Examining Distributions Shape- We generally want to know two things How many peaks? Is it unimodal (one distinct peak) or is it uniform (no distinct peaks)? Is the distribution symmetric (both tails are approximately equal) or skewed (one of the tails is longer) Left skewed- left tail is longer Right skewed- right tail is longer
22. Examining Distributions Outliers- like many things in statistics, outliers can be a judgment call. Although we will learn a customary formula, to determine outliers, to formula is arbitrary. In a histogram, outliers will be clearly separated from the rest of the observations Because class widths can be arbitrary, be sure to thoroughly examine the data before classifying an observation as an outlier. Do not ignore or delete outlier observations!
24. Relative Freq. and Cumulative Freq. We will add a column to show relative frequency Yes, “relative frequency” is the same thing as “percentage” At this point, you could make a histogram using relative frequencies, if desired.
25. Relative Freq. and Cumulative Freq. Now add a column to show cumulative frequency Yes, keep adding the next rel. freq. The last cell in the column should be 100, unless there is roundoff error (not a big deal)
26. Relative Freq. and Cumulative Freq. To create a “Cumulative Frequency Plot” or “Ogive” start by creating axes similar to a histogram The vertical axis is percentage and should be labeled 0 to 100% 100 80 60 40 20 Cumulative freq. (%) 0 10 20 30 40 50 60 Number of points scored
27. Relative Freq. and Cumulative Freq. Plot points for each Cum. Freq. The left boundary of the first class should be plotted at zero. The last point plotted will be the right boundary of the last class at 100% 100 80 60 40 20 Cumulative freq. (%) 0 10 20 30 40 50 60 Number of points scored
28. Relative Freq. and Cumulative Freq. CONNECT THE DOTS! 100 80 60 40 20 Cumulative freq. (%) 0 10 20 30 40 50 60 Number of points scored
29. Relative Freq. and Cumulative Freq. Some notes about ogives. It’s pronounced “Oh-Jives” Ogives can be used to find approx. percentile rank The vertical axis is percentile! In particular, we are interested in: Median (50th percentile) First Quartile (25th percentile) Third Quartile (75th percentile) The above vocab. Will come up again. Memorize it!
32. Measuring Center MEAN- calculated the same way you always calculate mean (average) The symbol is read as “x-bar” The mean is affected by not a resistant measure of center- it is sensitive to a few extreme observations.
33. Measuring Center Median- the “middle” number in a set of observations is known as the median If the data set has an even number of observations, then the median is the average of the two middle numbers Unlike the mean, the median is a resistant measure of center.
34. Measuring Spread The Quartiles The median of the subset of data less than the median is the First Quartile (Q1) The median of the subset of data greater than the median is the Third Quartile (Q3) Notice that the median is not included in either of the above calculations Q1 is the 25th percentile Q3 is the 75th percentile
35. Measuring Spread Recall the data from THS Football 2009 14, 19, 20, 20, 27, 28, 30, 32, 42, 44, 47, 53 We can order the numbers to help 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12 14, 19, 20, 20, 27, 28, 30, 32, 42, 44, 47, 53
36. Measuring Spread 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12 14, 19, 20, 20, 27, 28, 30, 32, 42, 44, 47, 53 Notice that the median is the average of 28 and 30 Med. = 29
37. Measuring Spread 01, 02, 03, 04, 05, 06, Q1 is the avg 14, 19, 20, 20, 27, 28, of 20 and 20 07, 08, 09, 10, 11, 12 Q3 is the avg. 30, 32, 42, 44, 47, 53 of 42 and 44 Q1 = 20 Med. = 29Q3 = 43
38. Measuring Spreat InterQuartile Range (IQR) IQR is the preferred measurement of spread when the median is used to describe center IQR = Q3 - Q1 IQR = 43–20 IQR = 23
39. Measuring Spread InterQuartile Range and Outliers The previously mentioned formula for determining outlier observations depends on IQR High outliers (outliers to the right) measurements greater than Q3 +1.5 x IQR Low Outliers (outliers to the left) measurements less than: Q1 -1.5 x IQR
40. Measuring Spread InterQuartile Range and Outliers High outliers greater than Q3 +1.5 x IQR = 43 + 1.5 x 23 or any observation greater than 77.5 Low Outliers less than: Q1 -1.5 x IQR = 20 – 1.5 x 23 or observations less than -14.5 Clearly, THS had no outlier football scores in 2009!
41. Five Number Summary A snapshot of a data distribution can be given with the 5 number summary: Minimum, Q1, Median, Q3, Maximum For our THS Football 2009, the five number summary is: 14, 20, 29, 43, 53
42. Five Number Summary The 5 number summary is used to create a box plot (“box and whiskers” plot) Min Q1 Med Q3 Max 0 10 20 30 40 50 60
43. Five Number Summary BOX PLOT a number line must be included with a box plot outliers appear as unconnected dots 0 10 20 30 40 50 60
45. The Standard Deviation The preferred measure of spread when using mean as a measure of center is the related measurements of “variance” and “standard deviation” variance = s2 standard deviation = s Yes, standard deviation is the square root of variance.
46. The Standard Deviation Formulation of variance Yes, take the square root to find the std. dev.
47. The Standard Deviation For the THS 2009 data Mean = 31.33 s2 = [(14-31.33)2+(19-31.33)2+(20-31.33)2+(20-31.33)2+(27-31.33)2+(28-31.33)2+(30-31.33)2+(32-31.33)2+(42-31.33)2+(44-31.33)2+(47-31.33)2+(53-31.33)2] / (12-1) s2 = 1730.66 / 11 s2 = 157.33
48. The Standard Deviation Notice that the number s2 = 157.33 doesn’t really have much to do with the data set! However we can see that s = 12.54 has some meaning in our data. With all data sets, “the majority” of observations are within the standard deviation of the mean Most data is btwn 31.33 - 12.54 and 31.33 + 12.54-or- Most data is btwn 18.79 and 43.87
49. Which measurements do I choose? Use “mean and standard deviation” when the data is reasonably symmetric with no outliers. Use “median and IQR” or 5 num. sum. in cases where the “mean and std. dev.” is not appropriate. Remember: “5 num sum” is resistant to outliers, while the “mean and std dev” is not resistant
50. Linear Transformation of Data If every member of a data set is multiplied by a positive number b, then the measures of center and spread are also multiplied by b. If a constant a is added to every member of a data set, then a is added to the measure center, but the measures of spread remain unchanged.
52. Comparing Data Sets The AP Exam always asks students to compare data. Clearly identify the populations that are being compared Make sure to compare each of CUSS Make reference to the measurement you are comparing i.e. use “mean” and not “center” Give the values of the measurements you are comparing. Make use of comparison phrases “is greater than” “is less than”