B409 W11 Sas Collaborative Stats Guide V4.2

Table of Contents TOC quot;
1-3quot;
Numerical Summaries3Variation Within The Data12Confidence Intervals25Simple Regression33Correlation Coefficient47Test of Significance61Limits (Confidence / Prediction)82Appendix87 Chapter 01 Numerical Summaries Team 1 Baljeet Kaur Trystan McDonald Jaswant Seahra Mriseal Sinha Surbhi Surbhi Theo Wolski Introduction Collecting, processing and forming data are skills that are widely sought after in today’s business world. In order to make effective business decisions you must possess the skills necessary to analyse, manipulate and present findings derived from the mining of raw data. Data can be produced in numerical and non-numerical forms. When deducing the significance of data, it is advantageous to provide context to the process; knowing where (location) and how your data fits (dispersion) into your query can provide valuable insight into your department’s current and future campaigns. Numerical summaries present data by location include stating the data’s mean, mode, and median. Data that is presented by how it is dispersed is done by stating its range and standard deviation. HYPERLINK quot;
http://www.palgrave.com/business/taylor/taylor1/lecturers/quot;
www.palgrave.com/business/taylor/taylor1/lecturers/ Numerical Summaries Definition: A set of numeric data summarized and described by two parameters. Measure of centrality: Data measured by its mean, median and mode. Measure of spread: Ordinal data measured by its range, quartile range and standard deviation. Mean: The arithmetic average of all data Median: The middle value of ordered data. Data must be ordinal or interval. Mode: The most commonly occurring value in data set. Terms and Concepts Mean: The arithmetic average of all data points. Mean = Mean=Σn/n Example - 3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29 The sum of these numbers is 330 There are fifteen numbers. Mean = 330 / 15 = 22 Median: The center value of ordinal or interval data ordered by ascension. 3,5,7,12,13,14,20,23,23,23,23,29,39,40,56 Total number is 15 and that is divided by 2, result is 7.5 So median value between 7 and 8 (20+23)/2= 21.5 Mode: The most commonly occurring value in a data set. 3,5,7,12,13,14,20,23,23,23,23,29,39,40,56 23 is the Mode because it is repeated 4 times. Range: Largest value - smallest value. Example: 2, 6 , 2 , 4, 1, 4 , 3, 1 , 1 6-1= 5 Quartile Range: The range from of the centre point of the ordered data. Example: 1, 11, 15, 19, 20, 24, 28, 34, 37, 47, 50, 57 Lower quartile = value from the centre of the first half of data or Quartile1The median of 1, 11, 15, 19, 20, 24 (third + fourth observations) ÷ 2 (15 + 19) ÷ 2=17 Upper quartile = value from the centre of the second half of data Quartile2 The median of 28, 34, 37, 47, 50, 57 (third + fourth observations) ÷ 2 (37 + 47) ÷ 2 = 42 Interquartile range is Q2–Q1 42 – 17= 25 Standard Deviation: Also known as 'Root Mean Square Deviation', is calculated by squaring and adding the deviations from the mean, finding the average of the squared deviations, and then square-rooting the result. σ = Variance n-1 http://www.mathsisfun.com/median.html www.palgrave.com/business/taylor Example: http://hubpages.com/hub/Musical-Terms Numeric summaries are to a mathematician what sheet music is to a musician. As we know, numerical summaries include equations of mean, median, range, quartile range and standard deviation; each of these equations allows for the input of data for the purpose of analysis. Without the numeric summaries and equations for these terms one would not be able to determine the desired findings, much like without sheet music a musician would not be able to play his or her instrument. As sheet music is a language that musicians speak; numerical summaries are the language of statisticians. A bar line is used for separating musical notes into areas that are manageable, allowing the musician to read where the tempo and notes are going within the song. In statistics, for example, if someone asks you to measure 46 points of data within a particular data set without identifying the centre point (mean or median) you cannot effectively measure the data. Like in the use of sheet music, following the building blocks of a process is an essential first step in determining the outcome of the data or song being played. Difference from one bar line to another is called measureDouble bar lineBar Line _________________________ _________________________ _________________________ _________________________ _________________________ In the above diagram: The bar line is referenced as… Mean (ie: mean= Staff/ 4 spaces) Standard deviation (ie: σ= One bar line- another bar line) Implementing with SAS In this tutorial we’ll be doing a walkthrough with the Height’s database. It’s a rather simple database but effective for showing Numerical Summaries. We have three columns shown above: Family, Gender, and Height. Although you can do many things with numerical summaries in SAS with variables relating to one another, we will just be working with the Heights Variable. Click Tasks-Describe-Summary Statistics (shown below). You’ll notice a window pop up. This is the main interface you will be using to run numerical summaries. You can do a lot of neat things with this function of SAS, but for this exercise we will just be using the Variable Heights. Click and drag over to the right hand side under analysis variable. On the left hand side of the window you will see a tab saying “Statistics” click on statistics. The window will change to list all the different numerical summaries that the data can be run through. Click the ones that you need for your research. In this example we used: Mean, Standard Deviation, Minimum, Maximum, and Number of Observation. If you wish for a more visual element to your research, click plots and pick a graph design. When you’re done click RUN at the bottom of the window. After clicking run, SAS will process the data and show the results you requested. At the top the Mean, Standard Deviation, Variance, Minimum variable, Max variable, the Range, and the total number of values that were used to process this information. So in this tutorial we can see that the average (Mean) of the values in the dataset equal 66.83, STD Dev equalling 2.72, Variance equalling 7.4, and the range 9. Conclusion With this application of SAS you can make statistical observations and decisions, depending on what marketing questions you need to answer in your career regarding numerical summaries. Chapter 02Chapter 02 Variation Within The Data Blueprint Christopher Atkinson Fredric Ayih Gauvtam Bajaaj Danusha Fernando Paramjeet Kaur Introduction Variation is seen in every part of our day to day lives, from our home to the workplace to anything in which we can observe a difference. On a daily basis, you see cars of different brands, models, colors and sizes. The very differences in these observations illustrate variance. When looking at a dataset of all Toyota cars for example, one can observe that they come in different prices, sizes and features such as engine size, horsepower and number of cylinders. These differences within a dataset illustrate the concept of variation within data. What Is Variation? Data variation measures the spread of data around the mean. It shows the differences in the variables which may be quantitative as well as qualitative. We may have two sets of data with varying input values but similar means. Here variations may be observed in terms of the number of variable inputs, range of data, dispersion of the data etc. In order to measure the amount of variability between the data sets we use statistical tools such as variance and standard deviation. Variance measures the difference between each variable and the mean, squared to remove the sign effects. The standard deviation is the square root of the variance which brings the measure back to scale. Together with mean, standard deviation gives a first level indication of the characteristics of any set of numbers. Standard Deviation indicates the degree to which the values are clustered around the mean. A large amount of dispersion explains how far results are from the expected level of mean. Thus, the variations within the data are measured in a quantitative manner. Pictorial representation of variations within data can be shown using bars and charts. What causes variation within the data? It becomes necessary to find out if the variation within the data is a regular event or a random event so that the results attained do not come as a surprise to us. There are common causes such as process input and conditions that contribute to the regular everyday variation. For example, a probability of a 3% occurrence of errors in data provides the Statisticians to forecast the temperature within a desired range. On the other hand, there may be some special causes such as the random occurrence of a temporary event which may create a variation within the existing data making it difficult to work on. For example, sudden flow of the north-east wind may cause a sudden drop in temperatures making it difficult to predict the temperature. Process Flow for Implementation in SAS To further understand the concept of variance, we will be exploring and analyzing the CARS dataset, which contains a variety of variables such as origin, type, horsepower, number of cylinders and retail price on vehicles sold by dealer. We will start by opening the dataset and familiarizing ourselves with the data and the variables. Following this, we will create several reports to describe the data, identify trends and explain variance within the dataset by using both numerical and alphabetical variables. You will also be given an opportunity to filter the data in order to focus on a smaller set of variables to run reports from. Creating a Simple Bar Chart: Open the Cars data table by selecting Servers > Libraries > SASHELP from the Server List. Navigate to the CARS database and select it. Click Open. Creating a Bar Chart: On the menu bar click Tasks and then select Graph to open Bar Charts. The Bar Chart window has five pages: Bar Chart, Data, Appearance, Titles and Properties. In the Bar Chart page, click the Simple Vertical Bar (Figure 2.1). Figure 2. SEQ Figure_2. ARABIC 1 To produce a report to identify the frequency in each category of variable Type, click the variable Type and drag it to the Column to chart role (Figure 2.2). Figure 2. SEQ Figure_2. ARABIC 2 Click Run to run the task and produce report. To make changes to the title, click Modify Task and give an appropriate name to the Title of graph (Figure 2.3). Figure 2. SEQ Figure_2. ARABIC 3 Rerun the task by clicking the Run button. Figure 2. SEQ Figure_2. ARABIC 4 The resulting graph (Figure 2.4) shows the number of cars in the database by type. There are more sedans than any type of car, but there are also some SUVs, Sports and Trucks. Note that the number of cars in each type changes as you look at a different type. This illustrates the concept of variance, when it comes to frequency. Creating a Scatter Plot: To generate a scatter plot, return to the Cars data set and click Tasks and then select the Graph to open Scatter Plot. Select the simple two-dimensional scatter plot in the scatter plot page (Figure 2.5). Figure 2. SEQ Figure_2. ARABIC 5 Click Data in the selection pane to assign a column. Drag Horsepower to the Horizontal task role followed by MSRP on the Vertical task role (Figure 2.6). Rename Titles and Click Run. Figure 2. SEQ Figure_2. ARABIC 6 Figure 2. SEQ Figure_2. ARABIC 7 This scatter plot (Figure 2.7) displays horsepower and the manufacturer suggested retail price. The horsepower is between 100 and 300 and are priced below $50,000. Due to the variance in the data, you can observe that certain cars have horsepower values as high as 500 and some cars are price closer to $200,000. This scatter plotter allows you to visualize variation by assigning a spot to every data set, based on 2 measurable variables. Creating a Tile Chart: Click Tasks and then Graph to open the Tile Chart. For this report, click the variable Type and drag it to Classification variable under column roles, drag variable Invoice to the Color analysis and drag variable Horsepower to Size Analysis variable (Figure 2.8). Figure 2.8 Click Titles in the list of options in the selection pane and click Graph. From the drop down arrow under Tile Layout click Flow layout. In the Title page of the Tile window give an appropriate name to the chart. Click Run. Figure 2.9 In this chart (Figure 2.9) variance in a data set is expressed through numerical and alphabetical variable (Type, Invoice and Horsepower). The cars in the database are arranged into boxes based on their type, and the sizes of the boxes are determined by the total horsepower in each type. Note that sedans do not have the highest horsepower per car, but because the database contains a lot more sedans than any other type of car (see frequency by vehicle chart), the total horsepower of sedans is higher than any other type of car. This is why the sedan box is the largest, and the hybrid box is the smallest. Lastly, the variance in Total invoice is illustrated by the color of the box. Note that Sedan is in a darker green not because they are more expensive, but because there are more sedans than any other car type; hence the Total invoice for sedans is much higher. Filtering Data: To filter the Cars data table, refer back to the process flow and click the Tasks tab on the menu bar and select Data to open Filter and Sort. Click and drag all the variables in the selected pane. To filter the data, click the Filter tab. The filter page contains four empty boxes. Click the down-arrow on the first box and select Type as variable; in the second box select the criteria as Equal to from the drop-down list, in the third box click the ellipsis button and select the value as Sports and click OK. Creating a Stacked Vertical Bar chart from the Filtered Data: To generate a Stacked Vertical Bar, click the Tasks tab on the menu bar and select Graph to open the Bar Chart window. In the Bar Chart page click the Stacked Vertical Bar. In the Data page drag the variable MSRP to Column to Chart and Origin as Stack. Give an appropriate name to the graph and click Run. Figure 2.10 Figure 2.10 displays variance within the data on three levels: The manufacturer suggested retail price, the number of cars or frequency, and the origin of the car. Note that Europe is the only location where the number of cars at the $90,000 price point is higher than other price points. The bulk of cars manufactured in Asia are at the $30,000 price point and a little more than half of USA manufactured cars are at the same retail price. The fact that Europe produces the majority of cars above $90,000 can indicate their focus on higher end vehicles. To generate and view a stacked vertical bar with a different variable, click the Tasks tab on the menu bar and select Graph to open the Bar Chart window In the Bar Chart page click the Stacked Vertical Bar. In the Data page drag the variable MSRP to the column to chart and variable Cylinder to stack and Run the report. Figure 2.11 The above chart (Figure 2.11) displays variance within the data on three levels: The manufacturer suggested retail price, the number of cars or frequency, and the number of cylinders. Note that the origin variable has been replaced by the cylinder variable. The heights of the bars have not changed, and the majority of cars are price at $30,000. As price increases there are fewer cars with six and eight cylinders available. Cars of four cylinders or less are only available at prices below $30,000, and ten or twelve cylinder cars are only available above the $90,000 price point. Note that this picture of variance allows you to identify an outlier: the only car with a price of $180,000 has six cylinders. Similarly, to generate a chart comparing the variables Engine Size and Cylinders, drag Engine Size to column to chart and Cylinders to stack to produce a report of two other variables. Give an appropriate name to the graph and RUN. Figure 2. SEQ Figure_2. ARABIC 8 Figure 2.12 displays variance on three levels: the Engine size (L), the frequency and the cylinder sizes within each engine size. Based on what you have learnt thus far, read the following statements and indicate if they are (T) TRUE or (F) FALSE. The most common engine size is 3.0[ ] The most common cylinder size is 6[ ] There are more 8 cylinder cars with 4.2 engine sizes than there are at 5.4[ ] There are as many 12 cylinder cars as there are 10 cylinder cars[ ] Across all cylinder sizes, the least common engine size is 7.8[ ] As you increase engine size, the number of car with four cylinders increase[ ] Conclusion As demonstrated, SAS can sort all variations within data to a specific set of objectives from the perspective of a specific department such as marketing department or the company on a whole. This allows management to project future strategies through historically available data and draw conclusions which may help create an overall analysis of the company in the long run. From the size of engines, miles per gallon-city or highway, manufacturer names, types and origins of vehicles, SAS provides a relatively easy way to calculate and visually verify the variations of data within different samples. Through several charts, graphs, one can arrive at conclusive decisions to support strategies (eg: increase sales, decrease production on non-selling vehicles, under achieving miles per gallon). We can simplify forms that break down variations within the data and draw conclusions in a simplified and comprehensive manner which are used to create strategies. Answers for the exercise based on figure 2.12 : 1 – True2 – True3 – True4 – True5 – True6 – False Chapter 03Chapter 03 Confidence Intervals Spice Girls Alexandra Gonchar Ellen Guimaraes Ksenia Knyazeva Ekaterina Loskutova What is a Confidence Interval? Statistics define Confidence Interval as a particular kind of interval approximation of a population limit. It is a perceived interval, which differs from sample to sample and normally includes the mean of the population of interest, and guarantees a high percentage of likelihood that the results will be very similar if the experiment is repeated. In order to determine how frequently the observed interval contains the parameter of interest, the confidence level or confidence coefficient are used. As the Confidence Interval is calculated from a sample which contains the value of a certain data parameter with a specified probability, the end-points of the interval are the confidence limits. The specified probability is called the confidence level. What is the purpose of a Confidence Interval? In order to predict the mean, the standard deviation, and variance of a population, a random sample is taken from a larger population and a statistic is calculated. It is usually very important to predict the level of reliability in the results provided by the sample. This is where the Confidence Interval comes in. The Confidence Interval provides a range in which one can be relatively certain that their specific data mean is located. Therefore, as the name states, a Confidence Interval is used to calculate the confidence that one can have in the result of a sample. When are Confidence Intervals most commonly used? A confidence interval does not forecast if the true value of the parameter of interest has an exacting chance of setting in the confidence interval given the data truly obtained. The Confidence Interval lets us estimate the true mean of a certain data set using the results of previous measurements (sample size, standard deviation, and confidence level). It is used to indicate the reliability of an estimate. Examples where Confidence Intervals can be used: ,[object Object]

Likelihood of certain candidates to be elected

Reactions to certain new products

Survey response rate reliability

Predict results based on previous researchUsing a Confidence Interval An example of how one can arrive at a Confidence Interval is the following: Getting statistics from an entire population may be impossible, information may be correct but outdated, and response rates on surveys may be very low. Because of this, researchers simplify the statistical process by picking a sample of the population of interest, finding answers to their research questions, and trying to estimate the reliability and precision of the results. This reliability estimate is where using the Confidence Interval comes in. For example, lets answer the following question: With 95% accuracy, what is the average amount of languages spoken by each student at George Brown? We could ask every student at George Brown but that would be time consuming and some students may not answer truthfully. Therefore, a convenient way to answer our question is by picking and analyzing a sample that we can work with. This will help us to calculate the Confidence Interval which will be the answer to our question. In this case, we will pick a reasonably large proportion of the students in the school , so that the results will be representative of the larger population (We will be using a representative class). Once we have chosen the sample, we need to estimate the reliability that the mean of the entire population will be contained In a certain range (Confidence Interval). Results: Mean=2.6 languages per student Standard deviation=1.836 (Intervals are calculated from the mean, standard deviation and the size of the sample) By doing the Confidence Interval calculations we arrive at a conclusion. The mean number of languages, with 95% confidence, is between 1.945 & 3.255. Applying Confidence Intervals to SAS The Distribution Analysis produces statistics describing the distribution of a single variable. Next example explores the distribution of the variable Height in the Volcanoes data set. On the Process Flow field click the Volcanoes data icon to make it active. Then select Task Describe Distribution. In the data tab choose the Height variable for analysis. Then in the distributions tab click Normal. In the Tables tab you can choose all the statistics you would like to explore. We are particularly interested in Basic Confidence Intervals and Basic Measures (Mean, Standard Deviation, and Variance). To measure confidence intervals we have to specify the confidence level in the drop-box on the top right. You can choose among 90%, 95%, and 99%. After selection click Run. The Resulting Report starts with basic statistic measures about the distribution of the variable: mean, median, standard deviation, variance, and range. Another section of the report contains confidence limits assuming normality. This table shows confidence intervals for main parameters (mean, standard deviation, and variance) with 95% confidence level. We can also build a plot to better evaluate the normality of variable distribution. Click Modify Task and in the open window click Plots page. You can choose among different appearances. Choose Histogram Plot. Click Insert Page and choose statistics you would like to include to the plot (for this example we took sample size, sample mean, and standard deviation). Choose the location of this information on the graph and click Run. From the example we can see that the sample size is 32. The graph shows that the data is normally distributed and the Volcanoes’ Height mean is 3113.563. With 95% of confidence, the height of average Volcano (mean) is from 2481.3 km to 3745.9 km. Chapter 04Chapter 04 Simple Regression Sukhoi Amit Bansal Sheleena Jaria Kalpesh Patel Ishan Sangrai Pranay Sankhe Introduction to Regression Analysis In the statistical terms, regression is the study of the natural relationship between the variables so that one may be able to predict the unknown value of one variable for a known value of another variable. According to Oxford English Dictionary, the word ‘regression’ means “stepping back” or “returning to average value”. The term was first used in the 19th century by Sir Francis Galton. He found out an interesting result by studying the height of about 1000 fathers and sons. His calculation were that (i) sons of all fathers tend to be tall and sons of short fathers tend to be short in height (ii) But the mean height of the tall fathers was greater than the mean height of sons, whereas the mean height of the short sons was greater than the mean height of the short fathers. The tendency of the entire mankind to twin back to average height was termed by Galton ‘Regression towards Mediocricity’ and the line that shows trend named as ‘Regression Line’. In words of M.M Blair, ‘Regression is the measure of the average relationship between two or more variables. Regression analysis is used to: ,[object Object]

Explain the impact of changes in an independent variable based on the dependent variable.Dependent variable: the variable we wish to predict or explain. Independent variable: the variable used to explain the dependent variable. Regression Formula To calculate relation between X and Y we need an equation which is Regression Equation Y = a + bX Where X and Y are the variables, b = the slope of the regression line, a = the intercept point of the regression line. Slope (B) = (nΣXY - (ΣX) (ΣY)) / (nΣX2 - (ΣX) 2) Intercept (A) = (ΣY – b (ΣX)) / n Figure 4.1 shows Simple Regression As per Figure 4.1 Regression line shows the average relationship between two variables. This is also known as Line of Best Fit. On the basis of regression line, we can predict the value of a dependent variable on the basis of the given value of the independent variable. So this regression line of Y on X gives the best estimate for the value of Y for any given value of X. Steps In Linear Regression ,[object Object]

Compute the regression equation.

Examine tests of statistical significant and measures of association.

Relate statistical findings to the hypothesis. Accept or reject the null hypothesis.

Reject, accept or revise the original hypothesis. Make suggestions for research design and management aspects of the problemRegression Example To find the Simple Regression, Let’s take a simple example, where X is Cattle and Y is Cost. The example shows the relationship between both of them. First we need a database. To find regression equation, we will first find slope, intercept and use it to form regression equation. Step 1: Count the number of values Step 2: Find XY, X2, Y2 Step 3: Find ΣX, ΣY, ΣXY, ΣX2,ΣY2 ΣX = 116.969; ΣY = 670.575; ΣXY = 5570.426; ΣX2 = 1036.087,ΣY2 =32134.66 Step 4: After putting Values in slope formula Slope (b) = (nΣXY - (ΣX) (ΣY)) / (nΣX2 - (ΣX)2) = 1.4086 Step 5: Now, substitute the value in the formula. Intercept (a) = (ΣY - b (ΣX)) / n = 26.6211 Step 6: Then substitute these values in regression equation Regression Equation(Y) = a + bX = 26.6211 + 1.4086X Suppose if we want to know the approximate ‘Y’ value for the variable ‘X’ = 3.437. Then we can substitute the value in the above equation. Regression Equation(Y) = a + bX = 26.6211 + 1.4086 (3.437) = 26.6211 + 4.8416 = 31.4627 The Above example tells us how to find the relationship between two variables by calculating the Regression from the above mentioned steps. Assumptions Of Simple Regression In theory, there are several important assumptions that must be satisfied if linear regression is to be used. These are: ,[object Object]

The relationship between the independent (X) and the dependent (Y) variables is linear.

Errors in prediction of the value of Y are distributed in a way that approaches the normal curve.

Errors in prediction of the value of Y are all independent of one another.

The distribution of the errors in prediction of the value of Y is constant regardless of the value of X.Implementing within SAS Now when we doing the same task in SAS, we need to have a database on which we will calculate relationship between them (Variables). So initially you have to do is Open SASFileOpenData Figure 4.2 shows how to access Data in SAS Now select Data from the computer which you want to analyze. After selecting data, window pops like as below shown: Figure 4.3 After selecting the data, go to GraphScatter Chart Figure 4.4 Click 2D Scatter chart Figure 4.5 Figure 4.6 shows Columns to assign different Task Roles Drag cattle into Horizontal and Cost into Vertical, then Run Figure 4.7 shows after selection of variables in their Task roles Figure 4.8 shows Scatter Plot Graph Now we need to find the relationship between X and Y through SAS, Select the process flow and then double click Market database. Figure 4.9 Select Analyze RegressionLinear Regression Figure 4.10 Then insert Cattle into Dependent Variable and Cost into Explanatory variables Figure 4.11 Click RUN. Output will have several graphs but we focused only on one which is shown below. Figure 4.12 shows relationship between Cattle and Cost. Figure 4.13 shows the window after clicking Process Flow In SAS, we can modify the output. Right click on Linear RegressionModify Linear Regression Figure 4.14 Linear Regression window will pop up and here we want name on Footer. So click Titlesfootnote Figure 4.15 Click Default text and then write you’re “Name” instead of “the SAS system” than click RUN Figure 4.16 Conclusion After doing the analysis, initially manually and later with SAS software, we get to know that output remains the same but the difference in efforts is far different from each other. By using SAS software, it’s easy to get the output which otherwise would take lots of tedious hours. The best thing about the SAS software is that you can make changes at any point of time with just fraction of seconds but otherwise you need to do the complete calculation again. So in nutshell, Simple regression gives us a relationship between two values and we can predict the one value if other is known and using the SAS software we get the output early and error free. Chapter 05Chapter 05 Correlation Coefficient Fusion Gaurav Anand Maninder Kaur Anil Khurana Rizwan Maknojia Bikramjit Singh Definition The correlation is one of the most common and most useful statistics. A correlation is a single number that describes the degree of relationship between two variables. It gives a mathematical number to weather two numeric variable are related or not, It ranges from -1 to +1. “+1” correlation indicates a perfect positive correlation, meaning that both variables move in the same direction together. “-1” correlation indicates a perfect negative correlation, meaning that as one variable goes up, the other goes down ,[object Object],In mathematic terms, Correlation is referred as “r”. The degree of relationship between variables can be defined by r value as shown in table 5.1. Value of rStrength of relationship-1.0 to -0.5 OR 1.0 to 0.5Strong-0.5 to -0.3 OR 0.3 to 0.5Moderate-0.3 to -0.1 OR 0.1 to 0.3Weak-0.1 to 0.1None or very weak Table 5.1 – r value table Correlation Example Let’s assume that we want to look at the relationship between two variables, the age of the student and their marks. Perhaps we have a hypothesis that the age of a student’s effects their marks. We have a sample data of 10 students and their marks out of 50. AgeMarks2535304826362436284525403146314026362531 Table 5.2 Based on the above data in Table 5.2 the calculated correlation value is “r=.105”. This indicates that there is not a strong positive relationship between age of the student and their mark. Therefore, it’s not necessarily that the older the student is, higher the marks he or she will get. Neither there is a negative relationship between the two. The “r” value is .105 which is very close to “0”, it indicates that there is hardly any relationship between the two variables. Implementing Within SAS Let’s understand how correlation can be used in SAS Enterprise Guide. Open SAS Enterprise Guide 4.2. Open Class data set from LibrarySASHelp. Class data set has the name of the student, Sex, Age, Height & their Weight. Now, we will check if there is any relationship between the Height of the student & their Weight. On the menu bar at the top, click on TasksMultivariateCorrelations as shown in figure 5.1 ,[object Object],Correlation window will pop up. ,[object Object],Select & drag Height under Analysis variable & Weight under Correlate with & click Run ,[object Object],Figure 5.4 – Correlation output window As you can see the output in figure 5.4, “Correlation Analysis” at the top it displays the variables for which you want to check the degree of relationship between them. Below that under “Simple Statistics”, it shows the Mean, Standard Deviation, Minimum Value & Maximum Value for both Height & Weight, where N is the number of students in the class. All these statistics are used to calculate the correlation between the two variables. In the output displayed above, we can see that the Correlation Coefficient value is .877. Thus, we can say that there is strong positive relation between the height of the student & their Weight. If the height of student will increase, the weight will also increase. Modifying Output In SAS enterprise guide, we can modify the output in different ways. For example, we want to check the correlation between height & weight separately for males & females & also we want the scatter plot in the output. ,[object Object]

Figure 5.5 – Modify correlation path

Correlations window will pop-up & drag Sex under Group analysis by as shown in figure 5.6

Figure 5.6 – Assigning variables for group analysis

Click on Resultscheck the option Create a scatter plot for each correlation pair

Figure 5.7 – Result screen of correlation windowClick on Titles & edit the Analysis Titles and Footnote by un-checking “Use default text”, click Run & click on yes to override the result from previous Run. Figure 5.8 – Titles & footnotes In output shown in Figure 5.9 & 5.10, we have two different results, one for males & other for females & both with the scatter plots. The correlation values are Males, r=.85 Females, r=.88 Thus, both males & females have a strong positive relation between their height & weight. Figure 5.9 – Correlation output window Figure 5.10 – Correlation output window Multiple Correlations We can also do the multiple correlations at the same time. For example, now we will check the relationship between the “height & weight” and “Age & height” from the class data set. Height is the common variable here, we want to correlate weight & age with it. So, we will put height in Analysis variable & we will put Age & height under Correlate with because each variable in “Correlate with” role will be correlated with the variables in the Analysis variables role. To do multiple correlations Right-click on CorrelationsModify Correlations under the process flow & drag Age under correlate field with weight. Click Run & click on yes to override the result from previous Run ,[object Object],The output in Figure 5.12 displays the correlation between “Weight & Height” & “Age & Height” for males & females separately. Figure 5.12 – Correlation output Note: The calculations we have done so far are based on simple correlation. We have some more options in SAS to calculate correlation in different ways. Right-click on CorrelationsModify Correlations under the process flow and click Options to find out the different ways as shown in figure Figure 5.13 – Correlation options window You can try different options to see what results they produce. Chapter 06Chapter 06 Test of Significance Gotcha Luz Alvarez Hasan Can Michell Escutia LiLi Xu Sharon Yang Tests of significance are statistical tests used make claims or inference about the population from which the sample has drawn. To begin, a null hypothesis H0 and confidence interval must be determined based a given scenario. H0 represents the assumption, either because it is believed to be true or because it is to be used as a basis for argument, but has been proved. Confidence Interval represents the estimated range being calculated from a given set of sample data. The common choices are 0.90, 0.95, and 0.99. The percentages correspond to the areas of the normal curve being covered. The outcome of the test is either “reject H0” or “Do not reject H0.” There are different tools used, but we are going to observe the most common ones: t-Test, One-Way ANOVA, Nonparametric One-Way ANOVA, Linear Models, and Mixed Models. 6.1 t-Test Within a t-Test, there are three different types: Two Sample, Paired, and One Sample. We will walk through each one of them based on a given scenario. In order to implement the t-Test using SAS Enterprise Guide 4.2, open the dataset named marathons.sas.7bdat . File Open Data. When the database is open, now we can access the t-Test menu by clicking Analyze ANOVAt-Test. (Figure 6.1) Figure 6.1: Open a Task We also can access this menu through Tasks ANOVA t-Test. (Figure 6.2) Figure 6.2: Open a Task t-Test Two Sample This is a statistic used to evaluate whether or not the two independent samples are representative of the same population. In addition, it is assumed that each sample is normally distributed with equal variances. For instance, you want to compare the running time during the marathon at the city of New York and Boston. A random sample of 50 observations from the Boston marathon and 100 observations from the New York marathon have been recorded and saved. The variables in the dataset include City and Time (in hours). In the new window, click t-Test types, you will find 3 different types of t-Test, select Two Sample. (Figure 6.3) Figure 6.3: Select t-Test type Then click Data, we are going to assign a variable to identify level row. Then classify the variable and select the variable we are going to analyze. Click the variable City and drag it to the Classification Variables. Then click the variable Time and drag it to the Analysis Variables. (Figure 6.4) Figure 6.4: Select variables Click Analysis on the left menu. Specify the test value for null hypothesis H0 and the confidence level. Set H0 to 0 because we believe the difference between the two observations is 0 or equal variances. Then set confidence level to 95%. (Figure 6.5) Figure 6.5: Set Null Hypothesis and Confidence Level Click Plots and select the type of plots you need to display in the report. (Figure 6.6) Figure 6.6: Select plots After customizing the titles and click Run. (Figure 6.7) Figure 6.7: Customize titles The t-Test result is now shown as below. Whether or not we should reject the null hypothesis, we can either use the method Pooled for unequal variances or the method Satterthwaite for unequal variances. The column labeled t values corresponds to the t-test statistic, the column labeled DF corresponds to degree of freedom, and the column labeled Pr > ltl corresponds to the P-value that has to be interpreted. Since we already assumed the two observed samples are equal variances, we can use its P-value as indicator, which is < 0.0001. with 95% confidence level we chose, the standard P-value we have set is (1 – 0.95), which is 0.05. The P-value for equal variances is < 0.0001, which is smaller than 0.05. So we can reject the null hypothesis. (Figure 6.8) Figure 6.8: t-Test Two Sample Results t-Test Paired This is to test whether or not the two matched samples are representative of the same population. Open the dataset named bloodpressure.sas.7bdat in order to examine the effectiveness of a medication in reducing blood pressure. A random sample of individuals with high blood pressure is taken and their diastolic pressure is recorded. The individuals are then placed on medication and one month later their diastolic blood pressure is once again recorded. The dataset contains the following variables: subject, age, baseline blood pressure, and new blood pressure. In the t-Test window, select Paired. (Figure 6.9) Figure 6.9: Select t-Test type Click Data, and then assign the variables of Baseline BP and New BP to Paired Variables. (Figure 6.10) Figure 6.10: Select variables After customizing the titles and click Run. (Figure 6.11) Figure 6.11: t-Test Paired Results t-Test One Sample This is a test to determine whether a sample is representative of a population with specified mean. Let’s use the same data set bloodpressure.sas.7bdat) as used in paired sample. Under t-Test type, select One Sample. (Figure 6.12) Figure 6.12: Choose t-Test type Under Data, click Age and drag it to Analysis Variables. (Figure 6.13) Figure 6.13: Select variables After customizing the titles and click Run. See the results below. (Figure 6.14) Figure 6.14: Results ,[object Object],One-Way ANOVA (Analysis of variance) test is another way to test hypothesesizes. It is a procedure used to perform an analysis of variance by testing whether or not the means of two or more samples are equal. It assumes all the samples are drawn from normally distributed populations with equal variance, which is similar t-test two sample. It is based on the fact that 2 independent estimates of the population variance and it can be obtained from the sample data. Select Analyze ANOVA One-Way ANOVA. (Figure 6.15) Figure 6.15: Open a Task Click Data and select the dependant and independent variable. In this case, Weight is the Dependent Variable and the Displacement is the Independent Variable. (Figure 6.16) Figure 6.16: Select variables Click Test and select tests for equal variance. (Figure 6.17) Figure 6.17: Tests Click Means Comparison, and then select the method and confidence level you want to use. We want to stick with 95% confidence level. (Figure 6.18) Figure 6.18: Comparison Click Breakdown and select the statistics for qualitative variables that you want in the report (Figure 6.19). Figure 6.19: Breakdown Click Plots and select between the two types (Box and Whisker or Means) that you want to display in your result. (Figure 6.20) Figure 6.20: Breakdown Customize your titles and click Run. See the results below. (Figure 6.21) Figure 6.21: Results ,[object Object],This type of test allows you to implement nonparametric tests for location and scale when you have a continuous dependent variable and a single independent variable. In statistical inference, or hypothesis testing, parametric runs because they depend on the spec of a probability distribution except for a place of free parameters the traditional runs are called. Parametric runs are stated to depend on distributional assumptions, nonparametric tests, do not require distributional assumptions. Nonparametric methods are often almost as powerful as parametric methods, even if the data are distributed normally. Select Analyze ANOVA One-Way ANOVA. (Figure 6.22) Figure 6.22: Open a Task Click Data and select the dependant and independent variable. (Figure 6.23) Figure 6.23: Select variables Click Analysis and select test scores you want in your results. (Figure 6.24) Figure 6.24: Analysis Tests Then click on Extract p-values. (Figure 6.25) Figure 6.25: Extract p-values Customize your titles and click Run. See the results. (Figure 6.26) Figure 6.26: Results ,[object Object],The Linear Models task is used to perform an analysis of variance when you have a continuous dependent variable with classification variables, quantitative variables, or both. Select Analyze ANOVA Linear models. (Figure 6.27) Figure 6.27: Open a Task Click Data and select the dependant. (Figure 6.28) Figure 6.28: Select variables Click Model Options and select the hypothesis test options that you want in your result. (Figure 6.29) Figure 6.29: Select model options Customize your titles and click Run. See the results. (Figure 6.30) Figure 6.30: Linear Model Results ,[object Object],The Mixed Models task is used to provide facilities for fitting a number of basic mixed models. These models enable you to handle both fixed effects and random effects in a linear model for a continuous response. Numerous experimental contrives produce data for which coalesced models are appropriate. Select Analyze ANOVA Mixed Models. (Figure 6.31) Figure 6.31: Open a Task Click Data and select the dependant variable and the quantitative variables you want to analyze. (Figure 6.32) Figure 6.32: Select variables Customize your titles and click Run. See the results as shown below. (Figure 6.33) Figure 6.33: Mixed Model Results Chapter 07Chapter 07 Limits (Confidence / Prediction) Dean Squad Eric Plaskacz Christina Mofid Edison Nguyen Marissa Shaver Alexandra Wackett Confidence Limits Definition A confidence interval is the likely range of the true value and since there is only one true value, the confidence interval defines a range where it is likely to be. Most often, confidence intervals are at the 95% level – called the 95% confidence interval. These intervals mean that on average, 95% of the ranges will capture the true population mean, while 5% of them, on average will not capture the true population. Confidence intervals are used because it might not be possible to measure everyone in a given population simply because of a lack of resources. However, by using confidence intervals, it is possible to use a sample of the population to calculate a range within which the population is likely to fall within. Confidence Limits – Confidence limits are the upper and lower boundaries of the interval. Width of Confidence Intervals Confidence intervals give us a range of upper and lower boundaries. If the interval is narrow – meaning a small difference between the upper and lower boundaries, than we can be confident that the study was quite large and the true value is precise. If the confidence interval is wide – than we can conclude that the study was most likely small which means that the true value will be imprecise Prediction Intervals Definition A prediction interval is a range that will tell you were you can expect to see future observations. These intervals are useful in determining what future values should be, based upon present or past data. They can be useful to us because they can predict future data points before the information is even collected, as opposed to having to wait to collect it. Since there is uncertainty in knowing what future data will be the prediction interval will always be wider than the confidence interval. Example in SAS EG Beer Sales Data ,[object Object]

The data shows trends of beer sales and the relationship

This Chapter will focus on computer confidence and prediction intervals as well as interpreting the associated output.How to Complete in SAS EG ,[object Object]

As you can see from the raw data, an increase in temperature is strongly and positively correlated to beer sales.

If we make a simple line plot before we start computing confidence intervals, it will give us a better sense of the information we’re looking at.

After selecting the first line plot, add High Temp to the horizontal axis (independent variable) and Sales to the Y axis (dependent variable)

Make sure to change the appropriate titles and footnotes in the properties tab. Click Run.The results: ,[object Object],Therefore, as temperature increases, beer sales increase as well. Confidence Limits ,[object Object]

To computer confidence limits for month, click TaskDescribeDistribution Analysis

To compute confidence limits on sales, drag sales to the task role pane under variable analysis

Click RunFigure 7.3 Output Analysis (below) ,[object Object]

The probability of the mean falling outside of the given confidence limit by chance alone is 5%.

We expect that if more data on beer sales is collected, the confidence limit is expected to decrease.

Limits on other variables including month and temperature can be computed by changing the variable analysis accordingly. 123190026670 Figure 7.4 AppendixAppendix Team Contributions A simple breakdown by each team, showing how the work was distributed among themselves: Team 1 – Numerical Summaries Definition – Jaswant Seahra, Mriseal Sinha Example– Baljeet Kaur, Jaswant Seahra, Theo Wolski Implementing with SAS – Trystan Macdonald, Surbhi Surbhi Documenting design process– Trystan Macdonald, Theo Wolski Defining SAS results – Trystan Macdonald Conclusion – Trystan Macdonald, Theo Wolski, Jaswant Seahra Compilation – Theo Wolski, Surbhi Surbhi Blueprint – Variation Within The Data Definition – Paramjeet Kaur, Christopher Atkinson Example – Fredric Ayih, Danusha Fernando Designing within SAS – Danusha Fernando , Christopher Atkinson Documenting design process – Gauvtam Bajaaj, Frederic Ayih Defining SAS results – Fredric Ayih , Paramjeet Kaur, Gauvtam Bajaaj Conclusion – Christopher Atkinson , Gauvtam Bajaaj Compilation – Danusha Fernando , Paramjeet Kaur Spice Girls – Confidence Intervals Definition – Everyone Example –Everyone Implementing within SAS – Everyone Documenting design process – Everyone Defining SAS results – Everyone Conclusion – Everyone Compilation – Everyone Sukhoi – Simple Regression Definition – Kalpesh Patel, Ishan Sangrai, Sheleena Jaria Example –Kalpesh Patel, Ishan Sangrai, Sheleena Jaria Implementing within SAS – Amit Bansal, Pranay Sankhe Documenting design process – Amit Bansal, Pranay Sankhe Defining SAS results – Amit Bansal, Pranay Sankhe Conclusion – Amit Bansal, Ishan Sangrai Compilation – Amit Bansal, Ishan Sangrai Fusion – Correlation Coefficient (r) Definition – Everyone Example –Everyone Implementing within SAS – Everyone Documenting design process – Everyone Defining SAS results – Everyone Conclusion – Everyone Compilation – Everyone Gotcha – Test Of Significance Definition – Hasan Can, Michell Escutia, Lily Xu Example – Luz Alvarez Implementing within SAS – Luz Alvarez Documenting design process – Luz Alvarez, Hasan Can, Michell Escutia Defining SAS results – Luz Alvarez, Lily Xu, Sharon Yang Conclusion – Lily Xu, Sharon Yang Compilation – Luz Alvarez, Lily Xu, Sharon Yang, Hasan Can, Michell Escutia Dean Squad – Limits (Confidence / Prediction) Definition – Everyone Example –Everyone Implementing within SAS – Everyone Documenting design process – Everyone Defining SAS results – Everyone Conclusion – Everyone Compilation – Everyone

B409 W11 Sas Collaborative Stats Guide V4.2

B409 W11 Sas Collaborative Stats Guide V4.2

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à B409 W11 Sas Collaborative Stats Guide V4.2

Similaire à B409 W11 Sas Collaborative Stats Guide V4.2 (20)

Dernier

Dernier (20)

B409 W11 Sas Collaborative Stats Guide V4.2