Signaler

•0 j'aime•7 vues

CLA 2 Presentation BUS 606 Advanced Statistical Concepts And Business Analytics Agenda Introduction Multiple linear regression is the most appropriate statistical technique in predicting the outcome of a dependent variable at different values (Keith, 2019). The study assessed the relationship between the cost of constructing an LWR Plant and the three predictor variables S, N, and CT. We assessed the association between the two-test used to examine the employee performance. Assumption of Regression Analysis Multicollinearity Multicollinearity is the condition where the predictor variables are highly correlated (Alin, 2010). Correlation Analysis 4 Assumption of Regression Analysis Cont’ Normality test The normality assumption is not violated after transforming the outcome variable C, using natural log (C) (Shapiro-Wilk = 0.967, p = 0.414). 5 Results and Discussion – Regression Analysis Use Residual Analysis and R2 to Check Your Model The R-Squared of 0.232 indicates that the model can explain about 23.2% of ln(C) The low R-Square indicated that the model does not fit the data well (Brown, 2009). 6 Results and Discussion Cont’ State which Variables are Important in predicting the cost of constructing an LWR plant? S is a significant contributing factor in predicting ln(C)(p = 0.021), but N and CT have no significant effect in predicting (p > 0.05) 7 Results and Discussion Cont’ State a prediction equation that can be used to predict ln(C). After dropping N and CT from the model since they do not have a significance effect in predicting ln(C), the prediction equation is given by: Does adding CT improve R2? If so, by what amount? Adding CT in the model changes R-Square by 0.001 from 0.232 to 0.234 which is not significant different from zero (p > 0.05). 8 Results and Discussion Cont’ - Correlational Analysis Evaluate the correlation between the two scores and state if there seems to be any association between the two. There was a weak positive correlation between the two tests (r = 0.187). This suggested that the two test scores were not correlated. 9 Results and Discussion Cont’ Find the probability of upgrading for each division of the sample by the Bayes’ theorem. P(Up/T1) = P (T1/Up) P(Up) ÷ P(T1) = (23/46*46/86) ÷43/86 = 23/43 P(Up/T2) = P (T2/Up) P(Up) ÷ P(T2) = (23/46*46/86) ÷43/86 = 23/43 10 Results and Discussion Cont’ Find the probability of upgrading for each division of the sample by the naïve version of the Bayes’ theorem P(Up/T1) = P (T1/Up) P(Up) ÷ P(T1) = (23/46*46/86) ÷43/86 = 23/43 P(Up/T2) = P (T2/Up) P(Up) ÷ P(T2) = (23/46*46/86) ÷43/86 = 23/43 11 Results and Discussion Cont’ Compare your results in parts b and c and explain the difference or indifference based on observed probabilities Naïve version and Bayes theorem have similar probabilities. We have only one predictor in each sample division This is because Naïve is a ...

•0 j'aime•7 vues

Signaler

CLA 2 Presentation BUS 606 Advanced Statistical Concepts And Business Analytics Agenda Introduction Multiple linear regression is the most appropriate statistical technique in predicting the outcome of a dependent variable at different values (Keith, 2019). The study assessed the relationship between the cost of constructing an LWR Plant and the three predictor variables S, N, and CT. We assessed the association between the two-test used to examine the employee performance. Assumption of Regression Analysis Multicollinearity Multicollinearity is the condition where the predictor variables are highly correlated (Alin, 2010). Correlation Analysis 4 Assumption of Regression Analysis Cont’ Normality test The normality assumption is not violated after transforming the outcome variable C, using natural log (C) (Shapiro-Wilk = 0.967, p = 0.414). 5 Results and Discussion – Regression Analysis Use Residual Analysis and R2 to Check Your Model The R-Squared of 0.232 indicates that the model can explain about 23.2% of ln(C) The low R-Square indicated that the model does not fit the data well (Brown, 2009). 6 Results and Discussion Cont’ State which Variables are Important in predicting the cost of constructing an LWR plant? S is a significant contributing factor in predicting ln(C)(p = 0.021), but N and CT have no significant effect in predicting (p > 0.05) 7 Results and Discussion Cont’ State a prediction equation that can be used to predict ln(C). After dropping N and CT from the model since they do not have a significance effect in predicting ln(C), the prediction equation is given by: Does adding CT improve R2? If so, by what amount? Adding CT in the model changes R-Square by 0.001 from 0.232 to 0.234 which is not significant different from zero (p > 0.05). 8 Results and Discussion Cont’ - Correlational Analysis Evaluate the correlation between the two scores and state if there seems to be any association between the two. There was a weak positive correlation between the two tests (r = 0.187). This suggested that the two test scores were not correlated. 9 Results and Discussion Cont’ Find the probability of upgrading for each division of the sample by the Bayes’ theorem. P(Up/T1) = P (T1/Up) P(Up) ÷ P(T1) = (23/46*46/86) ÷43/86 = 23/43 P(Up/T2) = P (T2/Up) P(Up) ÷ P(T2) = (23/46*46/86) ÷43/86 = 23/43 10 Results and Discussion Cont’ Find the probability of upgrading for each division of the sample by the naïve version of the Bayes’ theorem P(Up/T1) = P (T1/Up) P(Up) ÷ P(T1) = (23/46*46/86) ÷43/86 = 23/43 P(Up/T2) = P (T2/Up) P(Up) ÷ P(T2) = (23/46*46/86) ÷43/86 = 23/43 11 Results and Discussion Cont’ Compare your results in parts b and c and explain the difference or indifference based on observed probabilities Naïve version and Bayes theorem have similar probabilities. We have only one predictor in each sample division This is because Naïve is a ...

Lesson 2 Statistics Benefits, Risks, and MeasurementsAssignmen.docxSHIVA101531

5 vues•139 diapositives

- 1. CLA 2 Presentation BUS 606 Advanced Statistical Concepts And Business Analytics Agenda Introduction Multiple linear regression is the most appropriate statistical technique in predicting the outcome of a dependent variable at different values (Keith, 2019). The study assessed the relationship between the cost of constructing an LWR Plant and the three predictor variables S, N, and CT. We assessed the association between the two-test used to examine the employee performance.
- 2. Assumption of Regression Analysis Multicollinearity Multicollinearity is the condition where the predictor variables are highly correlated (Alin, 2010). Correlation Analysis 4 Assumption of Regression Analysis Cont’ Normality test The normality assumption is not violated after transforming the outcome variable C, using natural log (C) (Shapiro-Wilk = 0.967, p = 0.414). 5 Results and Discussion – Regression Analysis Use Residual Analysis and R2 to Check Your Model
- 3. The R-Squared of 0.232 indicates that the model can explain about 23.2% of ln(C) The low R-Square indicated that the model does not fit the data well (Brown, 2009). 6 Results and Discussion Cont’ State which Variables are Important in predicting the cost of constructing an LWR plant? S is a significant contributing factor in predicting ln(C)(p = 0.021), but N and CT have no significant effect in predicting (p > 0.05) 7 Results and Discussion Cont’ State a prediction equation that can be used to predict ln(C). After dropping N and CT from the model since they do not have a significance effect in predicting ln(C), the prediction equation is given by:
- 4. Does adding CT improve R2? If so, by what amount? Adding CT in the model changes R-Square by 0.001 from 0.232 to 0.234 which is not significant different from zero (p > 0.05). 8 Results and Discussion Cont’ - Correlational Analysis Evaluate the correlation between the two scores and state if there seems to be any association between the two. There was a weak positive correlation between the two tests (r = 0.187). This suggested that the two test scores were not correlated. 9 Results and Discussion Cont’ Find the probability of upgrading for each division of the sample by the Bayes’ theorem. P(Up/T1) = P (T1/Up) P(Up) ÷ P(T1) = (23/46*46/86) ÷43/86 = 23/43 P(Up/T2) = P (T2/Up) P(Up) ÷ P(T2) = (23/46*46/86) ÷43/86
- 5. = 23/43 10 Results and Discussion Cont’ Find the probability of upgrading for each division of the sample by the naïve version of the Bayes’ theorem P(Up/T1) = P (T1/Up) P(Up) ÷ P(T1) = (23/46*46/86) ÷43/86 = 23/43 P(Up/T2) = P (T2/Up) P(Up) ÷ P(T2) = (23/46*46/86) ÷43/86 = 23/43 11 Results and Discussion Cont’ Compare your results in parts b and c and explain the difference or indifference based on observed probabilities Naïve version and Bayes theorem have similar probabilities. We have only one predictor in each sample division This is because Naïve is applied with Bayes's theorem with an assumption of independence between the features of predictor variables (Webb, 2010). 12
- 6. Conclusion and Recommendations – LWR Plant Most of the variations of the outcome variable are explained by variables not included in the model. Further analysis indicated that the S predictor had a significant effect in predicting the cost of constructing an LWR plant (C). N and CT did not have a significant effect in predicting. Should drop N and CT predictors from the model. 13 Conclusion and Recommendations – Employee Performance The analysis also indicated that the two test were not related with each other but, the first test had the best ability in discriminating than the second test. Which suggested that the first test was the best to use in predicting whether employees will be unsuccessful or successful in the position. 14 References Alin, A. (2010). Multicollinearity. Wiley Interdisciplinary Reviews: Computational Statistics, 2(3), 370-374. Brown, J. D. (2009). The coefficient of determination. Daoud, J. I. (2017, December). Multicollinearity and regression analysis. In Journal of Physics: Conference Series (Vol. 949, No. 1, p. 012009). IOP Publishing. Keith, T. Z. (2019). Multiple regression and beyond: An
- 7. introduction to multiple regression and structural equation modeling. Routledge. Puth, M. T., Neuhäuser, M., & Ruxton, G. D. (2014). Effective use of Pearson's product–moment correlation coefficient. Animal behaviour, 93, 183-189. Webb, G. I. (2010). Naïve Bayes. Encyclopedia of machine learning, 15, 713-714. Kolmogorov-Smirnov a Shapiro-Wilk Statistic df Sig. Statistic df Sig. ln_C .104 32 .200 * .967 32 .414 *. This is a lower bound of the true significance. a. Lilliefors Significance Correction Model R R Square Adjusted R Square Std. Error of the Estimate Change Statistics R Square Change F Change df1 df2 Sig. F Change 1 .482 a
- 8. .232 .179 .34240 .232 4.385 2 29 .022 2 .483 b .234 .151 .34814 .001 .052 1 28 .822 a. Predictors: (Constant), N, S b. Predictors: (Constant), N, S, CT Model Unstandardized Coefficients Standardized Coefficients t Sig. Collinearity Statistics B Std. Error Beta Tolerance VIF (Constant) 5.300 .277 19.161 .000 S .001 .000 .406 2.447 .021 .963 1.039 N .012 .010 .193 1.164 .254 .963 1.039 (Constant) 5.294 .283 18.718 .000 S .001 .000 .403 2.385 .024 .958 1.044 N .011 .010 .189 1.110 .276 .950 1.053 CT .028 .125 .038 .227 .822 .978 1.022 a. Dependent Variable: ln_C 2
- 9. Mohammed Alsaadi Maria Claver Gern 400 6/11/2021 Life Review: Proposal/Script In life, every individual goes through unique experiences during the different stages of their lives. It is through these experiences that one is able to develop their personal perspectives, judgments, and also acquire personal strengths and wisdom. Much of what an individual is a today is determined by the physical, emotional and mental experiences, challenges and hardships that they have gone through in life. For this proposal, I plan to interview my elderly neighbour Salim, whom I have always been in awe of and who has always fascinated me with his dinner conversation during his visits. He is 67 years old and we have been neighbour for more than 20 years. He and his family are very nice I spent many times with them until they become like a family. The level of this interview would be very comfortable and easy-going since we know each other for a quite a long time. Furthermore, this is a great chance to talk to him and learn more about his
- 10. experience growing up. I hope to learn from him about his family background, his childhood, career and retirement and all the significant experiences and events that have taken place in his life. Through this interview, I expect to find out how one’s life experiences can determine a person’s mental outlook and physical conditions in the course of getting old. Apart from acquiring information in this interview, I also intend to apply the lessons and knowledge that I have accumulated in life also. The questions in this interview will be containing components of the biopsychosocial model, which entail biological, psychological and social aspects that make up the reality of life (Derek Bolton). To conclude, am hopeful of gaining a better understanding of how these three aspects have influenced his life and how it has affected both his mental and physical health. Biological Questions 1. When were you born? Do you know the date? 2. Can you describe the family background? 3. What was your childhood like and how were you brought up? 4. Who were your friends growing up, how did they influence you? Are they still present in your life? 5. What are your opinions on the physical aspects of your life and the changes to your body? 6. Do you suffer from health complication and if so, do you need assistance in relation to the health complication? 7. What is your current diet and is it in relation to your health? Psychological Questions 1. How do you perceive yourself and how would you describe your life experiences? 2. What are some of the challenges that you have experienced in your life and how have they influenced you? How were you able to conquer these scenarios and how did it affect your life and the person that you are today?
- 11. 3. What was your coping mechanism in overcoming the stress, anxiety and frustration caused by the challenges in your life? 4. Would you change the past given an opportunity to go back in time, and what would you do differently? 5. What do you think is most different today from when you were growing up? 6. How has technology and telecommunication affected you? Do you find it easy or hard to cope with this era of technology? Social Questions 1. What is your support system, is it family or social support? How has it changed your experience? 2. Do you have a written will or testament and what does it entail? What are your views on nursing homes or specialized facilities? 3. Are you happy with the life you have led and do you feel happier, sad or depressed as you advanced in age? 4. What do you do in your spare time for fun? What do you find most entertaining? 5. What is your legacy? What do you think you will be remembered for? 6. What positive or negative impacts will you leave on the world? Will you leave the world a better place than you found it? 7. Works Cited Derek Bolton, Grant Gillett. The Biopsychosocial Model of Health and Disease: New Philosophical and Scientific Development. Springer, 2019. THE BIOLOGY OF AGING 3
- 12. The Biology of Aging. Alazhar Alsaadi CSULB Dr. Maria Claver 6/11/2021 Study of aging. Abstract. I got the opportunity to interview Mrs. Fatima Al-Saadi with who we live in the same house. She suffers from back pains. I got the chance to assess her life, the challenges that are facing her, and how she was dealing with them. Further, we assessed the cause and how the condition was being managed.
- 13. Proposal. Mrs. Fatima, 75 years old, is my grandmother and we stay in the same house. It is entrusted upon me the duty of caring for her. She is of medium built and has a condition with her backbone and as a result, it does not support her body well. As a result, all her movement is through the use of a wheelchair. Before conducting the interview, she must be prepared to give her ample time to recollect herself and be prepared for the questions. After preparation, the interview will take place after a period of two days. One will take place on the first day of the week after which she will be allowed to rest and the other will take place on the third day of the week. The interview will take place in the living room which is where she receives and entertains guests and will be conducted in the mid-morning. The timing will ensure that she is not too exhausted to give accurate answers. The reason I chose her as the subject of the interview was due to the stories that she narrated about herself as a child. She was a strong and hardworking lady before an illness at the age of 45 that affected her spinal cord and since then, she has never been able to walk again. As a result, it arouses a curiosity within me as I tried to figure out the cause of her medical condition, and the resultant effect of this on her health and her perception of life. Every time looking at her vibrant in her wheelchair, it becomes more of a challenge to see her without it. As a result of the curiosity to know what happened to my grandmother, this assignment level is very comfortable. Further, she accepted her condition and moved on and is also an inspiration to us thus I was comfortable choosing her as a recipient of this assignment. Besides, as a result of the bond between us, we are free with each other thus making the interview easier to conduct. By the time I complete this assignment, I hope to have known the cause of her illness, whether it can be cured, and whether it
- 14. is hereditary and if so, how to protect future generations. The interview theory that I shall use is the biology of aging theory because it deals with the human body and the damages as a result of how it has been programmed. Script. Interview questions. 1. Can you elaborate on your life prior to the illness? 2. Are there activities that you used to regularly carry out before the illness? 3. Was there anything that happened that could have triggered the illness? 4. What were the initial signs and symptoms pf the condition? 5. Were you bullied? 6. How do you explain the illness to your friends and relatives? 7. Does questioning you about the diseases by any way make you uncomfortable? 8. Do you feel as if you differ from other people? 9. What is the most challenging thing about being disabled? 10. What is your perception of self? 11. Do people treat you differently when they visit than they once did? 12. Do you ever wish you were able to walk again? 13. Are the medication expenses a burden to you? 14. What does it feel like to be on medication on a daily basis? 15. How are you able to be so optimistic about life in your condition? 16. Was the illness by any chance preventable? If it was, how would you advise future generations to protect themselves? Running Head: LOGISTIC REGRESSION AND DISCRIMINANT ANALYSIS 1 LOGISTIC REGRESSION AND DISCRIMINANT ANALYSIS 9
- 15. LOGISTIC REGRESSION AND DISCRIMINANT ANALYSIS NAME: INSTRUCTOR: DATE: Part I Logistic Regression The age and gender of guests in a nursing home were examined whether they are the cause of deaths in 2015. Data was collected for gender, age and whether the guest died or not. In this case, death is our dependent variable while age and gender are the independent variables. Since the dependent variable “died” was categorical with two levels, this was an indication that logistic regression analysis was suitable for prediction in this study (Austin & Merlo, 2017). The assumption of a dichotomous dependent variable was met in this case where “died” took to values 0) No, 1) Yes. The assumption of one or more predictor variables was met. Age was quantitative reporting respective ages of the guests. Gender was categorical with two levels 0) Females and 1) Males. Analysis The collected data was analyzed to examine the relationship between the predictor variables and the binary
- 16. dependent variable. A sample of 284 guests was used for this study for easy analysis and generalizations. The overall logistic regression model is given as; Table 1 shows the total number of participants and the valid sample that was utilized in this study. Table 1: Logistic regression summary The analysis showed that there were 144 successes and 140 failures. According to the results, the overall model was statistically significant with χ2 (2) = 82.46, p < 0.001 (Warner, 2020). This implies that we can carry on with the analysis. Analysis showed that gender and age were both statistically significant and contributed to the variation in deaths. Gender was statistically significant where b = 1.96, OR = 7.08, p < 0.05 implying it had an impact on deaths. Age was also statistically significant where b = 0.196, OR = 1.22, p < 0.05. The likelihood of dying is 7.08 times higher in males as compared to females according to the odds ratio. Older people are 1.22 more likely to succeed in the tests (Norton et al., 2018). The logistic regression equation that helps in prediction of death of a person given the age and gender will be given as; Part II Discriminant Analysis Two tests were developed in a firm to determine whether some of the employees will perform in a given position. A sample of 43 employees was examined. The main aim is to group employees as either successful or unsuccessful by using the tests given. Discriminant analysis suits this case since exclusive grouping was required and the dependent variable was categorical with two groups 0) Unsuccessful and 1) Successful (Bowerman et al., 2019). Two independent variables used in this study (Test 1 and
- 17. Test 2) were quantitative reporting the scores of the employees in the two tests. Analysis Discriminant analysis was carried out in SPSS to classify the employees as successful or unsuccessful based on the two tests. Descriptive statistics were as shown in table 2. Table 2: Descriptive statistics Group Statistics Group Mean Std. Deviation Valid N (listwise) Unweighted Weighted Unsuccessful Test1 84.7500 4.24109 20 20.000 Test2 79.1000 4.38778 20 20.000 Successful Test1 92.4348 3.47492 23 23.000
- 18. Test2 84.7826 6.23740 23 23.000 Total Test1 88.8605 5.43175 43 43.000 Test2 82.1395 6.10847 43 43.000 The mean of test 1 in the unsuccessful group was 84.75 while for test 2 in the unsuccessful group was 79.10. The means for test 1 and 2 in the successful group were 92.43 and 84.78 respectively. Table 3 shows the importance of the independent variables in the discriminant function used to group the employees. Table 3: Test of equality of group means Tests of Equality of Group Means Wilks' Lambda F df1 df2 Sig. Test1 .490 42.644 1
- 19. 41 .000 Test2 .780 11.593 1 41 .001 According to the analysis, both tests scores were statistically significant in for the discriminant function. Table 4 shows the correlation matrix of the predictor variables. Table 4: Correlation matrix Pooled Within-Groups Matrices Test1 Test2 Correlation Test1 1.000 .187 Test2 .187 1.000 According to the analysis, the correlation between the scores of test 1 and test 2 was r = 0.19. This is a weak positive relationship implying the independent variables are not correlated. The assumption of multivariate normality was examined and the test results were as shown in the Box’s M statistics given in table 10 (Ul Hassan et al., 2017). Table 5: Homogeneity of covariance matrix Test Results Box's M
- 20. 5.014 F Approx. 1.582 df1 3 df2 936960.353 Sig. .191 Tests null hypothesis of equal population covariance matrices. According to the analysis, it is clear that groups did not differ in the covariance matrices implying that the assumption is not violated and the analysis can continue. According to table 6, one discriminant function was found given the two-grouped dependent variable. Table 6: Canonical discriminant function Eigenvalues Function Eigenvalue % of Variance Cumulative % Canonical Correlation 1 1.161a 100.0 100.0 .733 a. First 1 canonical discriminant functions were used in the analysis. The strong positive canonical correlation implies that there was
- 21. a strong association between the discriminant function and the dependent variable (Uurtio et al., 2017). Table 7 shows the coefficients of the independent variables. Table 7: Standardized canonical discriminant function coefficients Standardized Canonical Discriminant Function Coefficients Function 1 Test1 .885 Test2 .328 According to the analysis, Test 1 had the best ability in discriminating as compared to Test 2. This implies that Test 1 is very significant in predicting whether employees will be successful or unsuccessful in the position. Table 8 shows the unstandardized canonical coefficients of the model. Table 8: Unstandardized canonical coefficients Canonical Discriminant Function Coefficients Function 1 Test1 .230 Test2 .060 (Constant) -25.380 Unstandardized coefficients The discriminant equation becomes; D = -25.38 + 0.23*Test 1 + 0.06*Test 2 Table 9 shows the classification of the given variables.
- 22. Table 9: Classification Classification Resultsa,c Group Predicted Group Membership Total Unsuccessful Successful Original Count Unsuccessful 16 4 20 Successful 5 18 23 % Unsuccessful 80.0 20.0 100.0 Successful 21.7 78.3
- 23. 100.0 Cross-validatedb Count Unsuccessful 16 4 20 Successful 5 18 23 % Unsuccessful 80.0 20.0 100.0 Successful 21.7 78.3 100.0 a. 79.1% of original grouped cases correctly classified. b. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case. c. 79.1% of cross-validated grouped cases correctly classified. Analysis showed that 80% of the employees classified as unsuccessful were unsuccessful while 20% who were successful were classified as unsuccessful. 78.30% successful employees were classified as successful while 21.7% successful were classified as unsuccessful. In overall, 79.10% cases were
- 24. correctly classified. Reference Austin, P. C., & Merlo, J. (2017). Intermediate and advanced topics in multilevel logistic regression analysis. Statistics in medicine, 36(20), 3257-3277. Warner, R. M. (2012). Applied statistics: From bivariate through multivariate techniques. Sage Publications. Norton, E. C., Dowd, B. E., & Maciejewski, M. L. (2018). Odds ratios—current best practice and use. Jama, 320(1), 84-85. Bowerman, B., Drougas, A. M., Duckworth, A. G., Hummel, R. M. Moniger, K. B., & Schur, P. J. (2019). Business statistics and analytics in practice (9th ed.). McGraw-Hill Ul Hassan, E., Zainuddin, Z., & Nordin, S. (2017). A review of financial distress prediction models: logistic regression and multivariate discriminant analysis. Indian-Pacific Journal of Accounting and Finance, 1(3), 13-23. Uurtio, V., Monteiro, J. M., Kandola, J., Shawe-Taylor, J., Fernandez-Reyes, D., & Rousu, J. (2017). A tutorial on canonical correlation methods. ACM Computing Surveys (CSUR), 50(6), 1-33.
- 25. Introduction In this paper, I will use the cardUpgrade dataset to run the K nearest neighbor algorithm in JMP and predict the likeli hood of a customer upgrading to platinum status or not. This dataset has three attributes: a) UpGrade – this column contains categorical data represented in nominal form. b) Purchases – is a numeric column that describes the monetary value of purchases made by each customer. c) PlatProfile – this is also a categorical data that was used at the time the customers signed up to evaluate whether they fit the profile for a platinum member or not. K Nearest Neighbors (KNN) The K Nearest Neighbors algorithm is a supervised non- parametric lazy learning algorithm applicable to either classification or regression tasks (Okfalisa et al., 2017). First, a supervised machine learning algorithm relies on labeled training input to produce a desired output given non-labelled data. Secondly, non-parametric means that the algorithm makes no assumptions and any model built on it relies entirely on the training data it is fed without further assumptions on the structure of the data. Thirdly, the algorithm is lazy learning meaning that it does not make any generalization since the training is minimal. In essence, as opposed to most machine learning algorithms, the training data used in a KNN model is also used to test the model. KNN classifies a single data point by comparing it to the points it is closest and most similar to. It
- 26. assumes that similar items exist in close proximity KNN is apt for different ML tasks including decision-making (as in this task), recommender systems, and image recognition. For this particular task, I will use JMP to apply KNN on a customer’s dataset. The model will compare the initial customer profile attribute and purchase history to their current upgrade status to be able to predict whether other customers would be willing to upgrade or not. Because this is a classification task, the output should be a discrete value that shows whether a customer will upgrade or not. There is no middle ground which is why the values are binary, that is, 0 or 1. The model will have two predictors and a label. The output from this model will also be nominal which means it represents the upgrade status of an individual. While the output will be in the form of numerical values (0 or 1), these numbers are only representational and have no mathematical meaning (Ghattas et al., 2017). KNN Scheme on JMP First I imported the excel data set into JMP. The application treats the two categorical columns as numeric. This must be specified back to nominal in JMP. While this process is rather limited, it is a form of data cleaning when creating a machine learning model. Other data cleaning tasks typically include identification and removal of duplicate records/observations and finding ways to handle missing values. The second step is to conduct an exploratory data analysis (EDA). According to Jebb et al., this stage helps to decide the algorithm and variables that would be used (2017). EDA also involves data visualization which gives a cursory insight about the observations. For instance, below is a bubble plot of UpGrade status against purchases. It shows that those who did not upgrade are more concentrated on the lower end of purchases while those who did upgrade made higher purchase volumes. It also shows some outliers which should be handled. I have chosen to ignore these outliers because there are only one for each class meaning that they would have minimal impact on the model. Additionally, the dataset contains only forty observations, thus, there is very
- 27. little room for cropping out other observations. I also visualized the distribution of the purchase column to determine the percentage. The graph shows that the distribution is almost split evenly (55% for non-upgrades and 45% for upgrades). Splitting the Data For most machine learning modeling tasks, the data should only be split into training and testing sets. The train-test approach is used to avoid evaluating the model based on training dataset as this would result in a biased score. Kuhn & Johnson state that it is pragmatic for the model to be evaluated on data that was not used to either build or finetune it (2013, p. 67). Some algorithms including KNN, on the other hand, produce the best results when the data is split into three including a validation set. This third set is often seen as another set of test data, but it is used to tune the hyperparameters of the model. However, Touvron et al., elaborate that this splitting of the data into two or three sets only produces optimal results when the dataset has a large amount of data such that each class that could potentially be observed is included in each set (2020). Since this dataset only has forty observations, it cannot be split thrice without compromising on model performance. It is, instead, split into training and test sets at the ratio of 4:1. This gives 32 observations for training and 8 observations for testing. The KNN algorithm also expects a k value which is typically selected as the square root of the total number of observations. When there are only two classes to be predicted, the standard practice is to pick an odd number to avoid a tie during majority voting. For this reason, I have picked a 7 as the value of k. Confusion Matrix Interpretation
- 28. The confusion or error matrix is used to enhance the understanding of a classification task. Simply reporting on the accuracy of the model does not fully capture the performance of the algorithm (Okfalisa et al., 2017). The precision and recall scores are calculated from the output of the confusion matrix. In the matrix, the predicted values are described as either positive or negative whereas the actual values are resented as true or false. In the end, there are four possibilities: a) True Positive – these are the values that have been correctly predicted by the model as positive and they are true. b) True Negative – these values are actually true but the model predicts them as negative. c) False Positive – values predicted as positive yet they are actually false d) False Negative – these values have been predicted as false and they are also negative. In this case these values would be interpreted as follows: a) True Positive (TP) – customers who did upgrade to platinum and who the model has correctly predicted to have upgraded. b) True Negative (TN) – customers predicted to upgrade yet did not upgrade. c) False Positive (FP) – customers predicted to not upgrade yet they did upgrade. d) False Negative (FN) – customers correctly predicted to not have upgraded their cards to platinum. The matrix for this model is as in the image below: This means that the model produces 2 TPs, 0 TN, 1 FP, and 5 FN. Thus, only one out of the eight possible predictions is incorrect. This model, therefore, has an accuracy score of 87.5% PART 2 Using the Bayes theorem formula as indicated below, we can calculate the probability that a customer who makes purchases above 32.450 will also be likely to upgrade to platinum (Rouder & Morey, 2018).
- 29. Below are the notations for the probabilities: P(A) – the probability that a customer upgrades to platinum (45%) P(B) – the probability that a customer makes purchases equal to or more than 32.450 (52.5%) P(A|B) – the probability that a customer upgrades given that he makes purchases equal to or above 32.450 (79.32%) P(B|A) – the probability that a customer makes purchases equal to or above 32.450 given that he has upgraded. Thus P(B|A) = .9254, that is, 92.54% which is a significant improvement from 87.5%. Therefore, using the Bayes theorem formula, the model’s accuracy has been improved. References Bowerman, B., Drougas, A. M., Duckworth, A. G., Hummel, R. M., Moniger, K. B., & Schur, P. J. (2019). Business statistics and analytics in practice (pp. 186–189). Mcgraw-Hill Education. Ghattas, B., Michel, P., & Boyer, L. (2017). Clustering nominal data using unsupervised binary decision trees: Comparisons with the state of the art methods. Pattern Recognition, 67, 177– 185. https://doi.org/10.1016/j.patcog.2017.01.031 Jebb, A. T., Parrigon, S., & Woo, S. E. (2017). Exploratory data analysis as a foundation of inductive research. Human Resource Management Review, 27(2), 265–276. https://doi.org/10.1016/j.hrmr.2016.08.003 Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. In Google Books (p. 67). Springer Science & Business Media. https://books.google.com/books/about/Applied_Predictive_Mod eling.html?id=xYRDAAAAQBAJ&source=kp_book_description Okfalisa, Gazalba, I., Mustakim, & Reza, N. G. I. (2017). Comparative analysis of k-nearest neighbor and modified k- nearest neighbor algorithm for data classification. 2017 2nd International Conferences on Information Technology,
- 30. Information Systems and Electrical Engineering (ICITISEE). https://doi.org/10.1109/icitisee.2017.8285514 Rouder, J. N., & Morey, R. D. (2018). Teaching Bayes’ Theorem: Strength of Evidence as Predictive Accuracy. The American Statistician, 73(2), 186–190. https://doi.org/10.1080/00031305.2017.1341334 Touvron, H., Vedaldi, A., Douze, M., & Jégou, H. (2020). Fixing the train-test resolution discrepancy. ArXiv:1906.06423 [Cs]. https://arxiv.org/abs/1906.06423 Running Head: MERGING AND ACQUISITION 1 MERGING AND ACQUISITION 2 MERGING AND ACQUISITION NAME: INSTRUCTOR: DATE: Merging and Acquisition Introduction
- 31. Some of the factors determining merger and acquisition activities in retailing were examined in this paper to aid in decision making that have a positive impact on a firm. Characteristics of firms that were targeted for acquisition and the firms that were willing to make acquisition were looked into. Growth rate of sales for target firms and bidders were tested to enable management make decision wisely on acquisition (Christofi et al., 2017). Growth rate of sales for target firms The growth rate of sales for firms targeted for acquisition was tested using normal distribution tests. A sample of 25 firms was collected for this study. The mean sales growth rate was 0.16 and the standard deviation was 0.12. These are required in order to test for normally distributed data (D’Agostino, 2017). We perform t-test since the only sample statistics were known (Emmert-Streib & Dehmer, 2019). The research question for this study will be: Is there statistical significant difference between the population mean and sample mean for sales growth rate of target firms? The hypothesis testing process was as given below; Hypotheses The following are the null and alternative hypotheses. H0: μ = .10. The mean growth rate of sales for target firms is not different from 10% Ha: μ >.10. The mean growth rate of sales foe target firms exceeds 10%. Test statistics T-test was carried out to examine whether the sales growth rate of 0.16 was indeed statistically significant different from the 0.10. The t-test is given as; t = = = 2.50 Calculate the degrees of freedom by 1 from the sample size (Sung & Han, 2018); df = 25 – 1 = 24
- 32. Determine the critical value from the t table at α = 0.10, 0.05, 0.01 and 0.001. For α = 0.10, tα = 1.318 which is less than the calculated value of t = 2.50 implying we reject the null hypothesis at α = 0.10 and conclude Ha: μ >.10. For α = 0.05, tα = 1.711 which is less than the calculated value of t = 2.50 implying we reject the null hypothesis at α = 0.05 and conclude Ha: μ >.10. For α = 0.01, tα = 2.492 which is less than the calculated value of t = 2.50 implying we reject the null hypothesis at α = 0.01 and conclude Ha: μ >.10. For α = 0.001, tα = 3.467 which is less than the calculated value of t = 2.50 implying we fail to reject the null hypothesis at α = 0.001 and conclude Ha: μ =.10. Decision There was very strong evidence to reject the null hypothesis at α = 0.01 and we conclude that the mean growth rate of sales for target firms exceeds 10% (Bowerman et al., 2019). Growth rate of sales for bidders The growth rate for firms that were willing to make acquisition was examined in order to come up with decisions regarding acquisition. A sample size of 25 firms was collected for analysis necessary for decision making. The mean growth rate of sales was 0.12 and the standard deviation was 0.09. We perform t-test since the only sample statistics were known; Hypotheses Tested Hypotheses to be tested; H0: μ = .10. The mean growth rate of sales for bidders is not different from 10% Ha: μ >.10. The mean growth rate of sales foe bidders exceeds 10%. Test statistics
- 33. T-test was carried out to examine whether the sales growth rate of 0.12 was indeed statistically significant different from the 0.10. The t-test formula is given as follows; t = = = 1.111 Calculate the degrees of freedom by 1 from the sample size; = 25 – 1 = 24 Determine the critical value from the t table at α = 0.10, 0.05, 0.01 and 0.001. For α = 0.10, tα = 1.318 which is greater than the calculated value of t = 1.111 implying we fail to reject the null hypothesis at α = 0.10 and conclude Ha: μ = .10 (Trafimow, & Earp, 2017). For α = 0.05, tα = 1.711 which is greater than the calculated value of t = 1.111 implying we fail to reject the null hypothesis at α = 0.10 and conclude Ha: μ = .10. For α = 0.01, tα = 2.492 which is greater than the calculated value of t = 1.111 implying we fail to reject the null hypothesi s at α = 0.10 and conclude Ha: μ = .10. For α = 0.001, tα = 3.467 which is greater than the calculated value of t = 1.111 implying we fail to reject the null hypothesis at α = 0.10 and conclude Ha: μ = .10. Decision We failed to reject the null hypothesis at α = 0.10 and conclude that there was extremely evidence that the mean growth rate of sales for bidders did NOT exceed 10%. Conclusion Analysis shows that the growth of sales for firms targeted for acquisition exceeded 10% but the firms that were will ing to place bids for acquisition had sales growth rate not exceeding 10%. The bidding firms should be analyzed critically to ensure they meet the requirements of acquiring the already existing firms that have great growth rate curves. Reference Christofi, M., Leonidou, E., & Vrontis, D. (2017). Marketing
- 34. research on mergers and acquisitions: a systematic review and future directions. International Marketing Review. D’Agostino, R. B. (2017). Tests for the normal distribution. In Goodness-of-fit techniques (pp. 367-420). Routledge. Emmert-Streib, F., & Dehmer, M. (2019). Understanding statistical hypothesis Testing: the logic of statistical inference. Machine Learning and Knowledge Extraction, 1(3), 945-961. Bowerman, B., Drougas, A. M., Duckworth, A. G., Hummel, R. M. Moniger, K. B., & Schur, P. J. (2019). Business statistics and analytics in practice (9th ed.). McGraw-Hill ISBN 9781260187496 Sung, W. P., & Han, T. Y. (Eds.). (2018, July). Exploration and Practice of a New Formula for Calculating the Degree of Freedom. In MATEC Web of Conferences (Vol. 175, p. 03018). EDP Sciences. Trafimow, D., & Earp, B. D. (2017). Null hypothesis significance testing and Type I error: The domain problem. New Ideas in Psychology, 45, 19-27. Introduction The prediction is the process of determining the magnitude of predictors on the response variable. Prediction helps determine the value of the outcome variable in the future using predictors or factors included during the study. Moreover, multiple linear regression is a statistical test used in assessing the relationship between the response variable and more than one predictor variable (Keith, 2019). Also, Cacoullos (2014) argued that discriminant which is used in testing the equality of group centroids is associated with multivariate analysis of variance since it uses Wilks’ lambda applied in GLM multivariate In this regard, multiple linear regression and discriminant analysis are the most appropriate statistical techniques in predicting the outcome of a dependent variable at different
- 35. values of the predictor variables and assess the contribution of the predictors on the outcome and association between the independent variable respectively (Keith, 2019). Besides, the researcher can assess the variation of the outcome explained by the independent variables included in the model. The test is used in predicting the values of the response variable using more than one predictor variable. Discriminant analysis was used to assess the association between the two tests that were developed in a firm to examine employee performance. The study used a sample of 43 employees who were grouped as either successful or unsuccessful and performed the two test to assess their performance in a given position. Before conducting both multiple linear regression and discriminant, it is good to check whether the variables meet the necessary assumptions. For multiple linear regression dependent variable must be continuous and approximately normally distributed, while the predictor variables can either be continuous or categorical. For discriminant analysis the dependent variable must be divided into two or more groups. Besides, the predictor variables should not be correlated with each other, or there should be no multicollinearity (Alin, 2010) for both statistical tests. This can be assessed using the variance inflation factor, which should be less than ten, or through correlation coefficients between the predictor variables. The current study assessed the relationship between the cost of constructing an LWR plant and the three predictor variables S, N, and CT and assessed the association between the two test used to examine the employee performance. Assumption of Regression Analysis Multicollinearity Correlation analysis is used in determining the strength and direction of association between two variables (Puth et al., 2014). The Pearson correlation coefficient is used for testing the strength and direction of association between two variables with a continuous level of measurement. However, when the variables have an ordinal level of measurement, we use the
- 36. Spearman rank correlation coefficient (Puth et al., 2014). The Pearson correlation coefficient ranges from -1 to 1, with -1 or 1 indicating perfect correlation and zero indicating no cor relation. Table 1: Correlation Analysis S N S Pearson Correlation 1 .193 Sig. (2-tailed) .289 N 32 32 N Pearson Correlation .193 1 Sig. (2-tailed) .289 N 32 32 The correlation analysis in table 1 above indicates that the correlation between the two predictor variables (S and N) is positively weak and not significant at 0.05 level of significance
- 37. (r = 0.289, p = 0.193). This suggested that multicollinearity does not exist and the assumption multicollinearity is not violated. Furthermore, based on the analysis, the variance inflation factor (VIF) is less than 10 (Daoud, 2017, December), suggesting that the multicollinearity assumption is not violated. Normality test Table 2: Tests of Normality Kolmogorov-Smirnova Shapiro-Wilk Statistic df Sig. Statistic df Sig. ln_C .104 32 .200* .967 32 .414 *. This is a lower bound of the true significance. a. Lilliefors Significance Correction The test of normality of the dependent variable (ln(C)) revealed that the normality assumption is not violated after transforming the variable C, using natural log at 0.05 level of significance (Shapiro-Wilk = 0.967, p = 0.414). Results and discussion Regression Analysis a. Use residual analysis and R2 to check your model.
- 38. Table 3: Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate Change Statistics R Square Change F Change df1 df2 Sig. F Change 1 .482a .232 .179 .34240 .232 4.385 2 29 .022 2 .483b .234 .151 .34814 .001 .052 1 28
- 39. .822 a. Predictors: (Constant), N, S b. Predictors: (Constant), N, S, CT The R-Squared of 0.232 indicates that the model can explain about 23.2% of ln(C) variation and 76.8% of the variation is explained by other variables not included in the model. Besides, the analysis indicated high residuals. The low R-Square and high residuals indicated that the model does not fit the data well (Brown, 2009). b. State which variables are important in predicting the cost of constructing an LWR plant? c. Table 4: Regression Coefficients Model Unstandardized Coefficients Standardized Coefficients t Sig. Collinearity Statistics B Std. Error Beta Tolerance VIF 1 (Constant) 5.300 .277 19.161 .000
- 41. N .011 .010 .189 1.110 .276 .950 1.053 CT .028 .125 .038 .227 .822 .978 1.022 a. Dependent Variable: ln_C The regression analysis displayed in table 4 above indicates that S is a significant contributing factor in predicting ln(C) at a 0.05 level of significance (p = 0.021). However, the predictor variable N does not significantly predict the ln(C) at a 0.05 level of significance (p = 0.254). According to the analysis, there is no significant difference in ln(C) between the two levels of the cooling tower (p = 0.822), suggesting that the dummy variable CT does not have a significant effect in predicting ln(C). Therefore, the researcher used the S predictor to predict the cost of constructing an LWR plant but removed N and CT from the model. c. State a prediction equation that can be used to predict ln(C). After dropping N and CT from the model since they do not have a significance effect in predicting ln(C), the prediction equation is given by:
- 42. d. Does adding CT improve R2? If so, by what amount? Based on the analysis displayed on table 3 above, there is no significant improvement in R-Squared after adding CT (p = 0.822). Adding CT in the model changes R-Square by 0.001 from 0.232 to 0.234 which is not significant different from zero. Correlational Analysis a. Evaluate the correlation between the two scores and state if there seems to be any association between the two. Table 5: Pooled Within-Groups Matrices Test1 Test2 Correlation Test1 1.000 .187 Test2 .187 1.000 The correlation analysis is shown in table 5 above indicates that there was a weak positive correlation between the two tests (r = 0.187). This suggested that the two test scores were not correlated. b. Find the probability of upgrading for each division of the sample by the Bayes’ theorem. Given that: P(T1) = 43/86; P(T2) = 43/86 P (T1/Up) = 23/46; P (T2/Up) = 23/46 P(Up/T1) = P (T1/Up) P(Up) ÷ P(T1) = (23/46*46/86) ÷43/86 = 23/43 P(Up/T2) = P (T2/Up) P(Up) ÷ P(T2) = (23/46*46/86) ÷43/86 = 23/43 c. Find the probability of upgrading for each division of the
- 43. sample by the naïve version of the Bayes’ theorem. P(Up/T1) = P (T1/Up) P(Up) ÷ P(T1) = (23/46*46/86) ÷43/86 = 23/43 P(Up/T2) = P (T2/Up) P(Up) ÷ P(T2) = (23/46*46/86) ÷43/86 = 23/43 d. Compare your results in parts b and c and explain the difference or indifference based on observed probabilities Since we have only one predictor in each sample division, the naïve version and Bayes theorem have similar probabilities. There is indifference based on observed probabilities. This is because it is applied with Bayes's theorem with an assumption of independence between the features of predictor variables (Webb, 2010). Conclusion and Recommendations The analysis revealed that the model with the three predictors predicting the cost of constructing an LWR plant does not fit the data well. This suggested that most of the variations of the outcome variable are explained by variables not included in the model. Further analysis indicated that the S predictor had a significant effect in predicting the cost of constructing an LWR plant (C). However, N and CT did not have a significant effect in predicting. Therefore, the researcher should drop N and CT predictors from the model and only use the S predictor in predicting the cost of constructing the LWR plant. The analysis also indicated that the two test were not associated with each other but, the first test had the best potential in discriminating than the second test. Which suggested that the first test was the best to use in predicting whether employees will be unsuccessful or successful in the position. Nevertheless, the study did not control for alternative explanations that would affect the validity of the findings. Further study is needed that will include variables with a good fit of the data to help in predicting the cost of constructing an LWR plant. In addition, a study that will control all possible
- 44. confounding is required to help in the prediction of the outcome variable and assessing the best test in predicting employee performance. References Alin, A. (2010). Multicollinearity. Wiley Interdisciplinary Reviews: Computational Statistics, 2(3), 370-374. Brown, J. D. (2009). The coefficient of determination. Cacoullos, T. (Ed.). (2014). Discriminant analysis and applications. Academic Press. Daoud, J. I. (2017, December). Multicollinearity and regression analysis. In Journal of Physics: Conference Series (Vol. 949, No. 1, p. 012009). IOP Publishing. Keith, T. Z. (2019). Multiple regression and beyond: An introduction to multiple regression and structural equation modeling. Routledge. Puth, M. T., Neuhäuser, M., & Ruxton, G. D. (2014). Effective use of Pearson's product–moment correlation coefficient. Animal behaviour, 93, 183-189. Webb, G. I. (2010). Naïve Bayes. Encyclopedia of machine learning, 15, 713-714.