Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Quantitative analysis: A brief introduction

252 vues

Publié le

Basic concepts of quantitative analysis. No math behind things. T-test, correlation, Oneway ANOVA, correlation

Publié dans : Données & analyses
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Quantitative analysis: A brief introduction

  1. 1. Quantitative analysis A brief introduction Petri Lankoski, 2018 1
  2. 2. You should be familiar with following • Mean (medelvärde), for a normal distribution • Median (median) • Mode (typvärde) • Line chart (linjediagram) • Bar chart (stapeldiagram) Petri Lankoski, 2018 2
  3. 3. Is the Die Loaded? 11st throw 12st throw 43st throw 14st throw 25st throw We cannot say for certain, but we can estimate how likely or unlikely the perceived sequence is In long run we expect to see equal amount of 1s, 2s, 3s, 4s, 5s and 6s 16st throw Chance to get 1 is 1/6, but as first throw, this is as likely as any other result. We do not have enough information to say anything more about this six throws is probably still too little to estimate the die, so we would need to roll more… Petri Lankoski, 2018 3
  4. 4. Is the Die Loaded? 1 1 4 1 2 1 3 6 1 1 1 5 Testing this sequence against expected sequence indicate that the die is loaded • But we have around 1% change to be wrong We roll following sequence: 2 6 2 6 6 4 6 5 4 1 3 4 4 6 5 3 5 3 2 5 • Amounts of 6s and 1s does not match to expected amounts • We would have 70% likelihood of being wrong if we claim that the die is load Petri Lankoski, 2018 4
  5. 5. Boxplot Median IQR, 50% of data 1.5 * IQR Petri Lankoski, 2018 5
  6. 6. density and violin plot Violin plot is a form of density plot Petri Lankoski, 2018 6 Density plot and data points
  7. 7. Scatter plot -2 -1 0 1 2 -3-2-10123 Variable 1 Variable2 Scatter plot shows values of two variables • For example how a participant answered to questions Petri Lankoski, 2018 7
  8. 8. Random sampling Predicting election results - It is not practically possible to ask all what they will vote - Picking a sample of people randomly & asking them However, we know that there is uncertainty here If random sample again, we might get something else We get: A: 37.6% B: 12.3% C: 33.1% D: 5.2% … We get: A: 36.9% B: 13.0% C: 32.7% D: 6.1% … We can estimate uncertainty, but we need to make some assumptions Petri Lankoski, 2018 8 We get: A: 38.7% B: 11.0% C: 31.7% D: 6.3% …
  9. 9. Normal distribution 1𝜎 2𝜎-2𝜎 -1𝜎 0𝜎 68.3% 95.4% of data 9 𝜎 = standard deviation • describes the width of distribution
  10. 10. Back to polling 1.96𝜎-1.96𝜎 0𝜎 95% of population is in the area of ∓1.96𝜎; sample distribution behaves similarly However, within 95% certainty what we observed falls in area between -1.96𝜎 and 1.96𝜎. We cannot know where in population distribution what we observed was (red vertical lines). 10 We do not know true population value (black vertical line). Support for A 36.1% 38.7% 37.6%
  11. 11. Random sampling Instead of uncertainty, confidence is usually used. Confidence interval (CI), usually 95%, is function of sample size and probability of someone choosing a candidate. 0.376 ∓ 1.96 ∗ √ 0.376(1 − 0.376) 𝑁 𝜎95%A Petri Lankoski, 2018 11 We can backtrack from the sample distribution and estimate the uncertainty in what we observed when polling • When we poll next time within 95% certainty what we observed falls in area between -1.96𝜎 and 1.96𝜎
  12. 12. Are two means different, t-test? A B∆ We have two sample means A and B Their difference is ∆=B-A Mean is calculated based on sampled values Mean(A) = ∑𝑎 𝑛 (for normally distruted variables) To extrapolate if the there is difference between groups A and B in population level (from witch A and B were sampled) we need to account uncertainty. Again population mean and sample mean can be different. Petri Lankoski, 2018 12
  13. 13. Are two means different, t-test? A B∆ We have two sample means A and B Their difference is ∆=B-A t statistic describes difference so that it takes into account variance (𝜎2) and sample size p describes probability that perceived data deviates from null hypothesis; in case null hypothesis of t-test, is the means are not different. p depends on t-value and sample size; high t-value means lower p. p = 0.05 means that there is 5% change that observed data did not deviate from expected, there is no difference. P<0.05 is a typical statistically significant result criterion. Petri Lankoski, 2018 13
  14. 14. Are tree means different, one-way ANOVA • One-way ANOVA is similar to t-test • F-statistic describes difference so that it takes into account variance and sample size • p describes probability that perceived data deviates from null hypothesis; in case null hypothesis of ANOVA, is the means are not different • A significant result (p<0.05) tells that at least one mean differ from others • But not which • Post hoc comparisons are needed to determine which variable differs from which Petri Lankoski, 2018 14
  15. 15. Correlation Correlation (r) describes the strength of association between two variables p describes the likelihood that the observed correlation deviates from what is expected under null hypothesis (which is that there is no relation between the two variables) Correlation does not tell if v1 causes v2 or vice versa • There is a strong correlation between ice cream sales and drowning • Either is causing another • Third variable, temperature, related to both Petri Lankoski, 2018 15

×