Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Statistics "Descriptive & Inferential"

It is the science of dealing with numbers.
It is used for collection, summarization, presentation and analysis of data.

Livres associés

Gratuit avec un essai de 30 jours de Scribd

Tout voir

Livres audio associés

Gratuit avec un essai de 30 jours de Scribd

Tout voir
  • Soyez le premier à commenter

Statistics "Descriptive & Inferential"

  1. 1. Dr. Dalia El-Shafei Assistant professor, Community Medicine Department, Zagazig University
  2. 2. STATISTICS It is the science of dealing with numbers. It is used for collection, summarization, presentation and analysis of data. It provides a way of organizing data to get information on a wider and more formal (objective) basis than relying on personal experience (subjective). Collection Summarization Presentation Analysis
  3. 3. USES OF MEDICAL STATISTICS: Planning, monitoring & evaluating community health care programs. Epidemiological research studies. Diagnosis of community health problems. Comparison of health status & diseases in different countries and in one country over years. Form standards for the different biological measurements as weight, height. Differentiate between diseased & normal groups
  4. 4. TYPES OF STATISTICS • Describe or summarize the data of a target population. • Describe the data which is already known. • Organize, analyze & present data in a meaningful manner. • Final results are shown in forms of tables and graphs. • Tools: measures of central tendency & dispersion. Descriptiv e • Use data to make inferences or generalizations about population. • Make conclusions for population that is beyond available data. • Compare, test and predicts future outcomes. • Final results is the probability scores. • Tools: hypothesis tests Inferential
  5. 5. TYPES OF DATA
  6. 6. Data Quantitative Discrete (no decimal) No. of hospitals, No. of patients Continuous (decimals allowed) Weight, height, Hemoglobin level Qualitative Categorical Blood groups, Male & female Black & white Ordinal Have levels as low, moderate, high.
  7. 7. SOURCES OF DATA COLLECTION
  8. 8. 1ry sources 2ry sources
  9. 9. PRESENTATION OF DATA
  10. 10. Tabular presentation. Graphical presentation Graphic presentations usually accompany tables to illustrate & clarify information. Tables are essential in presentation of scientific data & diagrams are complementary to summarize these tables in an easy way.
  11. 11. TABULATION  Basic form of presentation • Table must be self-explanatory. • Title: written at the top of table to define precisely the content, the place and the time. • Clear heading of the columns & rows • Units of measurements should be indicated. • The size of the table depends on the number of classes “2 -10 rows or classes”.
  12. 12. TYPES OF TABLES List Frequency Distribution table
  13. 13. LIST Number of patients in each hospital department are: Medicine 100 patients Surgery 80 “ ENT 28 “ Ophthalmology 30 “
  14. 14. FREQUENCY DISTRIBUTION TABLE
  15. 15. Assume we have a group of 20 individuals whose blood groups were as followed: A, AB, AB, O, B, A, A, B, B, AB, O, AB, AB, A, B, B, B, A, O, A. we want to present these data by table. Distribution of the studied individuals according to blood group:
  16. 16. These are blood pressure measurements of 30 patients with hypertension. Present these data in frequency table: 150, 155, 160, 154, 162, 170, 165, 155, 190, 186, 180, 178, 195, 200, 180,156, 173, 188, 173, 189, 190, 177, 186, 177, 174, 155, 164, 163, 172, 160. Blood pressure “mmHg” Frequency % Tally Number 150 – 160 – 170 – 180 – 190 - 200 - 1111 1 1111 1 1111 111 1111 1 111 1 6 6 8 6 3 1 20 20 26.7 20 10 3.3 Total 30 100 Frequency distribution of blood pressure measurements among studied patients:
  17. 17. GRAPHICAL PRESENTATION Simple easy to understand. Save a lot of words. Simple easy to understand. Save a lot of words. Self explanatory. Has a clear title indicating its content “written under the graph”. Fully labeled. The y axis (vertical) is usually used for frequency.
  18. 18. Graphs Bar chart Pie diagram Histogram Scatter diagram Line graph Frequency polygon
  19. 19. BAR CHART  Used for presenting discrete or qualitative data.  It is a graphical presentation of magnitude (value or percentage) by rectangles of constant width & lengths proportional to the frequency & separated by gaps Simple MultipleComponent
  20. 20. SIMPLE BAR CHART
  21. 21. MULTIPLE BAR CHART Percentage of Persons Aged ≥18 Years Who Were Current Smokers, by Age and Sex — United States, 2002
  22. 22. COMPONENT BAR CHART
  23. 23. PIE DIAGRAM  Consist of a circle whose area represents the total frequency (100%) which is divided into segments.  Each segment represents a proportional composition of the total frequency.
  24. 24. HISTOGRAM • It is very similar to bar chart with the difference that the rectangles or bars are adherent (without gaps). • It is used for presenting class frequency table (continuous data). • Each bar represents a class and its height represents the frequency (number of cases), its width represent the class interval.
  25. 25. SCATTER DIAGRAM It is useful to represent the relationship between 2 numeric measurements, each observation being represented by a point corresponding to its value on each axis.
  26. 26. LINE GRAPH • It is diagram showing the relationship between two numeric variables (as the scatter) but the points are joined together to form a line (either broken line or smooth curve)
  27. 27.
  28. 28. FREQUENCY POLYGON  Derived from a histogram by connecting the mid points of the tops of the rectangles in the histogram.  The line connecting the centers of histogram rectangles is called frequency polygon. We can draw polygon without rectangles so we will get simpler form of line graph.  A special type of frequency polygon is “the Normal Distribution Curve”.
  29. 29. NORMAL DISTRIBUTION CURVE “GAUSSIAN DISTRIBUTION CURVE”
  30. 30. The NDC is the frequency polygon of a quantitative continuous variable measured in large number. It is a form of presentation of frequency distribution of biologic variables “weights, heights, hemoglobin level and blood pressure”.
  31. 31. CHARACTERISTICS OF THE CURVE: Bell shaped, continuous curve Symmetrical i.e. can be divided into 2 equal halves vertically Tails never touch the base line but extended to infinity in either direction The mean, median and mode values coincide Described by 2 parameters: arithmetic mean (X) “location of the center of the curve” & standard deviation (SD) “scatter around the mean”
  32. 32. AREAS UNDER THE NORMAL CURVE: X ± 1 SD = 68% of the area on each side of the mean. X ± 2 SD = 95% of area on each side of the mean. X ± 3 SD = 99% of area on each side of the mean.
  33. 33. SKEWED DATA If we represent a collected data by a frequency polygon & the resulted curve does not simulate the NDC (with all its characteristics) then these data are “Not normally distributed” “Curve may be skewed to the Rt. or to the Lt. side”
  34. 34. CAUSES OF SKEWED CURVE The data collected are from: So; the results obtained from these data can not be applied or generalized on the whole population. Heterogeneous group Diseased or abnormal population
  35. 35. Example: If we have NDC for Hb levels for a population of normal adult males with mean±SD = 11±1.5 If we obtain a Hb reading for an individual = 8.1 & we want to know if he/she is normal or anemic. If this reading lies within the area under the curve at 95% of normal (i.e. mean±2 SD)he /she will be considered normal. If his reading is less then he is anemic. NDC can be used in distinguishing between normal from abnormal measurements.
  36. 36. • Normal range for Hb in this example will be: Higher HB level: 11+2 (1.5) =14. Lower Hb level: 11–2 (1.5) = 8. i.e the normal Hb range of adult males is from 8 to 14. Our sample (8.1) lies within the 95% of his population. So; this individual is normal because his reading lies within the 95% of his population.
  37. 37. DATA SUMMARIZATION
  38. 38. Datasummarization Measures of Central tendency Mean Mode Median Measures of Dispersion Range Variance Standard deviation Coefficient of variation
  39. 39. Datasummarization Measures of Central tendency Mean Mode Median Measures of Dispersion Range Variance Standard deviation Coefficient of variation
  40. 40. ARITHMETIC MEAN Sum of observation divided by the number of observations. x = mean ∑ denotes the (sum of) x the values of observation n the number of observation
  41. 41. ARITHMETIC MEAN
  42. 42. In case of frequency distribution data we calculate the mean by this equation: ARITHMETIC MEAN
  43. 43. ARITHMETIC MEAN
  44. 44.  If data is presented in frequency table with class intervals we calculate mean by the same equation but using the midpoint of class interval.
  45. 45. MEDIAN  The middle observation in a series of observation after arranging them in an ascending or descending manner Rank of median Odd no. (n + 1)/2 Even no. (n + 1)/2 n/2
  46. 46. MEDIAN
  47. 47. MEDIAN
  48. 48. MODE The most frequent occurring value in the data.
  49. 49. ADVANTAGES & DISADVANTAGES OF THE MEASURES OF CENTRAL TENDENCY: Mean • Usually preferred since it takes into account each individual observation • Main disadvantage is that it is affected by the value of extreme observations. Median • Useful descriptive measure if there are one or two extremely high or low values. Mode • Seldom used.
  50. 50. Datasummarization Measures of Central tendency Mean Mode Median Measures of Dispersion Range Variance Standard deviation Coefficient of variation
  51. 51. MEASURE OF DISPERSION Describes the degree of variations or scatter or dispersion of the data around its central values (dispersion = variation = spread = scatter).
  52. 52. RANGE  The difference between the largest & smallest values.  It is the simplest measure of variation It can be expressed as an interval such as 4-10, where 4 is the smallest value & 10 is highest. But often, it is expressed as interval width. For example, the range of 4-10 can also be expressed as a range of 6.
  53. 53. RANGE Disadvantages:
  54. 54. • To get the average of differences between the mean & each observation in the data; we have to reduce each value from the mean & then sum these differences and divide it by the number of observation. V = ∑ (mean - x) / n • The value of this equation will be equal to zero, because the differences between each value & the mean will have negative and positive signs that will equalize zero on algebraic summation. • To overcome this zero we square the difference between the mean & each value so the sign will be always positive . Thus we get: • V = ∑ (mean - x)2 / n-1 VARIANCE
  55. 55. STANDARD DEVIATION “SD” The main disadvantage of the variance is that it is the square of the units used. So, it is more convenient to express the variation in the original units by taking the square root of the variance. This is called the standard deviation (SD). Therefore SD = √ V i.e. SD = √ ∑ (mean – x)2 / n - 1
  56. 56. COEFFICIENT OF VARIATION “COV” • The coefficient of variation expresses the standard deviation as a percentage of the sample mean. • C.V is useful when, we are interested in the relative size of the variability in the data.
  57. 57. • Example: If we have observations 5, 7, 10, 12 and 16. Their mean will be 50/5=10. SD = √ (25+9 +0 + 4 + 36 ) / (5-1) = √ 74 / 4 = 4.3 C.V. = 4.3 / 10 x 100 = 43% Another observations are 2, 2, 5, 10, and 11. Their mean = 30 / 5 = 6 SD = √ (16 + 16 + 1 + 16 + 25)/(5 –1) = √ 74 / 4 = 4.3 C.V = 4.3 /6 x 100 = 71.6 % Both observations have the same SD but they are different in C.V. because data in the 1st group is homogenous (so C.V. is not high), while data in the 2nd observations is heterogeneous (so C.V. is high).
  58. 58. • Example: In a study where age was recorded the following were the observed values: 6, 8, 9, 7, 6. and the number of observations were 5. • Calculate the mean, SD and range, mode and median. Mean = (6 + 8 + 9 + 7 + 6) / 5 = 7.2 Variance = (7.2-6)2 + (7.2-8)2 + (7.2-9)2 + (7.2-7)2 + (7.2- 6)2 / 5-1 = (1.2)2 + (- 0.8)2 + (-1.8) 2 +(0.2)2 + (1.2)2 / 4 = 1.7 S.D. = √ 1.7 = 1.3 Range = 9 – 6 = 3 Mode= 6Median = 7
  59. 59. INFERENTIAL STATISTICS
  60. 60. TYPES OF STATISTICS • Describe or summarize the data of a target population. • Describe the data which is already known. • Organize, analyze & present data in a meaningful manner. • Final results are shown in forms of tables and graphs. • Tools: measures of central tendency & dispersion. Descriptiv e • Use data to make inferences or generalizations about population. • Make conclusions for population that is beyond available data. • Compare, test and predicts future outcomes. • Final results is the probability scores. • Tools: hypothesis tests Inferential
  61. 61. INFERENCE Inference involves making a generalization about a larger group of individuals on the basis of a subset or sample.
  62. 62. HYPOTHESIS TESTING To find out whether the observed variation among sampling is explained by sampling variations, chance or is really a difference between groups. The method of assessing the hypotheses testing is known as “significance test”. Significance testing is a method for assessing whether a result is likely to be due to chance or due to a real effect.
  63. 63. NULL & ALTERNATIVE HYPOTHESES:  In hypotheses testing, a specific hypothesis is formulated & data is collected to accept or to reject it.  Null hypotheses means: H0: x1=x2 this means that there is no difference between x1 & x2.  If we reject the null hypothesis, i.e there is a difference between the 2 readings, it is either H1: x1 < x2 or H2: x1> x2  In other words the null hypothesis is rejected because x1 is different from x2.
  64. 64. GENERAL PRINCIPLES OF TESTS OF SIGNIFICANCE Set up a null hypothesis and its alternative. Find the value of the test statistic. Refer the value of the test statistic to a known distribution which it would follow if the null hypothesis was true. Conclude that the data are consistent or inconsistent with the null hypothesis.
  65. 65.  If the data are not consistent with the null hypotheses, the difference is said to be “statistically significant”.  If the data are consistent with the null hypotheses it is said that we accept it i.e. statistically insignificant.  In medicine, we usually consider that differences are significant if the probability is <0.05.  This means that if the null hypothesis is true, we shall make a wrong decision <5 in a 100 times.
  66. 66. TESTS OF SIGNIFICANCETests of significance Quantitative variables 2 Means Large sample “>60” z test Small sample “<60” t-test Paired t- test >2 Means ANOVA Qualitative variables X2 test Z test
  67. 67. COMPARING TWO MEANS OF LARGE SAMPLES USING THE NORMAL DISTRIBUTION: (Z TEST OR SND STANDARD NORMAL DEVIATE) If we have a large sample size “≥ 60” & it follows a normal distribution then we have to use the z-test. z = (population mean - sample mean) / SD. If the result of z >2 then there is significant difference. The normal range for any biological reading lies between the mean value of the population reading ± 2 SD. (includes 95% of the area under the normal distribution curve).
  68. 68. COMPARING TWO MEANS OF SMALL SAMPLES USING T-TEST  If we have a small sample size (<60), we can use the t distribution instead of the normal distribution.
  69. 69. Degree of freedom = (n1+n2)-2 The value of t will be compared to values in the specific table of "t distribution test" at the value of the degree of freedom. If t-value is less than that in the table, then the difference between samples is insignificant. If t-value is larger than that in the table so the difference is significant i.e. the null hypothesis is rejected.
  70. 70. Big t-value Small P- value Statistical significance
  71. 71. PAIRED T-TEST: If we are comparing repeated observation in the same individual or difference between paired data, we have to use paired t-test where the analysis is carried out using the mean and standard deviation of the difference between each pair.
  72. 72. Paired t= mean of difference/sq r of SD² of difference/number of sample. d.f=n – 1
  73. 73. ANALYSIS OF VARIANCE “ANOVA”  The main idea in ANOVA is that we have to take into account the variability within the groups & between the groups One-way ANOVA • Subgroups to be compared are defined by just one factor • Comparison between means of different socio-economic classes Two-way ANOVA • When the subdivision is based upon more than one factor
  74. 74. F-value is equal to the ratio between the means sum square of between the groups & within the groups. F = between-groups MS / within-groups MS
  75. 75. TESTS OF SIGNIFICANCETests of significance Quantitative variables 2 Means Large sample “>60” z test Small sample “<60” t-test Paired t- test >2 Means ANOVA Qualitative variables X2 test Z test
  76. 76. CHI -SQUARED TEST A chi-squared test is used to test whether there is an association between the row variable & the column variable or, in other words whether the distribution of individuals among the categories of one variable is independent of their distribution among the categories of the other. Qualitative data are arranged in table formed by rows & columns. One variable define the rows & the categories of the other variable define the columns.
  77. 77. O = observed value in the table E = expected value Expected (E) = Row total Χ Column total Grand total Degree of freedom = (row - 1) (column - 1)
  78. 78. EXAMPLE HYPOTHETICAL STUDY  Two groups of patients are treated using different spinal manipulation techniques  Gonstead vs. Diversified  The presence or absence of pain after treatment is the outcome measure.  Two categories  Technique used  Pain after treatment
  79. 79. GONSTEAD VS. DIVERSIFIED EXAMPLE - RESULTS Yes No Row Total Gonstead 9 21 30 Diversified 11 29 40 Column Total 20 50 70 Grand Total Technique Pain after treatment 9 out of 30 (30%) still had pain after Gonstead treatment and 11 out of 40 (27.5%) still had pain after Diversified, but is this difference statistically significant?
  80. 80.  To find E for cell a (and similarly for the rest) Yes No Row Total Gonstead 9 21 30 Diversified 11 29 40 Column Total 20 50 70 Grand Total Technique Pain after treatment Multiply row total Times column total Divide by grand total FIRST FIND THE EXPECTED VALUES FOR EACH CELL Expected (E) = Row total Χ Column total Grand total
  81. 81. Evidence-based Chiropractic  Find E for all cells Yes No Row Total Gonstead 9 E = 30*20/70=8.6 21 E = 30*50/70=21.4 30 Diversified 11 E=40*20/70=11.4 29 E=40*50/70=28.6 40 Column Total 20 50 70 Grand Total Technique Pain after treatment
  82. 82.  Use the Χ2 formula with each cell and then add them together Χ2 = 0.0186 + 0.0168 + 0.0316 + 0.0056 = 0.0726 (9 - 8.6)2 8.6 (21 - 21.4)2 21.4 = 0.0186 0.0168 (11 - 11.4)2 11.4 (29 - 28.6)2 28.6 0.0316 0.0056
  83. 83. Evidence-based Chiropractic o Find df and then consult a Χ 2 table to see if statistically significant o There are two categories for each variable in this case, so df = 1 o Critical value at the 0.05 level and one df is 3.84 o Therefore, Χ 2 is not statistically significant Degree of freedom = (row - 1) (column - 1)
  84. 84. Z TEST FOR COMPARING TWO PERCENTAGES p1=% in the 1st group. p2 = % in the 2nd group q1=100-p1 q2=100-p2 n1= sample size of 1st group n2=sample size of 2nd group . Z test is significant (at 0.05 level) if the result>2.
  85. 85. Example: If the no. of anemic patients in group 1 which includes 50 patients is 5 & the no. of anemic patients in group 2 which contains 60 patients is 20. To find if groups 1 & 2 are statistically different in prevalence of anemia we calculate z test. P1=5/50=10%, p2=20/60=33%, q1=100-10=90, q2=100-33=67 Z=10 – 33/ √ 10x90/50 + 33x67/60 Z= 23 / √ 18 + 36.85 z= 23/ 7.4 z= 3.1 Therefore there is statistical significant difference between percentages of anemia in the studied groups (because z >2).
  86. 86. CORRELATION & REGRESSION
  87. 87. CORRELATION & REGRESSION Correlation measures the closeness of the association between 2 continuous variables, while Linear regression gives the equation of the straight line that best describes & enables the prediction of one variable from the other.
  88. 88. CORRELATION
  89. 89. t-test for correlation is used to test the significance of the association.
  90. 90. CORRELATION IS NOT CAUSATION!!!
  91. 91. LINEAR REGRESSION Same as correlation •Determine the relation & prediction of the change in a variable due to changes in other variable. •t-test is also used for the assessment of the level of significance. Differ than correlation •The independent factor has to be specified from the dependent variable. •The dependent variable in linear regression must be a continuous one. •Allows the prediction of dependent variable for a particular independent variable “But, should not be used outside the range of original data”.
  92. 92. Evidence-based Chiropractic SCATTERPLOTS  An X-Y graph with symbols that represent the values of two variables Regression line
  93. 93. MULTIPLE REGRESSION  The dependency of a dependent variable on several independent variables, not just one.  Test of significance used is the ANOVA. (F test).
  94. 94. For example: if neonatal birth weight depends on these factors: gestational age, length of baby and head circumference. Each factor correlates significantly with baby birth weight (i.e. has +ve linear correlation). We can do multiple regression analysis to obtain a mathematical equation by which we can predict the birth weight of any neonate if we know the values of these factors.

×