Publicité

Steps in Developing A Valid and Reliable Scale.pdf

Professor od community medicine and public health
25 Mar 2023
Publicité

Contenu connexe

Publicité

Steps in Developing A Valid and Reliable Scale.pdf

  1. Steps In Developing AValid And Reliable Scale of Measurement BY: Omnia Samir Elseifi Assistant Professor of Public Health and Community Medicine. Faculty of Medicine Zagazig University 23 January 2020
  2. Scale development process • Measurement scales are useful tools to get scores about certain health aspects that cannot be measured directly, such as measuring quality of life. • The researcher must pass through many steps to reach the ultimate goal; which is the developing of a valid and reliable scale to support the application of the test results. Phase I Item Development 1- Identification of domain 2- Item generation 3- Content validity Phase II Scale Development 4- Pretesting (Pilot testing of the Items) 5- Item reduction 6- Extraction of factors Phase III Scale Evaluation 7- Test of dimensionality 8-Test of reliability 9-Test of validity (1,2,3)
  3. Scale development process Scheme 1- Identification of domain(s) 1- Purpose 2- Justification 2- Item generation 1- Appropriate questions 2- Number of items 3- Item wording 4- Translation of items 3- ContentValidity CVR CVI FaceValidity 3- Describing domains 4- Specify the dimensions 5- Define each dimension 5-Types of questions 6- Response to items • To Specify the boundaries of the domain. • To Select Which Items to Ask. • To Assess if the Items Adequately Measure the Content of The Domain of Interest.
  4. Scale development process Scheme 4- Pretesting 1- Interview with target population 2- Sample size 5- Item reduction 1- Item difficulty index 2- Item discrimination index 3- Item- item correlation and Item – total correlation 4- Distractor Efficiency Analysis 6- Extraction of factors Exploratory Factor Analysis (EFA) Confirmatory Factor Analysis (CFA) 3- Distribution of scale • To Gather Enough Data from the Right People. • To Identify Items That Are Not Related To The Domain, So, They Can Be Deleted Or Modified. • To Explore the Number of Latent Constructs that Fit The Observed Data.
  5. Scale development process Scheme 7-Test of Dimensionality Using Factor analysis Unidimensional scale 8-Test of Reliability 1- Test- Retest Reliability 2- Internal Consistency 3- Parallel form Reliability 4- Inter-Rater Reliability 9-Test ofValidity CriterionValidity: Concurrent validity Predictive validity ConstructValidity: ConvergentValidity. DivergentValidity Known groupValidity Multidimensional scale • To Identify The Number Of Latent Variables That Are Measured By The Scale. • To Establish if Responses Are ConsistentWhen Repeated. • To Ensure the scale Measures The intended Latent Dimension.
  6. Example Of Validated Scale Development Research A research conducted In Pakistan for “Development of a stress scale for pregnant women in the South Asian context: the A–Z Stress Scale.” Will be an example in most of steps.
  7. Phase 1: Item development Step 1: Identification of the Domain(s) Identification of the Domain(s) 5-Define each dimension 1-The purpose: is to develop a scale based on stressors to measure stress among pregnant in developing countries 2- Justification: They found preexisting scales record the somatic and psychological symptoms of the stressors not the stressors themselves 3- Describing domains: They agreed about defining the different stressors the pregnant exposed to. 4- Specify the dimensions :They decided the scale will be consisted from three dimensions; daily, life event and pregnancy related stressors. The purpose : To specify the boundaries of the domain and facilitate item generation. (4,5)
  8. Pitfalls 1. This step is often neglected or dealt with in a superficial manner. 2. Construct underrepresentation (focus on narrow aspect of the domain). These troubles lead to a significant number of problems later in the validation process(6,7). Phase 1: Item development Step 1: Identification of the Domain(s)
  9. Phase 1: Item development Step 2 Item Generation The purpose : To create an appropriate questions that fit to identified domain. Item Generation 6- Response to questions 1-Appropriate questions 2- Number of items (must be 2-5 times the number in final scale) Item pool of 235 items 3-Item wording 4-Translation of the items 5-Types of questions Deductive methods Literature review Inductive methods: interviews with 25 experts from different specialties” Psychiatry, Gynecology and Sociology”. They conducted interview with 79 pregnant women asking them about the possible stressors. (5,8-11)
  10. Pitfalls 1. Presence of irrelevant items to the defined domain can lead to failure of validation of the measuring scale, poor quality of data and invalid conclusion regarding the results and the relationship with other constructs. 2. Improper response to the items as too short scale can affect the reliability of the instrument this is also for too many responses (more than 7) (12). Phase 1: Item development Step 2 Item Generation
  11. Phase 1: Item development Step 3: ContentValidity Content validity: • Content validity is to be sure that the items of the generated scale measure what they are presumed to measure (all contents domain of interest) (2) Content validity is assessed by: • Experts, • Target population (2)
  12. Purpose: To evaluate the items constituting the domain regarding; content relevance, and technical quality . Phase 1: Item development Step 3: ContentValidity Expert evaluation ContentValidity Ratio (CVR) Kappa coefficient ContentValidity Index (CVI) • >0.74 it’s considered excellent. • Between 0.60 and 0.74 is considered good. • Between 0.40 and 0.59 are considered fair. (2) I-CVIs S-CVI
  13. ContentValidity Ratio (CVR): • The experts are requested to specify whether an item is necessary for the construct or not. -Score 1 for: [not necessary] item. -Score 2 for: [useful but not essential] item. - Score 3 for: [essential] item. . Phase 1: Item development Step 3: ContentValidity (Number of experts indicating essential - The total number of experts/2) / The total number of experts / 2. • For minimum number of expert (5 or 6 experts) CVR must be not less than 0.99, • for 8 experts not less than 0.85 • for 10 experts not less than 0.62 otherwise the item should be eliminated from the scale . CVR (13)
  14. Content validity index (CVI): Panel members are asked to rate instrument items in terms of clarity and relevancy to the construct on a 4-point scale: -Score 1 for: [not relevant or not clear] items. -Score 2 for: [somewhat relevant or item somewhat clear and need some revision] items. -Score 3 for: [quite relevant or quite clear] items. -Score 4 for: [highly relevant or highly clear] items Phase 1: Item development Step 3: ContentValidity For each item: Experts giving 3 or 4 score / the Total number of experts I-CVIs • >79%, the item is appropriate and retained within the scale. • If between 70 and 79 % it will need revision. • <70 percent, it is eliminated from the scale The number of relevant items by agreement of all experts / Total number of items S-CVI/UA Should be not less than 0.80 Sum of I-CVIs for the items / Total number of items S-CVI/Ave Should be not less than 0.90 (14)
  15. Phase 1: Item development Step 3: ContentValidity Face Validity Readability Feasibility Layout Clarity of words Face validity means the degree at which the designed measuring instrument is apparently appropriate and related to the domain under study. The target population share with expert in evaluating the face validity of the scale of measurement (15).
  16. Example for this step: A research conducted for the development of a stress scale for pregnant women in the South Asian context: the A–Z Stress Scale (5). The researchers stated that they evaluate the content validity of the scale: By experts and target pregnant (face validity) . According to that the items selected from the item pool were 78 items. Pitfalls • Some researches usually fail to assess the content validity, this may be due to lack of resources or skills. This is expected to affects the final collected data conducted by the scale and the statistical analysis. • Limited numbers of the developing scales undergo target population evaluation which is important step as those population are the target of the newly developed scale (16). Phase 1: Item development Step 3: ContentValidity
  17. Phase 2: Scale Development Step 4: Pre-testing Questions Pre- testing Questions 1- Cognitive Interviews with pregnant 2- Sample size Golden rule of thumb is10 respondents per survey item (10:1) They interviewed 70 pregnant 3-Distribution of the scale; Paper based survey or Online survey (they used Paper based face to face interview) The purpose : •To ensure the availability of sufficient data for scale development with minimum level of error. (5,17,18) Pitfalls • Sample size in many validation studies is usually less than the golden role, this may be due to this type of studies may be difficult to be funded. • Missing data increase the risk of inaccurate conclusions due to increasing occurrence of errors.
  18. Item Reduction Item Difficulty Index Item discrimination test Inter-item and Item-Total Correlations Distractor Efficiency Analysis The purpose : To identify items that are not related to the domain under study so they can be deleted or modified. (5) Phase 2: Scale Development Step 5: Item Reduction
  19. Inter-item correlations: Examine the correlation between each item in the scale and the other items. Phase 2: Scale Development Step 5: Item Reduction Inter-item and Item-Total Correlations Purpose: To determine the correlations between scale items, as well as the correlations between each item and sum score of scale items. Item-total correlations: Examine the relationship between each item score and the total scale score. In both techniques, items with low correlations (r <0.30) are less desirable and could be deleted. (19,20)
  20. Example: A research conducted for the development of a stress scale for pregnant women in the South Asian context: the A–Z Stress Scale (5). Phase 2: Scale Development Step 5: Item Reduction The researchers conducted item- total analysis ranged from r = 0.2 to r = 0.8. As a result the items were reduced to final 30 items.
  21. Item Difficulty Index Purpose: To assess the difficulty level of the scale test items. Phase 2: Scale Development Step 5: Item Reduction Item correct answers for the item / the total answers on that item Ranges between 0.0 to 1.0 Item difficulty index Difficulty level 0.86 and above Very easy. 0.71 to 0.85 Easy 0.30 to 0.70 Moderate 0.15 to 0.29 Difficult 0.14 and below Very difficult High difficulty index score means a greater proportion of the sample population answered the question correctly. Lower difficulty index score means a smaller proportion of the sample understood the question and answered correctly. (2,21)
  22. Item Discrimination test Purpose: to identify the degree to which an item can correctly differentiates between respondents . Phase 2: Scale Development Step 5: Item Reduction The upper group (with high scores) proportion of responders who got the item correct in the upper group - proportion of responders with correct answer in the lower group. Ranges between -1 to +1 The lower group (with low scores) Item discrimination index Discrimination level 0.19 and below Poor item; should be eliminated or revised. 0.20 to 0.29 Marginal items; need revision 0.30 to 0.39 Good item; may need some improvement 0.4 or above Very good item (22,23)
  23. Distractor Efficiency Analysis: Purpose: To determine the distribution of incorrect options “distractors” and how they contribute to the quality of items. Phase 2: Scale Development Step 5: Item Reduction The upper group (with high scores) The middle group (with middle scores) The lower group (with low scores) • 100% of participants in the high group • about 50% of participants in the middle • few or none of those in the lower group Correct option Appropriate item If those with adequate knowledge “the high group” can’t differentiate between the right option of the item and the distractors, the question may need to be modified or deleted. (24,25)
  24. Factor analysis: It is a method for explaining the construction of data by explaining the correlations between variables. It summarizes data into a few dimensions by condensing many variables into a smaller set of latent variables or factors . • Exploratory Factor Analysis (EFA) it’s the interrelation between items in the construct. It is used to reduce the set of observed variables to a smaller, more close set of variables. • Confirmatory Factor Analysis (CFA) and is used to determine the factors by statistically testing the hypothesis of the expected factor loading (FL) of the observed items on underlying (latent) factors and the correlation between latent variables. • Items having factor loading or slope coefficients below 0.30 are considered inadequate “Unrelated items” that should be eliminated. • Items with cross loading > 0.4 should be eliminated. Phase 2: Scale Development Step 6: Extraction of Factors (4,23,26)
  25. Phase 2: Scale Development Step 6: Extraction of Factors Example: In a research for Developing a disease- specific tool for assessment of quality of life of patients with hepatitis C virus associated chronic liver disease (27). They conducted CFA and calculated Factor loading, any item with factor loading less than 0.3 is eliminated. Pitfalls: Many of scale developers are hesitating to use factor analysis either because: • it needs large sample size to be conducted • because it involves many confusing and complicated steps and interpretations (16)
  26. • Purpose: A scale’s dimensionality, to identify the number of latent variables that are measured by the scale. • It’s usually depends on the factor’s extraction and analysis. Phase 3: Scale Evaluation Step 7:Test dimensionality (12) Start
  27. Example: A research conducted for the development of a stress scale for pregnant women in the South Asian context: the A–Z Stress Scale (5) The researchers stated that their scale has two dimension by multidimension scaling; 1- socioenvironmental related hassles dimension (includes items from 1-26). 2- chronic illness dimension (items 27-30). Phase 3: Scale Evaluation Step 7:Test dimensionality Pitfalls • Failure to effectively calculate EFA and CFA will lead to miss classification of the dimensions of the construct. • Many of the researchers depend on literature and expert view to divide the dimensions of the construct rather than using factors analysis (12).
  28. Reliability is the ability to reproduce same result consistently under the same conditions. Purpose: To measure reliability regarding; stability, internal consistency, equivalence and inter-rater reliability. Phase 3: Scale Evaluation Step 8:Tests of Reliability Stability The test is administered twice or more to the same participant to ensure that same results are obtained. Testing the developing scale on 43 pregnant twice one week interval (r = 0.86). It measures whether items measuring the same general construct produce the same scores (Homogeneity).It’s assessed by: • Cronbach’s α;(value 0-1, ≥0.7 is acceptable) • Kuder-Richardson • Split halves reliability (two equal halves of the scale then compare). • Cronbach’s alpha (0.82 for the scale and was ranged between 0.75 to 0.86 for different items). Equivalence It determines the correlation of level of agreement between two or more instruments at the same point of time. It assesses the degree of agreement between two or more raters in assessing certain phenomena at the same point of time. The developing scale was applied on 50 pregnant and two interviewers (r = 0.91). (22, 28, 29)
  29. Pitfalls: • Test – retest reliability should be used with caution as the score of values could be changed over time in some types of studies (e.g., intervention studies), here the change isn’t due to low reliable measure, but it’s a true change in the participants. • Number of items in the scale below 10, could lead to decrease Cronbach’s alpha • Lack of standardization between the observers leads to decrease interrater agreement (1,2). Phase 3: Scale Evaluation Step 8:Tests of Reliability
  30. Phase 3: Scale Evaluation Step 9:Tests ofValidity Validity The ability of the measuring scale to evaluate the domain that was intended to be measured. Content validity Including face validity Criterion validity Concurrent validity Compare at the same time Gold standard Predictive validity Gold standard or Behavior Predict after time Construct validity Convergent validity Same result Two related measures Divergent validity (Discriminate) Different result Two different measures Known-groups validity Two different groups Different result Same group Same group Same measurement New measure New measure (22. 28, 30)
  31. Phase 3: Scale Evaluation Step 9:Tests ofValidity Criterion validity Concurrent validity Compare at the same time multicultural validated depression scale New A–Z Stress Scale Moderate correlation between the two scales (r = 0.56) Example: In the study conducted for the development of a stress scale for pregnant women in the South Asian context: the A–Z Stress Scale (5) Pitfalls for validity calculation: 1- Criterion validity can’t be assessed with small sample size due to presence of sampling error. 2- Criterion validity cannot be used in all circumstances, especially in social sciences as a relevant criterion “gold standard” may be not present, So, it’s usually ignored and not calculated in most of the validation studies. 3- Lack of sufficient resources or skills for calculation and assessment (22).
  32. Pitfalls for validity calculation: (cont.) 4- The scale developers usually use homogeneous group from the population in the pilot study which limit calculation of construct validity, so recruiting of heterogenous group or random sample of the population is recommended. 5- Single time calculation of validity is inaccurate if the variable under study changed with time, so, it’s recommended to conduct longitudinal studies during scale development to get accurate validity measures especially predictive validity, as it will lead to pseudo correlations between variables. 6- Social desirability bias: which is a systematic error present in self-reporting measures in which the participants want to keep good image. This is considered as one of the important threats to the validity (22). Phase 3: Scale Evaluation Step 9:Tests ofValidity
  33. Conclusion • Valid research results begin with valid and reliable measurement. This can be achieved if a systematic and scientific based process is followed. • Developing a valid and reliable scale is a multiphasic procedure that need a researcher with adequate knowledge and proper level of skills. • Poor scale development will be had effect on the validity and reliability of the results and therefore, the applicability in practice. So, the availability of a comprehensive guide for scale development is essential.
  34. References 1. Fabrigar LR., Ebel-Lam A. Questionnaires. In N. J. Salkind (Ed.), Encyclopedia of Measurement and Statistics (2007).Thousand Oaks, CA: Sage. pp. 808-812. 2. DeVellis RF. Scale Development:Theory and Application. (3rd ed.). Los Angeles, CA: Sage Publications (2012). 3. Hinkin TR.A review of scale development practices in the study of organizations. J Manag. 1995; 21:967–88. doi:10.1016/01492063(95)90050-0 4. McCoach DB, Gable RK, Madura, JP. Instrument Development in the Affective Domain. School and Corporate Applications, 3rd Edn. NewYork, NY: Springer (2013). 5. Kazi A, Fatmi Z, Hatcher J, Niaz U, Aziz A. Development of a stress scale for pregnant women in the South Asian context: the A-Z Stress Scale. East Mediterr Health J. 2009 Mar- Apr;15(2):353-61. PMID: 19554982. 6. Messick S. Validity of psychological assessment: validation of inferences from persons’ responses and performance as scientifica inquiry into score meaning. Am Psychol. (1995) 50:741–9. doi: 10.1037/0003-066X.50.9.741 7. MacKenzie, S. B. 2003.“The Dangers of Poor Construct Conceptualization,” Journal of the Academy of Marketing Science (31:3), pp. 323-326. 8. Streiner, D. L., Norman, G. R., & Cairney, J. (2015). Health Measurement Scales:A Practical Guide to Their Development and Use (5th ed.). Oxford, UK: Oxford University Press. 9. Schinka JA,VelicerWF,Weiner IR. Handbook of Psychology, Research Methods in Psychology. Hoboken, NJ: JohnWiley & Sons, Inc. 2012. 10. DeVellis RF. Scale Development:Theory and Applications (4th ed.).Thousand Oaks, CA: Sage. 2017. 11. Price LR. Psychometric Methods:Theory into Practice. NewYork:The Guilford Press. 2017. pp: 190-191. 12. Furr RM. Scale Construction and Psychometrics for Social and Personality Psychology. New Delhi, IN: Sage Publications. 2011. 13. Streiner, DL, Norman GR, Cairney J. Health Measurement Scales:A Practical Guide to Their Development and Use (5th ed.). Oxford, UK: Oxford University Press. 2015. 14. 14. Polit DF, Beck CT, Owen SV. Is the CVI an acceptable indicator of content validity? Appraisal and recommend-ations. Res Nurs Health 2007;30(4):459-67.
  35. 15. Haynes SN, Richard DCS, Kubany ES. Content validity in psychological assessment: a functional approach to concepts and methods. Pyschol Assess. 1995; 7:238–47 16. Morgado FFR, Meireles JFF, Neves CM, Amaral ACS, Ferreira MEC. Scale development: ten main limitations and recommendations to improve future research practices. Psicol Reflex E Crítica 2018; 30:3. 17. Greenlaw C, Brown-Welty S.A Comparison of web-based and paper-based survey methods: testing assumptions of survey mode and response cost. EvalRev. 2009; 33:464–80. 18. Fanning J, McAuley E.A Comparison of tablet computer and paper-based questionnaires in healthy aging research. JMIR Res Protoc. 2014; 3:e38. 19-Raykov T, Marcoulides GA. Introduction to Psychometric Theory. NewYork, NY: Routledge,Taylor & Francis Group 2011. 20. Cohen RJ, Swerdlik ME. Psychological testing and assessment:An introduction to tests and measurement (6th ed.). NewYork: McGraw-Hill, 2005. 21. Si-Mui Sim, Rasiah RI. Relationship between item difficulty and discrimination indices in true/false type multiple choice questions of a para-clinical multidisciplinary paper. Ann Acad Med Singapore 2006; 35: 67-71 22- Whiston SC. Principles and Applications of Assessment in Counseling. Cengage Learning 2008. 23. Zubairi AM, Kassim NLA. Classical and Rasch analysis of dichotomously scored reading comprehension test items. Malaysian J of ELT Res 2006; 2: 1-20. 24- Tarrant M,Ware J, Mohammed AM.An assessment of functioning and nonfunctioning distractors in multiple-choice questions: a descriptive analysis. BMC Med Educ. 2009; 9:40. 25-Fulcher G, Davidson F.The Routledge Handbook of LanguageTesting. NewYork, NY: Routledge 2012. 26- Polit DF Beck CT. Nursing Research: Generating and Assessing Evidence for Nursing Practice, 9th ed. Philadelphia, USA:Wolters Klower Health, Lippincott Williams & Wilkins, 2012. 27- Sobhi SA, Ibrahim AS, Serwah AA, Tawfik MY. In a research for Developing a disease-specific tool for assessment of quality of life of patients with hepatitis C virus associated chronic liver disease. Suez canal university medical journal.2008; 11(2):207-214. 28. Boateng GO, Neilands TB, Frongillo EA, Melgar-Quiñonez HR and Young SL Best Practices for Developing and Validating Scales for Health, Social, and Behavioral Research: A Primer. Front. Public Health 2018; 6:149. 29.Wong KL, Ong SF, Kuek TY. Constructing a survey questionnaire to collect data on service quality of business academics. Eur J Soc Sci 2012; 29:209-21. 30, Sackett PR, Lievens F, Berry CM, Landers RN. "A Cautionary Note on the Effects of Range Restriction on Predictor Intercorrelations" (PDF). Journal of Applied Psychology 2007; 92 (2): 538–544. References
Publicité