Session 2 2018

Validity and Reliability
Session 2
Chapters 4
Colton & Covert (2007)

What is Validity?
According to Colton and Covert (2007), validity is
“the ability of an instrument to measure what you intend
it to measure” (p. 65).
Validity ensures trustworthy and credible information.

“Validity is a matter of degree.”
(Colton & Covert, 2017, p. 65)
• Assessment instruments are not merely valid or
invalid.
• Validity exists in varying degrees across a
continuum.
• Validity is a characteristic of the responses/data
gathered.
• The greater the evidence of validity the greater
the likelihood of credible trustworthy data.
• Hence, the importance of establishing/testing
the validity before the instrument is used.

In order to gather evidence that an
instrument is valid, we need to establish
that it is measuring :
1. the right content (Content Validity)
(Does the instrument measure the content it’s intended to measure?
2. the right construct (Construct Validity)
(Does the instrument measure the construct it’s designed to measure?)
3. the right criterion (Criterion Validity)
(Do the instrument scores align with 1 or more standards or outcomes
related to the instrument’s intent?)

Establishing Evidence of
Content Validity
To determine this, ask:
Do the items in the instrument represent
the topics or process being investigated?
Ex: An instrument designed to measure alcohol use
should measure behaviors associated with alcohol use
(not smoking, drug use, etc.).

These steps are done during the assessment development stage:
1. Define the content domain that the assessment intends to measure.
2. Define the components of the content domain that should be
represented in the assessment through a literature review.
3. Write the items/questions that reflect this defined content domain.
4. Have a panel of topic experts review the items/questions.
Establishing Evidence of
Content Validity

You are to design an instrument to measure undergraduate college teaching
effectiveness.
1. Clearly define the domain of the content that the assessment intends to represent.
Determine the topics/principles related to college teaching effectiveness using the
literature.
2. Define the components of the content domain that should be represented in the
assessment.
Select the content areas that are specific to effective undergraduate college
teaching (not graduate school or adult learning)
3. Write items/questions that reflect this defined content domain
Write response items for each component.
4. Have a panel of topic experts review the response items for clarity and coverage.
.
An Example: Establishing Evidence of
Content-related Validity

Recommended method for a response item
review by panel of topic experts (Popham, 2000)
1. Have a panel of experts individually examine each item for
content relevance—noting YES, it’s relevant or No, not
relevant
2. Calculate the percentage of YES responses for each item
and then the average percent of YES for all items. This
reflects item relevance.
3. Have panel members individually review the instrument for
content-coverage—noting a percentage
4. Compute an average percentage of all panelist estimates of
coverage. This reflects content coverage.

What do the results mean?
95% item relevance
85% content coverage
Impressive evidence of content-related validity!
You could say with, relative confidence, that the instrument
validly measures the content it intends to measure.
---------------------
65% item relevance
40% content coverage
Poor evidence of content-related validity
You could NOT say with, confidence, that instrument validly
measures the content it intends to measure.

Establishing Criterion Validity
Are the results from the instrument
comparable to an external standard or
outcome?
There are 2 types:
1. Predictive-related validity
2. Concurrent-related validity

1. Predictive-related Criterion Validity
The assessment scores are valid for predicting future
outcomes regarding similar criteria.
A significant lag time is needed.
Ex: A group of students take a standardized math & verbal
aptitude test in 10th grade and score very low. In the
students’ senior year, 2 years later, the students’ math and
verbal aptitude scores (criterion data) on the SAT (a college
entrance exam) bear out to be similarly low.
In this case, evidence of predictive criterion-related validity has been established.
We can trust the predictive inferences regarding math & verbal skills made from
this standardized instrument 

2. Concurrent-related Criterion Validity
The assessment scores are valid for indicating
current behavior.
Ex: A group of students take a standardized math &
reading comprehension aptitude test in 10th grade
and receive very low scores. The scores are
compared to grades in 10th grade algebra and English
literature courses. They are equally low.
In this case, evidence of concurrent criterion-related validity has been established.
We trust the inferences regarding math and reading comprehension scores made
from the standardized instrument.

Establishing Construct Validity
Does the instrument measure the construct
(i.e. psychological characteristic or human
behavior) it’s designed to measure?
Note: “Constructs” are hypothetical or theoretical.
Example: “Love” is a theoretical construction. Everyone
has constructed their own theory of what it is.

Establishing Construct Validity
• The first step is to use the literature to
operationalize (i.e. define) the construct.
• A panel of topic experts can add additional
support.
• Specific studies provide additional evidence.

Studies for Establishing Construct
Validity
1. Intervention studies
2. Differential-population studies
3. Related-measures studies: compare scores to
other measures that measure the same
construct

1. Intervention studies
Demonstrate pre-post changes in the construct
of interest based on a treatment.
Ex : An inventory designed to measure test-anxiety is given to 25
students self-identified as having test anxiety and 25 students
who claim they do not. The inventory is administered just
before a high-stakes final exam. As predicted, the scores were
significantly different between the test anxiety group and
non-test anxiety group.
In this case, evidence of construct-related validity has been established.
We can trust inferences regarding anxiety based on the anxiety inventory scores.

2. Differential-population studies:
Demonstrate different populations score differently
on the measure.
Ex : An inventory is designed to measure insecurity due to
baldness. The inventory is given to bald-headed men and
men with a head full of hair. As predicted the bald-headed
men had much higher scores than the men with hair.
Evidence of construct-related validity has been established.
We can trust inferences regarding bald-headed insecurity based on the inventory scores.

3. Related-measures studies:
Correlate scores (positive or negative) to other
measures that measure similar constructs.
Ex : An inventory is designed to measure introversion.
The inventory is given to sales people scoring high on an
extroversion inventory. As predicted the sales people
introversion inventory resulted in very low scores.
Evidence of construct-related validity has been established.
We can trust inferences regarding introversion based on the inventory scores.
It is recommended to continually establish construct-related validity as the instrument
is use. The theoretical definition of a construct changes over time.

Other types of validity
• Convergent Validity
• Discriminant Validity
• Multicultural Validity: Evidence that the instrument measures what it
intends to measure as understood by participants of a particular culture.
For example: If your instrument is to be administered to the Hmong population, then
the language, phrases, and connotations should be understood by this culture.
• Both are a type of construct validity
• Convergent validity refers to evidence that
similar constructs are strongly related.
For example: If your instrument is
measuring Depression, the response items
related to Sadness should score similarly.
• Discriminant validity refers to evidence that
dissimilar constructs are NOT related.
For example: If your instrument is
measuring Depression, the response items
related to Happiness should score
dissimilarly.

Quick summary
 In order to make valid decisions we need to use
appropriate instruments that have established
evidence of content-related, construct-related, and
criterion-related validity.
 This is determined in the developmental stage of the
instrument.
 If you are designing an instrument, you need to
establish this.
 If the instrument is already designed, review the
instrument’s manual to determine how this was done.
 If you alter an established instrument from it’s original
state, you need to re-establish validity and reliability.

Reliability
Assessment instruments need to yield valid data
AND be
reliable

What’s “Reliability”
The ability to gather consistent results from a
particular instrument.

There are 3 approaches to establishing
instrument reliability.
1. Stability reliability
2. Alternate-form reliability
3. Internal consistency reliability
Each is a statistical test of correlation to measure
of consistency.

1. Stability Reliability
Definition: Consistent results over time
Also known as “test-retest” reliability
Use this if the assessment is to be given to the same individuals
at different times.
How do we do determine this?
 Give the assessment over again to the same group of people.
 Calculate the correlation between the 2 scores.
 Be sure to wait several days or a few weeks.
 Long enough to reduce the influence of the 1st testing (i.e.
memory of test items) and short enough to reduce the
influence of intervening influences.

2. Alternate-form reliability
Definition: Consistent results between different forms of the
same test.
Also known as “parallel form” reliability
Use this if multiple test forms are needed for interchangeability —
usually for test security (i.e. prevent cheating).
How do we determine this?
Create different forms that are similar in content (i.e. “content
parallel”) and difficulty (i.e. “difficulty-parallel”). Administer both
forms to the same group of people and calculate the correlation.

Are stability reliability and alternate-
form reliability ever combined?
YES!
This is called stability and alternate-form reliability.
This is where there are consistent results over time using two
different test forms of parallel-content and parallel-difficulty.

3. Internal Consistency reliability
The degree to which all test items measure the content domain
consistently.
Use this when there is no concern about stability over time and no
need for an alternate form.
How do we do this?
Split-half technique: Divide test in half by treating the odd numbered
items and even numbered items as 2 separate tests. The entire
test is administered and the 2 sub-scores (scores from even items
& scores from odd items) are correlated.

Reliability Coefficients (known as “r “)
Stability reliability
Alternative form reliability
Internal reliability
Pearson-Product moment
Pearson-Product moment
Pearson-Product moment is used to
correlate each half
or Kuder-Richardson or
Cronbach’s alpha
When establishing reliability a correlation between the two sets of data
needs to be calculated using appropriate statistical formulas.
Reliability method Statistical formula

Acceptable r values
A reliability value of 0.00 means absence of reliability
whereas value of 1.00 means perfect reliability. An
acceptable reliability coefficient should not be below
0.80, less than this value indicates inadequate
reliability.
However with stability and alternative-form combined
reliability, .70 is acceptable since there are more
variables.

So let’s check your understanding
You design an instrument to be used as a pre-post assessment.
Which form of reliability should definitely be established?
____Stability
____Alternative-form
____Internal consistency
What type of statistical formula should you use to correlate the two
results? (i.e test and re-test scores)
____Pearson-Product Moment
____ Spearman Brown
The reliability coefficient was .70 Is the assessment reliable?
____Yes
____No It needs to be at least .80

Remember….
In order for an assessment to be worthwhile it
needs to be
RELIABLE
and able to yield
VALID data

AND….
It’s quite possible for an instrument to be
RELIABLE
and not
provide VALID inferences

HOWEVER….
It’s NOT possible for an instrument to provide
VALID inferences
without being
RELIABLE

This ends Info Session 2
“Validity and Reliability”
I highly recommend traveling through this session at least TWICE

Session 2 2018

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Session 2 2018

Similaire à Session 2 2018 (20)

Dernier

Dernier (20)

Session 2 2018

Notes de l'éditeur