1. Using tests for high stakes evaluation, what
educators need to know in Connecticut
Presenter - John Cronin, Ph.D.
Contacting us:
Rebecca Moore: 503-548-5129
E-mail: rebecca.moore@nwea.org
Visit our website: www.kingsburycenter.org
2. Connecticut requirements
• Components of the evaluation
– Student growth (45%) - including the state test, one non-standardized
indicator, and (optional) one other standardized indicator.
• Requires a beginning of the year, mid-year, and end-of year conference
– Teacher practice and performance (40%) –
• First and second year teachers – 3 in-class observations
• Developing or below standard – 3 in-class observations
• Proficient or exemplary – 3 observations of practice, one in-class
– Whole-school learning indicator or student feedback (5%)
– Parent or peer feedback (10%)
3. Connecticut requirements
Requirements for goal setting
• Process has each teacher set one to four goals with their principal taking into
account:
• Take into account the academic track record and overall needs and strengths of
the students the teacher is teaching that year/semester;
• Address the most important purposes of a teacher’s assignment through self-
reflection;
• Be aligned with school, district and state student achievement objectives;
• Take into account their students’ starting learning needs vis a vis relevant baseline
data when available.
• Consideration of control factors tracked by the state-wide public school
information system that may influence teacher performance ratings, including, but
not limited to, student characteristics, student attendance and student mobility
4. What changes for educators?
1.The proficiency standards get higher.
2.Teachers become accountable for all
students.
11. Connecticut requirements
• Criteria for student growth indicator
– Fair to students
• The indicator of academic growth and development is used in such a way as to provide
students an opportunity to show that they have met or are making progress in meeting the
learning objective. The use of the indicator of academic growth and development is as free as
possible from bias and stereotype.
– Fair to teachers
• The use of an indicator of academic growth and development is fair when a teacher has the
professional resources and opportunity to show that his/her students have made growth
and when the indicator is appropriate to the teacher’s content, assignment and class
composition.
– Reliable
– Valid
– Useful
• The indicator may be used to provide the teacher with meaningful feedback about student
knowledge, skills, perspective and classroom experience that may be used to enhance student
learning and provide opportunities for teacher professional growth and development.
12. Issues in the use of growth and value-
added measures
Measurement design of the
instrument
Many assessments are not
designed to measure growth.
Others do not measure growth
equally well for all students.
13. Tests are not equally accurate for all
students
California STAR NWEA MAP
14. Tests are not equally accurate for all
students
Grade 6 New York Mathematics
15. Issues in the use of growth and value-
added measures
Measurement sensitivity
Assessments must align with the
curriculum and should be
instructionally sensitive.
16. College and career readiness assessments will
not necessarily be instructionally sensitive
A thirdability might defined the discussion of
…when science is ariseis defined inof knowledge
When case in science in in terms terms of
ethical that are taught in school…(then) where
of factsand moral dimensions of science, those
scientific reasoning…achievement will be less
maturity, rather than intelligencethe facts will
students who haveand exposure,or curriculum
closely tied to age been taught and more
exposure mighttothose most important factor.
know them, and be the who have not will…not. A
closely related general intelligence. In other
Here it science reasoning tasks are relatively
test that assesses thesethe assessment is not
words, may well be that skills is likely to be
particularly to instruction.
highly sensitive to instruction.
insensitive sensitive to instruction
Black, P. and Wiliam, D.(2007) 'Large-scale assessment systems: Design
principles drawn from international comparisons', Measurement:
Interdisciplinary Research & Perspective, 5: 1, 1 — 53
17. Issues in the use of growth and value-
added measures
Measurement sensitivity
Classroom tests, which are
designed to measure mastery, may
not measure improvement well.
18. Issues in the use of growth and value-
added measures
Instructional alignment
Tests should align to the teacher’s
instructional responsibilities.
19. Issues in the use of growth and value-
added measures
Uncovered Subjects and Teachers
High quality tests may not be
administered, or available, for many
teachers and grades. Subjects like
social studies may be particularly
problematic.
20. Considerations for developing your own
assessment and student learning objectives
• Developing valid instruments is very time consuming
and resource intensive.
• The assessments developed must discriminate
between effective and ineffective teachers.
• The assessments must be valid in other respects.
– Aligned to curriculum
– Unbiased items
• The assessments can’t be open to security
violations or cheating
22. Issues in the use of growth and value-
added measures
Control for statistical error
All models attempt to address this
issue. Nevertheless, many teachers
value-added scores will fall within
the range of statistical error.
23. Sources of error in assessment
• The students.
• The testing conditions.
• The assessments.
Measurement error in the assessments can be dwarfed by error
introduced by the testing conditions and the students.
24. New York City
• Margins of error can be very large
• Increasing n doesn't always decrease the margin o
• The margin of error in math is typically less than re
26. Issues in the use of growth and value-
added measures
“Among those who ranked in the top
category on the TAKS reading test, more
than 17% ranked among the lowest two
categories on the Stanford. Similarly
more than 15% of the lowest value-added
teachers on the TAKS were in the highest
two categories on the Stanford.”
Corcoran, S., Jennings, J., & Beveridge, A., Teacher Effectiveness on High and Low Stakes
Tests, Paper presented at the Institute for Research on Poverty summer workshop, Madison, WI
(2010).
27. Issues in the use of growth and value-
added measures
Instability of results
A variety of factors can cause value-
added results to lack stability.
Results are more likely to be stable
at the extremes. The use of
multiple-years of data is highly
recommended.
28. Los Angeles Unified
• Teachers can easily rate in multiple categories
• The choice of model can have a large impact
• Models effect English more than Math
• Teachers do better in some subjects than others
• More complex models don't necessarily favor the t
29. Possible racial bias in models
“Significant evidence of bias plagued the value-added model
estimated for the Los Angeles Times in 2010, including significant
patterns of racial disparities in teacher ratings both by the race of
the student served and by the race of the teachers (see Green,
Baker and Oluwole, 2012). These model biases raise the possibility
that Title VII disparate impact claims might also be filed by teachers
dismissed on the basis of their value-added estimates.
Additional analyses of the data, including richer models using
additional variables mitigated substantial portions of the bias in the
LA Times models (Briggs & Domingue, 2010).”
Baker, B. (2012, April 28).
If it’s not valid, reliability doesn’t matter so much! More on VAM-ing
30. Instability at the tails of the
distribution
“The findings indicate that these modeling
choices can significantly influence outcomes
for individual teachers, particularly those in
the tails of the performance distribution who
are most likely to be targeted by high-stakes
policies.”
Ballou, D., Mokher, C. and Cavalluzzo, L. (2012)
Using Value-Added Assessment for Personnel Decisions: How Omitted Variables and Model Specif
LA Times Teacher
#1
LA Times Teacher
#2
31. Reliability of teacher value-added
estimates
Teachers with growth scores in lowest and
highest quintile over two years using NWEA’s
Measures of Academic Progress
Bottom Top quintile
quintile Y1&Y2
Y1&Y2
Number 59/493 63/493
Percent 12% 13%
r .64 r2 .41
Typical r values for measures of teaching effectiveness range
between .30 and .60 (Brown Center on Education Policy, 2010)
33. Challenges with goal setting
• Lack of a “racing form”. What have this
teacher and these students done in the past?
• Lack of comparison groups. What have other
teachers done in the past.
• What is the objective? Is the objective to
meet a standard of performance or
demonstrate improvement?
• Do you set safety goals or stretch goals?
34. Issues in the use of growth and value-
added measures
Model Wars
There are a variety of models in the
marketplace. These models may
come to different conclusions about
the effectiveness of a teacher or
school. Differences in findings are
more likely to happen at the
extremes.
35. Issues in the use of growth and value-
added measures
Lack of random assignment
The use of a value-added model
assumes that the school doesn’t
add a source of variation that isn’t
controlled for in the model.
e.g. Young teachers are assigned
disproportionate numbers of
students with poor discipline
records.
37. New York Rating System
• 60 points assigned from classroom observation
• 20 points assigned from state assessment
• 20 points assigned from local assessment
• A score of 64 or less is rated ineffective.
40. Other issues
Security and Cheating
When measuring growth, one
teacher who cheats disadvantages
the next teacher.
41. Other issues
(1) Each district shall define effectiveness and
ineffectiveness utilizing a pattern of summative
ratings derived from the new evaluation system.
(2) At the request of a district or employee, the State
Department of Education or a third-party entity
approved by the SDE will audit the evaluation
components that are combined to determine an
individual's summative rating in the event that such
components are significantly dissimilar (i.e. include
both exemplary and below standard ratings) to
determine a final summative rating.
(3) The State Department of Education or a third-party
designated by the SDE will audit evaluations ratings
of exemplary and below standard to validate such
exemplary or below standard ratings by selecting ten
districts at random annually
42. Other issues
Security and Cheating
When measuring growth, one
teacher who cheats disadvantages
the next teacher.
43. Cheating
Atlanta Public Schools
Crescendo Charter Schools
Philadelphia Public Schools
Washington DC Public Schools
Houston Independent School
District
Michigan Public Schools
44. Case Study #1 - Mean value-added performance in mathematics by
school – fall to spring
45. Case Study #1 - Mean spring and fall test duration in minutes by
school
46. Case Study #1 - Mean value-added growth by school and test
duration
47. Case Study # 2
Differences in fall-spring test durations Differences in growth index score
based on fall-spring test durations
48. Case Study # 2
How much of summer loss is really summer loss?
Differences in spring -fall test durations Differences in raw growth based by
spring-fall test duration
49. Case Study # 2
Differences in fall-spring test duration (yellow-black) and
Differences in growth index scores (green) by school
50. Security considerations
• Teachers should not be allowed to view the contents
of the item bank or record items.
• Districts should have policies for accomodation that
are based on student IEPs.
• Districts should consider having both the teacher and
a proctor in the test room.
• Districts should consider whether other security
measures are needed for both the protection of the
teacher and administrators.
51. Other issues
Proctoring
Proctoring both with and without the
classroom teacher raises possible
problems.
Documentation that test
administration procedures were
properly followed is important.
52. Potential Litigation Issues
The use of value-added data for high stakes
personnel decisions does not yet have a strong,
coherent, body of case law.
Expect litigation if value-added results are the
lynchpin evidence for a teacher-dismissal case
until a body of case law is established.
53. Possible legal issues
• Title VII of the Civil Rights Act of 1964 –
Disparate impact of sanctions on a protected
group.
• State statutes that provide tenure and other
related protections to teachers.
• Challenges to a finding of “incompetence”
stemming from the growth or value-added
data.
54. Recommendations
• Embrace the formative advantages of growth
measurement as well as the summative.
• Create comprehensive evaluation systems with
multiple measures of teacher effectiveness (Rand,
2010)
• Select measures as carefully as value-added models.
• Use multiple years of student achievement data.
• Understand the issues and the tradeoffs.
55. Thank you for attending
Presenter - John Cronin, Ph.D.
Contacting us:
NWEA Main Number: 503-624-1951
E-mail: rebecca.moore@nwea.org
The presentation and recommended resources are
available at our website: www.kingsburycenter.org
Editor's Notes
Race to the Top, Gates Foundation, Teach for America…