October 2009

Test-Based Accountability Fails to Measure Up

(?)

By Donna Forsyth

Measuring Up: What Educational Testing Really Tells Us

Daniel Koretz. Cambridge, MA: Harvard University Press, 2008; 368 pp; ISBN: 978-0-674-02805-0, hardcover $29.95 us.

At first glance, Canadian readers might be tempted to discount the relevance of this book because of its American focus, specifically its focus on the issues raised by large-scale, high-stakes testin in American school systems. However, at a time when Canadian school systems face increasing demands for accountability, Koretz’s book provides insight into the complexities and repercussions of educational testing that can serve as buffer against common misunderstandings about testing held by the public as well as many educators and policy makers on both sides of the border.

Koretz teaches at the Harvard Graduate School of Education. His extensive research on educational testing policy, on high-stakes testing, on the inclusion of English language learners and students with disabilities in large-scale testing, and on international testing programs qualify him well to write about testing. The inspiration for Measuring Up came from student responses to an introductory master’s level course on educational testing that Koretz teaches. In the course he focuses on helping his students become well-informed users of tests and test information rather than psychometricians.

The title of the first chapter, “If Only It Were So Simple,” sets the tone for the rest of the book as it unravels the complexities and pitfalls of educational testing, and the interpretation of test results. Koretz starts by providing thorough explanations of technical concepts such as reliability, validity, bias, measurement error and sampling. He goes on to trace the history of American testing and the evolution of large-scale achievement tests, including the widely used norm-referenced tests of the 1950s, the minimum-competency testing movement, performance assessment, and the standards-based or criterion-referenced assessments prevalent today.

Koretz believes the most significant change in testing in the last 50 years has been in its purpose — a shift from using tests as sources of information about student learning to using tests to hold teachers and students accountable, especially under the federal No Child Left Behind Act.

Koretz focuses on the repercussions of current test-based accountability systems that tie rewards and sanctions to the number of students in certain groups who meet or exceed a predetermined proficiency level. He exposes the erroneous thinking behind policies which place an almost exclusive emphasis on test scores to evaluate the effectiveness of a teacher, school or district.

Although achievement test scores purport to convey an overall measure of what students have learned, they really offer only an incomplete estimate of what students may have learned on small subsets of educational goals within very large domains of knowledge and skills.

On international achievement testing such as the OECD Programme for International Student Assessment, Koretz cautions against the simplistic interpretation of results, censuring the tendency to report results as a simple ranking of countries. He suggests factors such as the small sampling of content and changes in the emphasis given to subsets within content areas are significant.

He postulates that international comparisons “do not provide a consistent and logical norm group for comparison.” (p. 105) A more useful way to look at international comparisons, Koretz suggests, would be to compare American results with those from high-scoring countries that might serve as exemplars, and with countries that are similar to the United States like England and Australia. Another suggestion is to heed large differences and general patterns instead of dwelling on small differences.

One of the major issues Koretz tackles is the phenomenon of score inflation on high-stakes tests. He argues that high-stakes testing has encouraged the practice of “teaching to the test,” which can result in artificial gains in test scores that skew assumptions about student learning. Rising scores may not necessarily reflect real improvements in student achievement.

Koretz identifies a score pattern which calls the legitimacy of the reported increases into question — at first, scores on a new test are relatively low, but then show rapid increases over a period of several years before leveling out. When the test is replaced by a new one, the pattern repeats itself. If improvements in learning are genuine and the test items are representative of the curriculum, the scores should increase at only moderate rates, and should remain at similar levels even when the test changes. One remedy Koretz suggests is the use of audit tests to verify scoring trends.

Another area of educational testing Koretz sees as problematic is the ubiquitous use of standards-based reporting, an area Canadian educators will find relevant. Koretz asserts that standards-based reporting of student achievement is more complex than it appears, and that labels assigned to describe performance levels (e.g., basic, proficient, advanced) are quite arbitrary.

He argues that “… there are only trivial differences between students just above and just below a standard, and there can be huge differences among students who fall between the two standards and who are therefore assigned the same label.” (p. 324)

Koretz challenges the accuracy of achievement trend and achievement gap reports expressed in terms of “percent proficient.” Furthermore, Koretz points out a particularly serious error that arises when school systems try to compare the change over time in the achievement between two groups of students that start out at different achievement levels (e.g., African American and white students). For example, calculating the changes in percentages of students achieving above the “proficient” level confuses the amount of progress made with the proportion of the group clustered around that standard.

Here Koretz tries to make a case for a return to norm-referenced reporting, suggesting standards based reporting be accompanied by what he considers more useful forms of reporting, such as scale scores and percentiles.

Recent trends to increase the participation of students with disabilities and students with limited proficiency in English in large-scale achievement testing constitute another controversial topic examined in Measuring Up. Koretz suggests that, while it is important to move ahead with the inclusion of students with special needs in testing programs, we must do it cautiously, with full awareness of the inadequacies inherent in the methods currently available and with realistic expectations about the inferences that can be drawn from the scores.

The title of the final chapter, “Sensible Uses of Tests,” encapsulates Koretz’s message: educators and policy makers need a good understanding of the core concepts and principles of educational testing in order to interpret the results well and to make sound decisions.

He concludes, “In all, educational testing is much like a powerful medication. If used carefully, it can be … (a) powerful tool for changing education for the better. Used indiscriminately, it poses a risk of various and severe side effects.” (pp. 331–332)

---------------------------------------------------------------
Donna Forsyth is a professor of education at Brandon University.

CAUT Bulletin October 2009

CAUT Bulletin Archives 1996-2016

Test-Based Accountability Fails to Measure Up (?)

Measuring Up: What Educational Testing Really Tells Us

CAUT Bulletin Archives
1996-2016

Test-Based Accountability Fails to Measure Up

(?)