Berliner: Why we are ‘smart’ about evaluating athletes and ‘dumb’ about assessing students, teachers and schools
The Answer Sheet asked prominent researcher and educational psychologist David Berliner of Arizona State University to explain why using a standardized test score as a single measure of academic achievement doesn’t make sense and why we should use multiple measures.
By David C. Berliner
Americans are smart about evaluating athletes and sports teams, and dumb about evaluating students, teachers and schools. Let me explain.
Recently, two NFL teams were unbeaten after more than a dozen games into the 2009 season. Then both lost.
Suppose that you were observing them on the day they lost, rather than on the 13 or 14 times they had previously won. Given the circumstances, we might all agree that the day you watch a team matters, and thus a single observation can lead to a big mistake in judgment about a teams’ proficiency.
Suppose on the day you watch one of these teams the quarterback threw for 350 yards and three touchdowns. If that were all you were assessing that day, and reported the quarterbacks’ performance to others, it would sound impressive.
But the team you watched actually lost their game because the quarterback threw two interceptions, fumbled once, and had minus rushing yards. So despite his three touchdowns and impressive passing yardage in that particular game, we might well have reasons to boo the quarterback.
We might boo because we know that playing quarterback requires split-second decision making, skill in passing and running, holding on to the ball while being swarmed by blitzing linemen, reading defensive formations at the line of scrimmage, rallying team mates when the going gets rough, and a host of other skills.
In other words, the concept of quarterback is a complex one, made up of different bits of knowledge and skill that overlap and together make up the notion of “quarterbacking proficiency.”
“Touchdowns thrown” is by itself a pretty poor measure of the worth of a quarterback.
Even the most naïve sports fan knows that a team, or any player on it, should not be evaluated on the basis of a single observation.
Yet we often judge students, teachers, and schools on the basis of just that. A single test, often given over a few days in the Spring semester, constitutes the assessment of our students’ knowledge and skill for the purposes of evaluation under the No Child Left Behind (NCLB) law to which our schools must adhere.
But the test may have on it a form of item the students hadn’t encountered before. For example they might see:
rather than the more familiar 4 + ? = 8.
This little difference has been found to change the correct responses rate in some primary grades by 20 percentage points and more!
Or on any given day students might misinterpret an item or two, which sounds like it might not be much of a problem but it might reduce average test scores by many points since the number of items right is not usually the way scores are reported to a student or for a class.
A small two-item difference on a test on any one day could result in much bigger appearing score differences for students or schools. In addition a student might misalign the questions with the answer sheet and then get fewer answers scored correct by a machine.
Perhaps many students stayed up late the night before watching their favorite basketball team play in a tournament or viewing a TV special, throwing their sleep pattern off for that particular day of testing. Or suppose the school cafeteria served a particularly nutritious breakfast the day of the test. Those students who received the school breakfast could have performed better than they usually do.
The point is that on any given day, in any curriculum area measured, dozens of influences could affect the scores of a student or a class. The scores obtained on any one day may diverge a lot from the scores obtained on another day. The solution is to have multiple observations from which we could take an average that might better characterize the typical score of a student or class.
In addition, the tests in the United States often have no items covering major areas of the domain that we call reading or mathematics.
We recognize that quarterbacking is multi-faceted and we understand that all of the skills needed to hold that position must be assessed or we would make a mistake in judging the worth of a particular quarterback.
Reading is no different. It is about more than decoding, spelling and punctuation. Reading is about making sense. It requires connecting what was read with one’s own experience, being able to retell what was read in one’s own words, predicting the next events in a story, analyzing plot and theme and the motives of characters, recognizing metaphors and symbols, discerning the authors intent, and dozens of other sophisticated forms of “comprehending” the text.
Mathematics is an equally complex domain. Being proficient at mathematics requires many skills, most of which are never tapped in the single administration of the tests we use throughout the nation.
Why in the world do we readily recognize the problems of using a single observation in judging sport figures or teams, and abandon those ideas in education?
Almost without exception, the tests for compliance with the No Child Left Behind law are ordinarily given only once. And a single observation is no more appropriate for judging students, teachers, and schools than it is in judging athletes or their teams.
Education Secretary Arne Duncan is an athlete. He should know this.
But because Secretary Duncan does not have classroom experience, he may not know that teachers evaluate their students every day, 180 times a school year. The fact that under our laws these teachers have no say in evaluating their students’ skills and abilities is really quite ludicrous.
Their clinical knowledge, derived from many mini-tests, homework assignments, and classroom interactions is devalued in the quest for “objective” scores. But we now see that those one shot “objective scores” may be invalid as measures of what a child actually knows or what a teacher can accomplish.
Our national assessment practices have another problem. It makes the situation even worse for teachers and administrators than it is for students—and it is plenty bad for students.
Teachers and schools are evaluated on the basis of how well their students do, not on the basis of their actual teaching.
We all know that the coach of a Class D high school football team isn’t expected to win often if he plays against teams in tougher leagues. He simply hasn’t the talent pool to do so. And we are aware that physicians in cancer hospitals would have a low rating were we to judge their performance primarily on patient longevity.
Yet we rarely factor in the student’s social class or their family and neighborhood characteristics when judging teachers and schools.
Every student under NCLB has to be proficient, which is as ridiculous as saying every athlete has to excel and every patient has to get well. Is it so hard to acknowledge that English Language Learners will probably not do well, at first, on the literacy parts of the tests?
Do we really expect asthmatic children to learn as readily as healthy children? Are families with food insecurity likely to have children who can perform as well as some others?
What is it that lets most people cut the coach some slack and rate physicians on the basis of the severity of the illnesses that they treat, yet prevents them from applying anything like the same logic to teachers and schools?
Why do we judge teachers and schools to be superior, or not, regardless of the conditions under which they work and on the basis of a one -shot test of student knowledge and skill that is clearly inadequate in assessing a) their typical performance, and b) the breadth of the skills needed to be proficient in the domain being assessed.
Although defended by many politicians and parents, the vast majority of the assessment systems we use for compliance with federal law are no damn good. Single rather than multiple observations to assess the competency of a quarterback or team are understood not to be valid and fair.
Single observations are just as invalid and unfair when used to judge students, teachers, or schools.
We can do better.
David C. Berliner is co-author with Sharon L. Nichols of Collateral Damage: How high-stakes testing corrupts America’s schools (2007). He is also co-author with Bruce J. Biddle of The Manufactured Crisis: Myth, fraud, and the Attack on America’s Public Schools (1995).
| January 5, 2010; 12:28 PM ET
Categories: David Berliner, Standardized Tests | Tags: evaluating teachers, standardized tests
Save & Share: Previous: Teaching philosophy in school
Next: Good news: College financial aid form gets easier