Network News

X My Profile
View More Activity

Posted at 1:15 PM ET, 03/16/2010

Jay Mathews (and Obama) vs. me on tests

By Valerie Strauss

Valerie says the standardized tests we use now are too unreliable to tolerate.” -- Jay Mathews

Actually, I didn’t say that.

But my unrivaled colleague Jay Mathews wrote that I did, just before he went on in his Class Struggle blog, to extol the virtues of standardized tests. He also complained that I had not provided any evidence for “my side.”

So let’s start with exactly what my side is. I am not anti-test. What we should not do, but have done for the last eight years under No Child Left Behind, is use results of standardized tests to make major decisions about personnel and the fate of schools. On the basis of standardized test scores alone, for example, excellent schools were deemed to be failing because of NCLB's irrational accountability formula. President Obama, in his newly stated vision for rewriting NCLB, rightly said he wants to get rid of that scheme.

The results of standardized tests also should not, by themselves, determine whether a teacher gets merit pay, is fired or promoted, etc. This is what Obama and Education Secretary Arne Duncan, wrongly in my view, are advocating in part, and that is what I objected to in the post on Sunday to which Jay refers.

Opposition to using standardized tests for high-stakes decisions--including to measure how effective teachers are in the classroom--hardly originates with me. Experts in assessment have been complaining about it for most of the eight years that the country’s public schools have become obsessed with testing under NCLB.

The reason, they say, is that we don’t have standardized tests that are drawn well enough to be a single measure of anything important. The tests are just not that sophisticated, and, besides, all tests are not born equal.

Some state tests are more aligned with what kids are supposed to be learning than others. The federal National Assessment of Educational Progress--called "the nation's report card" because it is the only standardized test given in districts across the country--sometimes produces results that vary widely from state test scores.

So which tests should high-stakes decisions rest upon?

Daniel Koretz is a professor at Harvard University's School of Education and author of the book, “Measuring Up: What Educational Testing Really Tells Us.” He’s non-partisan when it comes to testing--that is, he’s not for or against it. He just researches them.

In the book, he explains in great detail the limits of standardized testing. No single test can tell us everything a child has learned or knows. He and others note that there is a certain amount of measurement error in every test. (And there are other complications as well; for example, when the tests are taken by a child with a disability that limits his or her chances of having the same conditions as other children. Some kids are given accommodations, but not many.)

Koretz also discusses how any single indicator used for high-stakes decision-making is more likely to become corrupted. Teaching to the test, which is what happened in many schools under NCLB, is one such way scores become corrupted.

Here’s what he said in an interview on this Web site:

“The misconception that matters the most is the notion somehow a good test measures all of what’s important. A good test is like a political poll. It’s a very small sample of something much larger.... So just as you predict a presidential election by polling 500 or 700 or 1,000 out of 120 million voters, you sample from this big domain of achievement a modest number of things that allow you to predict the whole. That’s all a test is, and its value is only as a tool for estimating what kids really know about the whole. Failing to understand that underlies, I think, a lot of the unfortunate consequences of high stakes testing today because people think that if they teach what’s on the test, they must be doing the right thing.

And this:

"It’s really a matter of degree, that if the pressure becomes too severe, then people game the system. And this is not a problem limited to education; it’s just everywhere you look. So for example, some years ago, the British National Health Service imposed time limits on the amount of time that patients could be waiting in emergency rooms, which for people who’ve waited in emergency rooms would seem like a very good thing to do. And it was a good thing to do, but unfortunately people gamed the system in a number of ways, one of which is that some hospitals kept patients in queues of ambulances out on the street until they had enough room that they were confident that they could get them through in four hours. Well, the answer to that problem isn’t, stop worrying about wait times. It’s [to] find a better way to hold hospitals accountable for keeping wait times short. And the same is true in education. The answer to the current problems we’re seeing is not, in my view, stop holding schools accountable for teaching kids. It’s [to] find a better way to do it, one that has fewer side effects."

That makes sense to me.

Now, Jay says that he has seen a lot of test results over the last 30 years that “seem to conform” with what he knows of the quality of teaching and the socio-economic level of the students being tested.

Research has long shown that students who live in high- and middle-income areas do better on tests--and in school--than students who live in low-income areas. Obviously the lucky kids are exposed to more experiences at a younger age, more words, better teachers and health care, schools with more resources, etc. etc.

This is from a report by the non-profit research organization the Rand Corporation:

Research that attempts to explain the variance in test scores across populations of diverse groups of students shows that family and demographic variables explain the largest part of total explained variance. Among commonly collected family characteristics, the strongest associations with test scores are parental educational levels, family income, and race and/or ethnicity. Secondary predictors are family size, the age of the mother at the child’s birth, and family mobility. Other variables, such as being in a single-parent family and having a working mother, are sometimes significant after controlling for other variables. The states differ significantly in the racial or ethnic composition of students and in the characteristics of the families of students, so it would be expected that a significant part of the differences in the NAEP test scores might be accounted for by these differences.

Jay also wrote that “schools that have taken unusual measures to deepen and invigorate the learning of impoverished children, such as Achievement First, Uncommon schools and KIPP, show significantly better scores than schools that have not.”

Well yes, but for one thing, that isn’t simply because the teacher in the classroom worked miracles all alone. It is because the teacher in the classroom was one of a number of measures, that, together, helped the student succeed.

How fair, then is it, for individual teachers to be judged strictly on the basis of standardized test results, without any of the other factors in a child’s life being changed?

If teachers get no support for what they do from within their school, if they work in a non-collegial atmosphere with few resources, how is it fair to judge them entirely on the scores of their students?

Why is the teacher being held completely responsible for student achievement when we know that many factors go into student performance?

It doesn’t make sense to me.

Tell me if/why I’m wrong and Jay Mathews is right/wrong.

Follow my blog all day, every day by bookmarking And for admissions advice, college news and links to campus papers, please check out our new Higher Education page at Bookmark it!

By Valerie Strauss  | March 16, 2010; 1:15 PM ET
Categories:  Education Secretary Duncan, No Child Left Behind, Standardized Tests  | Tags:  No Child Left Behind, standardized tes  
Save & Share:  Send E-mail   Facebook   Twitter   Digg   Yahoo Buzz   StumbleUpon   Technorati   Google Buzz   Previous: Latest college trend: Mixed-gender dorm rooms
Next: Terps Coach Williams: Keep college advising to yourself



I am almost certain that you are correct. There is no single test that is reliable enough to measure the progress of each child in a class while evaluating the effectiveness of the teacher at the same time. Add to this the fact that many desperate teachers and administrators are "gaming the system" by drilling the children on the exact items on the test, which of course invalidates it. Most, if not all, testing experts will agree with what you have stated.

During my career I had the good fortune of teaching very different student populations. When I taught in a school with mainly affluent children, the test scores were very high (above the 85th). When I taught in a very poor city school, the scores were, on average, BELOW the tenth percentile. In a middle-income community in Iowa, the test scores hovered around the norm. In the working class and Spanish-speaking city of California, where I spent most of my career, the average score was around the 30th percentile.

I enjoyed teaching English to little children in CA, but if I had to teach under the present conditions, I'd go back to the affluent school and stay there. A teacher has a right to protect herself from nonsense.

By the way, there ARE tests that can assess the progress of children, but they must be given individually and they must be secure (no peeking). Teachers CAN be evaluated but there is no single test that can do it.

Posted by: Linda/RetiredTeacher | March 16, 2010 1:43 PM | Report abuse

Valerie said

Some kids are given accommodations, but not many.

Oh if this were true. I have way too much first hand knowledge about the use of accommodations as a technique to raise test scores. It's far more of an entrenched way of doing business than you seem to understand. From a school's point of view it's win-win. The kid passes the high stakes exam and moves on, the principal looks good for generating better results, the teacher administering the test feels good about the student success. It would be enlightening to get some data on the number of students receiving accommodations on these tests as well as the details of the accommodations.

Posted by: mamoore1 | March 16, 2010 1:48 PM | Report abuse


I know that many citizens, including journalists, do not understand why teachers are so opposed to being evaluated on the basis of a single test, so I'll try to give an example of what can, and does, happen:

Mrs. V (a real teacher) teaches sixth grade in a low-income school in CA. She is authorized to teach English as a second language so she is assigned many children who are new to the United States. The standardized test that she must give in the spring is based mainly on a sixth grade curriculum so there are very few items that the Spanish-speaking children can answer correctly. It looks as though Mrs. V has taught nothing when in actuality the children have learned a great deal. When the children are tested individually by the school's psychologist, much progress is noted.

This is the sort of testing (one test for whole class) that is going on across the country and it is so unfair to the teachers at these schools. We can do better.

Posted by: Linda/RetiredTeacher | March 16, 2010 2:18 PM | Report abuse

I don't see what's so complicated. Pick a test, it doesn't much matter which, but the NAEP seems reasonable. Look at how the kids score before they enter a teacher's classroom. Look at how they score after a year with him/her. How does the change compare with that of other teachers in your school district? If it's significantly better, give the teacher a raise, worse, fire him/her. (It's probably best to average this procedure over a few years before you fire people, but you get the idea.)

Posted by: qaz1231 | March 16, 2010 4:20 PM | Report abuse

Another reason not to judge teachers by a test score is that, in many schools, it isn't only the classroom teacher who is involved with remediation. I've worked in two schools where students who were falling behind received intensive tutoring both to improve understanding of the material AND to improve test-taking skills (teaching to the test). The people doing the extra tutoring were an assortment of people, including retired teachers paid to tutor during school hours, librarians, and special ed teachers doing extra duty, just to give a few examples.

Is it fair for the classroom teacher of record to get the credit when it was a team effort? Raises based on test results discourages collaboration.

Posted by: aed3 | March 16, 2010 7:49 PM | Report abuse

Valarie, I posted on Jay's thread about this, so a bit repetitive, sorry. I agree with the general points you've made, and I also recommend Daniel Koretz's book.

Your concern is not with test reliability, but with the validity of the decisions made using test scores or other sources of information. The crucial point with validity is that a test that is valid for one purpose or type of decision may not be valid for another, so the question with standardized tests is, what decisions can they validly support?

I don't know any test developers who believe that standardized test scores alone are valid evidence for hiring or firing teachers (and, yes, I do know people who work in high-stakes testing). The misuse of standardized test scores seems to come from administrators and politicians who are either unaware of, or willfully ignore, the very extensive literature on test validity and interpretation.

Posted by: Trev1 | March 17, 2010 1:17 AM | Report abuse

To some people it sounds so simple - give the student a pre-test, do some teaching, give a post-test, and voilà - measure of learning.

Of course, for people who actually work with the students, day in, day out, thousands of them, over decades, it's not so clear. Could it be that we understand something that you don't?

No one in the field of education, or science, for that matter, will claim that measuring something once is good enough. So, how do you know if the pretest is valid? There's a reason that score reports usually include a margin of error for an individual student. Add in all of the unknown factors and influences on one student, and you really have trouble saying that the single pre-test is proof of anything conclusive about the student. In my experience, there isn't even a pre-test - it's just last year's test compared to this year's. So, if there's a bad sample in there, how do you know which one it is? If the student improved at reading, and has been in a good school where every teacher understands and teaches reading strategies, how do you determine which teachers are responsible for growth? If you're an excellent teacher on a weak team within a school, how would anyone know from one test result? If students show growth in algebra and have been using algebra regularly in their science class, how do you separate the influence of the math and science teachers?

The purveyors of "value-added measures" claim that with enough data over enough time, you can control for all of those factors, but that's a serious case of ivory tower blindness. Maybe, maybe... if you could hold other factors constant, such as the courses I teach, the number of students I teach, the curriculum, the technology and other resources, all the other teachers and all of their courses, the administrative policies, the school's schedule, school transportation, extra-curricular activities and field-trips that affect instructional time... if you haven't worked in a school, if you haven't administered the tests and watched the students' eyes glaze over as they go through a meaningless exercise with no consequences for them... then you simply do not understand this issue as well as you need to in order to make pronouncements about it. I'm surprised someone as smart as Jay Matthews hasn't got that part figured out yet. How many experts does it take to shake some people's blind faith in multiple choice tests???

Posted by: DavidBCohen | March 17, 2010 2:00 AM | Report abuse

DavidBCohen- there is extensive literature on the problems you describe, which are very real and well understood by the people who develop and validate tests.

In my experience talking to those people, they universally express frustration that the test scores are misused in the ways you describe. Major standardized test performance is typically analyzed and reported in both research journals and on the websites of the testing organizations. These reports include warnings about interpretation and misuse of test scores. These warnings are routinely ignored by people who can't be bothered wading through the thousands of pages of technical literature, or misrepresented for political or financial reasons.

Ultimately it comes down to cost. If test developers had the sort of budgets that NASA or the Pentagon have, things might be different. That is not the world we live in, however, so the emphasis in large-scale testing will continue to be low-cost and administrative convenience.

Posted by: Trev1 | March 17, 2010 3:24 AM | Report abuse

The devil is in the details: In my state, we have begun to move to a "growth model." The administration official that tried to explain this to my school didn't even quite understand it. In what I'm sure seemed a logical statement, we were told that "all students had to show growth." I asked how a student that was already in the 99% of the test was going to show growth--and the official couldn't answer that. Well, sure enough, this year I have a student that topped out the test last year. She keeps showing up on reports, in big red ink, as "in danger of not meeting growth." I have to keep explaining to upper administration--SHE CAN'T SHOW GROWTH WHEN SHE'S ALREADY AT THE TOP!!!! The folks who are in charge and making the rules don't know enough about standardized testing to know that it shouldn't/can't be used the way they want it to be used!

Posted by: inthetrenches1 | March 17, 2010 9:05 AM | Report abuse


You have said it best! Most people, including journalists, administrators and many teachers, do not understand these tests. They are designed to do only so much. They can't measure the progress of children who are way below or way above grade level. In any school, there might be quite a number of students who are "off the scale" and therefore make it look like they are not making progress.

If any single tests are used to evaluate teachers, I suspect there will be court cases where testing experts will testify on behalf of teachers.

Posted by: Linda/RetiredTeacher | March 17, 2010 10:36 AM | Report abuse


"Ultimately it comes down to cost. If test developers had the sort of budgets that NASA or the Pentagon have, things might be different."

Actually, NASA doesn't have a big budget. For every federal dollar spent, NASA's share is less than a penny.

Posted by: clevin | March 19, 2010 10:47 AM | Report abuse

The comments to this entry are closed.

RSS Feed
Subscribe to The Post

© 2010 The Washington Post Company