Network News

X My Profile
View More Activity

Posted at 12:00 PM ET, 01/14/2011

The biggest flaw in Gates value-added study

By Valerie Strauss

The following was written by Matthew Di Carlo, senior fellow at the non-profit Albert Shanker Institute, located in Washington, D.C. This post originally appeared on the institute’s blog.

By Matthew Di Carlo
The National Education Policy Center just released a scathing review of last month’s preliminary report from the Gates Foundation-funded Measures of Effective Teaching (MET) project. The critique was written by Jesse Rothstein, a highly respected Berkeley economist and author of an elegant and oft-cited paper demonstrating how non-random classroom assignment biases value-added estimates (also see the follow-up analysis).

Very quickly on the project: Over two school years (this year and last), MET researchers, working in six large districts—Charlotte-Mecklenburg, Dallas, Denver, Hillsborough County (FL), Memphis, and New York City—have been gathering an unprecedented collection of data on teachers and students, grades 4-8.

Using a variety of assessments, videotapes of classroom instruction, and surveys (student surveys are featured in the preliminary report), the project is attempting to address some of the heretofore under-addressed issues in the measurement of teacher quality (especially non-random classroom assignment and how different classroom practices lead to different outcomes, neither of which are part of this preliminary report). The end goal is to use the information to guide in the creation of more effective teacher evaluation systems that incorporate high-quality multiple measures.

Despite my disagreements with some of the Gates Foundation’s core views about school reform, I think that they deserve a lot of credit for this project. It is heavily resourced, the research team is top-notch, and the issues they’re looking at are huge. The study is very, very important — done correctly.

But Rothstein’s general conclusion about the initial MET report is that the results “do not support the conclusions drawn from them.” Very early in the review, the following assertion also jumps off the page: "there are troubling indications that the Project’s conclusions were predetermined.”

On first read, it might sound like Rothstein is saying that the MET researchers cooked the books. He isn’t. In the very next sentence, he points out that the MET project has two stated premises guiding its work— that, whenever feasible, teacher evaluations should be based “to a significant extent” on student test score gains; and that other components of evaluations (such as observations), in order to be considered valid, must be correlated with test score gains.

(For the record, there is also a third premise – that evaluations should include feedback on teacher practice to support growth and development – but it is not particularly relevant to his review.)

So, by “predetermined,” Rothstein’s is saying that the MET team’s acceptance of the two premises colors the methods they choose and, more importantly, how they interpret and present their results. That is, since the project does not ask all the right questions, it cannot provide all of the most useful answers in the most useful manner. This issue is evident throughout Rothstein’s entire critique, and I will return to it later. But first, a summary of the MET report’s three most important findings and Rothstein’s analysis of each (I don’t review every one his points):

MET finding: In every grade and subject, a teacher’s past track record of value-added is among the strongest predictors of their students’ achievement gains in other classes and academic years.

This was perhaps the major conclusion of the preliminary report. The implication is that value-added scores are good signals of future teacher effects, and it therefore makes sense to evaluate teachers using these estimates.

In response, Rothstein points out that one cannot assess whether value-added is the “among the strongest predictors” of teacher effects without comparing it with a broad array of alternative predictors.

Although the final MET report will include several additional “competitors” (including scoring of live and videotaped classroom instruction), this initial report compares only two: value-added scores and the surveys of students’ perception of teachers’ skills and performance — not the strongest basis to call something “among the best.” (Side note: One has to wonder whether it is reasonable to expect that any alternative measure would ever predict value-added better than value-added itself, and what it really proves if none does.)

Moreover, notes Rothstein, while it’s true that past value-added is a better predictor than are student perceptions, the actual predictive power of the former is quite low. Because the manner in which the data are presented makes this difficult to assess, Rothstein makes his own calculations (see his appendix). I’ll spare you the technical details, but he finds that the explanatory power is not impressive, and the correlations it implies are actually very modest.

Even in math (which was more stable than reading), a teacher with a value-added score at the 25th percentile in one year (less effective than 75 percent of other teachers) is just as likely to be above average the next year as she is to be below average. And there is only a one-in-three chance that she will be that far below average in year two compared with year one. This is the old value added story — somewhat low stability, a lot of error (though this error likely decreases with more years of data). The report claims that this volatility is not too high to preclude the utility of value-added in evaluations, and indeed uses the results as evidence of value-added’s potential; Rothstein is not nearly so sure.

MET finding: Teachers with high value-added on state tests tend to promote conceptual understanding as well.

This finding is about whether value-added scores differ between tests (see here and here for prior work). It is based on a comparison of two different value-added scores for teachers: one derived from the regular state assessment (which varies by participating district) and the other from an alternative assessment that is specifically designed to measure students’ depth of higher-order conceptual understanding in each subject.

The MET results, as presented, are meant to imply that teachers whose students show gains on the state test also produce gains on the test of conceptual understanding—that teachers are not just “teaching to the test.”

Rothstein, on the other hand, concludes that these correlations are also weak, so much so that “it casts serious doubt on the entire value-added enterprise." For example, more than 20 percent of the teachers in the bottom quartile (lowest 25 percent) on the state math test are in the top two quartiles on the alternative assessment. And it’s even worse for the reading test. Rothstein characterizes these results as “only slightly better than coin tosses.” So, while the MET finding — that high value-added teachers “tend to” promote conceptual understanding—is technically true, the modest size of these correlations makes this characterization somewhat misleading.

According to Rothstein, critically so.

MET finding: Teacher performance—in terms of both value-added and student perceptions—is stable across different classes taught by the same teacher.

This is supposed to represent evidence that estimates of a given teacher’s effects are fairly consistent, even with different groups of students. But, as Rothstein notes, the stability of value-added and student perceptions across classes is not necessarily due to the teacher.

There may be other stable but irrelevant factors that are responsible for the consistency, but remain unobserved. For example, high scores on surveys querying a teacher’s ability to control students’ behavior may be largely a result of non-random classroom assignment (some teachers are deliberately assigned more or less difficult students), rather than actual classroom management skills. Since the initial MET report makes no attempt to adjust methods (especially the survey questions) to see if the stability is truly a teacher effect, the results, says Rothstein, must be considered inconclusive (the non-random assignment issue also applies to most of the report's other findings on value-added and student surveys).

The final MET report will, however, directly address this issue in its examination of how value-added estimates change when students are randomly assigned to classes (also see here for previous work). In the meantime, Rothstein points out that the MET researchers could have checked their results for evidence of bias but did not, and in ignoring non-random assignment, they seem to be anticipating the results of the unfinished final study.

General Issues

In addition to these points, Rothstein raises a bunch of important general problems with the study. Many of them are well-known, and they mostly pertain to the presentation of results, but a few are worth repeating:

*Because it contains no literature review, the MET report largely ignores many of the serious concerns about value-added established by this literature – these issues may not be well-known to the average reader (for the record, there are citations in the study, but no formal review of prior research).

*The value-added model that the MET project employs, while common in the literature, is also not designed to address how the distribution of teacher effects varies between high- and low-performing classrooms (e.g., teachers of ELL classes are assumed to be of the same average effectiveness as teachers of gifted/talented classes) . The choice of models can have substantial impacts on results.

*Technical point (important, but I’ll keep it short here, and you can read the review if you want more detail): the report’s methods overstate the size of correlations by focusing only on that portion of performance variation that is “explained” by who the teacher is (even though most of the total variation is not explained).

*Perhaps most importantly, MET operates in a no-stakes context, but its findings—both preliminary and final—will be used to recommend high-stakes policies. There is no way to know how the introduction of these stakes will affect results (this goes for both value-added and the student surveys).

In conclusion, Rothstein asserts that the preliminary report “is not very useful in guiding policy, but the guidance it does provide points in the opposite direction from that indicated by its poorly-supported conclusions” (my emphasis).

Pretty brutal, but not necessarily as perplexing as it seems. With a few exceptions, most of Rothstein’s criticisms pertain to how the results are presented and the conclusions drawn from them, rather than to actual methods. This is telling, and it brings us back to the two premises (out of three) that guide the MET project—that value-added measures should be included in evaluations, and that other measures should only be included if they are predictive of students’ test score growth.

You may also notice that both premises are among the most contentious issues in education policy today. They are less assumptions than empirical questions, which might be subject to testing (especially the first one). Now, in a sense, the MET addresses many of the big issues surrounding these questions – e.g., by checking how value-added varies across classrooms, tests, and years. And they are very clear about acknowledging the limitations of value-added, but they still think it needs to be, whenever possible, a part of multi-measure evaluations. In other words, they’re not asking whether value-added should be used, only how. Rothstein, on the other hand, is still asking the former question (and he is most certainly not alone).

Moreover, the two premises represent a tautology—student test score growth is the most important measure, and we have to choose other teacher evaluation measures based on their correlation with student test score growth because student test score growth is the most important measure… This point, by the way, has already been made about the Gates study, as well as about seniority-based layoffs and about test-based policies in general.

There is tension inherent in a major empirical research project being guided by circular assumptions that are the very questions the research might be trying to answer. Addressing these questions would be very difficult, and doing so might not change the technical methods of MET, but it might definitely influence how they interpret and present their results.

This, I think, goes a long way towards explaining how Rothstein could draw opposite conclusions from the study based on the same set of results.

For example, some of Rothstein’s most important arguments pertain to the size of the correlations (e.g., value-added between years and classrooms). He finds many of these correlations to be perilously low, while the MET report concludes that they are high enough to serve as evidence of value-added’s utility. This may seem exasperating. But think about it: If your premise is that value-added is the most essential component of an evaluation (as does the MET project)—so much so that all other measures must be correlated with it in order to be included — your bar for precision may not be as high as someone who is still skeptical. You may be more likely to tolerate mistakes, even a lot of mistakes, as collateral damage. And you will be less concerned about asking whether evaluations should target short-term testing gains versus some other outcome.

For Rothstein and many others, the MET premises are still unanswered questions. Assuming them away leaves little room for the possibility that performance measures that are not correlated with value-added might be transmitting crucial information about teaching quality, or that there is a disconnect between good teaching and testing gains.

The proper approach, as Rothstein notes, is not to ask whether all these measures correlate with each other or over time or across classrooms, but whether they lead to various types of better student outcomes in a high-stakes, real life context. In other words, testing policies, not measures.

The MET project's final product will provide huge insights into teacher quality and practice no matter what, but how it is received and interpreted by policymakers and the public is also critical. We can only hope that, in making these assumptions, the project is not compromising the usefulness and impact of what is potentially one of the most important educational research projects in a long time.


Follow my blog every day by bookmarking And for admissions advice, college news and links to campus papers, please check out our Higher Education page at Bookmark it!

By Valerie Strauss  | January 14, 2011; 12:00 PM ET
Categories:  Guest Bloggers, Matthew Di Carlo, Research, Teacher assessment  | Tags:  MET project, MET study, assessing teachers, bill gates, bill gates foundation, gates foundation, teacher assessment, teacher evaluation, value-added, value-added measures  
Save & Share:  Send E-mail   Facebook   Twitter   Digg   Yahoo Buzz   StumbleUpon   Technorati   Google Buzz   Previous: The astrology college story
Next: Robocall revenge--the postscript


Thanks for a lucid and important synopsis of Jesse Rothstein's critique. One of the most difficult aspects of shaping effective education policy is boiling complex research results down into concepts that can be easily understood by practitioners, parents and policy-makers. Di Carlo did that masterfully in this piece--bravo.

The question is: Do the "predetermined" policy implications of this first report stand, especially in federal policy, even when they're weak, garbled and unsupported by careful analysis? My guess is that Gates will trump Rothstein.

Posted by: nflanagan2 | January 14, 2011 12:47 PM | Report abuse

I found this passage to be very interesting:

"MET finding: Teacher performance—in terms of both value-added and student perceptions—is stable across different classes taught by the same teacher.

This is supposed to represent evidence that estimates of a given teacher’s effects are fairly consistent, even with different groups of students. But, as Rothstein notes, the stability of value-added and student perceptions across classes is not necessarily due to the teacher.

There may be other stable but irrelevant factors that are responsible for the consistency, but remain unobserved. For example, high scores on surveys querying a teacher’s ability to control students’ behavior may be largely a result of non-random classroom assignment (some teachers are deliberately assigned more or less difficult students), rather than actual classroom management skills."

I think that it is HIGHLY likely that a teacher's demonstrated classroom management performance is related to the types of students in her class. In my experience as a former teacher, I was able to manage a class of 30+ honors students with ease, had some trouble with classes of 25 general ed kids, and struggled mightily to keep control of a class of 12-15 students with a history of behavior problems and other low-level special education needs.

If an administrator evaluated my ability to "control" two classes: honors History and honors English, I'd probably be seen to have consistent classroom management skills across all my classes. But how much of that "consistency" is due to my personal, inherent classroom management skills, and how much is due, instead, to my assigned students, is certainly not clear from such a comparison.

Posted by: AttorneyDC | January 14, 2011 2:39 PM | Report abuse

An excellent summary but both the DOE and main stream media will probably simply ignore this additional questioning of their currently accepted myths concerning teacher effectiveness. An interesting paragraph is the one that starts "Moreover, the two premises represent a tautology…" Here the author points out the tendency for circular reasoning in those cases where the same data can support many different conclusions. In the hard sciences nature simply laughs in ones face when an attempt is made to impose incorrect conclusions, but the life and social sciences are so complex that such conclusions sometimes stand for years. As a comparison, I would think the complexity of determining teacher effectiveness for over 3 million teachers and 50 million students can easily rival that of the science involved with the Large Haydron Collider, but the concentrated scientific effort to determine teacher effectiveness is very small compared to that involved with LHC. Keep in mind that the brain (maybe only part of the problem) of each teacher and student has about 100 billion neutrons, etc.

Posted by: bpeterson1931 | January 14, 2011 3:45 PM | Report abuse

Incisive critique. Sadly, news outlets will probably trumpet VAM as if its efficacy is indisputible.

Other social service workers -- psychologists, nurses, social workers -- are evaluated based on observation and evaluation of their practices. It makes good sense to evaluate teachers the same way -- by their own actions and pratices -- rather than the actions and practices of another human being.

Posted by: joshofstl1 | January 14, 2011 4:02 PM | Report abuse

The use if the term value-added suffers mightily in that the origin and use of the term is a staple of manufacturing and certainly not a measure of someones professional comopetance, in any field. The comment by attoney-D.C. touches on what I believe to be the most important criterion of all: the origins, background and influences of family of students. respecting education when trhing to make judgfements concerning the quality and ability if teachers and their work. Perhaps we get too bound up in trying top measure teacher effectiveness without seeing what their students backgrounds are- they ceretainly are not equal. Why try to judge a teacher striving mightily to work with dis, or un-interested students whose family doen't care.

Posted by: sashaneedles | January 14, 2011 4:19 PM | Report abuse

I think a big assumption in this teacher evaluation game is that all teachers are super-swiss army knives and can deal with any situation. The truth is a lot of the best teaching is done by specialists. Professionals who are experienced at a narrow area of teaching and their success is dependent on the systems and administrators around them. At my high school we had a a number of math teachers, but two are emblematic of this issue. One I adored and was fortunate to have for 3 of my 4 years. He was a master at teaching bright kids talented at math at warp speed and advancing them to their potential. He was honestly not very good with kids for whom math was more of a challenge. We had another teacher who really was the reverse, excellent at keeping kids engaged who weren't as good at math (or had decided they didn't like it) and keeping them moving and learning. It was important that a good administrator matched kids with the right teacher or no one did well. Probably both of them did better with the types of kids they specialized in than a good generalist would do with either.

On the less anecdotal side, here in DC the first year of our IMPACT system that is born out of this ideology found that teachers with more affluent students saw more growth in their students test scores. I don't think anyone needed to develop a new evaluation tool to predict that.

Posted by: Mulch5 | January 14, 2011 7:50 PM | Report abuse

We can save a bundle of money with the value added analysis of test scores.

Since we can totally before the school year predict the expected test scores of students based upon previous test score and effective teachers we can simply weed out all the students that would fail if they were assigned in a class with an effective teacher.

These students can be placed at low costs in the auditorium instead of class rooms. Many of the poverty schools have failure rates of 50 percent and above. Since the value added method allows the identification of students that will fail it would be possible to identify the 50 percent that will fail. In these schools it would be possible to lay off 50 percent of the teaching staff since the 50 percent of failures would not require teachers. Why throw away good money on these students when we know by value added analysis that these students will fail?

Using value added analysis would allow for significant savings in the poverty schools with the early identification of the students that can not learn with effective teachers.

Posted by: bsallamack | January 14, 2011 9:51 PM | Report abuse

"The end goal is to use the information to guide in the creation of more effective teacher evaluation systems that incorporate high-quality multiple measures."

Multiple measures implies a degree of objectivity be included in the evaluations. If VAM is the best objective measure of the day then...

That, in and of itself, would be an improvement over the anemic system used today where a number of teachers are seemingly not evaluated at all, and most who are evaluated receive a "satisfactory" rating.

All of the hoopla around evaluating teacher effectiveness, of course, is nothing but an attempt to avoid the politically incorrect issue of the instability of the poor/minority family structure. We know poverty, teenage motherhood, single-parent homes, the lack of value on education from this cohort, generational public assistance, etc., etc., are the primary causes of the achievement gap, NOT INCOMPETENT TEACHERS.

So, Gates and others are seemingly taking a page out of the medical profession which goes to great lengths to RULE OUT certain variables; in this case ineffective teachers. Only in education, ruling out ineffective teachers has become the focus of the achievement gap for some instead of a potential (or not) cause thereof.

Posted by: phoss1 | January 15, 2011 7:50 AM | Report abuse

From a review of comments I see that Americans still do not understand the implications of value added analysis and the significant cost reductions that value added analysis will allow.

Americans still do not recognize that once there are only effective teachers the value added analysis will allow school administrators to predict all of the students that can not be educated with effective teachers. This allows schools to identify all ineffective students. The cost saving from removing ineffective teachers is minor in comparison to the significant cost savings from not wasting resources on students that have been scientifically identified as ineffective students.

There no longer will be an achievement gap in public education since by using the scientifically proved method of value added analysis school will be able to determine all of the students that are ineffective and there no longer will be a need to waste resources in a totally futile effort to educate these students. There no longer will be a nebulous achievement gap since with value added analysis and effective teachers we will have scientifically proven the inability of certain students to learn.

Value added analysis will also allow for further savings. Since schools can predict those students that perform best with an effective teacher these students can be placed in classes with larger number of students. A cost benefit analysis would be done to indicate the number of specific students that can be placed in a single classroom. In many middle class schools classes that now require two teachers can be combined into a class only requiring one teacher.

An intelligent use of value aided analysis would allow easily for a 50 percent reduction in the number of teachers in the nation.

For example: With an effective teacher the test scores on a mathematics course such as Algebra 1 would indicate those student who will fail in more advanced mathematics courses. Schools would not have to waste further resources on futile attempts to educate these students in mathematics. This would significantly lessen the number of mathematics teachers required by schools.

There is no cost benefit in educating student in advanced mathematics when it has been proven by mathematics that these students will simply fail these classes with an effective teacher.

Time for Americans to understand that value added analysis is not a reform of public education but a revolution that will allow us to significantly lessen the costs of public education while obtaining the greatest cost benefits from the dollars spent on education.

A win win situation for America.

Posted by: bsallamack | January 15, 2011 10:15 AM | Report abuse


You are kidding, right?

Posted by: educationlover54 | January 16, 2011 3:40 PM | Report abuse

I read last week a 26 page interview with a great American statistician, Leo Goodman, one which reviewed his career, and then, toward the end of the article, the crisis in his life he survived for 30+ years, a form of cancer. Goodman tells that in supporting one course of therapy after he was diagnosed, his oncologist gave him several medical research report summaries from medical journals on this treatment regime, one which contrasted sharply with the therapeutic regime mostly associated with doctors and medical researchers in another country. Curious about the evidence in support of the summaries, Goodman tells that he he retrieved and read the research papers themselves, and found the conclusions in the summaries were unsupported by the evidence in the bodies of the papers. That led him to find another oncologist and to go with the other course of treatment. Goodman tells that several years later a large clinical trial comparing the treatments found the course he chose for treatment far outperformed the one his first oncologist would have had him on.

Posted by: incredulous | January 16, 2011 9:42 PM | Report abuse


You are kidding, right?

Posted by: educationlover54

Just trying to point out the lunacy of value added analysis. It is based upon the idea of being able to predict test scores for students. Supposedly if a student takes a test in the fifth grade with an effective teacher the method is able to predict the score of the student in the sixth grade with an effective teacher.

This is ludicrous but if it was true we could stop spending money to try to teach students that fail tests when they have an effective teacher. Logically this follows from the idea of value added analysis.

Not one supporter of value added analysis would accept stop trying to teach students that have failed with an effective teacher while the logic indicates this follows from the idea of firing teachers based upon value added analysis.

The reality is that the ideas in public education are absurd. Reliance on testing has simply lowered standards and teaching as everyone looks for tricks to have students pass the tests. The tests themselves are watered down.

The only way to tell if you have good teachers is by having the principal responsible for insuring this. If you want better teachers you have to have better principals where their fundamental priority is insuring the quality of teachers and teaching in their schools. Like it or not it the principal that is accountable and more resources should be given to principals to manage their schools.

Schools are not factories.

Posted by: bsallamack | January 17, 2011 9:51 AM | Report abuse

Post a Comment

We encourage users to analyze, comment on and even challenge's articles, blogs, reviews and multimedia features.

User reviews and comments that include profanity or personal attacks or other inappropriate comments or material will be removed from the site. Additionally, entries that are unsigned or contain "signatures" by someone other than the actual author will be removed. Finally, we will take steps to block users who violate any of our posting standards, terms of use or privacy policies or any other policies governing this site. Please review the full rules governing commentaries and discussions.

characters remaining

RSS Feed
Subscribe to The Post

© 2011 The Washington Post Company