Network News

X My Profile
View More Activity


Posted at 3:00 PM ET, 08/29/2010

Study blasts popular teacher evaluation method

By Valerie Strauss

Student standardized test scores are not reliable indicators of how effective any teacher is in the classroom, not even with the addition of new “value-added” methods, according to a study released today. It calls on policymakers and educators to stop using test scores as a central factor in holding teachers accountable.

“Value-added modeling” is indeed all the rage in teacher evaluation: The Obama administration supports it, and the Los Angeles Times used it to grade more than 6,000 California teachers in a controversial project. States are changing laws in order to make standardized tests an important part of teacher evaluation.

Unfortunately, this rush is being done without evidence that it works well. The study, by the Economic Policy Institute, a nonpartisan nonprofit think tank based in Washington, concludes that heavy reliance on VAM methods should not dominate high-stakes decisions about teacher evaluation and pay.

Value-added measures use test scores to track the growth of individual students as they progress through the grades and see how much “value” a teacher has added. They do not include other factors that affect students, and can skew results by giving better scores to teachers who “teach to the test” and lesser scores to teachers who are assigned students with the greatest educational needs.

As much as we’d like a simple way to identify and remove bad teachers, the study concludes that “there is simply no shortcut.”

The authors of the study, called, “Problems with the Use of Student Test Scores to Evaluate Teachers,” give it unusual credibility: It was written by four former presidents of the American Educational Research Association; two former presidents of the National Council on Measurement in Education; the current and two former chairs of the Board of Testing and Assessment of the National Research Council of the National Academy of Sciences; the president-elect of the Association for Public Policy Analysis and Management; the former director of the Educational Testing Service’s Policy Information Center; a former associate director of the National Assessment of Educational Progress; a former assistant U.S. secretary of education; a member of the National Assessment Governing Board; and the vice president, a former president, and three other members of the National Academy of Education.

I’m publishing the executive summary, and after that, a list of the authors of the study. Here’s a link to the entire report.



Executive summary

Every classroom should have a well-educated, professional teacher, and school systems should recruit, prepare and retain teachers who are qualified to do the job. Yet in practice, American public schools generally do a poor job of systematically developing and evaluating teachers.

Many policy makers have recently come to believe that this failure can be remedied by calculating the improvement in students’ scores on standardized tests in mathematics and reading, and then relying heavily on these calculations to evaluate, reward, and remove the teachers of these tested students.

While there are good reasons for concern about the current system of teacher evaluation, there are also good reasons to be concerned about claims that measuring teachers’ effectiveness largely by student test scores will lead to improved student achievement.

If new laws or policies specifically require that teachers be fired if their students’ test scores do not rise by a certain amount, then more teachers might well be terminated than is now the case.

But there is not strong evidence to indicate either that the departing teachers would actually be the weakest teachers, or that the departing teachers would be replaced by more effective ones. There is also little or no evidence for the claim that teachers will be more motivated to improve student learning if teachers are evaluated or monetarily rewarded for student test score gains.

A review of the technical evidence leads us to conclude that, although standardized test scores of students are one piece of information for school leaders to use to make judgments about teacher effectiveness, such scores should be only a part of an overall comprehensive evaluation.

Some states are now considering plans that would give as much as 50 percent of the weight in teacher evaluation and compensation decisions to scores on existing tests of basic skills in math and reading. Based on the evidence, we consider this unwise.

Any sound evaluation will necessarily involve a balancing of many factors that provide a more accurate view of what teachers in fact do in the classroom and how that contributes to student learning.

Evidence about the use of test scores to evaluate teachers

Recent statistical advances have made it possible to look at student achievement gains after adjusting for some student and school characteristics. These approaches that measure growth using “value-added modeling” (VAM) are fairer comparisons of teachers than judgments based on their students’ test scores at a single point in time or comparisons of student cohorts that involve different students at two points in time.

VAM methods have also contributed to stronger analyses of school progress, program influences, and the validity of evaluation methods than were previously possible.

Nonetheless, there is broad agreement among statisticians, psychometricians, and economists that student test scores alone are not sufficiently reliable and valid indicators of teacher effectiveness to be used in high-stakes personnel decisions, even when the most sophisticated statistical applications such as value-added modeling are employed.

For a variety of reasons, analyses of VAM results have led researchers to doubt whether the methodology can accurately identify more and less effective teachers. VAM estimates have proven to be unstable across statistical models, years, and classes that teachers teach.

One study found that across five large urban districts, among teachers who were ranked in the top 20 percent of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40 percent. Another found that teachers’ effectiveness ratings in one year could only predict from 4 percent to 16 percent of the variation in such ratings in the following year. Thus, a teacher who appears to be very ineffective in one year might have a dramatically different result the following year.

The same dramatic fluctuations were found for teachers ranked at the bottom in the first year of analysis. This runs counter to most people’s notions that the true quality of a teacher is likely to change very little over time and raises questions about whether what is measured is largely a “teacher effect” or the effect of a wide variety of other factors.

A study designed to test this question used VAM methods to assign effects to teachers after controlling for other factors, but applied the model backwards to see if credible results were obtained. Surprisingly, it found that students’ fifth-grade teachers were good predictors of their fourth-grade test scores.

In as much as a student’s later fifth-grade teacher cannot possibly have influenced that student’s fourth-grade performance, this curious result can only mean that VAM results are based on factors other than teachers’ actual effectiveness.

VAM’s instability can result from differences in the characteristics of students assigned to particular teachers in a particular year, from small samples of students (made even less representative in schools serving disadvantaged students by high rates of student mobility), from other influences on student learning both inside and outside school, and from tests that are poorly lined up with the curriculum teachers are expected to cover, or that do not measure the full range of achievement of students in the class.

For these and other reasons, the research community has cautioned against the heavy reliance on test scores, even when sophisticated VAM methods are used, for high stakes decisions such as pay, evaluation, or tenure.

For instance, the Board on Testing and Assessment of the National Research Council of the National Academy of Sciences stated:

....VAM estimates of teacher effectiveness should not be used to make operational decisions because such estimates are far too unstable to be considered fair or reliable.

A review of VAM research from the Educational Testing Service’s Policy Information Center concluded,

VAM results should not serve as the sole or principal basis for making consequential decisions about teachers. There are many pitfalls to making causal attributions of teacher effectiveness on the basis of the kinds of data available from typical school districts. We still lack sufficient understanding of how seriously the different technical problems threaten the validity of such interpretations.

And RAND Corporation researchers reported that,

The estimates from VAM modeling of achievement will often be too imprecise to support some of the desired inferences....

and that

The research base is currently insufficient to support the use of VAM for high-stakes decisions about individual teachers or schools.

Factors that influence student test score gains attributed to individual teachers

A number of factors have been found to have strong influences on student learning gains, aside from the teachers to whom their scores would be attached.

These include the influences of students’ other teachers—both previous teachers and, in secondary schools, current teachers of other subjects—as well as tutors or instructional specialists, who have been found often to have very large influences on achievement gains.

These factors also include school conditions—such as the quality of curriculum materials, specialist or tutoring supports, class size, and other factors that affect learning.

Schools that have adopted pull-out, team teaching, or block scheduling practices will only inaccurately be able to isolate individual teacher “effects” for evaluation, pay, or disciplinary purposes.

Student test score gains are also strongly influenced by school attendance and a variety of out-of-school learning experiences at home, with peers, at museums and libraries, in summer programs, on-line, and in the community. Well educated and supportive parents can help their children with homework and secure a wide variety of other advantages for them. Other children have parents who, for a variety of reasons, are unable to support their learning academically.

Student test score gains are also influenced by family resources, student health, family mobility, and the influence of neighborhood peers and of classmates who may be relatively more advantaged or disadvantaged.

Teachers’ value-added evaluations in low-income communities can be further distorted by the summer learning loss their students experience between the time they are tested in the spring and the time they return to school in the fall.

Research shows that summer gains and losses are quite substantial. A research summary concludes that while students overall lose an average of about one month in reading achievement over the summer, lower-income students lose significantly more, and middle-income students may actually gain in reading proficiency over the summer, creating a widening achievement gap.

Indeed, researchers have found that three-fourths of schools identified as being in the bottom 20% of all schools, based on the scores of students during the school year, would not be so identified if differences in learning outside of school were taken into account. Similar conclusions apply to the bottom 5% of all schools.

For these and other reasons, even when methods are used to adjust statistically for student demographic factors and school differences, teachers have been found to receive lower “effectiveness” scores when they teach new English learners, special education students, and low-income students than when they teach more affluent and educationally advantaged students.

The nonrandom assignment of students to classrooms and schools—and the wide variation in students’ experiences at home and at school—mean that teachers cannot be accurately judged against one another by their students’ test scores, even when efforts are made to control for student characteristics in statistical models.

Recognizing the technical and practical limitations of what test scores can accurately reflect, we conclude that changes in test scores should be used only as a modest part of a broader set of evidence about teacher practice.

The potential consequences of the inappropriate use of test-based teacher evaluation

Besides concerns about statistical methodology, other practical and policy considerations weigh against heavy reliance on student test scores to evaluate teachers.

Research shows that an excessive focus on basic math and reading scores can lead to narrowing and over-simplifying the curriculum to only the subjects and formats that are tested, reducing the attention to science, history, the arts, civics, and foreign language, as well as to writing, research, and more complex problem-solving tasks.

Tying teacher evaluation and sanctions to test score results can discourage teachers from wanting to work in schools with the neediest students, while the large, unpredictable variation in the results and their perceived unfairness can undermine teacher morale.

Surveys have found that teacher attrition and demoralization have been associated with test-based accountability efforts, particularly in high-need schools. Individual teacher rewards based on comparative student test results can also create disincentives for teacher collaboration.

Better schools are collaborative institutions where teachers work across classroom and grade-level boundaries toward the common goal of educating all children to their maximum potential. A school will be more effective if its teachers are more knowledgeable about all students and can coordinate efforts to meet students’ needs.

Some other approaches, with less reliance on test scores, have been found to improve teachers’ practice while identifying differences in teachers’ effectiveness.

They use systematic observation protocols with well-developed, research-based criteria to examine teaching, including observations or videotapes of classroom practice, teacher interviews, and artifacts such as lesson plans, assignments, and samples of student work. Quite often, these approaches incorporate several ways of looking at student learning over time in relation to a teacher’s instruction.

Evaluation by competent supervisors and peers, employing such approaches, should form the foundation of teacher evaluation systems, with a supplemental role played by multiple measures of student learning gains that, where appropriate, could include test scores. Some districts have found ways to identify, improve, and as necessary, dismiss teachers using strategies like peer assistance and evaluation that offer intensive mentoring and review panels.

These and other approaches should be the focus of experimentation by states and districts.

Adopting an invalid teacher evaluation system and tying it to rewards and sanctions is likely to lead to inaccurate personnel decisions and to demoralize teachers, causing talented teachers to avoid high-needs students and schools, or to leave the profession entirely, and discouraging potentially effective teachers from entering it.

Legislatures should not mandate a test-based approach to teacher evaluation that is unproven and likely to harm not only teachers, but also the children they instruct.

The report’s co-authors are:
Eva L. Baker, professor of education at UCLA and co-director of the National Center for Evaluation Standards and Student Testing (CRESST)
Paul E. Barton, former director of the Policy Information Center of the Educational Testing Service
Linda Darling-Hammond, professor of education at Stanford University, former president of the American Educational Research Association
Edward Haertel, professor of education at Stanford University, former president of the National Council on Measurement in Education, chair of the National Research Council’s Board on Testing and Assessment, former chair of the committee on methodology of the National Assessment Governing Board
Helen F. Ladd, professor of public policy and economics at Duke University, president-elect of the Association for Public Policy Analysis and Management
Robert L. Linn, professor emeritus at the University of Colorado, former president of the National Council on Measurement in Education and of the American Educational Research Association, former chair of the National Research Council’s Board on Testing and Assessment
Diane Ravitch, research professor at New York University and historian of American education, former U.S. assistant secretary of education
Richard Rothstein, research associate of the Economic Policy Institute
Richard J. Shavelson, Professor of Education (Emeritus), former dean of the School of Education at Stanford University, and former president of the American Educational Research Association
Lorrie A. Shepard, Dean and professor at the School of Education at the University of Colorado at Boulder, former President of the American Educational Research Association, immediate past President of the National Academy of Education


em> Follow my blog every day by bookmarking washingtonpost.com/answersheet. And for admissions advice, college news and links to campus papers, please check out our Higher Education page at washingtonpost.com/higher-ed Bookmark it!

By Valerie Strauss  | August 29, 2010; 3:00 PM ET
Categories:  Research, Teacher assessment, Teachers  | Tags:  economic policy institute, how to evaluate teachers, research and teacher evaluation, teacher evaluation, value added, value added and teachers, value added measures, value added modeling  
Save & Share:  Send E-mail   Facebook   Twitter   Digg   Yahoo Buzz   Del.icio.us   StumbleUpon   Technorati   Google Buzz   Previous: Dear President Obama...Sincerely, Parents Across America
Next: Education implications of D.C. mayoral primary

Comments

The executive summary of the executive summary is that VA methods should not be the principal factor in evaluating teachers. However, many school districts are not making VA methods a determining factor, but a contributing factor. I believe LA is looking at making the VA score 30% of a teacher's evaluation.

These authors do not say VA methods are worthless, only that they are imprecise. Most authors agree that VA methods are not good at distinguishing the large middle of teachers, but that they can be effective at identifying the extreme tails of very good teachers and very bad teachers.

I have read that among the first patients who was given penicillin was a young police officer who developed sepsis from a small cut while shaving. He ultimately died. Did this mean we should've stopped evaluating penicillin in the treatment of infection? Of course not! (Open heart surgery is another amazing technique which had many setbacks at first, only to become widely practiced today.)

Similarly, no one is saying that the VA methods being used today are perfect and will not be altered in the future. What is happening in LA and DC, though, are incredibly important first steps which will spur a tremendous amount of research into accurately quantitatively evaluating teachers and teaching methods.

I will point out that the LA data encompassed 7 years of teaching, and one would think that there would be less noise in the data than an evaluation of only 1 year of teaching.

The reactions we have seen among public school teachers remind me of the reactions I have seen among health care workers as the burden of regulation and outside evaluation increases every year. Threats of leaving the field, warnings that no one will enter health care under such work conditions, decreased morale. And yet, by and large, these things haven't happened (at least not to the antipated degree). It will be the same with teaching.

Finally, for those teachers who belittle "teaching to the test", I would ask how can you quantitatively, reproducibly, and cost-effectively evaluate a child's progress in learning? Other than using test scores, how can you quantitatively, reproducibly, and cost-effectively evaluate a teacher's effectiveness? (The semi-quantitative observation methods suggested in the paper would probably not be considered cost-effective because they would require a much higher skill set than is available in our public school system. Moreover, we have been doing observations for years already, with poor results).

Posted by: cypherp | August 29, 2010 3:55 PM | Report abuse

As a parent, I am not sure that I would want my teacher paid based primarily on test scores...my personal focus, as the parent of kids who I am sure can be proficient on the tests, would be preparing my child for college...The question arises: would i rather my kid's teacher drill him to perfection on some tested skill or spend the time moving him on to more challenging material once it is clear that he is proficient? WHen he or she reaches 7th grade, is it better to drill him or her to become advanced on the 7th grade skills or move him or her on to Algebra? THe system must be thought through carefully and like most things cannot be applied universally...

Posted by: petercat926 | August 29, 2010 3:57 PM | Report abuse

Even teachers deemed highly effective in the Los Angeles Times database question the VAM score they received.
--------
Terry Little's Response:

I'm proud of my results, but I caution all viewers of this data, that effective teachers impact students and families in many ways including: socio-emotional support, building trusting relationships, encouragement, empathy, providing supplies, food, and other resources.
-----
Helen Steinmetz's Response:

I fear that the emphasis on test scores will encourage more teaching to the test instead of the much more important skills such as critical thinking in both Math and Language Arts.

It will encourage a quantity over quality approach to teaching that adds no value whatsoever to student achievement.

This data would have been put to better use if it were not made public. It demeans our profession and the teachers who try their utmost to do a difficult job.What scores don't tell is the complete story of what goes into making an effective teacher, and it is so much more than a score on a test.
-----

Even the teacher who was in the original article, portrayed as a young phenom, wrote about the reporter's sneaky way of explaining why the teacher was being observed:

Miguel Garcia's Response:

I was not surprised by my value added measure simply because at Sunrise it has been our practice to view our incoming students' prior year's test scores at the onset of the new school year and then again after in the following school year. When Jason Song came to our school he stated that he was there to research why it was that our school compared to others of similar backgrounds was outperforming them. Jason Song was interested in researching what successful teachers have in common. Furthermore, Jason said that--without mentioning names--are there teachers here at your school who you believe are not good teachers ? Frankly, I said, " Jason, of course there are some that might need to improve, but, lets be clear, in any profession there are those that need to improve. There are doctors, lawyers, etc. that need to improve. Do you think that I think that every writer for the Los Angeles Times is a good writer ?" Are public servants such as police officers and doctors job performances going to be made public?
-----------

I hope the public can understand that teachers are not trying to avoid improved evaluation systems. Most would welcome them willingly. Many of us believe VAM will not provide the type of meaningful evaluation we seek and the parents, students, and community deserve. It is a red herring.

If the lay public is interested in a better evaluation system I recommend they examine the National Board of Professional Teaching Standards at nbpts.org

Posted by: avalonsensei | August 29, 2010 4:12 PM | Report abuse

These authors do not say VA methods are worthless, only that they are imprecise.

Most authors agree that VA methods are not good at distinguishing the large middle of teachers, but that they can be effective at identifying the extreme tails of very good teachers and very bad teachers.

Posted by: cypherp
......................
The authors say that the VA should be not used because it is not valid and not because it is not precise. The authors give numerous reasons why VA is invalid.

Great how cypherp claims "Most authors agree ... they can be effective at identifying the extreme tails of very good teachers and very bad teachers." when none of the authors make this claim.

Apparently cypherp believes he is a mind reader.

It looks like American adults also have the same problem in reading as many American children.

Posted by: bsallamack | August 29, 2010 5:16 PM | Report abuse

f the lay public is interested in a better evaluation system I recommend they examine the National Board of Professional Teaching Standards at nbpts.org

Posted by: avalonsensei
......................
Many Americans are not interested in a better evaluation system.

All that these Americans are interested in is accepting the political leaders in blaming teachers for the problem. No need to deal with problem.

Also the political leaders do not need to worry about anyone blaming them for doing nothing since they have already made many Americans believe that the teachers are the ones creating the problem.

See how it works.

Posted by: bsallamack | August 29, 2010 5:25 PM | Report abuse

I'd like to respond to this question, asked by cypherp:

"Finally, for those teachers who belittle 'teaching to the test,' I would ask how can you quantitatively, reproducibly, and cost-effectively evaluate a child's progress in learning?"

Before I answer I'd like to say that there are distinctly different definitions of "teaching to the test." One definition means teaching to the curriculum. For example, if multiplication is to be tested, then the wise and effective teacher will teach it and teach it well. There is nothing wrong with this view of "teaching to the test." In fact, this is what effective teachers do.

The other definition is teaching the items that are likely to be on the test by studying last year's test, which is likely the same or nearly the same as this year's test. So instead of teaching multiplication, the teacher might drill on the multiplication items likely to be on the test. In spelling, instead of focusing on the hundreds of words suggested for the grade level, the teacher might drill the children on the twenty that are likely to appear on the test. This is what many people mean by "teaching to the test." This type of test preparation invalidates the results and has a very detrimental effect on the quality of classroom instruction. In my opinion this explains high state test scores but low NAEP scores.

I DO believe that a child's progress can be measured TO SOME EXTENT by a standardized test, but the integrity of the test must be assured. A different test would have to be used each year and it would have to be handled by someone other than the classroom teacher.

This brings us to "cost-effectiveness" because administering the test properly would cost a lot of money that districts don't have at the present time. I was thinking about this today and came up with this idea:

Each state could come up with a criterion-referenced test based on national standards. In this way the tests could be used interchangeably, e.g. one year Iowa could use its own test and the next year exchange with Minnesota. Or perhaps the federal government could provide states with many different forms. To ensure proper administration, the test for each class could be sent to each school in a sealed envelope. The principal would deliver the sealed envelope to a teacher of another grade. For example, Miss Smith of second grade would administer the test to Mr. Jones' fourth grade class. The test would stay with her until the end, when she would seal it and deliver it to the office (or better yet, mail it to a central location).

Perhaps my ideas are impractical, but I'm certain of this: if the validity of the test cannot be reasonably assured, then the results should not be taken that seriously and should definitely not be used to evaluate teachers.

Posted by: Linda/RetiredTeacher | August 29, 2010 5:27 PM | Report abuse

cypherp-

Your comparison of teaching to health care staffing is mistaken. There is a shortage of health care workers, especially skilled nurses. Because of the shortage, nurses make good salaries that increase steadily with experience throughout their careers, unlike teaching salaries that flatten and top out much more quickly. Older experienced nurses are valued, whereas older experienced teachers are objects of scorn.

Maybe there isn't a big problem with morale in the health care industry (I have no experience there) but to say there isn't a morale problem in teaching shows a complete lack of knowledge about what is going on. It's too late--many great teachers have already left, and most quit within 5-7 years of starting. The reasons range from poor working conditions, to lack of respect on the part of administrators and students, to being held responsible for things they cannot control (i.e. being scapegoats for everything wrong in society that no one else seems to be able to fix). Most (not all) schools of education accept the dregs of the academic world because of a lack of competition and turn them loose to fill teaching spots left vacant by those who decided not to accept a lifetime of being treated like scum. Some turn out to be great and some are awful. Either way, most will quit within 5-7 years.

One of the problems with using observations is that too many people doing the observing are principals and other administrators who only care about test results. Many have virtually no teaching experience and don’t understand what they are looking at.Teaching quality becomes irrelevant. People who equate high test scores with good teaching are profoundly ignorant about what constitutes a good education. Plenty of very poor teachers are able to train kids to score well on tests without teaching them to develop higher level skills, such as creative ways to solve problems or to write a decent paper; you don’t need those skills to perform well on the tests.

To use a flawed evaluation instrument while we’re still trying to improve it is one thing. To use that instrument as a weapon against teachers when it was never intended to be used that way is simply abusive. It’s an example of the kind of mean-spirited disrespect that makes so many quit early on in their careers.

Posted by: aed3 | August 29, 2010 5:32 PM | Report abuse

Of course Linda Darling-Hammond (and many of her Stanford associates) doesn't like objective data on teachers or students. She'd prefer we evaluate students via projects, essays, portfolios etc., as opposed to tests. The problem with these types of assessments; no one would ever know who actually did the work. Was it, in fact, the student. or was it the teacher, an administrator, a classmate, a sibling, a friend, a neighbor, a relative, etc., etc.? It's too easy to compromise any of these methods, ANY of them.

Darling-Hammond and the Stanford Ed School operated a charter school in East Palo Alto for poor/minority ELLs and it was closed last spring because the state of California deemed it chronically under-performing since 2001. It was operated in the same "soft" manner she and her cronies want future assessments to be made on the new Common Core Standards.

VAMs may not be ready for prime time but they could well serve as part of a "mixed measures" approach to evaluate teachers. The subjective administrative evaluations used nationwide to evaluate teachers are essentially worthless. Too often the teacher knows in advance the principal is coming in for an observation and it turns into a dog and pony show. The teacher gets all gussied up, hair and nails done, beautiful make-up, a new outfit and does a terrific lesson for the evaluator. The very next day she shows up looking like she just got dragged out of the closet and performs less than admirably.

This entire process has gone on for too long and is an embarrassment to the teaching profession. There must be some degree of objectivity injected into the process. VAMs could initially be used to "improve instruction." Over time, after they have been refined further, then perhaps they could be used as part of the "mixed measures" approach, with the actual value of student tests to be determined by the collective bargaining agreement.

There are problems with VAMs. They are NOT insurmountable and should not be dismissed by the educational establishment because some consider them threatening.

Posted by: phoss1 | August 29, 2010 5:35 PM | Report abuse

phoss-

You are right about the dog and pony shows. Assessors have a checklist and they have to observe each item in order to give credit for a successful lesson. This creates an unnatural situation in which the teacher has to put on an elaborate display that would ordinarily take place once or twice a week, maybe. That's one reason why many principals give notice before they show up. If they walk in on a lesson after the teacher has already explained what the objective is, the teacher might well be penalized for teaching a lesson with no purpose. Does that sound stupid? It happens all the time.

Some principals, often the ones who spent many years in the classroom, can walk into a classroom at any random time and "see" a complete lesson cycle. For example, if she sees students working on different stages of writing research papers using copies of primary sources, and the kids can all explain where the material came from and its significance, and have developed an informed opinion about it, it's a safe assumption that the teacher did something that helped the child get to that point. Good principals are also in close enough contact with teachers and students that they know who is really teaching and who is not.

This kind of informed assessment is now considered too subjective to be of any use. In reality, any assessment is objective. Also, too many principals these days don't have enough teaching experience to be able to understand what's going on anyway. Even interpreting test scores is subjective.

Posted by: aed3 | August 29, 2010 8:44 PM | Report abuse

bsallamack -

Many authors (including those of this brief) state that VA methods *should* be a component of teacher evaluation. They suggest that VA methods should not be the main factor in an evaluation. If, as you say, VA methods are invalid (that is, VA methods do not measure what they purport to measure), then the authors recommendation makes no sense and they would be undermining their own credibility. Why recommend something that is no better than a coin flip?

A major issue in teacher evaluation is that there is no "gold standard" - there is no reproducible, accurate method for evaluation of a teacher's performance. Therefore, it is difficult to state whether a method is "valid" or not - what do we compare it to? Even these authors believe there is some validity to VA methods.

I think most people would agree that a teacher, who year after year, performs more poorly than her grade-level peers, needs to work on their teaching method. Similarly, a teacher who, year after year, moves her students ahead of her peers, is probably an effective teacher. It's the teachers in the middle, the ones who seem to be effective one year, less effective the following year, that are more difficult to differentiate from each other.

Posted by: cypherp | August 29, 2010 8:49 PM | Report abuse

cypherp wrote:
"These authors do not say VA methods are worthless, only that they are imprecise. Most authors agree that VA methods are not good at distinguishing the large middle of teachers, but that they can be effective at identifying the extreme tails of very good teachers and very bad teachers."

The executive summary authors seem to disagree:
"...among teachers who were ranked in the top 20 percent of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40 percent."

Posted by: Trev1 | August 29, 2010 9:20 PM | Report abuse

Trev1 -

You do realize that you can be a top or bottom 20% teacher and still be within 1 standard deviation of the mean? (About 68% of teachers will be within 1 standard deviation of the mean, assuming a normal distribution). As I originally stated, I was referring to the extreme tails of the distribution. If you want to quantitate it, look at teachers 2 or more standard deviations away from the mean.

If you go to the Urban Institute paper on the Stability of Value-Added Measures, you will find that year-to-year correlation is about 0.25-0.3. They have more detailed tables on year-on-year variation in value-added estimates.

It should be noted from the Urban Institute paper that VAMs seem to be better estimators of teacher effectiveness than teacher experience or educational attainment, and VAMs are equally good or better than administrator evaluations. Why should we make pay decisions on teacher experience and educational attainment, and not make pay decisions on value-added scores?

Posted by: cypherp | August 30, 2010 7:30 AM | Report abuse

In order to know what was accomplished, you must be able to Measure --and that's what tests do. Otherwise, parents, students and tachers would just be in a state of "mutually mystification"
about what exactly transpired in school, did anyone learn anything, etc.

Posted by: ohioan | August 30, 2010 11:14 AM | Report abuse

"It should be noted from the Urban Institute paper that VAMs seem to be better estimators of teacher effectiveness than teacher experience or educational attainment, and VAMs are equally good or better than administrator evaluations. Why should we make pay decisions on teacher experience and educational attainment, and not make pay decisions on value-added scores?"

Posted by: cypherp

Note assumptive falacy: Higher student standardized test scores = effective teacher. False assumption leads to faulty policy choice.

Posted by: mcstowy | August 30, 2010 11:27 AM | Report abuse

bsallamack -

Many authors (including those of this brief) state that VA methods *should* be a component of teacher evaluation. They suggest that VA methods should not be the main factor in an evaluation. If, as you say, VA methods are invalid (that is, VA methods do not measure what they purport to measure), then the authors recommendation makes no sense and they would be undermining their own credibility. Why recommend something that is no better than a coin flip?
Posted by: cypherp
.............................
Nonetheless, there is broad agreement among statisticians, psychometricians, and economists that student test scores alone are not sufficiently reliable and valid indicators of teacher effectiveness to be used in high-stakes personnel decisions, even when the most sophisticated statistical applications such as value-added modeling are employed.
-----------
One tires of this nonsense. These supposedly mathematical methods are flawed since the composition of students in a classroom are not random.

One tires of this continuous naive belief a mathematically flawed method can still be used.

The reality is that money should be not be spent at all on these systems since they are expensive and flawed. With Race To The Top there will be billions spent to use flawed systems to evaluate teachers from test results.

"Why recommend something that is no better than a coin flip?"

I would recommend using a coin flip instead since it would save billions on worthless systems. This will not allow you to mathematically evaluate teachers but it will only cost a penny.

Posted by: bsallamack | August 30, 2010 11:49 AM | Report abuse

More....such a projection is simply not generally realistic for struggling students, especially as they go up into higher grades. Part of this is that there is more and more language arts embedded into math. It is no longer simply computation of “all symbolic” math, but often challenging “contextual” word problems, which are historically difficult for students who lag behind. That teachers are often able to make significant strides among these students is admirable, but to suggest that this should be the norm is delusional.

3) To say that "Value-Added" removes outside influences (environmental, economic, language, etc.) is simplistic at best. Yes, you may be measuring the student's progress against their previous year. However, that progress will be continually be influenced (positively or negatively) by the world they live in: their parental situation and support, nutrition, community pressures such as gangs or church involvement, school support including class size and environment, teacher experience, transiency due to economics, etc. Particularly at middle school levels, the emotional well being of students are extremely fragile, and often school academia has very little to do with how they are surviving and evolving emotionally as a young person.

4) Are you aware that the scores on CSTs have absolutely nothing to do with student report card grades? Did you know that even where schools have their own "benchmark tests" which are most often linked directly to CA standards and given a number of times during the school year, these tests are most often NOT used to evaluate individual students in terms of grading. Where, then is any student or parent accountability in this system? Yes, AFTER school is over, usually in mid to late August, students are sent their CST test scores. Those are generally used to place students in one of those leveled classes mentioned above. But often students have already been placed for the next year, and administrators resist replacement after the school year has begun.

Posted by: Wizard64 | August 30, 2010 12:45 PM | Report abuse

5) What ARE valid expectations regarding student progress? I don't know if you are aware of this fact: every year, as students go through the grades, the overall trend is for their score to DECLINE. That is, the score of student A in 4th grade will on average decline anywhere for 5 to 10 points or more in 5th grade. I know. The state publishes results and pridefully states that this year (and last) the scores have increased in each subject. But actually that is quite misleading. If you simply look at the state averages (or pick a district such as LA Unified) you will quickly notice that through the grades the average score goes down each grade, especially for math. And it has been doing that for as long as you wish to research (cde.ca.gov). As an aside, it has appeared for a while that in 3rd grade, scores drop significantly. This could mean several things, including the possibility that the test for that grade is somewhat too difficult or covers to many standards. It also means that third grade teachers may need much more assistance in math education. Is more research needed? Your call. 4th grade is the only grade where the scores increase overall in the state results. For the state to say that scores have risen, the "trick" is that if you look at last year’s score for a particular grade level, then this year at that same grade level, there has in general been an increase. Why? My theory is that it all goes back to second grade.

As the second grade scores increase, so do the next year's third grade scores, etc. In other words, increasing scores at second grade every year raise all other grade levels in subsequent years. "A rising tide floats all boats". Another way to look at this is that since students are our "natural resources", the richer the incoming resource, the stronger the outgoing product. I would also warn that second grade scores may be reaching levels where there won't be much increase in the future. This means we can predict that all grade level scores will begin to flatten in subsequent years. How will the press publicize that lack of change. Will it mean that schools and teachers will again become whipping posts for "failure" to improve?

What this means is: to expect individual student scores to RISE each year is not only unfair and shows a lack of understanding of data, but it places very unrealistic expectations on educators. And understand that this is what is being demanded by the “Value-Added” so-called experts: that teachers actually raise the scores of all students each year of their school life. BTW, what I've written here is fundamentally understood by many teachers, even if they are not as knowledgeable concerning the data. It is not understood by all administrators, politicians or the public.

Posted by: Wizard64 | August 30, 2010 12:49 PM | Report abuse

What this means is: to expect individual student scores to RISE each year is not only unfair and shows a lack of understanding of data, but it places very unrealistic expectations on educators. And understand that this is what is being demanded by the “Value-Added” so-called experts: that teachers actually raise the scores of all students each year of their school life. BTW, what I've written here is fundamentally understood by many teachers, even if they are not as knowledgeable concerning the data. It is not understood by all administrators, politicians or the public.

Why is this declining data trend true and to be expected? Well, it actually makes common sense. As subjects get more difficult, even at lower grade levels, and especially with mathematics, students naturally begin to find the work conceptually more challenging. Think about it. If this wasn't the case, EVERYBODY would be an expert in Calculus, and at the same time. So it makes sense that math scores as well as other subjects SHOULDN'T increase each year as students aged, but should slowly decline. To actually raise a student's score over a year is a stunning victory against the natural trend. Somehow, that decline phenomena hasn't been made public, so there is this unrealistic expectation that scores for each student should go up each grade level. This places the whole "Value-Added" concept in question in terms of fair and reasonable evaluation. It is especially meaningful when comparing :benchmark” to “intensive” populations.

Posted by: Wizard64 | August 30, 2010 12:50 PM | Report abuse

But that is a story for another time. What I have tried to show here is that the concept of basing everything upon CST test results is clearly leading us in the wrong direction, away from creative and exciting and useful education with deep learning. In too many cases are standard are “a mile wide and an inch deep”. It is certainly important to evaluate educators, schools and districts. But simplistic tools will most likely make already serious problems only worse.

Finally, I hope you’ve noticed the irony of my beginning these ruminations by touting my CST scores for 2010. I did that to at least help the reader realize that by a single, and I feel misleading definition, it appears that I was a qualified, relatively successful teacher. What you should also know is that the year before this one, I taught 8th grade Algebra, which I’d done successfully for many, many years, and I had a high percentage of students who probably shouldn’t have been taking that subject due to their prior low grades and scores. Our school was attempting to be more inclusive and give students at least an opportunity to take Algebra in 8th grade even if they didn’t appear functionally prepared. While most schools limit Algebra students to those who have scored 350 or higher in Pre-Algebra, our school that year allowed students who scored “low basic” (300-315) to be part of a period and a half of Algebra each day, giving up their elective classes (another entirely relevant issue for another time).

Having ALL or even most 8th grade students attempt Algebra in 8th grade is very controversial among math educators, but California “punishes” schools in terms of their API for students not taking Algebra at grade level, although that appears to be changing. The NCTM (National Council of Teachers of Mathematics), for example does not support pushing every child to take Algebra in 8th grade, and has criticized California’s stand. Anyway, partially because my students were in a “Strategic Algebra” course, the number of students who attained Proficiency was considerably lower that year (2008-98), closer to the state average. That, even though I would say I worked harder that year than in some previous years to help student attain learning. Question: was I a lesser teacher that year than this year when 66% were either advanced or proficient? How would my differing results be easily evaluated for the two-year span? That, in essence, gets to the heart of the Value-Added issue in terms of evaluation, IMO. I won’t even go there when we speak absurdly of 100% proficiency in 2014. Anybody who has ever taught can tell you that is a ridiculous goal set to once again tout failure where it doesn’t exist, by those who neither understand education nor support public education.


Thank you for your time,

Cory M. Wisnia

Posted by: Wizard64 | August 30, 2010 12:51 PM | Report abuse

This was the top of the comments I made (out of order!)
Ruminations on “Value-Added” Evaluation Concepts

My wife and I are teachers. I retired after 35 years of teaching math and science, primarily at Middle School levels, at the end of this last school year. But during that time I taught every grade from 2nd through 12th grade. To put my rather lengthy comments below in perspective, I taught math (Pre-Algebra and Algebra) for many years as well as Physical Science at 7th and 8th grade levels. Last year for my 5 Pre-algebra classes, approximately 65%+ of my students rated "Proficient" or "Advanced", compared to state averages of approximately 42% and my own schools 49% average. I also have been a state certified Mathematics Staff Developer for various programs, including SB472, AB466, AB 1331, and others, and I've trained teachers in science and mathematics all over the state the past decades. All this is by way of letting you know up front that at least when it comes to teaching math, I have some perspective and experience.

Now, about "Value-Added" Evaluations. While I believe teachers, students, administrators and parents should be able to evaluate educational practices, using state test scores as a single measurement is fraught with tendencies of unfairness, errors in judgment, and inequity. Below I'll go through a number of areas of concern, discussing issues of importance when looking at such evaluation methods.

1) Are all the classes in the school heterogeneous? More and more schools, even at elementary levels are de facto "tracked" into up to 4 levels: honors, benchmark, strategic, intensive. Intensive students are at least two, and can be four or more years behind grade level when entering a grade level. How do we evaluate teachers who teach at these different levels? Certainly at Middle Schools such leveling is quite common in California. Many elementary schools place their EL (English Learners) students in common classes, which means more of those student lag behind in subject areas like math and language arts than other students due to language acquisition issues.

2) If students are at differing levels, can you easily project their learning targets? It is my experience (and that of many teachers) that students entering classes with difficulties in math generally do not make the same type of progress that benchmark or honor students do. In other words, struggling students tend to struggle more as they get older and the subjects get harder. Whereas one might have the expectation that a "benchmark" student would be able to continue to make at least one year's growth in their next class, such a projection is simply not generally realistic for struggling students, especially as they go up into higher grades.

Posted by: Wizard64 | August 30, 2010 12:54 PM | Report abuse

From phoss1:
"Of course Linda Darling-Hammond (and many of her Stanford associates) doesn't like objective data on teachers or students. She'd prefer we evaluate students via projects, essays, portfolios etc., as opposed to tests. The problem with these types of assessments; no one would ever know who actually did the work. Was it, in fact, the student. or was it the teacher, an administrator, a classmate, a sibling, a friend, a neighbor, a relative, etc., etc.? It's too easy to compromise any of these methods, ANY of them.

Darling-Hammond and the Stanford Ed School operated a charter school in East Palo Alto for poor/minority ELLs and it was closed last spring because the state of California deemed it chronically under-performing since 2001. It was operated in the same "soft" manner she and her cronies want future assessments to be made on the new Common Core Standards."

This might be of interest to you seeing as how you want to slam Dr. Darling-Hammond.

http://www.paloaltoonline.com/news/show_story.php?id=17992
East Palo Alto school test results soar -- too late
Boost in test scores too late to save Stanford charter school, shuttered in June

Posted by: ms_teacher | August 30, 2010 2:10 PM | Report abuse

Here's a way to validate the VAM model. Assign the top 10% teachers to the bottom 10% students form last year, and the bottom 10% teachers to the top 10% students. If the bottom students become the top and the top students become the bottom, you'll know it works. If not, throw it out.

Posted by: mcstowy | August 30, 2010 3:49 PM | Report abuse

phoss1
Obviously you are not in the trenches. If you were, you would realize that any 'idea' that is not proved to be statistically reliable will come back to bite the teachers. Let's not forget that we have had a huge exodus in the profession recently and administration is typically wet behind the ears...meaning they hve spent less than ten years in the classroom. Anything Darling-Hammond, Ratvich, etc puts forward gets my vote and should get yours too, if you are indeed in the trenches. Who else is looking out for the profession these days???? The expertise of the Hammond, et al crowd is insurmountable and should be shown the respect it deserves.

Posted by: vpientka | August 30, 2010 4:03 PM | Report abuse

Note: The paper that was released was NOT a "study." It was a political briefing paper released by a union-sponsored left-wing think tank. In no sense was it a "study" -- it didn't identify any new bit of knowledge, it merely reviewed some other actual studies.

It's amazing that Valerie Strauss doesn't know the difference between an actual study and a political document.

Posted by: educationobserver | August 30, 2010 4:53 PM | Report abuse

@cypherp: There are flaws with each measure of teacher effectiveness (observations, test scores, ed attainment, board cert, etc..) I haven't heard about peer-review model yet and that one has some track record. So there are certainly other options than the few you want to address.

As an Autistic Support teacher in Philadelphia, I have a few concerns about VAM. Many of the problems regarding outside factors influencing test scores are widespread in our school district (absenteeism, I would add homelessness too). However, I am not against value-added because it does offer a small snapshot of how the student fares from year to year when measured against a grade level standard. Can it be used in evaluations? I think it has a place among several other measures that form a picture of student progress in one's classroom. There is a caveat: the test must be valid. That has not been the case in NYC and that city has seen schools closed and $30 million in bonuses doled out of several years on the basis of invalid standardized test scores.

Taxpayers deserve to have their money count for something and those tests, believe me, cost states and districts millions of dollars. The citizens of NY just spent millions to be told very pretty lies for years about their students' competencies.

Posted by: Nikki1231 | August 31, 2010 7:53 AM | Report abuse

Cypherp wrote: "...for those teachers who belittle "teaching to the test", I would ask how can you quantitatively, reproducibly, and cost-effectively evaluate a child's progress in learning? Other than using test scores, how can you quantitatively, reproducibly, and cost-effectively evaluate a teacher's effectiveness?

This is very typical of how people who look at a subject from a distance and "see" a relationship that makes absolute sense, so you're not alone--especially in education because basically everyone has been in school themselves so they think they "know."

Obviously you don't have any kids in school who have been tested to death and "pre-tested" to death and "practice-tested" to death. I do.

When the students do well on the tests but now hate the subject and make it a point to forget all the material covered, do you still consider the teacher effective?? The administration doesn't care how dull the lessons are or even if the students actually LEARN anything as long as the test scores are good.

Is that where we want to go in education? Is the emphasis on test scores going to discourage students from dropping out?

Tests only measure how many questions a student answers correctly, not what the student "knows". This is especially true in math where students are frequently taught "recipe math" (i.e., "Just learn these steps to get the right answer.")

Here's an example I observed in a 6th-grade math class recently: The teacher was giving a lesson about adding integers and wrote the following on the board for the students to copy in their notebooks, which was taken directly from the Glencoe Math study guide: "When you add integers, remember the following. The sum of two positive integers is positive. The sum of two negative integers is negative. The sum of a positive integer and a negative integer is positive if the positive integer has the greater absolute value and negative if the negative integer has the greater absolute value."

You get the right answer by following those steps, but it is difficult to understand the actual number sense of it that way. As if to confirm that, one student raised her hand and said, "But I thought we were adding." To which the teacher said, "We are. You just need to follow the steps."

So what was being learned? The math? Or just how to get the right answer?

In Finland, which has one of the most envied elementary/secondary systems in the world, their students don't start school (officially) until age 7, they have a similar number of school days to us, and they have very few standardized tests. (It was in the WSJ--look it up.)

What did we do before we had all this standardized testing that pumps billions into the pockets of the testing companies--back when we put numerous men on the moon, developed computers, and had numerous other accomplishments?

If you want to close the "achievement gap," you need to address poverty.

Posted by: MathEdReseacher | August 31, 2010 4:41 PM | Report abuse

This business of VAM is not so different from paying professionals to give the specifications for and sort out parents, good from bad. Will any readers here admit to having been challenged, and perhaps defeated by their own children?

None of the findings and science of the Executive Summary is unpublished. Educational statisticians have analyzed classroom gains for decades and have never advanced these models to utility. We're just seeing new salesmen. Greater sophistication in statistical estimation of model paramaters, the majority of what economists have been peddling for a decade, doesn't improve the VAM or change the hazards of false prediction.

Anybody familiar what sports statisticians have done to change perceptions of player quality should appreciate how many characteristics of performance are part of the sabremetricians' developing arsenal and models; and then look at the pathetic math and reading tests that are the totality of the reductionist economist's stock of measurements.

Dissenting readers here repeat the errors of enterprising econometric professionals-- ignoring even brilliantly done empirical work demonstrating the hazards of the underlying model. Professional VAM modelers have demonstrate the same practice -- ignore - ance, in reading their own literature, alchemists insisting they are close to creation of gold. We have mostly been patients insisting the alchemists can simultaneous remove sickening impurities from the pond; and find the Miracle Grow (tm) agents for intellectual growth lurking in there as well.

We buy VAM because we will to believe they can identify terrible teachers at one extreme of quality; and brilliantly successful teachers at the other end. Why do we believe this? Because our own memories and data are selective. We imagine our own "great " teachers were so effective with every student in or educational classes; and we blame unsuccessful teachers of our past for not overcoming obstacles we placed in the path of their success.

The fact is, most of the bad teachers we knew were mediocre, not bad. We thought we knew the bad teachers and dropped their courses or never enrolled in them.

If this were medicine, and we killed so many patients from side effects as from a false model, promoting suicides among those we were sure we were helping with depression, or causing strokes among those we were sure we were protecting from heart infarctions, then (eventually) the remedy - VAM- would be pulled from the market.

As in medicine, there's a stronger role for clinical judgement in identifying syndromes and treating behavior. Teaching and managing schools is still art.

Posted by: incredulous | September 1, 2010 10:07 AM | Report abuse

The comments to this entry are closed.

 
 
RSS Feed
Subscribe to The Post

© 2010 The Washington Post Company