The Land of Disenchantment: Bias in New Mexico Teacher Evaluation Measures

Geiger, Tray

Over the past 20 years in the United States (U.S.), teachers have seen a marked

shift in how teacher evaluation policies govern the evaluation of their performance.

Spurred by federal mandates, teachers have been increasingly held accountable for their

students’…

Over the past 20 years in the United States (U.S.), teachers have seen a marked

shift in how teacher evaluation policies govern the evaluation of their performance.

Spurred by federal mandates, teachers have been increasingly held accountable for their

students’ academic achievement, most notably through the use of value-added models

(VAMs)—a statistically complex tool that aims to isolate and then quantify the effect of

teachers on their students’ achievement. This increased focus on accountability ultimately

resulted in numerous lawsuits across the U.S. where teachers protested what they felt

were unfair evaluations informed by invalid, unreliable, and biased measures—most

notably VAMs.

While New Mexico’s teacher evaluation system was labeled as a “gold standard”

due to its purported ability to objectively and accurately differentiate between effective

and ineffective teachers, in 2015, teachers filed suit contesting the fairness and accuracy

of their evaluations. Amrein-Beardsley and Geiger’s (revise and resubmit) initial analyses

of the state’s teacher evaluation data revealed that the four individual measures

comprising teachers’ overall evaluation scores showed evidence of bias, and specifically,

teachers who taught in schools with different student body compositions (e.g., special

education students, poorer students, gifted students) had significantly different scores

than their peers. The purpose of this study was to expand upon these prior analyses by

investigating whether those conclusions still held true when controlling for a variety of

confounding factors at the school, class, and teacher levels, as such covariates were not

included in prior analyses.

Results from multiple linear regression analyses indicated that, overall, the

measures used to inform New Mexico teachers’ overall evaluation scores still showed

evidence of bias by school-level student demographic factors, with VAMs potentially

being the most susceptible and classroom observations being the least. This study is

especially unique given the juxtaposition of such a highly touted evaluation system also

being one where teachers contested its constitutionality. Study findings are important for

all education stakeholders to consider, especially as teacher evaluation systems and

related policies continue to be transformed.

Copyright Statement