In a paper released Friday, a UC Berkeley professor and a campus consultant found student course evaluations do not properly measure teaching effectiveness.
Philip Stark, chair and professor of statistics at UC Berkeley, and Richard Freishtat, a senior consultant in the Center for Teaching and Learning at UC Berkeley, examined the statistics behind student course evaluations and determined that because of the imperfect response rate and flawed averaging process, student evaluations should not be used to gauge teaching efficacy.
“There is empirical support that student evaluations are, to some extent, reliable, repeatable,” Stark said. “But they are not valid in the sense of measuring what people claim to measure.”
The paper found that students comment on their experience with the class but that they are typically not suited to evaluate pedagogy. According to the paper, controlled and randomized experiments show that evaluations are negatively associated with direct measures of teaching effectiveness. Additionally, factors such as ethnicity, gender and attractiveness of the instructor can influence results.
Since 1975, UC Berkeley has collected student evaluations that ask respondents to rate their instructor’s “overall teaching effectiveness” on a scale from one to seven, according to the paper.
“The question that is asked is one of teaching effectiveness, and the question that seems to be answered is, ‘Did I like the class, or did I like the professor?,’ ” said Steven Evans, a professor of statistics and mathematics at UC Berkeley.
Johann Gagnon-Bartsch, a Neyman assistant professor of statistics on campus, cited his experience as a GSI to illustrate how student response rates alter the results of evaluations. He recalled receiving different evaluations between his 9 a.m. and 11 a.m. sections, adding that just one-third of students would attend the former.
“I’m the same GSI teaching the same stuff,” Gagnon-Bartsch said. “So based on the time of the class, you can get different results.”
The process of averaging the numerical results of course evaluations is also problematic, according to Evans, Gagnon-Bartsch and the paper’s authors.
“One professor might get a bunch of sevens and a bunch of ones because they might teach in a particular way that may be helpful to some students and not to others,” Gagnon-Bartsch said. “Another teacher might get fours. If you average them, you get the same result, but they obviously aren’t the same.”
Stark and Freishtat suggested including the distribution of scores, the number of respondents and the response rate — rather than the average.
In the paper, they also suggest that peer evaluation by colleagues should be part of the process.
“In academia, we want everyone to read our papers and books and come to our seminars and talks, but we have a funny squeamish feeling about having colleagues in our classroom,” Stark said. “Looking at each other’s teaching has an enormous beneficial effect.”
While Evans agreed with the points made in the paper, he expressed some concern with the idea of peer evaluations.
“If I see one of my colleagues teach, perhaps I’ll approve of the teaching just because they do things in the same way that I do,” Evans said, adding that it may be challenging to tell peers that their teaching is subpar.
Stark said he hopes the Academic Senate will look more closely at how the campus evaluates teaching, effectiveness and engagement.