An agreement-based approach for reliability assessment of Students’ Evaluations of Teaching

Students’ Evaluations of Teaching (SETs) are the most common way to measure teaching quality in Higher Education: they are assuming a strategic role in monitoring teaching quality, becoming helpful in taking the major formative and summative academic decisions. The majority of studies investigating SETs reliability focus on the instruments and the procedures adopted to collect students' evaluations rather than on the capability of the students as teaching quality assessors. In order to overcome this lack, a study has been carried out with the aim of measuring SETs reliability in terms of inter-student agreement and intra-student agreement. The results of our study show that the majority of students provided substantially repeatable evaluations whereas only a few students provided almost perfectly repeatable evaluations; the evaluations provided by different students generally slightly agreed, which means that the students did not share the same opinions and beliefs on teaching quality.


Introduction
Measuring the student experience is assuming increasingly importance in Higher Education (hereafter, HE) representing a widespread method for evaluating teaching quality whose importance is relevant for taking the major formative and summative academic decisions (Berk, 2005;Gravestock & Gregor-Greenleaf, 2008;Onwuegbuzie et al., 2009).
Student ratings, also known as Student Evaluations of Teaching (SETs), have dominated as the primary measure of teaching quality over the past 40 years (e.g., Centra, 1979;Seldin, 1999;Emery at al., 2003;Gaertner, 2014) forming the basis for the rankings of HE institutions.Although widely used, SETs are one of the most controversial and highlydebated measures of teaching quality: many researchers argue that there is no better option that provides the same sort of quantifiable and comparable data on teaching quality (McKeachie, 1997;Abrami, 2001) but, on the opposite, others point out significant biasing factors for SETs.
The fear that students cannot provide reliable teaching quality evaluations is, by far, one of the primary concerns about SETs.As a matter of fact, even highly motivated students can base their current evaluations on their past teaching experience, which can substantially vary depending on the college or university attended and/or on the student individual belief toward the degree (Ackerman et al., 2009).Students who are generally satisfied/dissatisfied with the course and/or the instruction can bias the results upward/downward (Sliusarenko et al., 2013).In addition, it is known that demographic (e.g., gender and age; Thorpe, 2002;Fidelman, 2007;Kherfi, 2011) as well as logistic (e.g., class size; Kuo, 2007) factors can influence SETs.The above considerations call into question the opportunity to consider the students as able to provide reliable evaluations on teaching quality.For this reason, differently from the majority of available studies, which rather focus on the instruments and the procedures adopted to collect SETs, our study aims at investigating the peculiar abilities of the students as teaching quality assessors by measuring SETs reliability in terms of interstudent and intra-student agreement.Particularly, the former allows evaluating the students' ability to provide the same score, on average, as the other students whereas the latter, also known as repeatability, allows evaluating the students' ability to score consistently a given quality item in different occasions.

Measuring inter-student and intra-student agreement: kappa-type indexes
The easiest approach for assessing the degree of agreement among repeated evaluations would be to simply calculate the observed agreement.This approach, however, provides a biased measure of agreement, especially when a rating scale with a few categories is adopted.In order to avoid this problem, inter-student and intra-student agreement will be assessed using the well-known kappa-type indexes, where the observed agreement is corrected for the agreement expected by chance.Specifically, the degree of inter-student agreement is assessed by calculating the statistic proposed by Marasini et al. (2014), that is a rescaled measure of the probability of observed agreement s a p corrected with the probability of agreement expected by chance alone | Being r the number of students who rated twice (i.e.replications) the same n quality items on a 3 k  points ordinal scale, hi r and hj r the number of students who assigned the th h quality item into th i and th j category during first and second replication, respectively; ij w the corresponding weight, introduced in order to account that some disagreements (i.e. on categories that are at least two steps apart) are more serious than others (i.e. on neighboring categories), the observed proportion of agreement and the proportion of agreement expected by chance alone can be obtained as: where ˆh p is the proportion of agreement on th h quality item given by: The degree of intra-student agreement, instead, is assessed using the weighted version of Brennan-Prediger coefficient (1981) proposed by Gwet (2014), that is a rescaled measure of the probability of observed agreement a p corrected with the probability of agreement expected by chance alone | ac p : The chance measurement system adopted in Brennan-Prediger coefficient is the uniform one.Being n the number of quality items rated twice on a 3 k  points ordinal scale by the same student, ij n the number of quality items classified into th i category in the first replication and into th j category in the second replication, the observed proportion of agreement ˆa p and the proportion of agreement expected by chance alone | ac p are: The values of kappa-type indexes range between -1 and 1, with negative values meaning disagreement.The index magnitude can be interpreted by adopting the Landis and Koch (1977) benchmark scale.According to this scale, there are 5 categories of agreement s corresponding to as many ranges of coefficient values: slight, fair, moderate, substantial and almost perfect agreement for coefficient values ranging between 0 and 0.2, 0.21 and 0.4, and 0.41 and 0.6, 0.61 and 0.8 and 0.81 and 1.0, respectively.

Case Study
The case study was conducted at the Department of Industrial Engineering of University of Naples "Federico II" and consisted of 3 supervised experiments (hereafter, E.1, E.2, E.3) carried out on classes of students attending the course of Statistical Quality Control (SQC) in 3 successive academic years.All three involved classes included more than 20 students; all of them obtained the first level degree in Management Engineering from the University of Naples "Federico II" and thus they can be reasonably assumed homogeneous in curriculum and instruction.
Students were asked to fill two evaluation sheets (each with a specific rating scale) in order to collect their quality evaluation for a set of items (regarding, for example, organization, workload and readings) of the SQC course they were attending.The first evaluation sheet used a Numeric Rating Scale (NRS) with scores ranging from 0 to 10 whereas the other used a Verbal Rating Scale (VRS) with agreement grades: "strongly disagreeing with the statement", "slightly agreeing with the statement", "quite agreeing with the statement" and "strongly agreeing with the statement".For comparability purposes, students' evaluations on the NRS were rescaled to the 4-points VRS using the following cut-off ranges: 0 to 2, 3 to 5, 6 to 8 and 9 to 10.
Each experiment consisted of two sessions: the first evaluation session (i.e., S.I) took place at mid-term course and the second evaluation session (i.e., S.II) took place the following lesson.Between S.I and S.II there was no new lesson and no interaction with the teacher, therefore no change in quality evaluation was expected.In order to guarantee evaluation traceability while preserving anonymity, each student signed her/his evaluation sheets with a nickname, which enabled to match student's ratings provided in the two evaluation sessions in order to estimate intra-student agreement.Only those students who rated all quality items in both experimental sessions were retained as participants in the study (viz.17 students in E.1, 18 students in E.2 and 17 students in E.3).
The collected data were used to estimate the inter-student and intra-student agreement on NRS (hereafter, NRS ŝ and , respectively) and the inter-student and intra-student agreement on VRS (hereafter, VRS ŝ and ); the intra-student agreement coefficients were both computed adopting the linear weighing scheme (Cicchetti & Allison, 1971).

Study results
The value of NRS ŝ and VRS ŝ for E.1, E.2 and E.3 are reported in Table 1.Results in Table 1 highlight that the inter-student agreement is at most moderate, so that it is not possible to assume that the involved students shared the same opinions about teaching quality; the difference between the two rating scales is irrelevant only for students of E.1, however results do not allow preferring a rating scale over the other.
The intra-student agreement was generally higher than the inter-student agreement: 73% of students were at least substantially repeatable on both NRS and VRS whereas 19% of them were even almost perfectly repeatable on both NRS and VRS.In addition, the majority of students show over the years values of them the repeatability on the two rating scales belong to the same agreement categories and only for few (i.e., 10) students and belong to no-adjacent categories of agreement.

Figure 1 .
Figure 1.Intra-student agreement on NRS (as abscissa) and VRS (as ordinate) for each student participating in E.1.(on the left), E.2. (in the middle) and E.3.(on the right)

Table 1 . Inter-student agreement on NRS and VRS
The results for intra-student agreement for each student participating in E.1, E.2 and E.3, are reported in Table2and plotted in Figures 1 against the 5 regions of intra-student agreement on NRS and intra-student agreement on VRS identified according to the Landis and Koch's benchmark scale.