nicoguaro · October 8, 2023 18:04
diff --git a/SLR_wordcloud.py b/SLR_wordcloud.py
 # -*- coding: utf-8 -*-
 import matplotlib.pyplot as plt
 from wordcloud import WordCloud, STOPWORDS


 #%%
 f = open('output_file.txt', 'r', encoding="utf8")
 text = f.read()
 f.close()

 # Stopwords
 stop = []

 # f = open('stop-words-spanish-snowball-mod.txt', 'r', encoding="utf8")
 # stop += f.read().split()
 # f.close()

 # f = open('spanish_stopwords.txt', 'r', encoding="utf8")
 # stop += f.read().split()
 # f.close()


 wordcloud = WordCloud(stopwords=STOPWORDS.union(set(stop)),
                      background_color='#3c3c3c',
                      width=1800,
                      height=1400,
                      max_words=100,
                      colormap="magma",
                      #font_path='./CabinSketch-Bold.ttf'
                      )
 wordcloud.generate(text)

 plt.figure(figsize=(9, 7))
 plt.imshow(wordcloud)
 plt.axis('off')
 plt.tight_layout()
 plt.savefig('word_cloud-SLR.png', dpi=300, transparent=True)
 plt.show()
diff --git a/output_file.txt b/output_file.txt
 Refocusing on the Traditional and Effective Teaching Evaluation:
 Rational Thoughts About SETEs in Higher Education
 Guo Cui
 Panzhihua University
 Zhong Ni
 Huaihua University
 Camilla Hong Wang
 (Corresponding Author)
 Shantou University

 Some higher educational institutions often use a student evaluation of teaching effectiveness (SETE) as the
 only way to evaluate teaching. Unfortunately, this instrument often fails to serve as a tool for improving
 instruction. It often serves as a disincentive to introducing rigor. Studies have found that student feedback
 is not enough to be the basis for evaluating teaching. This paper performs a literature review of student
 evaluations to measure teaching effectiveness. Problems are highlighted, and suggestions are offered to
 improve SETEs and refocus teaching effectiveness on outcome-based academic standards.
 Keywords: SETE, teaching interaction, teaching evaluation, performance assessment
 INTRODUCTION
 Student evaluation of teaching effectiveness (SETE) originated in the United States (Zhou, 2009).
 Experts who support SETE believe that students’ evaluation of teachers’ teaching is objective (Zhang, Ma,
 and Jiang, 2017). From students’ perspective, the teaching effect can reflect classroom quality and be used
 as the primary method to evaluate teaching quality in universities and vocational colleges (Wang and Yu,
 2016). However, some scholars believe that if SETE is used only and not combined with other evaluation
 bases, students will become the decision-maker of teachers’ appointment, evaluation, promotion and salary
 increase (Uttl, White, & Gonzalez, 2017). Some scholars also argue that if teachers are evaluated by student
 satisfaction, students are directly empowered to assess teaching effectiveness. It would significantly
 negatively impact and lower teaching quality (Emery, Kramer, & Tian, 2003).
 Many universities and higher vocational schools regard students as consumers rather than products
 (Emery & Tian, 2002). As a result, SETE tends to reflect the popularity of teachers rather than the actual
 quality of teaching. SETE results are subject to many factors and do not depend entirely on teachers’
 teaching levels and effectiveness. A study conducted by Chang et al. found that students’ “attitude toward
 teaching evaluation,” “attitude toward learning,” and “attitude toward the course” significantly affected the

 150 Journal of Higher Education Theory and Practice Vol. 22(3) 2022

 data error of SETE (Dong, 2014). The author argues that the existing SETE-based teaching evaluation
 method can hardly improve the teaching level, so it is necessary to discuss the advantages and disadvantages
 of the current SETE method and discuss them from the literature analysis and cases.
 LITERATURE REVIEW
 SETE was embraced by U.S. colleges and higher vocational education administrators as early as the
 1960s and has been prevalent in U.S. higher education for more than 50 years because of its practicality,
 sophistication, and accessibility. However, SETE is not the only or the best way to assess the quality of
 teaching and learning. The author analyzes and concludes different dimensions of research cases regarding
 the reliability and validity of SETE.
 Personal Traits and Popularity
 Most educational researchers believe that SETE essentially has nothing to do with teaching. In some
 courses, the same materials and assessment methods are used, but different instructors teach them, and the
 assessment results of teaching effectiveness are not the same for each instructor. Several Chinese and
 foreign scholars have reached conclusions supportive of these ideas (Dooris, 1997; Xie & Zhang, 2019;
 Guan, 2012; Wu, 2013; Zhong, 2012; Aleamoni, 1987). Research findings indicate that teachers’
 performance significantly impacts SETE results but not student achievement (Feldman, 1978). At the time
 of SETE, students often base their evaluations on teachers’ attributes (Abrami, Leventhal & Perry, 1982).
 Feldman noted a positive correlation between teacher personality and assessment results when evaluations
 are based on what students or colleagues know about the teachers (Feldman, 1978). Abrami et al. have
 suggested that schools should not decide teacher promotions and tenure based solely on SETE because
 teachers who are popular with students receive good SETE scores regardless of teaching ability. Thus, using
 SETE to assess teaching quality can be challenging academically (Abrami, Leventhal & Perry, 1982).
 Student Achievement
 Numerous studies have shown that student achievement is not related to actual evaluation results of
 teaching effectiveness. Cohen noted that the coefficient of variation in overall SETE results due to
 differences in student achievement was only 14.4% (Cohen, 1983). Dowell and Neal suggested that the
 correlation between student achievement and SETE results was only 3.9% (Dowell & Neal, 1982). In a
 broader study, Damron noted that SETE scores were not related to teachers’ ability to improve student
 achievement. If the weight of classroom satisfaction on SETE results were increased, teachers would
 receive lower evaluation scores, potentially depriving teachers of opportunities for promotion, salary
 increase, or even succession (Damron, 1996).
 Situational Factors and Effectiveness
 Some researchers have proposed that situational factors can interfere with SETE (Damron, 1996),
 making the results, not representative (Cohen, 1983). Cashin noted that there is a sizeable disciplinary bias
 in SETE. Some surveys suggest that teachers in the arts and humanities consistently score higher on the
 SETE results, while teachers in business, mathematics, and engineering consistently score lower. In
 addition, differences between compulsory and optional courses and between senior and junior students may
 affect the evaluation results (Aleamoni, 1989). The amount and intensity of course assignments can also
 influence students’ teacher evaluation. A faculty member at a university teaches an introductory course.
 Due to adopting a collectively developed syllabus, there is no coursework and only three multi-choice
 exams. As a result, students give the teacher high evaluations every year, with scores higher than the college
 average. The other two courses taught by the same teacher receive low evaluations from students because
 they have developed their syllabus and are assigned more coursework.
 It should be noted that the teacher is the leading scholar of these two courses. The textbook used is also
 authored by the teacher, who is pretty familiar with the content of the course but has received poor
 evaluations simply because of the large amount of coursework. In one of these courses, the average student
 Journal of Higher Education Theory and Practice Vol. 22(3) 2022 151

 evaluation score was 73. Still, the standard error was as high as 35, and we wondered what the validity of
 such a teaching evaluation was.
 Assessors
 The issue of assessors in SETE deserves attention. Assessors who are not familiar with the assessment
 system may be misled by useless data and draw conclusions that deviate from the facts. The evaluation of
 teaching effectiveness should focus on scientific statistics, and any sample of fewer than 30 respondents is
 a small sample, which requires a specific statistical method. An unscientific statistical approach may lead
 to three types of errors. Firstly, data processing is not scientific. Secondly, assessors confuse the critical
 difference factors and non-critical difference factors, and thirdly, assessors cannot reasonably explain the
 differences of respondents and cannot identify the sources of these differences. Therefore, college
 administrators should master scientific, statistical analysis theories and methods (Zhong, 2012).
 Qualifications
 Many researchers argue that students who are not equipped with critical thinking skills cannot assess
 teachers. Therefore, most researchers believe that SETE can be a teaching evaluation. Still, the teaching
 effectiveness can only be set to the extent that the student is qualified (Wu, 2004). It has also been proposed
 that assessors receive appropriate training before evaluation (Aleamoni, 1989). Conversations between
 assessors are generally protected by defamation suits, a fundamental civil right (Cascio and Bernardin,
 1981). If the assessors are not qualified but still assess others, the assessors can sue the assessors for
 defamation (Chen, 2012).
 CASE ANALYSIS
 The literature review revealed that administrators’ practice of using SETE as the sole basis for making
 decisions about faculty promotions and salary increases had been widely resented and opposed by the
 faculty. The following is an analysis from teachers’ and students’ perspectives, illustrating how to
 rationalize this approach.
 Case 1. What Is Excellent Teaching?
 A professor at a university in the United States had a SETE average of 4.25 (out of 5) in the first
 semester, 4.23 in the second semester, and 4.21 in the third semester. The professor constantly reflected on
 his teaching and made improvements over the past three semesters, but his SETE scores were always below
 average. The professor was recognized as an outstanding faculty member, with excellent performance on
 all aspects of the performance evaluation. However, based on his SETE score, he was not awarded the
 Excellence in Teaching Award. The award was granted to another professor who had a high SETE score
 but performed poorly on the performance evaluation. This phenomenon was brought to the president’s
 attention, who became aware that the SETE system was flawed (Emery, Kramer & Tian, 2003).
 It is also worth noting that the professor’s scores are all above 4.0. In this regard, the authors questioned
 how to achieve “good” if a score higher than 4.0 out of 5 is considered not good. If other factors are not
 considered, how should SETE scores be measured? If these so-called “other factors” are more influential
 than SETE, why is the SETE method used to assess teaching and learning?
 Case 2. Differences in Scores of Different Classes Taught by the Same Professor
 A professor at Anhui University of Finance and Economics took up the teaching task of 4 classes in
 one semester, and his SETE score in one class was 94.33 (100 out of 100), which ranked 6th in the
 university, while his score in another class was 62.5, which was the lowest score in the university. In other
 words, the same professor is considered by one type to be one of the best teachers in the university, while
 students in another class think him to be one of the worst teachers in the university. Assuming that SETE
 is an indicator of the actual situation, the scores of the same professor should be very close. The above data

 152 Journal of Higher Education Theory and Practice Vol. 22(3) 2022

 indicate that such a significant contrast calls into questions about the objectivity and validity of SETE
 (Dong, 2014).
 Case 3. Differences From the Control Group
 A professor at a U.S. university who was not yet tenured received 4.10 and 4.24 in the two classes he
 taught in the fall semester. In the following spring semester, he led the same course at the same university
 and scored 4.04 and 4.33 in the two classes. The average score for the entire university was 3.99 in the fall
 semester and 4.31 in the spring semester. The professor’s scores differed little between the two semesters
 when compared longitudinally. However, compared to the school average, his teaching performance was
 worse in the spring semester than in the fall semester. Could it be attributed to the improved quality of
 teaching throughout the university during the spring semester? The answer is no. To some extent, these
 differences depend on the composition of the faculty participating in SETE. In the fall semester, all faculty
 members are required to take SETE, whereas, in the spring semester, only non-tenure-track professors and
 teaching assistants are required to take SETE (Emery, Kramer & Tian, 2003).
 Many researchers believe that teaching assistants are often more “likely” to meet student expectations
 and, therefore, are more likely to receive high scores. In addition, because SETE has a significant impact
 on faculty careers, non-tenure-track professors tend to make more effort to gain favor with students and
 thus earn higher scores. Both of these factors contribute to higher SETE scores for the entire university.
 Since SETE scores have little impact on their teaching careers, tenure-track professors are not required to
 please their students to get higher student evaluations. Therefore, the overall average score decreases when
 tenure-track professors are also involved in the SETE process. This phenomenon is quite common in U.S.
 colleges and universities. In this way, does it mean that tenure-track and experienced professors are
 considered inferior teachers (Feldman, 1986)?
 Case 4. Score Differences and Teachers’ Teaching Styles
 The researcher from Nanjing Communications Institute of Technology analyzed the correlation
 between the personality traits of the interviewed teachers and the SETE results based on the research and
 interviews with full-time teachers in several higher vocational colleges and universities and developed a
 comparison table of teachers’ teaching style indicators. It can be seen that the SETE scores are relatively
 low for teachers who are more demanding in terms of student attendance and classroom discipline and high
 for teachers who are not. The SETE scores are lower for teachers who are more rigorous and formal in their
 classroom style or appearance and more elevated for teachers who are not, as shown in the following Table
 1 (Schmelkin, Spencer & Gellman, 1997).

 Journal of Higher Education Theory and Practice Vol. 22(3) 2022 153

 Teachers maintain
 the dignity of the
 teacher, maintain the
 psychological
 distance between the
 teacher and the
 student, and hold
 “orthodox” values

 Teachers and
 students are friends;
 teachers can
 comment on fashion
 or criticize current
 affairs and
 communicate with
 students without
 distance.

 Teachers should
 “teach students
 according to their
 abilities” so that
 students’
 performance can
 be reasonably
 distributed and as
 many “good
 students” as
 possible can
 emerge.

 Classroom
 communication and
 break-time
 interaction

 Teachers should
 not leave students
 unattended and
 should not lower
 their standards to
 cater to them, or
 else the quality of
 graduates is
 bound to decline.

 Examination
 standards and
 requirements

 154 Journal of Higher Education Theory and Practice Vol. 22(3) 2022

 Be strict in
 attendance.
 Teacher and
 student are like
 Faculty
 father and son,
 group
 and the teacher
 with lower
 should criticize
 SETE
 the student if
 scores
 they make
 some mistakes
 deserving
 criticism.
 Teachers are
 not necessarily
 rigorous;
 Faculty
 teachers and
 group
 students are
 with
 like friends,
 higher
 and teachers
 SETE
 should be
 scores
 tolerant when
 they should be
 tolerant of
 students

 Attendance and
 classroom
 discipline

 The teachers are
 strict and severe
 and dress
 traditionally or
 with slight
 variation.

 Classroom
 style/teaching
 manner and
 appearance

 The teachers are
 Teachers want
 relaxed, lively
 students to talk to
 (female) /
 them, even if it is not humorous (male),
 related to their
 and dressed in
 studies
 fashionable and
 neat styles.

 Teachers rarely
 communicate with
 students outside of
 class and do not
 communicate with
 them on matters
 other than academic
 work.

 Extracurricular
 communication and
 life interactions

 The teachers are skilled
 in a case study or
 scenario-based
 teaching and enjoy
 writing school-based
 textbooks, reference
 books or teaching
 casebooks.

 The teachers prefer
 academic research, are
 willing to teach
 cutting-edge
 educational theories,
 and are meticulous in
 deriving formulas.

 Teaching and
 research/teaching
 preferences

 TABLE 1
 COMPARISON TABLE OF TEACHING STYLE INDICATORS FOR TEACHERS WITH SIGNIFICANT
 DIFFERENCES IN SETE SCORES

 Case 5. Students’ Use of the Right to Evaluate Teaching at a University
 A random sample of 350 students at a university was surveyed on how students evaluate their teachers.
 The results showed that 68% of the students said they considered their teachers based on how much they
 liked them. In other words, 68% of the students valued the teacher’s personality more than the basic
 teaching skills or effectiveness. At the same time, 47% of the students surveyed admitted a disciplinary bias
 when evaluating their teachers. A student who prefers music to physical education is likely to give a higher
 rating to the music teacher and a lower rating to the physical education teacher (See Table 2).
 TABLE 2
 QUESTIONNAIRE FOR STUDENTS’ EVALUATION OF TEACHERS
 Question

 Item

 I do not attach much importance to the final I agree, I strongly agree
 course evaluation, and I do not think it has I don’t know
 much influence on the teachers
 I can’t entirely agree.
 I strongly disagree
 The mechanism of student evaluation of I agree, I strongly agree
 teachers weakens the authority of teachers
 I don’t know
 I can’t entirely agree.
 I strongly disagree

 Number of
 Percentage
 respondents
 51.7%
 181
 23.7%
 83
 82
 179
 104
 62

 23.4%
 51.1%
 29.7%
 17.7%

 *Only valid data were selected.

 To ensure the rigor and accuracy of the study, a questionnaire on the credit system and teacher
 evaluation was distributed to the students to explore the relationship between course evaluation and teachers
 and students in a quantitative way. We found that teacher evaluation did not seem to have the desired effect
 based on the in-depth interviews. As shown in Table 2, more than half (51.7%) of the students thought that
 course evaluation had little impact on the teachers, while only 23.4% disagreed with this statement. Thus,
 it can be seen that most students do not think that course evaluation has much impact on teachers, so students
 can hardly take assessment courses seriously. Therefore, students may give teachers positive or negative
 comments, discouraging teachers’ motivation and weakening the teacher-student relationship.
 In addition, more than half (51.1%) of the students were more optimistic about the statement that “The
 mechanism of student evaluation of teachers weakens the authority of teachers,” and only 17.7% of the
 students disagreed with this statement. This result is highly consistent with our interviews with some
 teachers. It indicates that most students believe that student assessment of teachers’ courses could affect
 teachers’ sense of authority. It can be inferred from both teachers and students that teachers’ power has
 been weakened due to the SETE mechanism, which is far from the value of “a one-day teacher is a lifelong
 father” in traditional Chinese culture. It has a significant negative impact on the teacher-student relationship
 in colleges and universities.
 At the same time, in-depth interviews also showed that 74% of students would change their teacher’s
 opinion, thus changing their evaluation score. They get some unique benefits from the teacher outside of
 teaching. A teacher who treats students to chocolate increases student favorability, resulting in higher scores
 on student evaluations, which is highly consistent with Professor Emery’s findings (Emery & Tian, 2002).
 In addition, it is interesting to note that 52% of the students did not evaluate the teaching based on the
 teacher’s actual performance but gave the teacher a full 5 out of 5. There were two reasons for this group
 of students to score. One is that they think the teachers work very hard and should be recognized and
 appreciated; the other is that they believe it is convenient to achieve all 5s and complete the SETE task
 quickly.

 Journal of Higher Education Theory and Practice Vol. 22(3) 2022 155

 DISCUSSIONS
 Many scholars believe that the SETE method has more disadvantages than advantages: (1) SETE tends
 to train mediocre people and discourages people from taking risks. (2) The SETE method focuses on shortterm performance and lacks a long-term perspective, ignoring critical factors that are not easily measured.
 (3) This method focuses on individuals and is not conducive to teamwork. (4) This method is based on
 detection, not aimed at prevention. (5) The method is unfair, and the assessment is highly subjective. (6)
 This system does not distinguish between endogenous factors of individual differences and exogenous
 factors that are not under human control (Huang and Qi, 2014; Trout, 2000; McGregor, 1972; Meyer, Kay
 and& French, 1965).
 American scholars Milliman and McFadden conducted a study in which they found that 90% of GM
 employees considered themselves to be in the top 10% best employees in the company. In this regard, the
 two scholars asked these employees whether their motivation would be seriously undermined if managers
 did not evaluate their performance highly. It can be seen that the scientific evaluation of employee
 performance has a significant impact on the labor productivity of the company. Likewise, suppose
 employees are allowed to evaluate their supervisors backward. In that case, it can seriously affect
 supervisors’ managerial motivation and, as a result, hurt the labor productivity of the company (Milliman
 & McFadden, 1997). Therefore, the scholar Deming strongly condemned these performance evaluation
 procedures (Deming, 1986). Human resource management scholars Porter and Lawler’s expectancy model
 of motivation explain motivation models’ importance. If employees disagree that “the harder they work,
 the greater the reward,” they will not work as hard as they should and will lose their way (Porter & Lawler,
 1968).
 In our opinion, the evaluation of teaching has two primary purposes: to serve as a basis for reward and
 punishment, and the other serves as a reference for development. In the evaluation case for reward and
 punishment, the evaluation results are used as the basis for teachers’ promotion and salary increase. In
 contrast, in the case of evaluation for development, the evaluation results are used as a reference and
 suggestion for teachers to improve their teaching and enhance their teaching skills. However, from our
 observation and research, in China’s universities, rewards and punishments overwhelm development in
 practice, and teaching evaluation is more like a convenient means of administrative control. As a result,
 teachers who desire to receive feedback from students and improve their teaching seek alternative
 approaches.
 We also believe that the most significant value of evaluating teaching is to provide a platform for
 teachers and students to communicate with each other. In implementing the evaluation system, school
 administrators must clarify that evaluation scores should be used only as a reference for teachers to improve
 their teaching. The evaluation scores should not be used as the basis for appraisal and promotion, at least
 not as the only or primary basis for review and advertisement. In short, business managers may symbolically
 provide employees with feedback on their work through performance appraisal methods to be aware of
 their strengths and weaknesses. To a certain extent, performance appraisals are helpful for companies to
 make decisions related to employee management. The author believes that the primary purpose of the SETE
 for educational administrators is to provide information and feedback, but not to serve as a basis for making
 decisions about teachers’ promotion. It should be the key to the sustainable development of teaching
 evaluation by refocusing on the essence of teaching in higher education and attaching importance to the
 practical effectiveness of education (Tan, 2014).
 CONCLUSIONS AND RECOMMENDATIONS
 The SETE approach, which is widely used today, actually rewards teachers for making high SETE
 scores by catering to students, thereby lowering the expectations of students and thus diminishing the
 quality of teaching (Emery, Kramer, and Tian, 2003; Zhong, 2012; Feldman, 1986; Tan, 2014). The purpose
 of teaching evaluation is to help teachers improve their performance. Still, in practice, administrators use it
 to make decisions about the fate of teachers (Abrami, d’Apollonia & Cohen, 1990). Worse still, many
 156 Journal of Higher Education Theory and Practice Vol. 22(3) 2022

 colleges and universities have adopted various means and regulations to get students involved in teaching
 evaluation. Some universities require students to evaluate their teachers before checking their final grades.
 Others need students to assess their teachers before they can take a course. Others require that it affect
 students’ final grades if they do not evaluate their teachers. The author believes that performance evaluation
 is necessary for making decisions about individual teachers. SETE results should only be used as a reference
 factor and not as a determinant. In this regard, some recommendations for management are proposed:
 (1) The SETE method should be oriented to teaching performance rather than student satisfaction;
 simultaneously, the sources of the evaluation data should be broadened, and SETE results
 should not be used as the sole basis for measuring teaching quality.
 (2) Teachers should be evaluated against some criteria, not just a cross-sectional comparison
 between universities. Also, comparisons of course evaluations should be made between similar
 courses.
 (3) It should be ensured that the measures are feasible and that the data are statistically significant.
 If a student gives a grade below satisfactory, the student should be requested to write a comment
 to add credibility to the negative assessment.
 (4) Assessors and third-party monitors should be trained to ensure that the evaluation system is
 legitimate, adaptable, and diverse.
 (5) Graduates can be invited to evaluate their former teachers. When there is no longer a stake
 between teachers and students, and students are more mentally sophisticated due to their social
 experience, the evaluation will be more objective, fair and rational.
 In short, we should all believe in the principle that the teachers are responsible for teaching and the
 students are accountable for their success. Likewise, we should encourage evaluation procedures that
 evaluate professors based on their teaching performance. Teaching is essentially an interpersonal
 interaction, and it cannot be separated from the students’ perceptions of the teacher’s characteristics.
 Therefore, teaching evaluation must be based on teaching performance, and any other factors are considered
 secondary and alternate.
 REFERENCES
 Abrami, P.C., d’Apollonia, S., & Cohen, P.A. (1990). Validity of Student Ratings of Instruction: What
 We Know and What We Do Not. Journal of Education Psychology, 82(2), 219–231.
 Abrami, P.C., Leventhal, L., & Perry, R.P. (1982). Educational seduction. Review of Educational
 Research, 32, 446–464.
 Aleamoni, L. (1987). Student rating: myths versus research facts. Journal of Personnel Evaluation in
 Education, 1, 111–119.
 Aleamoni, L. (1989). Typical faculty concerns about evaluation of teaching. In L.M. Aleamoni (Ed.),
 Techniques for Evaluating and Improving Instruction. San Francisco, CA: Jossey-Bass.
 Cascio, W.F., & Bernardin, H.J. (1981). Implications of performance appraisal litigation for personnel
 decisions. Personnel Psychology, 34, 211–226.
 Cashin, W.E. (1989). Defining and evaluating college teaching. IDEA Paper No. 21, Center for Faulty
 Evaluation and Development, Kansas State University, Manhattan, KS.
 Cashin, W.E. (1990). Students do rate different academic fields differently. In M. Theall & J. Franklin
 (Eds.), Student Ratings of Instruction: Issues for Improving Practice. San Francisco, CA: JosseyBass.
 Cashin, W.E. (1996). Developing an effective faculty evaluation system. IDEA Paper No. 33, Center for
 Faulty Evaluation and Development, Kansas State University, Manhattan, KS.
 Chen, Q. (2012). On the Development Path of Civil Rights Protection in the United States. The Journal of
 Shandong Agricultural Administrators’ College, 6, 71–73.
 Cohen, P.A. (1983). Comment on a selective review of the validity of student ratings of teaching. Journal
 of Higher Education, 54, 448–458.

 Journal of Higher Education Theory and Practice Vol. 22(3) 2022 157

 Damron, J.C. (1996). Instructor personality and the politics of the classroom. Douglas College, New
 Westminster, British Columbia, Canada.
 Deming, W.E. (1986). Out of the Crisis. MIT Center for Advanced Engineering Study, Cambridge, MA.
 Dong, G.C. (2014). A Study of Non-Classroom Factors in SETE. Higher Education Exploration, 2, 104–
 106.
 Dooris, M.J. (1997). An Analysis of the Penn State Student Rating of Teaching Effectiveness. A Report
 Presented to The University Faculty Senate of the Pennsylvania State University.
 Dowell, D.A., & Neal, J.A. (1982). A selective review of the validity of student ratings of teaching.
 Journal of Higher Education, 53, 51–62.
 Dowell, D.A., & Neal, J.A. (1983). The validity and accuracy of student ratings of instruction: A reply to
 Peter A. Cohen. Journal of Higher Education, 54, 459–63.
 Emery, C., & Tian, R. (2002). Schoolwork as Products, Professors as Customers: A Practical Teaching
 Approach in Business Education. Journal for Business Education, 78(2), 97–102.
 Emery, C.R., Kramer, T.R., & Tian, R.G. (2003). Return to Academic Standards: A Critique of Student
 Evaluations of Teaching Effectiveness. Quality Assurance in Education, 11(1), 37–46.
 Feldman, K.A. (1978). Course characteristics and college students’ ratings of their teachers: What we
 know and what we don’t. Research in Higher Education, 9, 199–242.
 Feldman, K.A. (1986). The perceived instructional effectiveness of college teachers as related to their
 personality and attitudinal characteristics: a review and synthesis. Research in Higher Education,
 24, 139–213.
 Guan, H.H. (2012). An Empirical Study on the Effectiveness of SETE in Ningde Normal University.
 Journal of Ningde Normal University, 3, 103–109.
 Huang, T.Y., & Qi, H.X. (2014). An Analysis of the Factors Influencing SETE Based on Individual
 Teachers’ Perspectives. Education and Vocation, 3, 103–105.
 McGregor, D. (1972). An uneasy look at performance appraisal. Harvard Business Review, pp. 19–27.
 Meyer, H.H., Kay, E., & French, J.R. (1965). Split roles in performance appraisal. Harvard Business
 Review, pp. 28–37.
 Milliman, J.F., & McFadden, F.R. (1997). Toward changing performance appraisal to address TQM
 concerns: The 360-degree feedback process. Quality Management Journal, 4(3), 44–64.
 Mohrman, A.M. (1989). Deming Versus Performance Appraisal: Is There a Resolution. Center for
 Effective Organisations. Los Angeles, CA: University of Southern California.
 Porter, L.W., & Lawler, E.E. (1968). Managerial Attitudes and Performance. Burr Ridge, IL: Irwin
 Publishing.
 Schmelkin, L.P., Spencer, K.J., & Gellman, E.S. (1997). Faculty perspectives on course and teacher
 evaluations. Research in Higher Education, pp. 575–592.
 Tan, Y.E. (2014). Reflection and Trend of Teaching Evaluation in Universities. Chongqing Higher
 Education Research, 2(5), 83–87.
 Trout, P.A. (2000). Flunking the Test: The Dismal Record of Student Evaluations. The Touchstone, 10(4),
 11–15.
 Uttl, B., White, C.A., & Gonzalez, D.W. (2017). Meta-analysis of faculty’s teaching effectiveness:
 Student evaluation of teaching ratings and student learning are not related. Studies in Educational
 Evaluation, 54, 22–42.
 Wang, J., & Yu, J.J. (2016). Teaching-centered or Learning-centered Teacher Ratings by Students: An
 Analysis Based on Indexes of 30 Institutions of Higher Education. Journal of Soochow University
 (Educational Science Edition), 02, 104–112.
 Wu, S. (2013). Study on the Factors Affecting SETE in China’s Universities. Dalian: Dalian University of
 Technology.
 Wu, Y.Q. (2004). The Actual Malice Rule as Applied Under American Defamation Law. National Chung
 Cheng University Law Journal, 15, 1–97.

 158 Journal of Higher Education Theory and Practice Vol. 22(3) 2022

 Xie, J.L., & Zhang, C. (2019). A Study on the Influence of Non-Instructional Factors on the Effectiveness
 of SETE in Higher Education - Based on the Perspective of Student Subjects. Heilongjiang
 Education (Higher Education Research & Appraisal), 7, 25–28.
 Zhang, G.J., Ma, X.P., & Jiang, T.K. (2017). On the Feedback of SETE Outcomes. University Education,
 7, 194–195.
 Zhong, G.Z. (2012). Validity of College Students’ Evaluation of Teaching and Its Optimization
 Strategies. Journal of Jimei University, 13(1), 74–77.
 Zhou, W. (2009). SETE System in U.S. Colleges and Universities and Its Inspirations. Journal of
 Hulunbeier College, 4, 107–110.

 Journal of Higher Education Theory and Practice Vol. 22(3) 2022 159

 Teaching in Higher Education
 Critical Perspectives

 ISSN: 1356-2517 (Print) 1470-1294 (Online) Journal homepage: https://www.tandfonline.com/loi/cthe20

 Course evaluation scores: valid measures for
 teaching effectiveness or rewards for lenient
 grading?
 Guannan Wang & Aimee Williamson
 To cite this article: Guannan Wang & Aimee Williamson (2020): Course evaluation scores: valid
 measures for teaching effectiveness or rewards for lenient grading?, Teaching in Higher Education,
 DOI: 10.1080/13562517.2020.1722992
 To link to this article: https://doi.org/10.1080/13562517.2020.1722992

 Published online: 05 Feb 2020.

 Submit your article to this journal

 Article views: 56

 View related articles

 View Crossmark data

 Full Terms & Conditions of access and use can be found at
 https://www.tandfonline.com/action/journalInformation?journalCode=cthe20

 TEACHING IN HIGHER EDUCATION
 https://doi.org/10.1080/13562517.2020.1722992

 Course evaluation scores: valid measures for teaching
 eﬀectiveness or rewards for lenient grading?
 Guannan Wanga and Aimee Williamsonb
 a

 Accounting Department, Suﬀolk University, Boston, MA, USA; bInstitute for Public Service, Suﬀolk University,
 Boston, MA, USA
 ABSTRACT

 ARTICLE HISTORY

 Course Evaluation Instruments (CEIs) are critical aspects of faculty
 assessment and evaluation across most higher education
 institutions, but heated debates surround the value and validity of
 such instruments. While some argue that CEI scores are valid
 measures of course and instructor quality, others argue that
 faculty members can game the system, most notably with lenient
 grading practices to achieve higher student ratings. This article
 synthesizes the literature on course evaluation instruments as
 they relate to student grades to assess the evidence supporting
 and refuting the major theoretical frameworks (i.e. leniency
 hypothesis and validity hypothesis), explores the implications
 of research design and methods and proposes practical
 recommendations for colleges and universities. This paper also
 goes beyond the CEI-grade relationship and provides a framework
 that illustrates the relationships between teaching quality and CEI
 scores, and the potential confounding factors and omitted
 variables which may signiﬁcantly deteriorate the informativeness
 of the CEI score.

 Received 25 July 2019
 Accepted 23 January 2020
 KEYWORDS

 Course evaluation
 instrument; expected grade;
 teaching quality; student
 learning
 JEL Classiﬁcation

 I20; I21

 1. Introduction
 Course evaluation processes are critical and inﬂuential components of teaching, with
 signiﬁcant weight for review, tenure, and promotion decisions across most universities.
 The course evaluation instrument (CEI) is widely used by institutions of higher education to evaluate and improve teaching quality. Student evaluations of courses are
 common among colleges and universities and virtually all business schools use some
 form of student evaluations (Clayson 2009; Brockx, Spooren, and Mortelmans 2011).
 The ﬁrst student rating forms were completed at the University of Washington in
 the 1920s, and the ﬁrst research on student ratings followed soon after (Kulik 2001).
 Despite closing in on a century of use, there is still much debate as to the validity
 and appropriate use of student evaluations of courses. Given the important role
 these evaluations play in faculty tenure and promotion processes, it is not very surprising that such student evaluations continue to generate signiﬁcant debate and attention
 in the literature.
 CONTACT Guannan Wang

 igwang@suﬀolk.edu

 © 2020 Informa UK Limited, trading as Taylor & Francis Group

 Suﬀolk University, 120 Tremont Street, Boston, MA 02108, USA

 2

 G. WANG AND A. WILLIAMSON

 Student rating programs were originally designed, and continue to be used, for two
 main reasons: (1) to help instructors improve their teaching and (2) to help administrators
 oversee teaching quality across the institution and make related decisions (Kulik 2001;
 Brockx, Spooren, and Mortelmans 2011). These broad goals have evolved into many signiﬁcant and inﬂuential uses of student evaluations. These include the use of course evaluation scores in hiring new full time and adjunct faculty, annual review processes,
 promotion and tenure decisions, teaching awards, assignment of faculty to courses,
 accreditation reviews, development of professional development programs, merit pay,
 and student selection of courses (Kulik 2001; Barth 2008; Benton 2011; Brockx,
 Spooren, and Mortelmans 2011; Catano and Harvey 2011; Chulkov and Alstine 2011).
 Some schools have thresholds for student course evaluation scores, below which a
 faculty member is ineligible for tenure. One or two bad CEI scores may also mean that
 an adjunct faculty member will not be given another opportunity to teach in a school.
 It is critical that we do our best to fully understand the course evaluation process,
 create valid and informative course evaluation forms, and use them in the most appropriate manner.
 Student evaluations are the most widely used source for evaluating teaching eﬀectiveness, even serving as the only source in many colleges (Benton 2011). The use and
 inﬂuence of such evaluations have increased in recent years, however, in part due to
 broader trends in accountability and marketization of higher education (Brockx,
 Spooren, and Mortelmans 2011). Accreditation requirements may also drive the use of
 student evaluations (Brockx, Spooren, and Mortelmans 2011). Even with the availability
 of other forms of evaluation, student evaluations typically have the most impact and
 receive the most attention (Dodeen 2013).
 Research on CEIs has identiﬁed relationships between CEI scores and a variety of
 factors, such as course grades (Krautmann and Sander 1999; McPherson 2006; Weinberg,
 Hashimoto, and Fleisher 2009; Brockx, Spooren, and Mortelmans 2011; among others),
 class attendance (Arnold 2009; Brockx, Spooren, and Mortelmans 2011; Braga, Paccagnella, and Pellizzari 2014), discipline (McPherson 2006; Nowell 2007; Driscoll and
 Cadden 2010; Matos-Díaz and Ragan 2010), class type (Krautmann and Sander 1999;
 Centra 2003; Driscoll and Cadden 2010), class level (Nelson and Lynch 1984; Nowell
 2007; Driscoll and Cadden 2010; Ewing 2012) and many other factors.
 The interpretation of these relationships has generated even further debate. It is critical
 that we develop a good understanding of this process and its impact. As others have
 suggested, if evaluation scores can be ‘bought’, the instrument most used for measuring
 teaching eﬀectiveness is ﬂawed and may contribute to grade inﬂation at a more systemic
 level (Krautmann and Sander 1999).
 At its very core is the debate over whether course evaluation instruments are valid
 measures of teaching. As Kulik (2001) succinctly states, ‘[t]o say that student ratings
 are valid is to say that they reﬂect teaching eﬀectiveness’ (p. 10). While some faculty
 members see CEI scores as valid measures that inform their teaching and bring needed
 accountability to higher education, others view CEI scores as invalid measures that are
 more likely to reﬂect student bias and retaliation than instructor performance. Some
 point out that student evaluation ratings are more appropriately measures of ‘satisfaction’
 than outcomes or teaching value (Benton 2011). Given that other measures of teaching
 performance, such as exam scores and peer evaluations, carry similar or even stronger

 TEACHING IN HIGHER EDUCATION

 3

 concerns about validity and reliability, there is no holy grail by which to measure teaching
 eﬀectiveness and compare it to CEI scores (Kulik 2001).

 2. Research questions & method
 As explained above, one of the most controversial topics in the CEI literature is the association between students’ expected grades and CEI scores. We identiﬁed two critical questions surrounding this debate: (1) Is there a relationship between grades (actual, expected,
 etc.) and CEI scores? (2) If so, what is the nature of or explanation for that relationship?
 While previous studies have argued responses to these questions, the ﬁndings are mixed,
 demonstrating a strong need for a more comprehensive analysis.
 To answer these questions and inform the debate surrounding the validity and leniency
 hypotheses, we conducted a comprehensive survey of the CEI literature, identifying and
 analyzing pedagogical studies that shed light on the relationship between grades and
 CEI scores, particularly student expected grades. First, we searched for educational articles
 related to course evaluation in the major databases, including ABI/INFORM, Business
 Source Complete, ScienceDirect, and Google Scholar by using a list of keywords.1 The
 initial search found 72 published articles related to student evaluations. Second, given
 that our focus is the impact of grades on student evaluations, we further limited the
 sample to studies incorporating grade (actual grade or expected grade) in their study.
 That narrowed the sample down to the 28 studies listed in Tables 1 and 2.
 Tables 1 and 2 summarize the research type, research question, and data source of the
 related literature. Table 3 presents a summary of the choice of research method, dependent
 variables, independent variables, statistical results, and control variables. Our analysis
 includes an evaluation of the arguments in the existing literature, implications of research
 designs and methods, confounding factors, and practical implications. Among the 28
 studies reviewed in Tables 1 and 2, 24 of them are empirical analyses and thus discussed
 in Table 3.

 3. Prior discussion on the CEI-Grade relationship
 3.1. Leniency hypothesis vs. validity hypothesis
 As noted above, there has been widespread debate around the association between students’ grades and CEI scores. Many studies show consistent evidence that course
 grades, both expected and relative among peers, have a positive relationship with the
 CEI score (Marsh and Roche 2000; Isely and Singh 2005; Driscoll and Cadden 2010;
 Brockx, Spooren, and Mortelmans 2011). However, several researchers have cast doubt
 on that contention and ﬁnd no signiﬁcant association between course grades and CEI
 scores or ﬁnd that the impact of expected grades on CEI scores is subtle and can be
 explained by other factors (Centra 2003; Arnold 2009). Among the 28 studies we surveyed,
 24 studies performed statistical analyses on the relationship between grades and CEI
 scores. 19 of these studies demonstrate a positive association to some degree between
 the (average) CEI score and (average) grade expectation, 4 studies do not ﬁnd any signiﬁcant association, and 1 study ﬁnds a negative association. The most common measures
 and proxies for grade are individual expected grade, class-average expected grade,

 4

 G. WANG AND A. WILLIAMSON

 Table 1. Pedagogical research on the impact of course grade on student evaluation: publication outlet
 and research question.
 Author

 Year

 Arnold, I. J. M

 2009

 Do examinations inﬂuence
 student evaluations?

 Bausell, R.B. and
 J. Magoon

 1972

 Beleche, T., D. Fairris,
 and M. Marks

 2012

 Braga, M.,
 M. Paccagnella,
 and M. Pellizzari
 Brockx, B., P.Spooren,
 and D. Mortelmans

 2014

 Expected grade in a course,
 grade point average, and
 student ratings of the course
 and the instructor
 Do course evaluations truly
 reﬂect student learning?
 Evidence from an objectively
 graded post-test
 Evaluating students’
 evaluations of Professors

 Butcher, K. F.,
 P. J. McEwan, and
 A. Weerapana
 Centra, J. A.

 2014
 2003

 Clayson, D. E.

 2009

 Driscoll, J. and
 D. Cadden

 2010

 Ewing, A. M.

 2012

 Gorry, D.

 2017

 Greenwald, A. G.,
 G. M.Gillmore

 1997a

 Greenwald, A. G.,
 G. M.Gillmore

 1997b

 Hoefer, P.,J.
 Yurkiewicz, and
 J. C. Byrne

 2012

 2011

 Article Name

 Taking the grading leniency
 story to the edge. The
 inﬂuence of student, teacher,
 and course characteristics on
 student evaluations of
 teaching in higher education
 The eﬀects of an anti-gradeinﬂation policy at Wellesley
 College
 Will teachers receive higher
 student evaluations by giving
 higher grades and less course
 work?
 Student Evaluations of
 Teaching: Are They Related to
 What Students Learn?: A
 Meta-Analysis and Review of
 the Literature
 Student evaluation
 instruments: the interactive
 impact of course
 requirements, student level,
 department and anticipated
 grade
 Estimating the impact of
 relative expected grade on
 student evaluation of
 teachers
 The impact of grade ceilings on
 student grades and course
 evaluations: Evidence from a
 policy change
 Grading leniency is a
 removable contaminant of
 student ratings
 No pain,no gain? The
 importance of measuring
 course workload in student
 ratings of instruction
 The association between
 students’ evaluation of
 teaching and grades

 Journal
 International
 Journal of
 Educational
 Research
 Educational and
 Psychological
 Measurement
 Economics of
 Education Review
 Economics of
 Education Review

 Research questions
 Measures the impact of timing
 on student evaluations
 Examines the relation between
 expected grade and the
 course rating
 The relationship between
 student course evaluations
 and an objective measure of
 student learning
 Contrasts measures of teacher
 eﬀectiveness

 Educ Assc Eval Acc

 Examines the inﬂuence of
 course grades and other
 characteristics of students on
 student evaluations

 Journal of
 Economic
 Perspectives
 Research in Higher
 Education

 Evaluates the consequences of
 the mandatory grade ceiling
 on student evaluations
 Examines the relationship
 between the expected
 grades, the level of diﬃculty,
 workload in courses, and
 course rating
 The relationship between the
 evaluations and learning

 Journal of
 Marketing
 Education
 American Journal of
 Business
 Education

 Examines the relationship
 between measures of
 teaching eﬀectiveness and
 several factors, including the
 students’ anticipated grade

 Economics of
 Education Review

 Investigates instructors’
 incentives to ‘buy’ higher
 evaluation scores by inﬂating
 grades
 The eﬀects of a grade ceiling
 policy on grade distributions
 and course evaluations

 Economics of
 Education Review
 American
 Psychologist
 Journal of
 Educational
 Psychology
 Decision Sciences
 Journal of
 Innovative
 Education

 Examines the relation between
 grading leniency and student
 evaluations
 Examines the relation between
 course grade and student
 evaluations
 Examines the relation between
 course grade and course
 rating, and the moderating
 role of gender, academic
 level, and ﬁeld.
 (Continued )

 TEACHING IN HIGHER EDUCATION

 5

 Table 1. Continued.
 Author

 Year

 Isely, P. and H.Singh

 2005

 Do higher grades lead to
 favorable student
 evaluations?

 Article Name

 The Journal of
 Economic
 Education

 Krautmann, A. C,and
 W.Sander

 1999

 Grades and student evaluations
 of teachers

 Economics of
 Education Review

 Love, D. A. and
 M. J. Kotchen

 2010

 Grades, course evaluations, and
 academic incentives

 Eastern Economic
 Journal

 Marsh, H. W. and
 L. A. Roche

 2000

 Matos-Díaz, H. and
 J. R. Ragan Jr

 2010

 Journal of
 Educational
 Psychology
 Education
 Economics

 McPherson, M. A.

 2006

 Eﬀects of grading leniency and
 low workload on students’
 evaluations of teaching
 Do student evaluations of
 teaching depend on the
 distribution of expected
 grade?
 Determinants of how students
 evaluate teachers

 Millea, M. and
 P. W. Grimes

 2002

 Nelson, J. P, and
 K. Lynch

 1984

 Nowell, C.

 2007

 The impact of relative grade
 expectations on student
 evaluation of teaching

 Remedios, R. and
 D. A. Lieberman

 2008

 I like your course because you
 taught me well: The inﬂuence
 of grades, workload,
 expectations and goals on
 students’ evaluations of
 teaching

 Stumpf, S. A. and
 R. D. Freedman

 1979

 Uttl, B., C. A. White,
 and D. W. Gonzales

 2017

 VanMaaren, V. G.,
 C. M.Jaquett, and
 R. L.Williams

 2016

 Expected grade covariation
 with student ratings of
 instruction: Individual versus
 class eﬀects
 Meta-analysis of faculty’s
 teaching eﬀectiveness:
 Student evaluation of
 teaching ratings and student
 learning are not related
 Factors most likely to
 contribute to positive course
 evaluations

 Weinberg, B. A.,
 M. Hashimoto,

 2009

 Grade expectations and
 student evaluation of
 teaching
 Grade inﬂation, real income,
 simultaneity, and teaching
 evaluations

 Evaluating Teaching in Higher
 Education

 Journal

 The Journal of
 Economic
 Education
 College Student
 Journal
 The Journal of
 Economic
 Education
 International
 Review of
 Economics
 Education
 British Educational
 Research Journal

 Journal of
 Educational
 Psychology
 Studies in
 Educational
 Evaluatio
 Innovative Higher
 Education

 Journal of
 Economic
 Education

 Research questions
 Examines the relation between
 the expected grade in other
 classes of the same course
 and student evaluations
 Examines the relation between
 grading practices and
 student evaluations
 Investigate the incentives
 created by academic
 institutions aﬀect students’
 evaluation on faculty and
 grade inﬂation
 Examines the relation between
 grading leniency and student
 evaluations
 Examines the relation between
 the distribution of expected
 grades and student
 evaluations
 Grade expectations and
 student evaluation of
 teaching
 Examines the links between
 course rigor and grades to
 evaluation scores
 Examines the relation between
 student evaluation and grade
 inﬂation and the moderating
 role of faculty real income
 Examines the relation between
 student evaluations and
 relative grades among peers
 Investigates how factors such
 as students’ pre-course
 expectations, achievement
 goals, grades, workload, and
 perceptions of course
 diﬃculty aﬀect how they rate
 their courses
 Compares individual and class
 eﬀects and their role on
 student rating of instruction
 Re-estimate previously
 published meta-analyses and
 examine the relationship
 between CEI score and
 student learning.
 Determines the extent to which
 students diﬀerentially rated
 ten factors likely to aﬀect
 their ratings on overall course
 evaluations
 Examines the relation between
 grading practices and
 student evaluations and the
 role of learnings

 6

 G. WANG AND A. WILLIAMSON

 Table 2. Pedagogical research on the impact of course grade on student evaluation: research type, data
 source, and sample size.
 Author

 Year

 Arnold, I. J. M

 2009

 Archival

 Method

 Erasmus School of Economics

 Target Sample (Survey/experimental)

 Bausell, R.B. and J. Magoon

 1972

 Archival

 University of Delaware

 Beleche, T., D. Fairris, and
 M. Marks
 Braga, M., M. Paccagnella, and
 M. Pellizzari
 Brockx, B., P. Spooren, and
 D. Mortelmans
 Butcher, K. F., P. J. McEwan, and
 A. Weerapana
 Centra, J. A.

 2012

 Archival

 2014

 Archival

 Unidentiﬁed four-year public
 university
 Bocconi University

 2011

 Archival

 University of Antwerp

 1,244 students

 2014

 Archival

 Wellesley College

 104,454 students

 2003

 Archival

 55,000 classes

 Clayson, D. E.

 2009

 Driscoll, J. and D. Cadden
 Ewing, A. M.
 Gorry, D.
 Greenwald, A. G., G. M.Gillmore
 Greenwald, A. G., G. M.Gillmore
 Hoefer, P.,J. Yurkiewicz, and
 J. C. Byrne
 Isely, P. and H.Singh
 Krautmann, A. C,and W.Sander
 Love,D. A. and M. J. Kotchen
 Marsh, H. W. and L. A. Roche
 Matos-Díaza, H. and J. R. Ragan Jr
 McPherson, M. A.
 Millea, M. and P. W. Grimes
 Nelson, J. P, and K. Lynch
 Nowell, C.
 Remedios, R. and D. A. Lieberman
 Stumpf, S. A. and R. D. Freedman

 2010
 2012
 2017
 1997a
 1997b
 2012

 MetaAnalysis
 Archival
 Archival
 Archival
 Theory
 Archival
 Archival

 Student Instructional Report II by
 Educational Testing Service
 More than 17 prior archival research
 Quinnipiac University
 University of Washington
 Unidentiﬁed state university
 N/A
 University of Washington
 Pace University

 29,596 students
 53,658 classes
 281 classes
 N/A
 200 classes
 381 Classes

 2005
 1999
 2010
 2000
 2010
 2006
 2002
 1984
 2007
 2008
 1979

 Archival
 Archival
 Theory
 Archival
 Archival
 Archival
 Archival
 Archival
 Archival
 Archival
 Archival

 Grand Valley State University
 DePaul University in Chicago
 N/A
 American University
 University of Puerto Rico at Bayamón
 University of North Texas
 Mississippi State University
 Penn State University
 A large public university in the US
 Scottish University
 New York University

 Uttl, B., C. A. White, and
 D. W. Gonzales
 VanMaaren, V. G., C. M.Jaquett,
 and R. L.Williams
 Weinberg, B. A., M. Hashimoto,
 and B. M. Fleisher

 2017

 More than 58 prior research

 2016

 MetaAnalysis
 Archival

 260 classes
 258 Classes
 N/A
 5,433 classes
 1,232 classes
 607 classes
 149 students
 146 classes
 716 students
 610 students
 5,894 Students and
 197 classes
 N/A

 2009

 Archival

 A large state university in the
 southeastern US
 Ohio State University

 Sample Size
 Around 3,000
 students
 Over 17,000
 students
 4,293 students
 1,206 students

 N/A

 148 students
 26,666 Students

 individual expected grade divided by GPA, individual expected grade relative to the
 section average, individual expected grade relative to the actual grade, actual course
 grade, overall GPA, grade in the subsequent course, and high school grades. The most
 used measures of CEI scores include overall course rating, overall instructor rating, and
 rating on instructor’s teaching ability. We present a summary of the choices of research
 methods, dependent variables, independent variables, statistical results, and control variables in Table 3.
 While many studies have provided empirical evidence supporting the relationship
 between grades and CEI scores, the interpretation of such a relationship is under
 debate. Greenwald and Gillmore (1997) suggest that the grade–rating correlation primarily results from instructors’ grading leniency. This study established the fundamental
 theory of the relationship between course grades and CEI scores and represents the
 leniency hypothesis. Another interpretation is the validity hypothesis which posits that

 Table 3. Sample selection and variable deﬁnition.
 Sample
 level

 Author

 Year

 Bausell, R. B. and
 J. Magoon
 Stumpf, S. A. and
 R. D. Freedman

 1972

 Student

 1979

 Dependent Variables

 Independent Variables

 Statistical Results

 Control Variables

 Course evaluation score/Instructor
 evaluation score
 Ratings of courses and instructors

 Expected grade

 ns

 N/A

 Expected grade

 Positive;
 **/Positive; ***

 N/A

 Average course evaluation score

 Average expected grade

 Positive; *

 Absolute expected grade/ Relative
 expected grade
 Expected grade

 Positive; ***/
 Positive; ***
 Positive; ***

 Average instructor evaluation, average
 present grade, instructor’s average real
 income by rank, instructor’s access,
 instructor’s interest, instructor’s
 organization, class time, class size,
 Saturday meeting time, instructor
 experience, class level, exam grade and
 workload
 Self-progress, same instructor, workload

 Average expected grade
 Attitude about remaining graded work
 /Current earned grade

 ns
 Positive;
 ***/Positive; ***

 Average expected grade

 ns

 Average expected grade/ Relative
 expected grade (the gap between
 expected grade and cumulative grade
 point average of incoming students)

 ns /Positive;

 1984

 Greenwald, A. G.,
 G. M.Gillmore
 Krautmann and
 Sander
 Marsh and Roche
 Millea and Grimes

 1997b

 Class

 Average course evaluation score

 1999

 Student

 Course evaluation score

 2000
 2002

 Class
 Student

 Centra

 2003

 Class

 Average course evaluation score
 The overall evaluation, the rating
 directly related to the quality of
 the course, and the rating directly
 related to the quality of the
 instructor
 Average course evaluation score

 Isely and Singh

 2010

 Class

 Average course evaluation score

 Instructor gender, instructor rank, class
 size, class type, and the class level
 Perceived learning and course workload
 Student’s gender, student’s race, student’s
 age, student’s intellectual ability, and
 course diﬃculty
 Course diﬃculty, course workload, student
 eﬀort and involvement, course type,
 course level, class size, institutional type,
 teaching by lecture, teaching by
 discussion or laboratories, and course
 outcomes
 Class size, percentage of students taking a
 required course, percentage of students
 that are majors, average cumulative GPA
 of students in each class, intensive
 writing requirements, length of class,
 class time, class location, percentage of

 7

 (Continued )

 TEACHING IN HIGHER EDUCATION

 Nelson and Lynch

 Student
 and
 Class
 Class

 Author

 8

 Table 3. Continued.
 Year

 Sample
 level

 Dependent Variables

 Independent Variables

 Statistical Results

 2006

 Class

 Average course evaluation score

 Average expected grade

 Negative; ***

 Nowell

 2007

 Student

 Course evaluation score

 Positive; **/
 Positive; */
 Positive **/ns/
 Positive; **

 Remedios, R. and
 D. A. Lieberman

 2008

 Student

 Course evaluation score

 Individual expected grade/ Individual
 expected grade divided by GPA/
 Individual expected grade relative to
 the section average/Individual
 expected grade relative to the course
 average/ Individual expected grade
 relative to the average grade given by
 the instructor in all classes
 Course Grade

 Indirectly

 Arnold

 2009

 Student

 The overall course evaluation score
 and the scores on separate items

 Course Grade

 Positive; ***

 Weinberg and
 Hashimoto,

 2009

 Class

 Average course evaluation score

 Average course grade

 Positive; ***

 Driscoll and Cadden

 2010

 Student

 Expected grade

 Positive; ***

 Matos-Díaza and
 Ragan

 2010

 Class

 Instructor’s teaching ability and
 whether the student would
 recommend this instructor to a
 friend.
 Average course evaluation score

 Average expected grade

 Positive; **

 Brockx, Spooren, and
 Mortelmans

 2011

 Student

 Course evaluation score

 Course grade/ Overall grade

 Positive; **/ ns

 Achievement motivation, study hours,
 perceived diﬃculty, and pre-course
 expectations
 High school grade, self-reported measures
 of students’ class attendance and study
 eﬀort
 Grades in future sections, female
 instructor, foreign born instructor,
 lecturer, graduate associate, instructor
 has PhD, instructor’s experience, Multisection class, honors class, and class time
 Discipline, course type, course level

 Actual GPA, instructor’s rank, instructor’s
 degree, instructor’s age, instructor’s
 gender, class size, class time, discipline,
 and academic term,
 Course type, class attendance, instructor’s
 gender, instructor’s age, student’s
 gender, course workload, class size, and
 examination period in which the
 students received their highest course
 grades

 G. WANG AND A. WILLIAMSON

 McPherson

 Control Variables
 class that is represented in course
 evaluation, and number of years a
 instructor has taught at the university
 Discipline and the proportion of students
 who completed the evaluation
 questionnaire
 Whether the instructor was part-time, the
 percentage of the student’s grade that
 was based on testing, course level, class
 size, the number of times the class met
 each week, disciplines, the student’s age,
 the student’s gender, self-reported
 measures of student’s eﬀort in the class

 2012

 Student

 Course evaluation score

 Grade in the current course/ Grade in the
 subsequent course

 Positive; **/ ns

 Ewing

 2012

 Class

 Average course evaluation score

 Average relative expected grade

 Positive; ***

 Hoefer, P.,J.
 Yurkiewicz, and
 J. C. Byrne
 Braga, Paccagnella
 and Pellizzari

 2012

 Class

 Average course evaluation score

 Normalized student grade

 Positive; *

 2014

 Class

 Average course evaluation score

 Average high school grade/ Overall
 teaching quality/ Overall clarity of the
 lectures

 Positive; **/
 Negative; **/
 Negative; **

 Butcher, K. F.,
 P. J. McEwan, and
 A. Weerapana
 VanMaaren, V. G.,
 C. M.Jaquett, and
 R. L.Williams

 2014

 Class

 Average course evaluation score

 Mandatory grade cap

 Negative; ***

 2016

 Student

 Final grade

 Expected grade

 Positive; ***

 Gorry

 2017

 Class

 Average course evaluation score

 Average course grade

 Positive; *

 Cumulative high school GPA, placement
 score, SAT verbal, SAT writing, ACT, and
 indicators for missing SAT, ACT or
 placement score, student’s age,
 student’s gender, student’s ethnicity,
 student’s housing status, ﬁrst
 generation, low income, term,
 enrollment, course evaluation response
 rate, withdrawal rate, and percent of
 students repeating the class
 Actual average grade, instructor’s ranking,
 course level, class size, course evaluation
 response rate, discipline, class time, and
 class frequency.
 Gender, academic level, discipline
 Class size, class attendance, high school
 grade, entry test score, percentage of
 females, percentage of non-local
 students, percentage of late enrollees,
 student ability, class time, room’s ﬂoor,
 and classroom building.
 Age, faculty gender, faculty tenure status,
 class level, class size
 Gender, academic classiﬁcation, class
 characteristics such as relevant class
 discussion, extra credit, well-organized
 classes, small-group activities, course
 papers, student presentation and course
 standards.
 Ceiling policy, class size, instructor, and
 academic semester

 TEACHING IN HIGHER EDUCATION

 Beleche, Fairris, and
 Marks

 9

 10

 G. WANG AND A. WILLIAMSON

 more eﬀective teachers mean higher student learning, which translates into higher grades
 and higher CEI scores. In the following sections, we provide a detailed discussion of the
 two compelling hypotheses proposed by prior studies.
 3.1.1. Leniency hypothesis
 The leniency hypothesis posits that students give higher CEI scores to instructors from
 whom they receive higher grades. Supporters of the leniency hypothesis generally argue
 that ‘instructors can “buy” higher grades by grading more leniently’ (Krautmann and
 Sander 1999; McPherson 2006; Weinberg, Hashimoto, and Fleisher 2009; among others).
 In an early study, Greenwald and Gillmore (1997) ﬁnd that courses that receive higher
 CEI scores are those in which students expect to receive higher grades or a lighter workload, not necessarily those with higher teaching quality. Many studies interpret the
 relationship between course grades and CEI scores to support the leniency hypothesis.
 For example, Krautmann and Sander (1999) show that a one-point increase in the
 expected classroom grade point average (GPA) leads to an improvement of between
 0.34 and 0.56 in the CEI score. Similarly, McPherson (2006) ﬁnds that an increase of
 one point on a four-point expected grade scale results in an improvement in the CEI
 score of around 0.34 for foundational courses and 0.30 for upper-level courses. Brockx,
 Spooren, and Mortelmans (2011) ﬁnd that when a student’s course grade increases by
 one point, the CEI score increases by 0.33 (grand-mean centered) and 1.56 (groupmean centered). Millea and Grimes (2002) report similar ﬁndings that both the current
 grade and expected grade have a positive relationship with the CEI score.
 Some studies dig deeper to provide clearer evidence of the leniency hypothesis. According to Handelsman et al. (2005), most college students can be classiﬁed as performanceoriented rather than mastery-oriented, indicating that their satisfaction with a course is
 largely based on their grade in that course. Braga, Paccagnella, and Pellizzari (2014) performed a similar analysis of teaching eﬀectiveness and ﬁnd that teaching quality is negatively correlated with students’ CEI scores.
 In addition to empirical evidence, both Gorry (2017) and Butcher, McEwan, and Weerapana (2014) provide anecdotal evidence regarding the impact of a change of grading
 policy on CEI scores at Wellesley College. Butcher, McEwan, and Weerapana (2014)
 examine the policy change at Wellesley College by comparing the CEI scores between
 departments that were obligated to lower their grades with the outcomes in departments
 that were not. The study ﬁnds that students in the ‘grading-decreasing’ courses lowered
 their evaluations of the instructors accordingly. Similarly, Gorry (2017) analyzes the
 eﬀects of a grade ceiling policy implemented by a large state university on grade distributions and CEI scores; such research shows that lowering the grade ceiling signiﬁcantly
 decreases CEI scores across a variety of measures.
 3.1.2. Validity hypothesis
 The main diﬀerence between the leniency and validity hypotheses is whether student
 evaluations reﬂect the quality of teaching or simply capture the grading-satisfaction
 game between the instructors and students. Supporters of the validity hypothesis argue
 that instructors who teach more eﬀectively receive better evaluation scores because their
 students learn more, thereby earning higher grades. In other words, CEI is a valid instrument (Centra 2003; Barth 2008; Remedios and Lieberman 2008; Arnold 2009; Clayson

 TEACHING IN HIGHER EDUCATION

 11

 2009). Essentially, the validity hypothesis suggests that even if there is a strong correlation
 between student grades and CEI scores, we cannot be sure that there is causality.
 Using more than 50,000 CEI scores, Centra (2003) investigates the previously examined
 relationship between grades and student evaluations. Unlike previous researchers, Centra
 (2003) controls for a series of variables in regression analyses, including factors such as
 subject area, class size, teaching method, and student-perceived learning outcomes. Contrary to many other analyses, Centra (2003) does not ﬁnd convincing evidence that students’ course ratings are inﬂuenced by the grades they receive from their instructors
 when controlling for other factors. Rather, the ﬁndings suggest a curvilinear relationship
 between the diﬃculty/workload level of courses and the CEI score, all of which are more
 indicative of students’ learning experiences.
 Centra’s (2003) arguments are further conﬁrmed by a few other studies. Remedios and
 Lieberman (2008) ﬁnd that grades only have a small impact on student ratings compared
 with other inﬂuential factors. By controlling for students’ achievement goals and expectations at the beginning of the semester, Remedios and Lieberman (2008) show that students’ course ratings are largely determined by the extent to which the students ﬁnd their
 courses stimulating, interesting, and useful. The impact of grades and course diﬃculty
 appears to be small. Marsh and Roche (2000) ﬁnd similar results that many CEI scores
 are not related to grading leniency; rather, they are more related to the learning experience
 and teaching eﬀorts. Clayson (2009) conducts a meta-analysis on more than thirty studies
 and shows a small average relationship exists between learning and the CEI score.
 However, the author highlights that such a relationship is situational and may vary
 across teachers, disciplines, or class levels. Barth (2008) shows that the overall instructor
 rating is primarily driven by the quality of instruction. Beleche, Fairris, and Marks (2012)
 examine the learning–CEI association by using a centrally graded exam as a proxy for
 actual student learning. This exam was not related to any speciﬁc course, so the sample
 was independent of course type, faculty grading policy, and students’ grade expectations.
 The literature also suggests inconsistencies and a lack of linearity. For example, Arnold
 (2009) ﬁnds that successful students do not increase the CEI score in response to their successful performance, whereas unsuccessful students externalize failure by lowering the CEI
 score (Arnold 2009). Such results are inconsistent with the common criticism of CEIs,
 which is that students use it as a tool to reward or penalize teachers.
 3.2. Other factors impact CEI scores
 As suggested above, it is well documented that the CEI-grade relationship varies considerably across diﬀerent subgroups of observations, and that other factors are as impactful if
 not more so than grade itself. This paper focuses on the relationship between student
 grades and CEI scores, but it is important to remember that this is just one piece of a
 complex picture. Figure 1 proposes a diagram representing various factors that impact
 CEI scores and their relationships, including student grades. Going into depth on all of
 these factors is beyond the scope of this paper, but it is important to keep in mind, particularly in cases where confounding factors may have a strong intersection with the
 student grade-CEI score relationship. The most commonly documented confounding
 factors include workload, course discipline, course level, class size, class attendance, percentage of non-local students, percentage of late enrollees, student eﬀort, class time, class

 12

 G. WANG AND A. WILLIAMSON

 Figure 1. A Framework for Understanding the Relationship Between Teaching Quality and CEI Scores.

 location, class frequency, instructor’s ranking, instructor’s gender, course evaluation
 response rate, etc. (Krautmann and Sander 1999; Millea and Grimes 2002; Centra 2003;
 among others). We will highlight a few factors found to have a strong impact on the
 CEI score.
 3.2.1. Workload
 A number of studies ﬁnd that there is a negative relationship between workload and CEI
 score, as students typically rate courses higher if they are more manageable (Feldman
 1978; Marsh 1987; Paswan and Young 2002; Centra 2003; Clayson 2009; Driscoll and
 Cadden 2010). The results from Marsh and Roche (2000) and Centra (2003) indicate
 that courses with lighter workloads, such as lower ‘hours per week required outside of
 class’, receive higher student ratings.
 3.2.2. Course characteristics
 Course type, course level, and discipline all have a signiﬁcant impact on CEI scores.
 Brockx, Spooren, and Mortelmans (2011) conclude that instructors teaching elective
 courses receive higher scores than instructors teaching required ones. Benton and
 Cashin (2012) conclude that higher-level courses tend to receive higher evaluation
 ratings in comparison to lower-level courses. Similarly, Ewing (2012) also ﬁnds that graduate courses tend to receive better evaluations than undergraduate courses. Such factors can
 be so strong that they mitigate or exacerbate the CEI-grade relationship. For example,
 Hoefer, Yurkiewicz, and Byrne (2012) extend the discussion and ﬁnd the correlation
 between grade and CEI score is stronger for courses that are for undergraduates and

 TEACHING IN HIGHER EDUCATION

 13

 those in some speciﬁc disciplines, such as management and marketing. Their results also
 indicate that the CEI scores vary considerably across disciplines. Suggested by their study,
 the highest SET scores are received in arts and humanities, followed by biological and
 social sciences, business, computer science, math, engineering and physical science
 (Matos-Díaz and Ragan 2010; Brockx, Spooren, and Mortelmans 2011). Nowell (2007)
 ﬁnds that courses will receive higher CEI scores if students exert more eﬀort in the
 course or the class meets at least two times per week. Such variation creates endogeneity
 issues when CEI scores are used to assess the instructor’s performance. because courses
 with diﬀerent characteristics may not be truly comparable. Driscoll and Cadden (2010)
 suggest that, given that CEI scores vary signiﬁcantly across courses, instructors should
 be evaluated within their respective departments by a department average rather than
 by an overall university measure.
 3.2.3. Instructor characteristics
 In addition, literature has documented that full-time faculty members generally receive
 higher scores than part-time faculty (Nowell 2007; Driscoll and Cadden 2010). Ewing
 (2012) further documents that pre-tenure professors tend to receive lower evaluation
 scores than tenured professors. An instructor’s age may also have an impact on the CEI
 score. Interestingly, this is not in the direction that would be predicted based on an expectation that experience improves teaching. Rather, Brockx, Spooren, and Mortelmans
 (2011) ﬁnd that younger professors tend to receive better evaluations. Driscoll and
 Cadden’s (2010) literature review reports that other studies have found perceptions of
 an instructor’s personality and/or enthusiasm to be strong factors in course evaluation
 instruments (Clayson and Sheﬀet 2006; Clayson 2009; Driscoll and Cadden 2010).
 Again, some factors have been found to strengthen the CEI-grade relationship, with
 Hoefer, Yurkiewicz, and Byrne (2012) ﬁnding the correlation between grade and CEI
 score to be stronger for courses that taught by female faculty.

 4. The caveats of CEI score as a measure of teaching quality
 To examine the relationship between grades and CEI scores, prior literature builds
 diﬀerent empirical models and uses various proxies for grades and CEI scores.
 Beyond the CEI-grade relationship documented by prior literature (see 3.1 and 3.2 for
 detailed review), there are a number of caveats which concern the validity of CEI as a
 measure of teaching quality. In this section, we will discuss the possible biases introduced
 by CEI: (1) relative performance and peer eﬀect, (2) selection biases, and (3) grade
 inﬂation.
 4.1. Relative performance and peer eﬀect
 While most of the variables included in these studies capture an individual’s absolute
 grade or CEI score, the relative student standing is also shown to have a signiﬁcant
 impact on the student’s decision making regarding CEI scores. Economists and sociologists have found that individuals’ satisfaction depends not only on their own performance but also on their circumstances relative to a reference group (Becker 1974).
 Therefore, it is possible that although students’ satisfaction with a course – as captured

 14

 G. WANG AND A. WILLIAMSON

 by CEI scores – may be inﬂuenced by individual performance, it may also be inﬂuenced by
 their relative performance among their peers. Knowing the impact of peer eﬀect is important, as suggested by Nowell (2007):
 If students reward teachers for high relative grades as opposed to simply high absolute grades,
 there may be limits to an instructor’s ability to ‘purchase’ better teaching evaluations by
 increasing the grades of all students. Conversely, if individual students reward teachers for
 their own high grades as well as the high grades of their peers, it becomes expensive to
 give low grades to anyone in the class and increases the incentive to ‘buy’ higher SET
 ratings. (p. 44)

 Stumpf and Freedman (1979) provide early evidence of the relationship between grades
 and student ratings at both the individual and class levels. Their results suggest that both
 the individual’s expected grade and the instructor’s overall expected grading policy contribute to the grade–rating relationship, and that the latter tends to have a stronger
 impact. As an extension of Stumpf and Freedman (1979), several studies further
 explore the relationship between relative performance and CEI score. Common measures
 for relative performance include: (1) the diﬀerence between the expected grade for the
 current course and the students’ historical GPA, (2) the average grade earned by all students who take the same course, (3) the expected grades in other classes in which the
 student is enrolled, and (4) the distribution of expected grades.
 Isely and Singh (2005) measure peer performance with two variables: expected grades
 in other classes taught by the same instructor and the gap between the expected grade in
 the current course and the students’ cumulative GPA. Their ﬁndings indicate that if an
 instructor has other classes in which students expect higher grades, then the average
 CEI score tends to be higher.
 Analogous to the ﬁndings in Isely and Singh (2005), Nowell (2007) adopts three
 measurements for peer performance: the diﬀerence between the expected grade for the
 current course and the student’s historical GPA, the average grade earned by all students
 who take the same course, and the expected grades in other classes in which the student is
 enrolled. The study reveals that the grade students care most about has a considerable
 impact on the CEI score. If the students use their own grades as benchmark, then the
 grade-rating relationship is stronger. In contrast, if students use their peers’ grades as
 benchmark, then the grade–rating relationship is weakened.
 Matos-Díaz and Ragan (2010) explore the impact of the expected grade on the CEI
 score from another perspective. They draw inferences from economics theories about
 risk and uncertainty and argue that the variance of expected grades signals the teacher’s
 reward structure. The narrow distribution of expected grades indicates that the penalty
 for lower study time or unfavorable performance (e.g. poor performance on an examination or assignment) is relatively low and is, therefore, more likely to lead to favorable
 student ratings. As expected, Matos-Díaz and Ragan (2010) report a negative relationship
 between the variance of the expected grade and CEI score, showing that instructors can
 strategically obtain favorable ratings by narrowing the grade distribution. This ﬁnding
 also weakens the argument that students care more about their relative performance in
 a class.
 Overall, the literature on peer eﬀect suggests that instructors can signiﬁcantly increase
 CEI scores, not only by increasing grades for individual students, but by lowering the

 TEACHING IN HIGHER EDUCATION

 15

 grading standards for the entire class. In this scenario, the incentives and costs of ‘buying’
 high CEI scores may be greater than has been suggested by the literature documented in
 sections 3.1 and 3.2.
 4.2. Self-selection bias
 A favorable CEI score may also reﬂect factors that increase students’ satisfaction, but are
 unrelated to teaching quality, such as students’ initial ability, course type, and instructor
 grading leniency. To better isolate the link between the CEI score and teaching quality, it
 is necessary to introduce objective measures of student characteristics at the individual
 level to control for the impact of learning ability on the students’ evaluation of the
 instructors. However, due to the anonymous nature of CEI processes, it is challenging
 to incorporate individual level variables and self-selection bias may occur. CEI scores
 are mostly calculated as course means and only present a subset of students who
 choose to ﬁll out the evaluations (Beleche, Fairris, and Marks 2012). This introduces
 crucial measurement errors, especially when the pool of students who complete the
 CEI diﬀers from the total student population (Clayson 2009; Isely and Singh 2005;
 Kherﬁ 2011).
 Second, students who participate in the administration of a CEI cannot fully represent
 the total student population. The course evaluation response rate is normally less than 100
 percent and it is questionable to simply assume that the students who do not complete the
 survey are well represented by the students who do complete it. Assuming a random
 sample of students, when the number of students incorporated into CEI scores has
 decreased, the eﬀect of individual variations and biases will be stronger (Isely and Singh
 2005). Also, average CEI scores will be more statistically inﬂuenced by such bias if the
 class size is small.
 4.3. Grade inﬂation
 CEI processes may exacerbate the problem of grade inﬂation and can even decrease a professor’s teaching eﬀort (Krautmann and Sander 1999; Love and Kotchen 2010; Butcher,
 McEwan, and Weerapana 2014). Love and Kotchen (2010) examine the eﬀects of CEI
 use on faculty behavior and showed that excessive institutional emphasis on teaching,
 research, or both can exacerbate the problems of grade inﬂation and result in diminished
 faculty teaching eﬀort. To better align instructors’ incentives with the institution’s objectives on teaching and research, the authors suggest that universities should ensure uniform
 grade distributions for individual classes and restrain grade inﬂation.
 Nelson and Lynch (1984) ﬁnd that the evaluation process produces grade inﬂation,
 reaching similar conclusions. They also determine that faculty members’ grading policies
 are related to their real incomes because faculty members are more willing to adopt easier
 grading policies when the real income from teaching is falling.
 Given the pressures on faculty to maintain favorable CEI scores and the impact of
 expected grades on instructors’ evaluations, enforcing lower expected grades may inevitably cause adverse consequences on an instructor’s evaluation. Institutions should carefully
 evaluate such impacts, especially when CEI scores are used in tenure and promotion
 decisions. To ensure fairness across faculty, it would be important to ensure even

 16

 G. WANG AND A. WILLIAMSON

 application of uniform grade distributions across faculty and programs, and to account for
 any overall reduction in CEI scores.

 5. Discussion & recommendations
 Overall, the literature suggests that course grades are positively correlated with CEI scores,
 but there is considerably less evidence as to whether that relationship is properly attributed
 to the leniency hypothesis or the validity hypothesis. Given the evidence of correlation
 between grades and CEI scores and the lack of clear indication that the validity hypothesis
 is more accurate, colleges and universities should consider potential actions to mitigate the
 potential for various forms of bias in CEIs. We propose the following to continue eﬀorts to
 assess this relationship and mitigate its potential impact: (1) ensuring quality design of the
 instrument, (2) attention to qualitative items on CEIs, (3) university level internal analyses
 to identify (and address) potential biases and validity issues, (4) consideration of a portfolio approach to instructor evaluation, and (5) increased eﬀorts to tease out the nature of
 the relationship in future research.
 In this section, we aim to provide examples and pratical techniques that schools can use
 to improve the objectivity and informativeness of teaching evaluations. Particularly, we
 tailor the recommendation section for schools and institutions that are going through a
 CEI adoption or revision process.
 5.1. Quality instrument design
 First and foremost, it is imperative that colleges and universities review CEIs and design or
 adopt a quality instrument. While instrument design alone cannot alleviate student biases,
 a poorly designed instrument can exacerbate such biases. In particular, we advocate for
 clarity in the wording of the items on CEIs and a clear separation of instructor versus
 course questions to help avoid the exacerbation of biases. Item clarity is important to
 reduce misinterpretation of items. While items should be broad enough to refer to all
 types of courses and instructors, clear and directed questions will give the respondent
 something speciﬁc to reﬂect upon.
 Given that many instructors do not have complete control over course characteristics,
 we also advocate for a clear separation of items focused on the instructor versus those
 focused on the course. Based on this analysis of prior studies, it is clear that student
 course ratings are determined by multiple variables beyond the instructor’s teaching performance, such as course characteristics, course grade, student qualities, and student
 biases. Among these factors, course characteristics, which are frequently not under an
 individual instructor’s control, have considerable impact on the student’s perception of
 the course. This problem is particularly common when multiple sections of the same
 course are taught by diﬀerent instructors while the textbook, course syllabus, exams,
 and other materials are all designed by one faculty member or a small group. In such
 cases, instructors tend to have limited freedom in choosing course content or structure,
 but these factors are still counted into the instructor’s evaluation.
 To separate the uncontrollable factors from instructor eﬀectiveness, universities can
 design the course evaluation questionnaire to improve item clarity and reduce response
 bias. For example, we recommend presenting questions related to ‘Evaluation of

 TEACHING IN HIGHER EDUCATION

 17

 Instructor’ and questions related to ‘Evaluation of Course’ separately to students. In cases
 where a faculty member has little to no control over course content and design, the ‘Evaluation of Instructor’ items provide a more objective valuation on the instructor’s teaching
 quality for hiring, tenure and promotion purposes. The ‘Evaluation of Course’ provides
 insights on both course-level pedagogy and program-level curriculum, and can be used
 by faculty members to improve and enhance their teaching skills. Below is a sampled
 CEI from a business school located in Boston, MA.
 Evaluation of Instructor
 The instructor was well prepared and organized for class.
 The instructor communicated information eﬀectively.
 The instructor promoted useful classroom discussions, as appropriate for the course.
 The instructor demonstrated the importance of the subject matter.
 The instructor provided timely and useful feedback.
 The instructor was responsive to students outside the classroom.
 Overall rating of this instructor.
 Evaluation of Course
 The syllabus clearly described the goals, content, and requirements of the course.
 The course materials, assigned text(s), and/or other resources helped me understand concepts and ideas related to the
 course.
 The workload for this course (reading, assignments, papers, homework, etc.) was manageable given the subject matter
 and course level.
 Assignments (exams, quizzes, papers, etc.) adequately reﬂected course concepts.
 Overall rating of this course.

 5.2. Attention to qualitative items on CEIs
 One disadvantage of quantitative CEI (scaled) questions is that the questions are speciﬁcally pre-designed, and the dimensions covered might be somewhat narrow. Qualitative
 evaluation questions provide students opportunities to provide in-depth feedback on
 broader dimensions, resulting in an extensive examination of the student experience
 (Steyn, Davies, and Sambo 2019). Consistent with this argument, Sherry, Fulford, and
 Zhang (1998) examine the accuracy, utility, and feasibility of both quantitative and qualitative evaluation approaches and ﬁnd that both approaches eﬃciently capture the aspects
 of the instructional climate. Grebennikov and Shah (2013) also focus on the use of qualitative evaluation feedback from students and ﬁnd eﬃcient use of student qualitative feedback and timely response to it helps increase student satisfaction and retention.
 5.3. University-level analyses
 The complexity of the literature, variety of ﬁndings, and heterogeneity of CEIs themselves
 suggest that colleges and universities may wish to examine these questions internally to
 evaluate the validity of their own instruments as a measure of teaching eﬀectiveness,
 assess the impact of grading policies, and identify potential biases.
 While the literature provides mixed ﬁndings on the validity/leniency approach, universities often have thousands of data points they could use to conduct internal analyses of
 CEI scores. With the variation in CEIs, grading scales, and other confounding factors
 across universities, internal analyses could provide clearer evidence of the state of the
 student grade- CEI score relationship as it exists in a particular university. In particular,
 analyzing the grade-CEI score relationship for speciﬁc faculty members’ courses over

 18

 G. WANG AND A. WILLIAMSON

 time would control for teaching quality to some extent, especially if there is suﬃcient data
 to analyze by speciﬁc course or type of course given that teaching quality could easily vary
 based on a faculty member’s expertise in a particular course topic.
 As an extension of this, universities could also consider adoption of a relative performance approach to mitigate the eﬀects of teaching courses or disciplines that typically result
 in lower course averages, such as more quantitatively focused courses or particularly challenging ﬁrst year courses. A relative performance approach that compares the ratings of
 the instructors with others who teach the same or similar courses, or at least within the
 same discipline, can help reduce student grade eﬀects on CEI scores.
 5.4. Consideration of a portfolio approach to expand the measures of teaching
 quality
 The prior recommendations focus on the CEI itself and ways it can be designed or analyzed to mitigate the potential for student grades to drive CEI scores inappropriately.
 The extent to which the instruments themselves, both quantitative and qualitative
 items, can do this, however is limited. Thus, we also recommend that schools consider
 a portfolio approach to expand measures of teaching quality, particularly in cases where
 the internal analyses recommended above suggest signiﬁcant student biases or the
 ability for instructors to eﬀectively ‘buy’ grades.
 A portfolio approach is based on a combination of measures, such as student evaluations, peer evaluations, chair evaluations, and self-evaluations. Portfolio approaches are
 well discussed in the current literature (Mullens et al. 1999; Laverie 2002; Berk 2005;
 Chism and Banta 2007) and the details of such an approach are beyond the scope of
 this article, so we will limit our discussion. As examples, Berk (2005) discusses some
 potential sources of evidence of teaching eﬀectiveness including student ratings, peer
 ratings, self-evaluation, student interviews, alumni ratings, teaching scholarship, learnings
 outcome measures, etc. While portfolio approaches cannot alleviate any student grade
 biases in CEI scores, they allow for alternative measures of teaching eﬀectiveness to
 provide a more holistic evaluation.
 While some of this information is more diﬃcult to collect than others, these diﬀerent
 sources of information focus on diﬀerent aspects of teaching eﬀectiveness. The instructor’s
 self-evaluation may provide informal evidence of teaching performance. Information provided by the department chair or course coordinator may highlight the instructor’s compliance with the internal policies and procedures. Colleagues who have expertise in the
 discipline can provide important feedback through classroom visits or course material
 reviews. Schools and departments can randomly select courses and solicit evaluations
 from external professors in the same ﬁeld. This approach allows the school and department to evaluate the teacher’s teaching skills from an educator’s point of view in addition
 to the recipients’ (students’) perspectives, but there are recognizably more resource allocation costs involved with such an approach.
 It is also important to recognize and balance the beneﬁts and caveats for diﬀerent types of
 peer review. For instance, internal reviewers have a good understanding of schools’ institutional backgrounds, but may feel social pressure to overpraise the reviewees or understate
 the concerns. Institutions need to balance the beneﬁts and costs associated with these
 diﬀerent approaches and such debate might be further explored the future research.

 TEACHING IN HIGHER EDUCATION

 19

 5.5. Further research
 Finally, give the literature’s mixed ﬁndings and continued debate over the relationship
 between student grades and CEI scores, most notably whether or not such relationships
 are causal, there is a strong need for continued research in this area, speciﬁcally targeted
 to teasing out the nature of the relationship. As the discussion above demonstrates, it is not
 suﬃcient to argue that there is a relationship between student grades and CEI scores, if the
 argument can also be made that eﬀective teaching leads to increased student grades. What
 we really want to know is the extent to which student grades or expected grades bias
 student evaluations of instructors.
 Given that universities across the nation are already collecting troves of CEI data, the
 real need is for strong methodologists to design studies that can better determine or refute
 claims to causality. This is not to suggest that there are not challenges involved. Anonymity is a critical nature of CEIs, so disaggregating data to the individual student level is problematic, but with the increase in online CEI distribution, it may be increasingly possible
 to do so. On a related note, further reﬁnement of eﬀective quantitative measurements and
 analyses of CEI scores is advised. While the challenges of ﬁnding adequate proxies for
 student learning is clear, additional eﬀorts on this front are worthwhile, as the critical
 nature of CEIs in higher education should not be underestimated.
 Our study also suggests the need for more qualitative research in this area, as most prior
 research in this stream of literature uses quantitative research designs such as correlation
 tests or multivariate regression. Qualitative research, such as focus group interviews or
 quasi-experiments, will provide valuable insights on how these course evaluation questions are truly perceived by students. Future studies can also investigate students’ judgments and decision making with regard to their responses to the quantitative CEI
 questions. Such a study would help CEI designers to better align CEI questions with students’ perceptions of their own learning.

 6. Conclusions and remarks
 As described at length above, prior research has provided ample evidence on the relationship between CEI scores and various instructor and course factors, including grades and
 many other characteristics. However, most previous literature either focuses on a single
 study, or takes a broad look at the extensive factors that play a role, without fully unpacking particular relationships. In addition, there hasn’t been any conceptual framework that
 synthesizes these existing theories and research ﬁndings. This article seeks to ﬁll that gap,
 focusing on the relationship between student grades and CEI scores, to synthesize the
 ﬁndings to date, assess the leniency and validity hypotheses, identify closely related
 factors, discuss potential biases, and make practical recommendations for schools and
 universities.
 Overall, the literature suggests that course grades are positively correlated with CEI
 scores, but there is considerably less evidence as to whether that relationship is properly
 attributed to the leniency hypothesis or the validity hypothesis. In this paper, we survey
 28 prior studies and discuss the impact of course grades on course evaluation scores.
 We speciﬁcally explore the leniency hypothesis, which posits that students give higher
 CEI scores to instructors from whom they receive higher grades, and the validity

 20

 G. WANG AND A. WILLIAMSON

 hypothesis, which posits that instructors who teach more eﬀectively receive better evaluation scores because their students learn more and therefore earn higher grades. Our
 review reveals that existing research focuses more on the extent of the relationship than
 the nature of that relationship. The empiricial studies that do assess this, however, tend
 to be more consistent with the leniency hypothesis.
 One of the major implications of these ﬁndings is that colleges and universities should
 be thoughtful about their reliance on CEI scores in the broader faculty evaluation process
 and consider a variety of approaches to meet their needs. To address these serious limitations on CEI and provide a more objective evaluation of the instructor’s teaching quality,
 we propose ﬁve recommendations: quality design of the instrument, attention to qualitative items, university level internal analyses, a portfolio approach to instructor evaluation,
 and increased eﬀorts to tease out the nature of the relationship in future research.
 In addition, as shown in Figure 1, this study proposes a conceptual framework that
 illustrates the relationships between actual teaching quality and CEI scores, and suggests
 where confounding factors may play a role. While we are trying to focus on one speciﬁc
 relationship between CEI scores and grades, we believe that a broad overview of the evaluation-teaching quality relationships is informative to the readers of this study. The
 poposed framework lays the groundwork for future research regarding the potential confounding factors and omitted variables which may signiﬁcantly deteriorate the informativeness of the CEI score.

 Note
 1. The keywords include teaching evaluation, course evaluation, student evaluation, student
 feedback, student perception, and student rating.

 Disclosure statement
 No potential conﬂict of interest was reported by the author(s).

 References
 Arnold, I. J. M. 2009. “Do Examinations Inﬂuence Student Evaluations.” International Journal of
 Educational Research 48 (4): 215–224.
 Barth, M. M. 2008. “Deciphering Student Evaluations of Teaching: A Factor Analysis Approach.”
 Journal of Education for Business 84 (1): 40–46.
 Bausell, R. B., and J. Magoon. 1972. “Expected Grade in a Course, Grade Point Average, and Student
 Ratings of the Course and the Instructor.” Educational and Psychological Measurement 32 (4):
 1013–1023.
 Becker, G. S. 1974. “A Theory of Social Interactions.” Journal of Political Economy 82 (6): 1063–
 1093.
 Beleche, T., D. Fairris, and M. Marks. 2012. “Do Course Evaluations Truly Reﬂect Student
 Learning? Evidence From an Objectively Graded Post-Test.” Economics of Education Review
 31 (5): 709–719.
 Benton, S. 2011. “Using Student Course Evaluations to Design Faculty Development Workshops.”
 Academy of Educational Leadership Journal 15 (2): 41–53.
 Benton, S., and W. E. Cashin. 2012. Student Ratings of Teaching: A Summary of Research and
 Literature. IDEA Paper No. 50.

 TEACHING IN HIGHER EDUCATION

 21

 Berk, R. A. 2005. “Survey of 12 Strategies to Measure Teaching Eﬀectiveness.” International Journal
 of Teaching and Learning in Higher Education 17 (1): 48–62.
 Braga, M., M. Paccagnella, and M. Pellizzari. 2014. “Evaluating Students’ Evaluations of Professors.”
 Economics of Education Review 41 (August): 71–88.
 Brockx, B., P. Spooren, and D. Mortelmans. 2011. “Taking the Grading Leniency Story to the Edge.
 The Inﬂuence of Student, Teacher, and Course Characteristics on Student Evaluations of
 Teaching in Higher Education.” Educational Assessment, Evaluation and Accountability 23
 (4): 289–306.
 Butcher, K. F., P. J. McEwan, and A. Weerapana. 2014. “The Eﬀects of an Anti-Grade Inﬂation
 Policy at Wellesley College.” Journal of Economic Perspectives 28 (3): 189–204.
 Catano, V., and S. Harvey. 2011. Student Perception of Teaching Eﬀectiveness: Development and
 Validation of the Evaluation of Teaching Competencies Scale (ETCS). Halifax, Nova Scotia,
 QC, Canada: Routeledge.
 Centra, J. A. 2003. “Will Teachers Receive Higher Student Evaluations by Giving Higher Grades
 and Less Course Work?” Research in Higher Education 44 (5): 495–518.
 Chism, N. V. N., and T. W. Banta. 2007. “Enhancing Institutional Assessment Eﬀorts Through
 Qualitative Methods.” New Directions for Institutional Research 136 (winter): 15–28.
 Chulkov, D. V., and J. V. Alstine. 2011. “Challenges in Designing Student Teaching Evaluations in a
 Business.” International Journal of Educational Management 26 (2): 162–174.
 Clayson, D. E. 2009. “Student Evaluations of Teaching: Are They Related to What Students Learn: A
 Meta-Analysis and Review of the Literature.” Journal of Marketing Education 31 (1): 16–30.
 Clayson, D. E., and M. J. Sheﬀet. 2006. “Personality and the Student Evaluation of Teaching.”
 Journal of Marketing Education 28 (2): 149–160.
 Dodeen, H. 2013. “Validity, Reliability, and Potential Bias of Short Forms of Students’ Evaluation of
 Teaching: The Case of UAE University.” Educational Assessment 18 (4): 235–250.
 Driscoll, J., and D. Cadden. 2010. “Student Evaluation Instruments: The Interactive Impact of
 Course Requirements, Student Level, Department and Anticipated Grade.” American Journal
 of Business Education 3 (5): 21–30.
 Ewing, A. M. 2012. “Estimating the Impact of Relative Expected Grade on Student Evaluations of
 Teachers.” Economics of Education Review 31: 141–154.
 Feldman, K. A. 1978. “Course Characteristics and Variability Among College Students in
 Rating Their Teachers and Courses: A Review and Analysis.” Research in Higher Education 9:
 199–242.
 Gorry, D. 2017. “The Impact of Grade Ceilings on Student Grades and Course Evaluations:
 Evidence from a Policy Change.” Economics of Education Review 56 (February): 133–140.
 Grebennikov, L., and M. Shah. 2013. “Student Voice: Using Qualitative Feedback from Students to
 Enhance Their University Experience.” Teaching in Higher Education 18 (6): 606–618.
 Greenwald, A. G., and G. M. Gillmore. 1997a. “Grading Leniency is a Removable Contaminant of
 Student Ratings.” American Psychologist 52 (11): 1209–1217.
 Greenwald, A. G., and G. M. Gillmore. 1997b. “No Pain, no Gain? The Importance of Measuring
 Course Workload in Student Ratings of Instruction.” Journal of Educational Psychology 89 (4):
 743–751.
 Handelsman, M. M., W. L. Briggs, N. Sullivan, and A. Towler. 2005. “A Measure of College Student
 Course Engagement.” The Journal of Educational Research 98 (3): 184–192.
 Hoefer, P., J. Yurkiewicz, and J. C. Byrne. 2012. “The Association between Students’ Evaluation of
 Teaching and Grades.” Decision Sciences Journal of Innovative Education 10 (3): 447–459.
 Isely, P., and H. Singh. 2005. “Do Higher Grades Lead to Favorable Student Evaluations?.” The
 Journal of Economic Education 36 (1): 29–42.
 Kherﬁ, S. 2011. “Whose Opinion is it Anyway? Determinants of Participation in Student Evaluation
 of Teaching.” The Journal of Economic Education 42 (2): 19–30.
 Krautmann, A. C., and W. Sander. 1999. “Grades and Student Evaluations of Teachers.” Economics
 of Education Review 18 (1): 59–63.

 22

 G. WANG AND A. WILLIAMSON

 Kulik, J. A. 2001. “Student Ratings: Validity, Utility, and Controversy.” In The Student Ratings
 Debate: Are They Valid? how can we Best use Them? Vol. 2001., edited by Michael Theall,
 Philip C. Abrami, and Lisa A. Mets, 9–25.
 Laverie, D. A. 2002. “Improving Teaching Through Improving Evaluation: A Guide to Course
 Portfolios.” Journal of Marketing Education 24 (2): 104–113.
 Love, D. A., and M. J. Kotchen. 2010. “Grades, Course Evaluations, and Academic Incentives.”
 Eastern Economic Journal 36 (2): 151–163.
 Marsh, H. W. 1987. “Students’ Evaluations of University Teaching: Research Findings,
 Methodological Issues, and Directions for Future Research.” International Journal of
 Educational Research 11 (3): 253–388.
 Marsh, H. W., and L. A. Roche. 2000. “Eﬀects of Grading Leniency and Low Workload on Students’
 Evaluations of Teaching: Popular Myth, Bias, Validity, or Innocent Bystanders?” Journal of
 Educational Psychology 92 (1): 202–228.
 Matos-Díaz, H., and J. R. Ragan Jr. 2010. “Do Student Evaluations of Teaching Depend on the
 Distribution of Expected Grade?” Education Economics 18 (3): 317–330.
 McPherson, M. A. 2006. “Determinants of how Students Evaluate Teachers.” The Journal of
 Economic Education 37 (1): 3–20.
 Millea, M., and P. W. Grimes. 2002. “Grade Expectations and Student Evaluation of Teaching.”
 College Student Journal 36 (4): 582–590.
 Mullens, J., M. S. Leighton, K. G. Laguarda, and E. O’Brian. 1999. Student Learning, Teaching
 Quality, and Professional Development: Theoretical Linkages, Current Measurement, and
 Recommendations for Future Data Collection. Working paper.
 Nelson, J. P., and K. Lynch. 1984. “Grade Inﬂation, Real Income, Simultaneity, and Teaching
 Evaluations.” The Journal of Economic Education 15 (1): 21–37.
 Nowell, C. 2007. “The Impact of Relative Grade Expectations on Student Evaluation of Teaching.”
 International Review of Economics Education 6 (2): 42–56.
 Paswan, A. K., and J. A. Young. 2002. “Student Evaluation of Instructor: A Nomological
 Investigation Using Structural Equation Modeling.” Journal of Marketing Education 24 (3):
 193–202.
 Remedios, R., and D. A. Lieberman. 2008. “I Liked Your Course Because you Taught me Well: The
 Inﬂuence of Grades, Workload, Expectations and Goals on Students’ Evaluations of Teaching.”
 British Educational Research Journal 34 (1): 91–115.
 Sherry, A. C., C. Fulford, and S. Zhang. 1998. “Assessing Distance Learners’ Satisfaction with
 Instruction: A Quantitative and a Qualitative Measure.” American Journal of Distance
 Education 12 (3): 4–28.
 Steyn, C., D. Davies, and A. Sambo. 2019. “Eliciting Student Feedback for Course Development: the
 Application of a Qualitative Course Evaluation Tool among Business Research Students.”
 Assessment and Evaluation in Higher Education 44 (1): 11–24.
 Stumpf, S. A., and R. D. Freedman. 1979. “Expected Grade Covariation with Student Ratings of
 Instruction: Individual Versus Class Eﬀects.” Journal of Educational Psychology 71 (3): 293–302.
 Uttl, B., C. A. White, and D. W. Gonzales. 2017. “Meta-analysis of Faculty’s Teaching Eﬀectiveness:
 Student Evaluation of Teaching Ratings and Student Learning are not Related.” Studies in
 Educational Evaluation 54 (1): 22–42.
 VanMaaren, V. G., C. M. Jaquett, and R. L. Williams. 2016. “Factors Most Likely to Contribute to
 Positive Course Evaluations.” Innovative Higher Education 41 (5): 425–440.
 Weinberg, B. A., M. Hashimoto, and B. M. Fleisher. 2009. “Evaluating Teaching in Higher
 Education.” Journal of Economic Education 40 (3): 227–261.

 See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/259503622

 Appropriate and inappropriate uses of students' assessment of instruction
 Article · January 2013

 CITATIONS

 READS

 0

 537

 1 author:
 David M. McCord
 Western Carolina University
 78 PUBLICATIONS 871 CITATIONS
 SEE PROFILE

 Some of the authors of this publication are also working on these related projects:

 Thesis Pilot Study View project

 All content following this page was uploaded by David M. McCord on 15 March 2014.
 The user has requested enhancement of the downloaded file.

 View publication stats

 See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/232718162

 Top ﬁve ﬂashpoints in the assessment of teaching effectiveness
 Article in Medical Teacher · October 2012
 DOI: 10.3109/0142159X.2012.732247 · Source: PubMed

 CITATIONS

 READS

 39

 524

 1 author:
 Ronald Alan Berk
 Johns Hopkins University
 76 PUBLICATIONS 3,502 CITATIONS
 SEE PROFILE

 Some of the authors of this publication are also working on these related projects:

 Microaggressions in the academic workplace and classroom View project

 All content following this page was uploaded by Ronald Alan Berk on 21 August 2016.
 The user has requested enhancement of the downloaded file.

 2013; 35: 15–26

 Top five flashpoints in the assessment of
 teaching effectiveness
 RONALD A. BERK
 Johns Hopkins University, USA

 Med Teach Downloaded from informahealthcare.com by University of Dundee on 01/09/13
 For personal use only.

 Abstract
 Background: Despite thousands of publications over the past 90 years on the assessment of teaching effectiveness, there is still
 confusion, misunderstanding, and hand-to-hand combat on several topics that seem to pop up over and over again on listservs,
 blogs, articles, books, and medical education/teaching conference programs. If you are measuring teaching performance in
 face-to-face, blended/hybrid, or online courses, then you are probably struggling with one or more of these topics or flashpoints.
 Aim: To decrease the popping and struggling by providing a state-of-the-art update of research and practices and a ‘‘consumer’s
 guide to trouble-shooting these flashpoints.’’
 Methods: Five flashpoints are defined, the salient issues and research described, and, finally, specific, concrete recommendations
 for moving forward are proffered. Those flashpoints are: (1) student ratings vs. multiple sources of evidence; (2) sources
 of evidence vs. decisions: which come first?’ (3) quality of ‘‘home-grown’’ rating scales vs. commercially-developed scales;
 (4) paper-and-pencil vs. online scale administration; and (5) standardized vs. unstandardized online scale administrations. The first
 three relate to the sources of evidence chosen and the last two pertain to online administration issues.
 Results: Many medical schools/colleges and higher education in general fall far short of their potential and the available
 technology to comprehensively assess teaching effectiveness. Specific recommendations were given to improve the quality and
 variety of the sources of evidence used for formative and summative decisions and their administration procedures.
 Conclusions: Multiple sources of evidence collected through online administration, when possible, can furnish a solid foundation
 from which to infer teaching effectiveness and contribute to fair and equitable decisions about faculty contract renewal, merit pay,
 and promotion and tenure.

 Introduction
 FLASHPOINT: a critical stage in a process, trouble
 spot, discordant topic, or lowest temperature at
 which a flammable liquid will give off enough
 vapor to ignite.
 If you have read any of my previous articles, you know I have
 given off buckets of vapor. For you language scholars,
 ‘‘flashpoint’’ is derived from two Latin words, ‘‘flashus,’’
 meaning ‘‘your shorts,’’ and ‘‘pointum,’’ meaning, ‘‘are on fire.’’

 Why flashpoints?
 This article is not another review of the research on student
 ratings. It is a state-of-the-art update of research and practices,
 primarily since 2006 (Berk 2006; Seldin & Associates 2006;
 Arreola 2007), with specific TARGETS: the flashpoints that have
 emerged, which are critical issues, conflicts, contentious
 problems, and volatile hot buttons in the assessment of
 teaching effectiveness. They are the most prickly, thorny,
 vexing, and knotty topics that every medical school/college
 and institution in higher education must confront.
 These flashpoints cause confusion, misunderstanding, dissension, hand-to-hand combat, and, ultimately, inaccurate and

 Practice points
 . Polish your student rating scale, but start building
 multiple sources of evidence to assess teaching
 effectiveness.
 . Match your highest quality sources to the specific
 formative and summative decisions using the 360 MSF
 model.
 . Review current measures of teaching effectiveness with
 your faculty and plan specifically how you can improve
 their psychometric quality.
 . Design an online administration system in-house or outhouse with a vendor to conduct the administration and
 score reporting.
 . Standardize directions, administration procedures, and a
 narrow window for completion of your student rating
 scale and other measures of teaching effectiveness.

 unfair decisions about faculty. Although there are many more
 than five in this percolating cauldron of controversy, the ones
 tackled here seem to pop up over and over again on listservs,
 blogs, articles, books, and medical education/teaching conference programs, plus they generate a firestorm of debate by

 Correspondence: R.A. Berk, Johns Hopkins University, 10971 Swansfield Road, Columbia, MD 21044, USA. Tel: þ1 410 9407118; fax: þ1 206
 3091618; email: [email protected]
 ISSN 0142–159X print/ISSN 1466–187X online/13/010015–12 ß 2013 Informa UK Ltd.
 DOI: 10.3109/0142159X.2012.732247

 15

 R. A. Berk

 faculty and administrators more than others. This contribution
 is an attempt to decrease some of that percolating and
 popping.

 Trouble-shooting flashpoints

 Med Teach Downloaded from informahealthcare.com by University of Dundee on 01/09/13
 For personal use only.

 If you are currently using any instrument to measure teaching
 performance in face-to-face, blended/hybrid, or online
 courses, then you are probably struggling with one or more
 flashpoints. This article is a ‘‘consumer’s guide to troubleshooting these flashpoints.’’ The motto of this article is: ‘‘Get to
 the flashpoint and the solution.’’
 This is the inauguration of my new PBW series on problembased writing. Your problems are the foci of my writing. The
 structure of each section will be governed by the PBW
 perspective:
 (1)
 (2)
 (3)

 Definition: Each flashpoint will be succinctly defined.
 Options: The options available based on research and
 practice will be described.
 Recommended Solution: Specific, concrete recommendations for faculty and administrators will be proffered
 to move them to action.

 There does not seem to be any short-cut, quick fix, or multilevel marketing scheme to improve the quality of teaching.
 Tackling these flashpoints head-on will hopefully be one
 positive step toward that improvement.
 The top five flashpoints are: (1) student ratings vs. multiple
 sources of evidence; (2) sources of evidence vs. decisions:
 which come first?; (3) quality of ‘‘home-grown’’ rating scales
 vs. commercially-developed scales; (4) paper-and-pencil vs.
 online scale administration; and (5) standardized vs. unstandardized online scale administration. The first three relate to
 critical decisions about the sources of evidence chosen and the
 last two pertain to online scale administration issues.

 Top five flashpoints
 Student ratings vs. multiple sources of evidence
 FLASHPOINT 1: Student rating scales have dominated as the primary or, usually, the only measure of
 teaching effectiveness in medical schools/colleges
 and universities worldwide and in a few remote
 planets. This state of practice is contrary to the advice
 of a cadre of experts and the limitations of student
 input to comprehensively evaluate teaching effectiveness. Several other measures should be used in
 conjunction with student ratings.
 Student ratings. Historically, student rating scales have been
 the primary measure of teaching effectiveness for the past 50
 years. Students have had a critical role in the teaching–learning
 feedback system. The input from their ratings in summative
 decision making has been recommended on an international
 level (Strategy Group 2011; Surgenor 2011).
 There are nearly 2000 references on the topic (Benton &
 Cashin 2012) with the first journal article published 90 years
 ago (Freyd 1923). There is more research and experience in
 16

 higher education with student ratings than with all of the other
 measures of teaching effectiveness combined (Berk 2005,
 2006). If you need to be brought up to speed quickly with the
 research on student ratings, check out these up-to-date
 reviews (Gravestock & Gregor-Greenleaf 2008; Benton &
 Cashin 2012; Kite 2012).
 Unfortunately, in medical/healthcare education, student
 ratings have not received the same level of research attention.
 There is only a sprinkling of studies over the last 20 years
 (e.g., Hoeks & van Rossum 1988; Jones & Froom 1994; Mazor
 et al. 1999; Elzubeir & Rizk 2002; Barnett et al. 2003; Kidd &
 Latif 2004; Pierre et al. 2004; Turhan et al. 2005; Maker et al.
 2006; Ahmady et al. 2009; Barnett & Matthews 2009; Berk
 2009a; Chenot et al. 2009; Donnon et al. 2010; Boerboom et al.
 2012; Stalmeijer et al. 2010). There is far more research on peer
 observation (e.g., Berk et al. 2004; Siddiqui et al. 2007; Wellein
 et al. 2009; DiVall et al. 2012; Pattison et al. 2012; Sullivan et al.
 2012). There are also a few qualitative studies that are
 peripherally related (Stark 2003; Steinert 2004; Martens et al.
 2009; Schiekirka et al. 2012).
 With this volume of scholarly productivity and practice in
 academia worldwide, student ratings seem like the solution to
 assessing teaching effectiveness in medical/healthcare education and higher education in general. So, what is the problem?
 Limitations of student ratings. As informative as student
 ratings can be about teaching, there are numerous behaviors
 and skills defining teaching effectiveness which students are
 NOT qualified to rate, such as a professor’s knowledge and
 content expertise, teaching methods, use of technology,
 course materials, assessment instruments, and grading practices (Cohen & McKeachie 1980; Calderon et al. 1996;
 d’Apollonia & Abrami 1997a; Ali & Sell 1998; Green et al.
 1998; Hoyt & Pallett 1999; Coren 2001; Ory & Ryan 2001;
 Theall & Franklin 2001; Marsh 2007; Svinicki & McKeachie
 2011). Students can provide feedback at a certain level in each
 of those areas, but it will take peers and other qualified
 professionals to rate those skills in depth. BOTTOM LINE:
 Student ratings from well-constructed scales are a necessary,
 but not sufficient, source of evidence to comprehensively assess
 teaching effectiveness.
 Student ratings provide only one portion of the information
 needed to infer teaching effectiveness. Yet, that is pretty much
 all that is available at most institutions. When those ratings
 alone are used for decision making, they will be incomplete
 and biased. Without additional evidence of teaching effectiveness, student ratings can lead to incorrect and unfair career
 decisions about faculty that can affect their contract renewal,
 annual salary increase, and promotion and tenure.
 It is the process of evaluation or assessment that permits
 several sources of appropriate evidence to be collected
 for the purpose of decision making. Assessment is a
 ‘‘systematic method of obtaining information from [scales]
 and other sources, used to draw inferences about characteristics of people, objects, or programs,’’ according to
 the US Standards for Educational and Psychological
 Testing (AERA, APA, & NCME Joint Committee on
 Standards 1999, p. 272). Student ratings represent one
 measure and just one source of information in that process.

 Flashpoints in teaching effectiveness

 Med Teach Downloaded from informahealthcare.com by University of Dundee on 01/09/13
 For personal use only.

 Multiple sources of evidence. Over the past decade, there has
 been a trend toward augmenting student ratings with other
 data sources of teaching effectiveness. Such sources can serve
 to broaden and deepen the evidence base used to assess
 courses and the quality of teaching (Theall & Franklin 1990;
 Braskamp & Ory 1994; Hoyt & Pallett 1999; Knapper &
 Cranton 2001; Ory 2001; Cashin 2003; Berk 2005, 2006; Seldin
 2006; Arreola 2007; Theall & Feldman 2007; Gravestock &
 Gregor-Greenleaf 2008; Benton & Cashin 2012). In fact, several
 comprehensive models of ‘‘faculty evaluation’’ have been
 proposed (Centra 1993; Braskamp & Ory 1994; Berk 2006,
 2009a; Arreola 2007; Gravestock & Gregor-Greenleaf 2008),
 which include multiple sources of evidence with some models
 attaching greater weight to student and peer ratings and less
 weight to self-, administrator, and alumni ratings, and other
 sources. All of these models are used to arrive at formative and
 summative decisions.
 15 Sources. There are 15 potential sources of evidence of
 teaching effectiveness: (1) student ratings; (2) peer observations; (3) peer review of course materials; (4) external expert
 ratings; (5) self-ratings; (6) videos; (7) student interviews;
 (8) exit and alumni ratings; (9) employer ratings; (10) mentor’s
 advice; (11) administrator ratings; (12) teaching scholarship;
 (13) teaching awards; (14) learning outcome measures; and
 (15) teaching (course) portfolio. Berk (2006) described several
 major characteristics of each source, including type of measure
 needed to gather the evidence, the person(s) responsible for
 providing the evidence (students, peers, external experts,
 mentors, instructors, or administrators), the person or committee who uses the evidence, and the decision(s) typically
 rendered based on that data (formative, summative, or
 program). He also critically examined the value and contribution of these sources for teaching effectiveness based on the
 current state of research and practice. His latest recommendations will be presented in Flashpoint 2.
 Triangulation. Much has been written about the merits and
 shortcomings of these various sources of evidence (Berk 2005,
 2006). Put simply: There is no perfect source or combination of
 sources. Each source can supply unique information, but also
 is fallible, usually in a way different from the other sources. For
 example, the unreliability and biases of peer ratings are not the
 same as those of student ratings; student ratings have other
 weaknesses. By drawing on three or more different sources of
 evidence, you can leverage the strengths of each source to
 compensate for weaknesses of the other sources, thereby
 converging on a decision about teaching effectiveness that is
 more accurate and reliable than one based on any single
 source (Appling et al. 2001). This notion of triangulation is
 derived from a compensatory model of decision making.
 Given the complexity of measuring the act of teaching in a
 real-time classroom environment or online course, it is
 reasonable to expect that multiple sources can provide a
 more accurate, reliable, and comprehensive picture of teaching effectiveness than just one source. However, the decision
 maker should integrate the information from only those
 sources for which validity evidence is available (see Standard
 14.13). The quality of the sources chosen should be beyond

 reproach, according to the Standards (AERA, APA, & NCME
 Joint Committee on Standards 1999).
 Since there is not enough experience with multiple sources,
 there is a scarcity of empirical evidence to support the use of
 any particular combination of sources (e.g., Barnett et al. 2003;
 Stalmeijer et al. 2010; Stehle et al. 2012). There are a few
 surveys of the frequency of use of individual sources (Seldin
 1999; Barnett & Matthews 2009). Research is needed on
 various combinations of measures for different decisions to
 determine ‘‘best practices.’’
 Recommendations. All experts on faculty evaluation recommend multiple sources of evidence to assess teaching effectiveness. Beyond student ratings, is it worth the extra effort,
 time, and cost to develop the additional measures suggested in
 this section? Just what new information do you have to gain?
 As those instruments are being built, it should become clear
 that they are intended to measure different teaching behaviors
 that contribute to teaching effectiveness. Each measure should
 bite off a separate chunk of behaviors. They should be
 designed to be complementary, not redundant, although there
 may be justification for some overlap for corroboration.
 There is even research evidence on the relationships
 between student ratings and several other measures to support
 their complementarity. Benton and Cashin’s (2012) research
 review reported the following validity coefficients with student
 ratings: trained observers (0.50 with global ratings), self (0.30–
 0.45), alumni (0.54–0.80), and administrators (0.47–0.62; 0.39
 with global ratings). Since 0.50 is only 25% explained variance
 and 75% unexplained or new information, these coefficients
 suggest a lot of insight can be gained using observers’, self, and
 administrators’ ratings as sources of evidence.

 Sources of evidence vs. decisions: Which come
 first?
 FLASHPOINT 2: Rating scales are typically administered and then confusion occurs over what to do
 with the results and how to interpret them for specific
 decisions. A better strategy would be to do exactly the
 opposite of that practice. Spin your head around
 180 , exorcist style. The decision should drive the
 selection of the appropriate sources of evidence, the
 types of data needed for the decision, and the design
 of the report form. Custom tailor the sources, data,
 and form to fit the decision. The information and
 format of the evidence a professor needs to improve his
 or her teaching are very different from that required
 by a department chair or associate dean for annual
 review (contract renewal or merit pay) or by a faculty
 committee for promotion and tenure review. The
 sources of evidence and formats of the reports can
 either hinder or facilitate the decision process.
 Types of decisions. According to Seldin (1999), teaching is
 the major criterion (98%) in assessing overall faculty performance in liberal arts colleges compared to student advising
 (64%), committee work (59%), research (41%), publications
 (31%), and public service (24%). Although these figures may

 17

 R. A. Berk

 not hold up in research universities and, specifically, in
 medical schools/colleges, teaching didactic, and/or clinical
 courses is still a critical job requirement and criterion on which
 most faculty members are assessed.
 There are two types of individual decisions in faculty
 assessment with which you may already be familiar in the
 context of student assessment, plus one decision about
 programs:

 Med Teach Downloaded from informahealthcare.com by University of Dundee on 01/09/13
 For personal use only.

 (1)

 (2)

 (3)

 Formative decisions. These are decisions faculty make
 to improve and shape the quality of their teaching. It is
 based on evidence of teaching effectiveness they gather
 to plan and revise their teaching semester after semester. This evidence and the subsequent adjustments in
 teaching can occur anytime during the course, so the
 students can benefit from those changes, or after the
 course in preparation for the next course.
 Summative decisions. These decisions are rendered by
 the administrative-type person who controls a professor’s destiny and future in higher education. This
 individual is usually the dean, associate dean, program
 director, or department head or chair. This administrator uses evidence of a professor’s teaching effectiveness
 along with other evidence of research, publications,
 clinical practice, and service to ‘‘sum up’’ his or her
 overall performance or status to decide about contract
 renewal or dismissal, annual merit pay, teaching
 awards, and promotion and tenure.
 Although promotion and tenure decisions are often
 made by a faculty committee, a letter of recommendation by the dean is typically required to reach the
 committee for review. These summative decisions are
 high-stakes, final employment decisions reached at
 different points in time to determine a professor’s
 progression through the ranks and success as an
 academician.
 Program decisions. Several sources of evidence can also
 be used for program decisions, as defined in the
 Program Evaluation Standards by the US Joint
 Committee on Standards for Educational Evaluation
 (Yarbrough et al. 2011). They relate to the curriculum,
 admissions and graduation requirements, and program effectiveness. They are NOT individual decisions;
 instead, they focus on processes and products. The
 evidence usually is derived from various types of faculty
 and student input and employers’ performance appraisal of students. It is also collected to provide documentation to satisfy the criteria for accreditation review.

 Matching sources of evidence to decisions. The challenge is
 to pick the most appropriate and highest quality sources of
 evidence for the specific decision to be made; that is, match the
 sources to the decision. The decision drives your choices of
 evidence. Among the aforementioned 15 sources of evidence of
 teaching effectiveness, here are my best picks based on the
 literature for formative, summative, and program decisions:
 Formative decisions
 . student ratings,
 . peer observations,
 18

 .
 .
 .
 .
 .
 .

 peer review of course materials,
 external expert ratings,
 self-ratings,
 videos,
 student interviews, and
 mentor’s advice.

 Summative decisions (annual review for contract renewal
 and merit pay)
 .
 .
 .
 .
 .
 .

 student ratings,
 self-ratings,
 teaching scholarship,
 administrator ratings,
 teaching portfolio (for several courses over the year),
 peer observation (report written expressly for summative
 decision),
 . peer review of course materials (report written expressly for
 summative decision), and
 . mentor’s review ( progress report written expressly for
 summative decision).
 Summative decisions ( promotion and tenure)
 .
 .
 .
 .
 .
 .
 .

 student ratings,
 self-ratings,
 teaching scholarship,
 administrator ratings,
 teaching portfolio (across several years’ courses),
 peer review (written expressly for summative decision), and
 mentor’s review ( progress report written expressly for
 summative decision).
 Program decisions

 . Student ratings
 . Exit and alumni ratings
 . Employer ratings
 The multiple sources identified for each decision can be
 configured into the 360 multisource feedback (MSF) model of
 assessment (Berk 2009a, 2009b) or other model for accreditation documentation of teaching assessment. The sources for
 each decision may be added gradually to the model. This is an
 on-going process for your institution.

 Recommendations. So now that you have seen my picks,
 which sources are you going to choose? So many sources, so
 little time! Which sources are already available in your
 department? What is the quality of the measures used to
 provide evidence of teaching effectiveness? Are the faculty
 stakeholders involved in the current process?
 You have some decisions to make. Where do you begin?
 Here are a few suggestions:
 (1)

 (2)

 Start with student ratings. Consider the content and
 quality of your current scale and determine whether it
 needs a minor or major tune-up for the decisions being
 made.
 Review the other sources of evidence with your faculty
 to decide the next steps. Which sources will your
 faculty embrace which reflect best practices in

 Flashpoints in teaching effectiveness

 (3)

 Med Teach Downloaded from informahealthcare.com by University of Dundee on 01/09/13
 For personal use only.

 (4)

 teaching? Weigh the pluses and minuses of the different
 sources.
 Decide which combination of sources is best for your
 faculty. Identify which sources should be used for both
 formative and summative decisions, such as self- and
 peer ratings, and which sources should be used for one
 type of decision but not the other, such as administrator
 ratings and teaching portfolio.
 Map out a plan to build those sources, one at a time, to
 create an assessment model for each decision (see Berk
 2009a).

 Whatever combination of sources you choose to use, take
 the time and make the effort to design the scales, administer
 the scales, and report the results appropriately. The accuracy
 of faculty assessment decisions depends on the integrity of the
 process and the validity and reliability of the multiple sources
 of evidence you collect. This endeavor may seem rather
 formidable, but, keep in mind, you are not alone in this
 process. Your colleagues at other institutions are probably
 struggling with the same issues. Maybe you could pool
 resources.

 Quality of ‘‘home-grown’’ rating scales vs.
 commercially-developed scales
 FLASHPOINT 3: Many of the rating scales developed by faculty committees in medical schools/
 colleges and universities do not meet even the most
 basic criteria for psychometric quality required by
 professional and legal standards. Most of the scales
 are flawed internally, administered incorrectly, and
 rarely is there any evidence of score reliability and
 validity. The serious concern is that decisions about
 the careers of faculty are being made with these
 instruments.
 Quality control. Researchers have reviewed the quality of
 student rating scales used by colleges and universities
 throughout the US and Canada (Berk 1979, 2006; Franklin &
 Theall 1990; d’Apollonia & Abrami 1997b, 1997c; Seldin 1999;
 Theall & Franklin 2000; Abrami 2001; Franklin 2001; Ory &
 Ryan 2001; Arreola 2007; Gravestock & Gregor-Greenleaf
 2008). The instruments are either commercially developed
 scales with pre-designed reporting forms or ‘‘home-grown,’’
 locally constructed measures built usually by faculty committees. The former exhibit the quality control of the company
 that developed the scales and reports, such as Educational
 Testing Service and The IDEA Center (see Flashpoint 4); the
 latter have no consistency in the development process and
 rarely any formal procedures for controlling psychometric
 quality.
 Quality of ‘‘home-grown’’ scales. That lack of quality control
 may very well extend to institutions worldwide. It could be
 due to a lack of commitment, importance, accountability, or
 interest; inappropriate personnel without the essential skills; or
 limited resources. No one knows for sure. Regardless of the
 reason, the picture is ugly.

 Reviewers of practices at institutions in North America have
 found the following problems with ‘‘home-grown’’ scales:
 .
 .
 .
 .
 .
 .
 .
 .

 poor or no specifications of teaching behaviors,
 faulty items (statements and anchors),
 ambiguous or confusing directions,
 unstandardized administration procedures,
 inappropriate data collection, analysis, and reporting,
 no adjustments in ratings for extraneous factors,
 no psychometric studies of score reliability and validity, and
 no guidelines or training for faculty and administrators to
 use the results correctly for appropriate decisions.

 Does the term psychometrically putrid summarize current
 practices? How does your scale stack up against those
 problems? Fertilizer-wise, ‘‘home-grown’’ scales are not growing. Their development is arrested. They are more like ‘‘Peter
 Pan scales.’’
 The potential negative consequences of using faulty
 measures to make biased and unfair decisions to guide
 teaching improvement and faculty careers can be devastating.
 Moreover, this assessment only addresses the quality of
 student rating scales. What would be the quality of peer
 observations, self-ratings, and administrator ratings and their
 interpretations? Serious attention needs to be devoted to the
 quality control of all ‘‘home-grown’’ scales.
 From a broader perspective, poor quality scales violates US
 testing/scaling standards according to the Standards for
 Educational and Psychological Testing (AERA, APA, &
 NCME Joint Committee on Standards 1999), Personnel
 Evaluation Standards (Joint Committee on Educational
 Evaluation Standards 2009), and the US Equal Employment
 Opportunity Commission’s (EEOC) Uniform Guidelines on
 Employee Selection Procedures (US Equal Employment
 Opportunity Commission 2010). The psychometric requirements for instruments used for summative ‘‘employment’’
 decisions about faculty are rigorous and appropriate for their
 purposes.

 Recommendations. This issue reduces to the leadership and
 the composition of the faculty committee that accepts the
 responsibility to develop the scales and reports and/or the
 external consultant or vendor hired to guide the development
 process. The psychometric standards for the construction,
 administration, analysis, and interpretation of scales must be
 articulated and guided by professionals trained in those
 standards (AERA, APA, & NCME Joint Committee on Standards
 1999). As Flashpoint 2 emphasized, if the committee does not
 contain one or more professors with expertise in psychometrics, then it should be ashamed of itself. That is a prescription
 for putridity and the previous problem list. Reviewers rarely
 found any one with these skills on the committees of the
 institutions surveyed.
 It is also recommended that all faculty members be given
 workshops on item writing and scale structure. In the
 development process, they will be reviewing, selecting,
 critiquing, adapting, and writing items. Even if faculty are
 excellent test item writers, that does not mean they can write
 scale items.

 19

 R. A. Berk

 The structure and criteria for writing scale items are very
 different from test items (Berk 2006), not difficult, just different.
 Even with commercially developed instruments, professors are
 usually given the option to add up to 10 course-specific items;
 in other words, they will need to write items. Rules for writing
 scale items are available in references on scale construction
 (Netemeyer et al. 2003; Dunn-Rankin et al. 2004; Streiner &
 Norman 2008; Berk 2006; deVellis 2012).

 catalog of items. These options are listed in order of increasing
 cost. Depending on in-house resources, it is possible to
 execute the entire processing in a very cost-effective manner.
 Alternatively, estimates from a variety of vendors should be
 obtained for the out-house options.
 (1)

 Med Teach Downloaded from informahealthcare.com by University of Dundee on 01/09/13
 For personal use only.

 Paper-and-pencil vs. online scale administration
 FLASHPOINT 4: The battle between paper-andpencil versus online administration of student rating
 scales is still being fought in medical schools and on
 many campuses worldwide. Despite an international trend and numerous advantages and
 improvements in online systems over the past
 decade, there are faculty who still dig their heels in
 and institutions that have resisted the conversion.
 Much has been learned about how to increase
 response rates, which is a flashpoint by itself, and
 how to overcome many of the deterrents to adopting
 an online system. Online administration, analysis,
 and reporting can be executed in-house or by an
 out-house vendor specializing in that processing.
 Comparison of paper-and-pencil and online administration. A detailed examination of the advantages and disadvantages of the two modes of administration according to
 15 key factors has been presented by Berk (2006). There are
 major differences between them. Although it was concluded
 that both are far from perfect, the benefits of the online mode
 and the improvements in the delivery system with the research
 and experiences over the past few years exceed the pluses of the
 paper-based mode. Furthermore, most Net Geners do not
 know what a pencil is. Unless it is an iPencil, it is not on their
 radar or part of their mode.
 The benefits of the online mode include ease of administration, administration flextime, low cost, rapid turnaround
 time for results, ease of scale revision, and higher quality and
 greater quantity of unstructured responses (Sorenson &
 Johnson 2003; Anderson et al. 2005; Berk 2006; Liu 2006;
 Heath et al. 2007). Students’ concerns with lack of anonymity,
 confidentiality of ratings, inaccessibility, inconvenience, and
 technical problems have been eliminated at many institutions.
 Faculty resistance issues of low response rates and negative
 bias and lower ratings than paper-based version have been
 addressed (Berk 2006). Two major topics that still need
 attention are lack of standardization (Flashpoint 5) and
 response bias, which tends to be the same for both paper
 and online.
 Three online delivery options. Online administration, scoring,
 analysis, and reporting of student ratings can be handled in
 three ways: (1) in-house by the department of computer
 services, IT, or equivalent unit; (2) out-house by a vendor that
 provides all delivery services for the institution’s ‘‘homegrown’’ scale; or (3) out-house by a vendor that provides all
 services, plus their own scale or a scale you create from their
 20

 (2)

 In-house administration. If you have developed or
 plan to develop your own scale, you should consider
 this option. Convene the key players who can make
 this happen, including administrators and staff from IT
 or computer services, faculty development, and a
 testing center, plus at least one measurement expert.
 A discussion of scale design, scoring, analysis, report
 design, and distribution can determine initially
 whether the resources are available to execute the
 system. Once a preliminary assessment of the resources
 required has been completed, costs should be estimated for each phase. A couple of meetings can
 provide enough information to consider the possibility.
 Your in-house system components, products, and
 personnel can then be compared to the two options
 described next. As you go shopping for an online
 system, at least you will have done your homework and
 be able to identify what the commercial vendors offer,
 including qualitative differences, that you cannot execute yourself. Although the cost could be the dealbreaker, you will know all the options available to
 make an informed final decision. Further, you can
 always change your system if your stocks plummet, the
 in-house operation has too many bumps that cannot be
 squished and ends up in Neverland, or the commercial
 services do not deliver as promised.
 Vendor administration with ‘‘home-grown’’ scale.
 If outsourcing to a vendor is your preference or you
 just want to explore this option, but you want to
 maintain control over your own scale content and
 structure, there are certain vendors that can online your
 scale. For some strange reason, they are all located in
 Madagascar. Kidding. They include CollegeNET (What
 Do You Think?), ConnectEDU (courseval), and IOTA
 Solutions (MyClassEvaluation). They will administer
 your scale online, perform all analyses, and generate
 reports for different decision makers. Thoroughly
 compare all of their components with yours. Evaluate
 the pluses and minuses of each package.
 Make sure to investigate the compatibility of the
 packages with your course management system. The
 choice of the system is crucial to provide the anonymity
 for students to respond, which can boost response rates
 (Oliver & Sautter 2005). Most of the vendors’ packages
 are compatible with Blackboard, WebCT, Moodle,
 Sakai, and other campus portal systems.
 There are even free online survey providers, such as
 Zoomerang (MarketTools 2006), which can be used
 easily by any instructor without a course management
 system (Hong 2008). Other online survey software, both
 free and pay, has been reviewed by Wright (2005).
 There are specific advantages and disadvantages of the
 different packages, especially with regard to rating

 Flashpoints in teaching effectiveness

 Med Teach Downloaded from informahealthcare.com by University of Dundee on 01/09/13
 For personal use only.

 (3)

 scale structure and reporting score results (Hong 2008).
 This is a viable online option worth investigating for
 formative feedback.
 Vendor administration and rating scale. If you want a
 vendor to supply the rating scale and all of the delivery
 services, there are several commercial student rating
 systems you should consider. Examples include Online
 Course Evaluation, Student Instructional Report II,
 Course/Instructor Evaluation Questionnaire, IDEA
 Student Ratings of Instruction, Student Evaluation of
 Educational Quality, Instructional Assessment System,
 and Purdue Instructor Course Evaluation Service.
 Sample forms and lists of services with prices are
 given on the websites for these scales.
 This is the simplest solution to the student rating
 scale online system: Just go buy one. The seven
 packages are designed for you, Professor Consumer.
 The items are professionally developed; the scale has
 usually undergone extensive psychometric analyses to
 provide evidence of reliability and validity; and there
 are a variety of services provided, including the scale,
 online administration, scanning, scoring, and reporting
 of results in a variety of formats with national norms.
 For some, you can access a specimen set of rating
 scales and report forms online. All of the vendors
 provide a list of services and prices on their websites.
 Carefully shop around to find the best fit for your
 faculty and administrator needs and institutional
 culture. The packages vary considerably in scale
 design, administration options, report forms, norms,
 and, of course, cost.

 Comparability of paper-and-pencil and online ratings.
 Despite all of the differences between paper-based and
 online administrations and the contaminating biases that afflict
 the ratings they produce, researchers have found consistently
 that online students and their in-class counterparts rate
 courses and instructors similarly (Layne et al. 1999; Spooner
 et al. 1999; Waschull 2001; Carini et al. 2003; Hardy 2003;
 McGee & Lowell 2003; Dommeyer et al. 2004; Avery et al.
 2006; Benton et al. 2010b; Venette et al. 2010; Perrett 2011;
 Stowell et al. 2012). The ratings on the structured items are not
 systematically higher or lower for online administrations. The
 correlations between online and paper-based global item
 ratings were 0.84 (overall instructor) and 0.86 (overall course)
 (Johnson 2003).
 Although the ratings for online and paper are not identical,
 with more than 70% of the variance in common, any
 differences in ratings that have been found are small.
 Further, interrater reliabilities of ratings of individual items
 and item clusters for both modalities were comparable (McGee
 & Lowell 2003), and so were the underlying factor structures
 (Layne et al. 1999; Leung & Kember 2005). All of these
 similarities were also found in comparisons between face-toface and online courses, although response rates were slightly
 lower in the online courses (Benton et al. 2010a).
 Alpha total scale (18 items) reliabilities were similar for
 paper-based (0.90) and online (0.88) modes when all items
 appeared on the screen (Peer & Gamliel 2011). Slightly lower

 coefficients (0.74–0.83) for online displays of one, two, or
 four items only on the screen were attributable to response
 bias (Gamliel & Davidovitz 2005; Berk 2010; Peer &
 Gamliel 2011).
 The one exception to the above similarities is the unstructured items, or open-ended comment section. The research
 has indicated that the flexible time permitted to the onliners
 usually, but not always, yields longer, more frequent and
 thoughtful comments than those of in-class respondents
 (Layne et al. 1999; Ravelli 2000; Johnson 2001, 2003; Hardy
 2002, 2003; Anderson et al. 2005; Donovan et al. 2006; Venette
 et al. 2010; Morrison 2011). Typing the responses is reported
 by students to be easier and faster than writing them, plus it
 preserves their anonymity (Layne et al. 1999; Johnson 2003).
 Recommendations. Weighing all of the pluses and minuses
 in this section strongly suggests that the conversion from a
 paper-based to online administration system seems worthy of
 serious consideration by medical schools/colleges and
 every other institution of higher education using student
 ratings. When the concerns of the online approach are
 addressed, its benefits for face-to-face, blended/hybrid, and
 online/distance courses far outweigh the traditional paperbased approach. (NOTE: Online administration should also be
 employed for alumni ratings and employer ratings. The costs
 for these ratings will be a small fraction of the cost of the
 student rating system.)

 Standardized vs. unstandardized online scale
 administration
 FLASHPOINT 5: Standardized administration
 procedures for any measure of human or rodent
 behavior are absolutely essential to be able to
 interpret the ratings with the same meaning for all
 individuals who completed the measure. Student
 rating scales are typically administered online at the
 end of the semester without regard for any standardization or controls. There doesn’t seem to be any
 sound psychometric reasons for why the administrations are scheduled the way they are. This is,
 perhaps, the most neglected issue in the literature
 and in practice.
 Importance of standardization. A significant amount of
 attention has been devoted to establishing standardized
 times, conditions, locations, and procedures for administering
 in-class tests and clinical measures, such as the OSCE, as well
 as out-of-class admissions, licensing, and certification tests.
 National standards for testing practices require this standardization to assure that students take tests under identical
 conditions so their scores can be interpreted in the same way,
 they are comparable from one student or group to another,
 and they can be compared to norms (AERA, APA, & NCME
 Joint Committee on Standards 1999).
 Unfortunately, standardization has been completely
 neglected in the faculty evaluation literature for the administration of online student rating scales (Berk 2006). This topic
 was only briefly mentioned in a recent review of the student

 21

 R. A. Berk

 Med Teach Downloaded from informahealthcare.com by University of Dundee on 01/09/13
 For personal use only.

 ratings research (Addison & Stowell 2012). Although the
 inferences drawn from the scale scores and other measures of
 teaching effectiveness require the same administration precision as tests, procedures to assure scores will have the same
 meaning from students completing the scales at the end of the
 semester have not been addressed in research and practice.
 Typically, students are given notice that they have 1 or 2
 weeks to complete the student ratings form with the deadline
 before or after the final exam/project.
 Confounding uncontrolled factors. Since students can complete online rating scales during their discretionary time, there
 is no control over the time, place, conditions, or any
 situational factors under which the self-administrations occur
 (Stowell et al. 2012). Most of these factors were controlled with
 the paper-and-pencil, in-class administration by the instructor
 or a student appointed to handle the administration.
 In fact, in the online mode, there is no way to insure that
 the real student filled out the form or did not discuss it with
 someone who already did. It could be a roommate, partner,
 avatar, alien, student who has never been to class doing a
 favor in exchange for a pizza, alcohol, or drugs, or all of the
 preceding. Any of those substitutes would result in fraudulent
 ratings (Standard 5.6). Bad, bad ratings! Although there is no
 standardization of the actual administration, at least the written
 directions given to all students can be the same. Therefore, the
 procedures that the students follow should be similar if they
 read the directions.
 Timing of administration. The timing of the administration
 can also markedly affect the ratings. For example, if some
 students complete the scale before the final review and final
 exam, on the day of the final, or after the exam, their feelings
 about the instructor/course can be very different. Exposure to
 the final exam alone can significantly affect ratings,
 particularly if there are specific items on the scale measuring
 testing and evaluation methods. It could be argued that the
 final should be completed in order to provide a true rating of
 all evaluation methods.
 Despite a couple of ‘‘no difference’’ studies of paper-andpencil administrations almost 40 years ago (Carrier et al. 1974;
 Frey 1976) and one study examining final exam day administration (Ory 2001), which produced lower ratings, there does
 not seem to be any agreement among the experts on the best
 time to administer online scales or on any specific standardization procedures other than directions.
 What is clear is that whatever time is decided must be the
 same for all students in all courses; otherwise, the ratings of
 these different groups of students will not have the same
 meaning. For example, faculty within a department should
 agree that all online administrations must be completed before
 the final or after, but not both. Faculty must decide on the best
 time to get the most accurate ratings. That decision will also
 affect the legitimacy of any comparison of the ratings to
 different norm groups.
 Standards for standardization. So what is the problem with
 the lack of standardization? The ratings by students are
 assumed to be collected under identical conditions according
 22

 to the same rules and directions. Standardization of the
 administration and environment provide a snapshot of how
 students feel at one point in time. Although their individual
 ratings will vary, they will have the same meaning. Rigorous
 procedures for standardization are required by the US
 Standards for Educational and Psychological Testing (AERA,
 APA, & NCME Joint Committee on Standards 1999).
 Groups of students must be given identical instructions,
 which is possible, administered the scale under identical
 conditions, which is nearly impossible, to assure the comparability of their ratings (Standards 3.15, 3.19, and 3.20). Only
 then would the interpretation of the ratings and, in this case,
 the inferences about teaching effectiveness from the ratings be
 valid and reliable (Standard 3.19). In other words, without
 standardization, such as when every student fills out the scale
 willy-nilly at different times of the day and semester, in
 different places, under different conditions, using different
 procedures, the ratings from student to student and professor
 to professor will not be comparable.
 Recommendations. Given the limitations of online administration, what can be done to approximate some semblance of
 standardized conditions or, at least, minimize the extent to
 which the bad conditions contaminate the ratings? Here are a
 few options extended from Berk’s (2006) previous suggestions, listed from highest level of standardization and control to
 lowest level:
 (1)

 (2)

 (3)

 In-class administration before final for maximum
 control: Set a certain time slot in class, just like the
 paper-and-pencil version, for students to complete the
 forms on their own PC/Mac, iPad, iPhone, iPencil, or
 other device. The professor should leave the room and
 have a student execute and monitor the process.
 Adequate time should be given for students to type
 comments for the unstructured section of the scale.
 (NOTE: Not recommended if there are several items or
 a subscale that measures course evaluation methods,
 since the final is part of those methods.)
 Computer lab time slots before or after final: Set certain
 time slots in the computer lab or an equivalent location
 during which students can complete the forms. The
 controls exercised in the previous option should be
 followed in the lab. If available, techie-type students
 should proctor the slots to eliminate distractions and
 provide technical support for any problems that arise.
 One or two days before or after final at students’
 discretion: This is the most loosy-goosy option with the
 least control, albeit, the most popular. Specify a narrow
 window within which the ratings must be completed,
 such as one or two days after the final class and before
 the final exam, or one or two days after the exam
 before grades are submitted and posted. This gives new
 meaning to ‘‘storm window.’’

 Any of these three options will improve the standardization
 of current online administration practices beyond the typical
 1- or 2-week bay window. Experience and research on these
 procedures will hopefully identify the confounding variables
 that can affect the online ratings. Ultimately, concrete

 Flashpoints in teaching effectiveness

 guidelines to assist faculty in deciding on the most appropriate
 administration protocol will result.

 Declaration of interest: The author report no conflicts of
 interest. The author alone is responsible for the content and
 writing of the article.

 Med Teach Downloaded from informahealthcare.com by University of Dundee on 01/09/13
 For personal use only.

 Top five recommendations
 After ruminating over these flashpoints, it can be concluded
 that there are a variety of options within the reach of every
 medical school/college and institution of higher education to
 improve its current practices with its source(s) of evidence and
 administration procedures. Everyone is wrestling with these
 issues and, although more research is needed to test the
 options, there are tentative solutions to these problems. As
 experience and research continue to accumulate, even better
 solutions will result.
 There is a lot of activity and discourse on these flashpoints
 because we know that all of the summative decisions about
 faculty will be made with or without the best information
 available. Further, professors who are passionate about
 teaching will also seek out sources of evidence to guide
 their improvement.
 The contribution of this PBW article rests on the value and
 usefulness of the recommendations that you can convert into
 action. Without action, the recommendations are just dead
 words on a page. Your TAKE-AWAYS are the concrete action
 steps you choose to implement to improve the current state of
 your teaching assessment system.
 Here are the top five recommendations framed in terms of
 action steps:
 (1)

 (2)

 (3)

 (4)

 (5)

 polish your student rating scale, but also start building
 additional sources of evidence, such as self, peer, and
 mentor scales, to assess teaching effectiveness;
 match your highest quality sources to the specific
 formative and summative decisions using the 360 MSF
 model;
 review current measures of teaching effectiveness with
 your faculty and plan specifically how you can improve
 their psychometric quality;
 design an online administration system in-house or outhouse with a vendor to conduct the administration and
 score reporting for your own student rating scale or the
 one it provides; and
 standardize directions, administration procedures, and
 a narrow window for completion of your student rating
 scale and other measures of teaching effectiveness.

 Taking action on these five can yield major strides in
 improving the practice of assessing teaching effectiveness and
 the fairness and equity of the formative and summative
 decisions made with the results. Just how important is
 teaching in your institution? Your answer will be expressed
 in your actions. What can you contribute to make it better than
 it is ever been? That is my challenge to you.

 Notes on contributor
 RONALD A. BERK, PhD, is Professor Emeritus, Biostatistics and
 Measurement, and former Assistant Dean for Teaching at the Johns
 Hopkins University, where he taught for 30 years. He has presented 400
 keynotes/workshops and published 14 books, 165 journal articles, and 300
 blogs. His professional motto is: ‘‘Go for the Bronze!’’

 References
 Medical/healthcare education
 Ahmady S, Changiz T, Brommels M, Gaffney FA, Thor J, Masiello I. 2009.
 Contextual adaptation of the Personnel Evaluation Standards for
 assessing faculty evaluation systems in developing countries: The
 case of Iran. BMC Med Educ 9(18), DOI: 10.1186/1472-6920-9-18.
 Anderson HM, Cain J, Bird E. 2005. Online student course evaluations:
 Review of literature and a pilot study. Am J Pharm Educ 69(1):34–43.
 Available from http://web.njit.edu/bieber/pub/Shen-AMCIS2004.pdf.
 Appling SE, Naumann PL, Berk RA. 2001. Using a faculty evaluation triad to
 achieve evidenced-based teaching. Nurs Health Care Perspect
 22:247–251.
 Barnett CW, Matthews HW. 2009. Teaching evaluation practices in colleges
 and schools of pharmacy. Am J Pharm Educ 73(6).
 Barnett CW, Matthews HW, Jackson RA. 2003. A comparison between
 student ratings and faculty self-ratings of instructional effectiveness.
 J Pharm Educ 67(4).
 Berk RA. 2009a. Using the 360 multisource feedback model to evaluate
 teaching and professionalism. Med Teach 31(12):1073–1080.
 Berk RA, Naumann PL, Appling SE. 2004. Beyond student ratings: Peer
 observation of classroom and clinical teaching. Int J Nurs Educ
 Scholarsh 1(1):1–26.
 Boerboom TBB, Mainhard T, Dolmans DHJM, Scherpbier AJJA, van
 Beukelen P, Jaarsma ADC. 2012. Evaluating clinical teachers with the
 Maastricht clinical teaching questionnaire: How much ‘teacher’ is in
 student ratings? Med Teach 34(4):320–326.
 Chenot J-F, Kochen MM, Himmel W. 2009. Student evaluation of a primary
 care clerkship: Quality assurance and identification of potential for
 improvement. BMC Med Educ 9(17), DOI: 10.1186/1472-6920-9-17.
 DiVall M, Barr J, Gonyeau M, Matthews SJ, van Amburgh J, Qualters D,
 Trujillo J. 2012. Follow-up assessment of a faculty peer observation and
 evaluation program. Am J Pharm Educ 76(4).
 Donnon T, Delver H, Beran T. 2010. Student and teaching characteristics
 related to ratings of instruction in medical sciences graduate programs.
 Med Teach 32(4):327–332.
 Elzubeir M, Rizk D. 2002. Evaluating the quality of teaching in medical
 education: Are we using the evidence for both formative and
 summative purposes? Med Teach 24:313–319.
 Hoeks TW, van Rossum HJ. 1988. The impact of student ratings on a new
 course: The general clerkship (ALCO). Med Educ 22(4):308–313.
 Jones RF, Froom JD. 1994. Faculty and administration views of problems in
 faculty evaluation. Acad Med 69(6):476–483.
 Kidd RS, Latif DA. 2004. Student evaluations: Are they valid measures of
 course effectiveness? J Pharm Educ 68(3).
 Maker VK, Lewis MJ, Donnelly MB. 2006. Ongoing faculty evaluations:
 Developmental gain or just more pain? Curr Surg 63(1):80–84.
 Martens MJ, Duvivier RJ, van Dalen J, Verwijnen GM, Scherpbier AJ, van der
 Vleuten. 2009. Student views on the effective teaching of physical
 examination skills: A qualitative study. Med Educ 43(2):184–191.
 Mazor K, Clauser B, Cohen A, Alper E, Pugnaire M. 1999. The dependability
 of students’ rating of preceptors. Acad Med 74:19–21.
 Pattison AT, Sherwood M, Lumsden CJ, Gale A, Markides M. 2012.
 Foundation observation of teaching project – A developmental model
 of peer observation of teaching. Med Teach 34(2):e36–e142.
 Pierre RB, Wierenga A, Barton M, Branday JM, Christie CD. 2004.
 Student evaluation of an OSCE in paediatrics at the University of
 the West Indies, Jamaica. BMC Med Educ 4(22), DOI: 10.1186/14726920-4-22.
 Schiekirka S, Reinhardt D, Heim S, Fabry G, Pukrop T, Anders S,
 Raupach T. 2012. Student perceptions of evaluation in undergraduate
 medical education: A qualitative study from one medical school.
 BMC Med Educ 12(45), DOI: 10.1186/1472-6920-12-45.

 23

 R. A. Berk

 Med Teach Downloaded from informahealthcare.com by University of Dundee on 01/09/13
 For personal use only.

 Siddiqui ZS, Jonas-Dwyer D, Carr SE. 2007. Twelve tips for peer
 observation of teaching. Med Teach 29(4):297–300.
 Stalmeijer RE, Dolmans DH, Wolfhagen IH, Peters WG, van Coppenolle L,
 Scherpbier AJ. 2010. Combined student ratings and self-assessment
 provide useful feedback for clinical teachers. Adv Health Sci Educ
 Theory Pract 15(3):315–328.
 Stark P. 2003. Teaching and learning in the clinical setting: A qualitative
 study of the perceptions of students and teachers. Med Educ
 37(11):975–982.
 Steinert Y. 2004. Student perceptions of effective small group teaching. Med
 Educ 38(3):286–293.
 Sullivan PB, Buckle A, Nicky G, Atkinson SH. 2012. Peer observation of
 teaching as a faculty development tool. BMC Med Educ 12(26), DOI:
 10.1186/1472-6920-12-26.
 Turhan K, Yaris F, Nural E. 2005. Does instructor evaluation by students
 using a web-based questionnaire impact instructor performance? Adv
 Health Sci Educ Theory Pract 10(1):5–13.
 Wellein MG, Ragucci KR, Lapointe M. 2009. A peer review process for
 classroom teaching. Am J Pharm Educ 73(5).

 General higher education
 Abrami PC. 2001. Improving judgments about teaching effectiveness using
 rating forms. In: Theall M, Abrami PC, Mets LA, editors. The student
 ratings debate: Are they valid? How can we best use them? (New
 Directions for Institutional Research, No. 109). San Francisco, CA:
 Jossey-Bass. pp 59–87.
 Addison WE, Stowell JR. 2012. Conducting research on student evaluations
 of teaching. In: Kite ME, editor. Effective evaluation of teaching: A guide
 for faculty and administrators. pp 1–12. E-book [Accessed 6 June 2012]
 Available from the Society for the Teaching of Psychology website
 http://teachpsych.org/ebooks/evals2012/index.php.
 AERA
 (American
 Educational
 Research
 Association),
 APA
 (American Psychological Association), NCME (National Council on
 Measurement in Education) Joint Committee on Standards. 1999.
 Standards for educational and psychological testing. Washington, DC:
 AERA.
 Ali DL, Sell Y. 1998. Issues regarding the reliability, validity and utility of
 student ratings of instruction: A survey of research findings. Calgary,
 AB: University of Calgary APC Implementation Task Force on Student
 Ratings of Instruction.
 Arreola RA. 2007. Developing a comprehensive faculty evaluation system:
 A handbook for college faculty and administrators on designing and
 operating a comprehensive faculty evaluation system. 3rd ed. Bolton,
 MA: Anker.
 Avery RJ, Bryan WK, Mathios A, Kang H, Bell D. 2006. Electronic course
 evaluations: Does an online delivery system influence student
 evaluations? J Econ Educ 37(1):21–37.
 Benton SL, Cashin WE. 2012. Student ratings of teaching: A summary
 of research and literature (IDEA Paper no. 50). Manhattan, KS:
 The IDEA Center. [Accessed 8 April 2012] Available from http://
 www.theideacenter.org/sites/default/files/idea-paper_50.pdf.
 Benton SL, Webster R, Gross A, Pallett W. 2010a. An analysis of IDEA
 Student Ratings of Instruction in traditional versus online courses (IDEA
 Technical Report no. 15). Manhattan, KS: The IDEA Center.
 Benton SL, Webster R, Gross A, Pallett W. 2010b. An analysis of IDEA
 Student Ratings of Instruction using paper versus online survey
 methods (IDEA Technical Report no. 16). Manhattan, KS: The IDEA
 Center.
 Berk RA. 1979. The construction of rating instruments for faculty
 evaluation: A review of methodological issues. J Higher Educ
 50:650–669.
 Berk RA. 2005. Survey of 12 strategies to measure teaching effectiveness.
 Int J Teac Learn Higher Educ 17(1):4862. Available from http://
 www.isetl.org/ijtlthe.
 Berk RA. 2006. Thirteen strategies to measure college teaching:
 A consumer’s guide to rating scale construction, assessment, and
 decision making for faculty, administrators, and clinicians. Sterling, VA:
 Stylus.

 24

 Berk RA. 2009b. Beyond student ratings: ‘‘A whole new world, a new
 fantastic point of view.’’ Essays Teach Excellence 20(1). Available from
 http://podnetwork.org/publications/teachingexcellence.htm.
 Berk RA. 2010. The secret to the ‘‘best’’ ratings from any evaluation scale.
 J Faculty Dev 24(1):37–39.
 Braskamp LA, Ory JC. 1994. Assessing faculty work: Enhancing individual
 and institutional performance. San Francisco, CA: Jossey-Bass.
 Calderon TG, Gabbin AL, Green BP. 1996. Report of the committee on
 promoting evaluating effective teaching. Harrisonburg, VA: James
 Madison University.
 Carini RM, Hayek JC, Kuh GD, Ouimet JA. 2003. College student responses
 to web and paper surveys: Does mode matter? Res Higher Educ
 44(1):1–19.
 Carrier NA, Howard GS, Miller WG. 1974. Course evaluations: When?
 J Educ Psychol 66:609–613.
 Cashin WE. 2003. Evaluating college and university teaching: Reflections of
 a practitioner. In: Smart JC, editor. Higher education: Handbook of
 theory and research. Dordrecht, the Netherlands: Kluwer Academic
 Publishers. pp 531–593.
 Centra JA. 1993. Reflective faculty evaluation: Enhancing teaching and
 determining faculty effectiveness. San Francisco: Jossey-Bass.
 Cohen PA, McKeachie WJ. 1980. The role of colleagues in the evaluation of
 teaching. Improving College Univ Teach 28(4):147–154.
 Coren S. 2001. Are course evaluations a threat to academic freedom?
 In: Kahn SE, Pavlich D, editors. Academic freedom and the inclusive
 university. Vancouver, BC: University of British Columbia Press.
 pp 104–117.
 d’Apollonia S, Abrami PC. 1997a. Navigating student ratings of instruction.
 Am Psychol 52:1198–1208.
 d’Apollonia S, Abrami PC. 1997b. Scaling the ivory tower, part 1:
 Collecting evidence of instructor effectiveness. Psychol Teach Rev
 6:46–59.
 d’Apollonia S, Abrami PC. 1997c. Scaling the ivory tower, part 2:
 Student ratings of instruction in North America. Psychol Teach
 Rev 6:60–76.
 deVellis RF. 2012. Scale development: Theory and applications. 3rd ed.
 Thousand Oaks, CA: Sage.
 Dommeyer CJ, Baum P, Hanna RW, Chapman KS. 2004. Gathering
 faculty teaching evaluations by in-class and online surveys: Their
 effects on response rates and evaluations. Assess Eval Higher
 Educ 29(5):611–623.
 Donovan J, Mader CE, Shinsky J. 2006. Constructive student feedback:
 Online vs. traditional course evaluations. J Interact Online Learn
 5:283–296.
 Dunn-Rankin P, Knezek GA, Wallace S, Zhang S. 2004. Scaling methods.
 Mahwah, NJ: Erlbaum.
 Franklin J. 2001. Interpreting the numbers: Using a narrative to help others
 read student evaluations of your teaching accurately. In: Lewis KG,
 editor. Techniques and strategies for interpreting student evaluations
 (Special issue) (New Directions for Teaching and Learning, No. 87).
 San Francisco, CA: Jossey-Bass. pp 85–100.
 Franklin J, Theall M. 1990. Communicating student ratings to decision
 makers: Design for good practice. In: Theall M, Franklin J, editors.
 Student ratings of instruction: Issues for improving practice (Special
 issue) (New Directions for Teaching and Learning, No. 43). San
 Francisco, CA: Jossey-Bass. pp 75–93.
 Frey PW. 1976. Validity of student instructional ratings as a function of their
 timing. J Higher Educ 47:327–336.
 Freyd M. 1923. A graphic rating scale for teachers. J Educ Res
 8(5):433–439.
 Gamliel E, Davidovitz L. 2005. Online versus traditional teaching
 evaluation: Mode can matter. Assess Eval Higher Educ 30(6):
 581–592.
 Gravestock P, Gregor-Greenleaf E. 2008. Student course evaluations:
 Research, models and trends. Toronto, ON: Higher Education
 Quality Council of Ontario. E-book [Accessed 6 May 2012] Available
 from http://www.heqco.ca/en-CA/Research/Research%20Publications/
 Pages/Home.aspx.
 Green BP, Calderon TG, Reider BP. 1998. A content analysis of teaching
 evaluation instruments used in accounting departments. Issues Account
 Educ 13(1):15–30.

 Med Teach Downloaded from informahealthcare.com by University of Dundee on 01/09/13
 For personal use only.

 Flashpoints in teaching effectiveness

 Hardy N. 2002. Perceptions of online evaluations: Fact and fiction. Paper
 presented at the annual meeting of the American Educational Research
 Association, April 1–5 2002, New Orleans, LA.
 Hardy N. 2003. Online ratings: Fact and fiction. In: Sorenson DL, Johnson
 TD, editors. Online student ratings of instruction (New Directions for
 Teaching and Learning, No. 96). San Francisco, CA: Jossey-Bass.
 pp 31–38.
 Heath N, Lawyer S, Rasmussen E. 2007. Web-based versus paper and
 pencil course evaluations. Teach Psychol 34(4):259–261.
 Hong PC. 2008. Evaluating teaching and learning from students’
 perspectives in their classroom through easy-to-use online surveys.
 Int J Cyber Soc Educ 1(1):33–48.
 Hoyt DP, Pallett WH. 1999. Appraising teaching effectiveness: Beyond
 student ratings (IDEA Paper no. 36). Manhattan, KS: Kansas State
 University Center for Faculty Evaluation and Development.
 Johnson TD. 2001. Online student ratings: Research and possibilities.
 Invited plenary presented at the Online Assessment Conference,
 September, Champaign, IL.
 Johnson TD. 2003. Online student ratings: Will students respond?.
 In: Sorenson DL, Johnson TD, editors. Online student ratings of
 instruction (New Directions for Teaching and Learning, no. 96).
 San Francisco, CA: Jossey-Bass. pp 49–60.
 Joint Committee on Standards for Educational Evaluation. 2009. The
 personnel evaluation standards: How to assess systems for evaluating
 educators. 2nd ed. Thousand Oaks, CA: Corwin Press.
 Kite ME, editor. 2012. Effective evaluation of teaching: A guide for faculty
 and administrators. E-book [Accessed 6 June 2012] Available from the
 Society for the Teaching of Psychology website http://teachpsych.org/
 ebooks/evals2012/index.php.
 Knapper C, Cranton P, editors. 2001. Fresh approaches to the evaluation
 of teaching (New Directions for Teaching and Learning, no. 88).
 San Francisco, CA: Jossey-Bass. pp 19–29.
 Layne BH, DeCristoforo JR, McGinty D. 1999. Electronic versus
 traditional student ratings of instruction. Res Higher Educ
 40(2):221–232.
 Leung DYP, Kember D. 2005. Comparability of data gathered from
 evaluation questionnaires on paper through the Internet. Res Higher
 Educ 46:571–591.
 Liu Y. 2006. A comparison of online versus traditional student evaluation of
 instruction. Int J Instr Technol Distance Learn 3(3):15–30.
 MarketTools. 2006. Zoomerang: Easiest way to ask, fastest way to know.
 [Accessed 17 July 2012] Available from http://info.zoomerang.com.
 Marsh HW. 2007. Students’ evaluations of university teaching:
 Dimensionality, reliability, validity, potential biases and usefulness.
 In: Perry RP, Smart JC, editors. The scholarship of teaching and learning
 in higher education: An evidence-based perspective. Dordrecht, the
 Netherlands: Springer. pp 319–383.
 McGee DE, Lowell N. 2003. Psychometric properties of student ratings of
 instruction in online and on-campus courses. In: Sorenson DL, Johnson
 TD, editors. Online student ratings of instruction (New Directions for
 Teaching and Learning, no. 96). San Francisco, CA: Jossey-Bass.
 pp 39–48.
 Morrison R. 2011. A comparison of online versus traditional student end-ofcourse critiques in resident courses. Assess Eval Higher Educ
 36(6):627–641.
 Netemeyer RG, Bearden WO, Sharma S. 2003. Scaling procedures.
 Thousand Oaks, CA: Sage.
 Oliver RL, Sautter EP. 2005. Using course management systems to
 enhance the value of student evaluations of teaching. J Educ Bus
 80(4):231–234.
 Ory JC. 2001. Faculty thoughts and concerns about student ratings.
 In: Lewis KG, editor. Techniques and strategies for interpreting student
 evaluations (Special issue) (New Directions for Teaching and Learning,
 No. 87). San Francisco, CA: Jossey-Bass. pp 3–15.
 Ory JC, Ryan K. 2001. How do student ratings measure up to a new validity
 framework?. In: Theall M, Abrami PC, Mets LA, editors. The student
 ratings debate: Are they valid? How can we best use them? (Special
 issue) (New Directions for Institutional Research, 109). San Francisco,
 CA: Jossey-Bass. pp 27–44.
 Peer E, Gamliel E. 2011. Too reliable to be true? Response bias as a
 potential source of inflation in paper and pencil questionnaire

 reliability. Practical Assess Res Eval 16(9):1–8. Available from http://
 pareonline.net/getvn.asp?v=16%n=9.
 Perrett JJ. 2011. Exploring graduate and undergraduate course evaluations
 administered on paper and online: A case study. Assess Eval Higher
 Educ 1–9, DOI: 10.1080/02602938.2011.604123.
 Ravelli B. 2000. Anonymous online teaching assessments: Preliminary
 findings. [Accessed 12 June 2012] Available from http://www.edrs.com/
 DocLibrary/0201/ED445069.pdf.
 Seldin P. 1999. Current practices – good and bad – nationally. In: Seldin P &
 Associates Changing practices in evaluating teaching: A practical guide
 to improved faculty performance and promotion/tenure decisions.
 Bolton, MA: Anker. 1–24.
 Seldin P. 2006. Building a successful evaluation program. In: Seldin P &
 Associates Evaluating faculty performance: A practical guide to
 assessing teaching, research, and service. Bolton, MA: Anker 1–19.
 Seldin P, Associates, editors. 2006. Evaluating faculty performance: A
 practical guide to assessing teaching, research, and service. Bolton, MA:
 Anker. pp 201–216.
 Sorenson DL, Johnson TD, editors. 2003. Online student ratings of
 instruction (New Directions for Teaching and Learning, no. 96).
 San Francisco, CA: Jossey-Bass.
 Spooner F, Jordan L, Algozzine B, Spooner M. 1999. Student ratings of
 instruction in distance learning and on-campus classes. J Educ Res
 92:132–140.
 Stehle S, Spinath B, Kadmon M. 2012. Measuring teaching effectiveness:
 Correspondence between students’ evaluations of teaching and
 different measures of student learning. Res Higher Educ. DOI:
 10.1007/s11162-012-9260-9.
 Stowell JR, Addison WE, Smith JL. 2012. Comparison of online and
 classroom-based student evaluations of instruction. Assess Eval Higher
 Educ 37(4):465–473.
 Strategy Group. 2011. National strategy for higher education to 2030
 (Report of the Strategy Group). Dublin, Ireland: Department of
 Education and Skills, Government Publications Office. [Accessed 17
 July
 2012]
 Available
 from
 http://www.hea.ie/files/files/
 DES_Higher_Ed_Main_Report.pdf.
 Streiner DL, Norman GR. 2008. Health measurement scales: A practical
 guide to their development and use. 4th ed. New York: Oxford
 University Press.
 Surgenor PWG. 2011. Obstacles and opportunities: Addressing the
 growing pains of summative student evaluation of teaching. Assess
 Eval Higher Educ 1–14, iFirst Article. DOI: 10.1080/
 02602938.2011.635247.
 Svinicki M, McKeachie WJ. 2011. McKeachie’s teaching tips: Strategies,
 research, and theory for college and university teachers. 13th ed.
 Belmont, CA: Wadsworth.
 Theall M, Feldman KA. 2007. Commentary and update on Feldman’s (1997)
 ‘‘Identifying exemplary teachers and teaching: Evidence from student
 ratings’’. In: Perry RP, Smart JC, editors. The teaching and learning in
 higher education: An evidence-based perspective. Dordrecht, the
 Netherlands: Springer. pp 130–143.
 Theall M, Franklin JL. 1990. Student ratings in the context of
 complex evaluation systems. In: Theall M, Franklin JL, editors.
 Student ratings of instruction: Issues for improving practice (New
 Directions for Teaching and Learning, no. 43). San Francisco, CA:
 Jossey-Bass. pp 17–34.
 Theall M, Franklin JL. 2000. Creating responsive student ratings systems to
 improve evaluation practice. In: Ryan KE, editor. Evaluating teaching in
 higher education: A vision for the future (Special issue) (New Directions
 for Teaching and Learning, no. 83). San Francisco, CA: Jossey-Bass.
 pp 95–107.
 Theall M, Franklin JL. 2001. Looking for bias in all the wrong places:
 A search for truth or a witch hunt in student ratings of instruction?.
 In: Theall M, Abrami PC, Mets LA, editors. The student ratings
 debate: Are they valid? How can we best use them? (New Directions
 for Institutional Research, no. 109). San Francisco, CA: Jossey-Bass.
 pp 45–56.
 US Equal Employment Opportunity Commission (EEOC). 2010. Employment
 tests and selection procedures. [Accessed 20 August 2012] Available from
 http://www.eeoc.gov/policy/docs/factemployment_procedures.html.

 25

 R. A. Berk

 Med Teach Downloaded from informahealthcare.com by University of Dundee on 01/09/13
 For personal use only.

 Venette S, Sellnow D, McIntire K. 2010. Charting new territory: Assessing
 the online frontier of student ratings of instruction. Assess Eval Higher
 Educ 35:101–115.
 Waschull SB. 2001. The online delivery of psychology courses: Attrition,
 performance, and evaluation. Comput Teach 28:143–147.
 Wright KB. 2005. Researching internet-based populations: Advantages and
 disadvantages of online survey research, online questionnaire

 26

 View publication stats

 authoring software packages, and web survey services. J Comput
 Mediated Commun 10(3). Available from http://jcmc.indiana.edu/
 vol10/issue3/wright.html.
 Yarbrough DB, Shulha LM, Hopson RK, Caruthers FA. 2011. The
 program evaluation standards: A guide for evaluators and evaluation
 users. 3rd ed. Thousand Oaks, CA: Sage.

 See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/268522696

 Views from below: Students’ perceptions of teaching practice evaluations and
 stakeholder roles
 Article in Perspectives in Education · December 2013

 CITATIONS

 READS

 6

 53

 1 author:
 Lungi Sosibo
 Cape Peninsula University of Technology
 24 PUBLICATIONS 45 CITATIONS
 SEE PROFILE

 Some of the authors of this publication are also working on these related projects:

 NRF Project on Maths and Science View project

 Inequalities in Education View project

 All content following this page was uploaded by Lungi Sosibo on 05 May 2018.
 The user has requested enhancement of the downloaded file.

 View publication stats

 College Teaching

 ISSN: 8756-7555 (Print) 1930-8299 (Online) Journal homepage: https://www.tandfonline.com/loi/vcol20

 Predicting Student Achievement in UniversityLevel Business and Economics Classes: Peer
 Observation of Classroom Instruction and Student
 Ratings of Teaching Effectiveness
 Craig S. Galbraith & Gregory B. Merrill
 To cite this article: Craig S. Galbraith & Gregory B. Merrill (2012) Predicting Student Achievement
 in University-Level Business and Economics Classes: Peer Observation of Classroom
 Instruction and Student Ratings of Teaching Effectiveness, College Teaching, 60:2, 48-55, DOI:
 10.1080/87567555.2011.627896
 To link to this article: https://doi.org/10.1080/87567555.2011.627896

 Published online: 04 Apr 2012.

 Submit your article to this journal

 Article views: 426

 View related articles

 Citing articles: 3 View citing articles

 Full Terms & Conditions of access and use can be found at
 https://www.tandfonline.com/action/journalInformation?journalCode=vcol20

 COLLEGE TEACHING, 60: 48–55, 2012
 C Taylor & Francis Group, LLC
 Copyright 
 ISSN: 8756-7555 print / 1930-8299 online
 DOI: 10.1080/87567555.2011.627896

 Predicting Student Achievement in University-Level
 Business and Economics Classes: Peer Observation
 of Classroom Instruction and Student Ratings of
 Teaching Effectiveness
 Craig S. Galbraith
 University of North Carolina Wilmington

 Gregory B. Merrill
 St Mary’s College of California

 We examine the validity of peer observation of classroom instruction for purposes of faculty
 evaluation. Using both a multi-section course sample and a sample of different courses across
 a university’s School of Business and Economics we find that the results of annual classroom
 observations of faculty teaching are significantly and positively correlated with student learning
 outcome assessment measures. This finding supports the validity of classroom observation as
 an assessment of teaching effectiveness. The research also indicates that student ratings of
 teaching effectiveness (SETEs) were less effective at measuring student learning than annual
 classroom observations by peers.

 There is no question that teaching effectiveness is a very
 personal, highly complex, and ever changing process involving a multitude of different skills and techniques. Teaching
 is also part of the mission of every institution of higher learning, although certainly the weightings between teaching and
 other components, such as scholarship, service, and regional
 engagement, may vary between individual campuses. As the
 primary institutional service providers for the core mission of
 teaching, the faculty’s teaching effectiveness must be evaluated for various personnel decisions, such as promotion,
 tenure, and retention. Today, most universities systematically
 use a combination of peer evaluations and student ratings.
 The notion of peer evaluations has evolved significantly
 since the 1980s and now represents a relatively broad definition that includes both direct classroom observation and
 review of a faculty member’s teaching portfolio of syllabi,
 exam samples, and possibly other data points, such as statements of teaching philosophy and reflective reactions to student feedback. As Yon, Burnap, and Kohut (2002) observe,

 Correspondence should be sent to Craig S. Galbraith, University of North
 Carolina Wilmington, Cameron School of Business, 601 South College
 Road, Wilmington, NC 28403, USA. E-mail: [email protected]

 “the expanding use of peers in the evaluation of teaching
 is part of a larger trend in postsecondary education toward
 a more systematic assessment of classroom performance”
 (104). In fact, there now exists a broad normative literature describing the underlying theories, proposed protocols,
 and content breadth of comprehensive peer evaluations (e.g.,
 Centra 1993; Cavanagh 1996; Malik 1996; Hutchings 1996
 1998: Bernstein and Edwards 2001; Bernstein et al. 2006;
 Arreola 2007; Chism 2007; Bernstein 2008).
 Of all the elements in a typical university peer evaluation
 process, direct classroom observation continues to be one
 of the more controversial. Not only can issues of peer bias,
 observer training, and classroom intrusion be raised (e.g.,
 Cohen and McKeachie 1980; DeZure 1999; Yon et al.
 2002; Costello et al. 2001; Arreola 2007; Courneya, Pratt,
 and Collins 2007), but there remains a fundamental debate
 whether classroom observation is most valid for formative
 purposes in assisting faculty to improve their teaching effectiveness or for evaluative purposes in providing university
 administrators and faculty colleagues useful data for personnel decisions (e.g, Cohen and McKeachie; 1980; Centra
 1993; Shortland 2004; Peel 2005; Chism 2007). In practice,
 most universities that use classroom observation for evaluative purposes generally use observations from a single class

 PREDICTING STUDENT ACHIEVEMENT

 “visit,” and that assessment is then assumed to reflect an evaluation of the faculty member’s overall teaching ability during
 that particular time period, or at least until another “visit” is
 conducted. Despite the intermittent nature of some aspects of
 peer evaluations, surveys tend to support the argument that
 at least the faculty themselves believe that peer evaluations
 can be an effective measure of teaching effectiveness (e.g.,
 Peterson 2000; Yon et al. 2002; Kohut, Burnap, and Yon
 2007).
 While faculty may believe that peer evaluations and classroom observations, if done properly, are valid measures of
 teaching effectiveness, it is difficult to draw any conclusion
 at all from empirical validity studies of peer evaluations and
 classroom observation. As with any instructional related metric used for faculty personnel decisions, the argument for, or
 against, the validity of peer evaluations should be based upon
 convincing evidence that they indeed measure teaching effectiveness or student learning. As Cohen and McKeachie
 (1980) succinctly noted early in this debate, “clearly what
 is needed are studies demonstrating the validity of colleague
 ratings against other criteria of teaching effectiveness. One
 possibility would be to relate colleague ratings to student
 achievement” (149).
 In spite of Cohen and McKeachie’s (1980) call for more
 empirical research tied to student achievement, this type of
 validity testing for peer evaluations or classroom observation
 has simply not yet occurred. To date, almost all the empirical arguments for, or against, the validity of peer evaluations
 and classroom observation are based upon their correlations
 with student evaluations of teaching effectiveness (SETEs)
 or some other purported measure of teaching excellence,
 such as teaching awards. Feldman’s (1989b) meta-analysis
 of these types of studies, for example, reports a mean correlation between peer evaluations and student ratings of 0.55,
 with correlations ranging from 0.19 to 0.84. Empirical results since Feldman’s meta-analysis report similar correlations (e.g., Kremer 1990; Centra 1994). In general, higher
 correlations with SETEs are found when peers examined a
 complete teaching portfolio, and therefore may have been
 influenced by student evaluations included in the portfolio,
 while the lower correlations were from studies involving primarily classroom observations (Burns 1998).
 While this line of research is interesting, given the fact that
 SETEs themselves are often challenged as valid measures of
 teaching effectiveness, studies that correlate peer evaluations
 with SETEs simply provide little or no insight regarding
 the validly of peer evaluations and classroom observations
 for purposes of assessing faculty teaching effectiveness. The
 validity of SETEs as a measure of teaching effectiveness has
 been challenged for a number of reasons.
 First, early research indicates only a moderate amount
 of statistical variation in independent and objective measures of teaching effectiveness are explained by SETE
 scores—depending on the meta-analysis study, between
 about 4% and 20% for the typical “global” item on SETE in-

 49

 struments (Cohen, P. 1981, 1982, 1983; Costin 1987; Dowel
 and Neal 1982, 1983; McCallum 1984; Feldman 1989a,
 2007)—with many of the studies finding validity within the
 “weak” category of scale criterion validity suggested by Cohen, J. (1969, 1981)1.
 Second, it has been noted that the vast majority of this
 early SETE research relied upon data from introductory undergraduate college courses at research institutions taught by
 teaching assistants (TAs) following a textbook or departmental created lesson plan. These types of TA taught introductory classes, however, only account for a small percentage
 of a university’s total course offerings, and may not be at
 all representative for non-doctorate granting colleges. In addition, as Taylor (2007) notes, it is in the more advanced
 core, elective, and graduate courses where faculty members
 have the greatest flexibility over pedagogical style, course
 content, and assessment criteria—the factors most likely to
 drive classroom learning. In fact, recent empirical research
 has indicated a possible negative, or negatively bi-modal, relationship between SETEs and student achievement in more
 advanced university courses (Carrell and West, 2010; Galbraith, Merrill, and Kline, 2011).
 Third, in the past two decades a number of articles have
 appeared that specifically challenge various validity related
 aspects of SETEs (e.g., Balam and Shannon, 2010; Campbell
 and Bozeman 2008; Davies et al. 2007; Emery, Kramer, and
 Tian 2003; Pounder 2007; Langbein 2008; Carrell and West
 2010). These include arguments that student perceptions of
 teaching are notoriously subject to various types of manipulation, such as the often debated “grading leniency” hypothesis, or even giving treats such as “chocolate candy” prior
 to the evaluation (e.g., Blackhart et al. 2006; Bowling 2008;
 Boysen 2008; Felton, Mitchell, and Stinson 2004; Youmans
 and Jee 2007). Other research has demonstrated that student
 ratings are influenced by race, gender, and cultural biases
 as well as various “likability and popularity” attributes of
 the instructor, such as physical looks and “sexiness” (e.g.,
 Abrami, Levanthal, and Perry 1982; Ambady and Rosenthal
 1993; Anderson and Smith 2005; Atamian and Ganguli 1993;
 Buck and Tiene 1989; Davies et al. 2007; Felton, Mitchell,
 and Stinson 2004; McNatt, 2010; Naftulin, Ware, and Donnelly 1973; Riniolo et al. 2006; Smith 2007; Steward and
 Phelps 2000).
 The lack of empirical studies linking classroom observations by peers to student achievement combined with the continuing questions surrounding the overall validity of SETEs
 as an indicator of teaching effectiveness clearly underlines
 the need for continued research as to how faculty members

 J. (1969, 1981) refers to r = 0.10 (1.0% variance explained) as
 a small effect, r = 0.30 (9.0% variance explained) as a medium effect, and
 r = 0.50 (25.0% variance explained) as a large effect. Many researchers
 have used an r<0.30 (less than 9% variance explained) to signify a “small”
 effect for purposes of testing scale validity (e.g., Barrett et al. 2009; Hon et
 al., 2010; Varni et al. 2001; Whitfield et al. 2006).
 1Cohen,

 50

 GALBRAITH AND MERRILL

 are evaluated. In this study, we directly examine issues surrounding the validity of peer classroom observations in relationship to student learning. Our analysis differs from previous empirical efforts in several respects. First, we investigate
 the validity of classroom peer observations by using standardized learning outcome measures set by an institutional
 process rather than simply correlating peer evaluations with
 SETEs. Second, our sample of advanced but required core
 undergraduate and graduate courses represents a mid-range
 of content control by individual instructors. Third, we compare the explanatory power of peer evaluation ratings with
 SETEs, and fourth, we have both part-time instructors and
 full-time faculty members in our sample. This allows for a
 test regarding the possible impact of independence in selecting which “peers” observe a faculty member’s classroom
 instruction.
 Data
 The data come from courses taught by thirty-four different
 faculty at a “School of Business and Economics” for a private
 university located in a large urban region. Classes are offered at both the undergraduate and graduate (masters) level.
 Similar to many urban universities, a number of adjunct or
 part-time instructors are used to teach courses. Some of the
 adjunct instructors hold terminal degrees, however, and are
 associated with other colleges in the region. Those part-time
 instructors not holding terminal degrees would be considered
 “professionally qualified” under standards set by the Association to Advance Collegiate Schools of Business (AACSB).
 The university would be classified as a non-research intensive
 institution offering masters degrees in the Carnegie Foundation classification, with a mission that is clearly “teaching”
 in orientation.
 Courses in the sample include the disciplines of marketing, management, leadership, finance, accounting, statistics,
 and economics, with 48% of the sample being graduate
 courses. Sixty percent of the sample courses were taught
 by full-time instructors. Average class size is 16 students.
 Measures

 Teaching effectiveness—Achievement of student
 learning outcomes (ACHIEVE)
 Encouraged by the guidelines of various accrediting agencies, the School has used course learning outcomes for several years. Learning outcomes are established by a faculty
 committee for each core and concentration required course
 within the School. There is an average of six to ten learning outcomes per course, and these learning outcomes are
 specifically identified in the syllabus of each course.
 Recently the School has invested substantial time and
 resources in revising and quantifying its learning outcome
 assessment process. Quantified assessment of course learning outcome attainment by students is measured by a stan-

 dardized student learning outcome test. The School’s student
 learning outcome exams are developed individually for each
 course in each program by a committee of content experts in
 the subject area, with four questions designed to assess each
 of the stated learning outcomes. Student outcome exams go
 beyond simple final exam questions in that they are institutionally agreed upon and formally tied to programmatic objectives. Approximately one-third of the School’s core and
 required courses are given student learning outcome exams
 at the present time.
 Student outcome exams are administered to every student
 in every section of the course being assessed, regardless of
 instructor and delivery mode. The same student outcome
 assessment is used for all sections of the same course, and
 instructors are not allowed to alter the questions. Student
 learning outcome exams are given at the end of the course
 period. Since there are four questions per learning outcome,
 the exams are all scored on a basis of zero (0) to four (4)
 points per learning outcome. In the present study, for the
 ACHIEVE score we use the mean score of all the learning
 outcome questions on that particular course exam.
 Depending upon the course and learning outcome, student learning outcome exams consist of multiple choice, or
 occasionally, short essay questions. For the short essay questions a grading rubric is created so that there is consistency
 in scoring across all sections of a particular course. Less than
 10% of all student learning outcome exam content is essay,
 with over 90% multiple choice. Exam design and rigor is
 specifically modeled after professional certification exams
 such as the Certified Public Accounting (CPA) exam. This
 type of assessment data directly related to carefully articulated “course learning outcomes” is exactly what McKeachie
 (1979) referred to when he noted, “we take teaching effectiveness to be the degree to which one has facilitated student
 achievement of education goals” (McKeachie 1979, 385).
 Although the student learning outcome exam questions
 are designed to be the same format, the same level of difficulty, and all scored on a 0 to 4 scale, for the full crosssectional sample the ACHIEVE measure does comes from
 assessments for different courses using different questions.
 For our full sample analysis we therefore dichotomize the
 student learning data based upon the median (low student
 achievement v. high student achievement). Dichotomizing
 the outcome variable using the median is common in these
 situations when outcome data come from a relatively small
 cross-sectional sample, and there is not sufficient sample
 size to accurately calculate multiple means to normalize the
 outcome data across the different categories in the sample (e.g., Bolotin 2006; Baarveld, Kollen, Groenier 2007;
 Mazumdar and Glassman 2000; Muennig, Sohler, and Mahato 2007; Muthén and Speckart 1983). In addition, the implied binary benchmarking of teaching effectiveness for faculty across different departments is common in practice. Most
 obvious are tenure and promotion decisions (yes or no) for
 full-time faculty at smaller teaching-driven colleges, annual

 PREDICTING STUDENT ACHIEVEMENT

 contract renewals for non-tenure track full-time and part-time
 teaching lecturers, faculty teaching award nominations, and
 the formal use of binary assessment metrics of faculty teaching effectiveness by some institutions (e.g., Glazerman et al.,
 2010). Not surprisingly, faculty themselves often tend to informally categorize colleagues (or themselves) as effective or
 “good” teachers versus being less effective in the classroom
 (e.g., Fetterley 2005; Sutkin et al. 2008). However, when examining multiple sections of a single course using exactly
 the same set of learning outcome questions, we use the raw
 ACHIEVE score in our analysis.

 Classroom observation (PEER)
 In our sample, faculty members are required to undergo
 one classroom observation per year. The classroom observation procedure is typical to many universities—the peer
 observer reviews the syllabus and arranges the time to visit
 the class. Although faculty training in peer evaluation processes is often recommended (e.g., Cohen and McKeachie
 1980; Bernstein et al. 2006, Chism 2007), in our sample faculty peer classroom observers were not provided any specific
 training in observation techniques. While some universities
 suggest multiple observers, our sample university required
 only one observer per instructor.
 An “evaluation” form is used where the classroom observer checks/scores various questions related to ten different
 categories of teaching: class meets stated outcomes, level of
 student understanding, enthusiasm for teaching, sensitivity to
 student needs, giving clear explanations, use of instructional
 material, teaching methods and pacing, knowledge of subject, clarity of syllabus, and the course’s assessment process.
 The last two categories are from review of the syllabus. Comments can also be added. After reading the submitted written
 observation form, the senior administrator gave a “class observation” score between “1” (low) and “7” (high) based upon
 the scoring and information in the form. For this study we
 used this numerical score. In our sample, the numerical classroom peer observation score ranged between “3” and “7”.
 There was an important difference in the classroom observation process for part-time faculty versus full-time faculty.
 Full-time faculty could generally request which colleague
 observed his or her class, with an obvious possible bias toward requesting friends or colleagues who might provide
 more favorable comments. In contrast, for part-time faculty
 the classroom observer was appointed by the department
 chairperson rather than requested by the instructor.
 It should be noted that there are certainly other possible
 differences between full-time faculty and part-time faculty,
 such as tenure status, types of courses taught and terminal
 degree education. However, in our sample we feel that the
 most likely explanation for any differences between the ability of the peer evaluation ratings of part-time versus full-time
 faculty to explain student achievement would come from differences in the “peer” selection process; that is, controlling
 for the “peer selection bias” commonly mentioned in the

 51

 literature. In fact, no apparent bias in the peer evaluation
 score was noted for the part-time faculty across a number
 of variables. For example, there was no significant difference in the mean peer evaluation score between part-time
 faculty with terminal degrees versus those without terminal
 degrees. Similarly, although full-time faculty taught a greater
 percentage of graduate classes versus part-time faculty there
 was no significant difference in the peer evaluation score for
 part-time faculty that taught graduate classes versus undergraduate classes.

 Student perception of teaching effectiveness
 (SETE)
 As with most universities, student course evaluations are
 based upon multiple item forms that gather student perceptions, with several questions directly related to perceptions of
 the instructor’s skill. We used the comprehensively worded
 “global” item (SETE Global Instructor) asking students to
 rate the instructor with the wording, “overall, I rate the
 instructor of this course an excellent teacher.” Most SETE
 scales use such a final “global” question, and from the authors’ experience it is this question that tends to hold the
 most weight in faculty performance evaluations.

 Control variables
 As control variables we used class size, whether the course
 was a graduate course, and delivery method (on-site versus
 distance). Class size appears to be a particularly important
 control variable. Zietz and Cochran (1997) found a negative
 relationship between class size and test results, while Lopus
 and Maxwell (1995) found a positive relationship in business
 related classes. Pascarella and Terenzini (2005) argue that the
 connection is still unknown. A more recent large-scale study
 of science classes by Johnson (2010) indicates that while
 class size negatively impacts student learning (as measured
 by grades), the impact diminishes as class size increases.

 ANALYSIS
 We model the analysis close to the actual practice in universities. In our sample we have ACHIEVE, SETE, and the
 control variables for forty-six classes taught by thirty-four
 faculty within a one-year period. We use the faculty member’s annual classroom observation scores (PEER) from a
 single “face-to-face” classroom visit that is closest to the
 one-year period of our class-specific data2. The other important component of assessing teaching effectiveness in practice would be the collection of student ratings for the various
 classes during the time period.

 2We only have the numeral score for a faculty’s peer evaluation, not the
 specific class it came from.

 52

 GALBRAITH AND MERRILL

 Within our sample, the bivariate correlation between
 PEER and SETE is 0.43. This directly compares with
 Feldman’s (1989b) meta-analysis report of a mean correlation between peer reviews and student evaluations of 0.55.
 Since research indicates that ratings from direct class observation have somewhat lower correlations with SETEs than
 for broad peer evaluations (e.g, Burns 1998), the 0.43 correlation between PEER and SETE suggests our sample is
 probably representative.
 Ideally, the best test of validity would involve multiple
 sections of the same course, taught by different instructors,
 using a common measurement of student performance. This
 has been noted by several authors. For example, in their
 discussion of the need to establish peer evaluation validity,
 Cohen and McKeachie (1980) write, “this would require a
 multi-section course with a standard post-term achievement
 measures, such an endeavor would prove valuable for assessing the validity of colleague ratings” (149). In our sample,
 one course (a graduate finance class) had a sufficient number of different instructors (N = 5) to calculate a correlation
 between the faculty member’s annual classroom observation
 score (PEER) and student learning outcomes (ACHIEVE)3.
 All five of the instructors were full-time faculty. Since the
 student learning outcome exam for this particular finance
 course was the same across all sections, we could use the
 raw scores in this analysis. The bivariate correlation between
 PEER and ACHIEVE for this particular multi-section course
 was 0.675 (p < 0.10, one-tailed), statistically significant and
 in the expected direction in spite of the very small sample
 size. On the other hand, for this one multi-section course
 analysis, the bivariate correlation between student evaluation of teaching (SETE) and ACHIEVE was only 0.289; a
 positive relationship but not statistically significant. In fact,
 the amount of variance (8.26%) in student achievement explained by SETE in our analysis is very similar to many
 of the SETE validity studies reviewed by Feldman (1989a,
 2007) and falls within the “weak” category of scale criterion
 validity suggested by Cohen, J. (1969, 1981). On the other
 hand, the faculty member’s annual course observation score
 (PEER) explains 45.6% of the variance in student achievement in this sample, and therefore falls within the “strong”
 category of scale criterion validity. Thus, within this well controlled, albeit small, multi-section case, the faculty member’s
 annual classroom observation score explained a much higher
 percentage of student achievement than their SETEs. Given
 the small sample size these results should certainly be interpreted cautiously, however, it should be noted that Feldman’s
 (1989a, 2007) often cited meta-analyses of SETE validity
 also includes research with only five instructors/sections in
 their multi-section samples.

 3The next largest multi-section course in our sample had only three
 different instructors, and they were a combination of part-time and full-time
 faculty.

 TABLE 1
 Binary Logistic Regression Analysis—Explaining
 Student Achievement (ACHIEVE)

 Variables
 Constant
 Class Size
 Online Class
 Graduate Class
 SETE
 PEER
 Nagelkerke R2
 Cox and Snell R2
 N

 Pooled-Sample
 Regression

 Full-Time Faculty
 Regression

 Part-Time
 Faculty
 Regression

 −7.738
 −0.117∗∗
 0.874
 2.610∗∗∗
 1.178∗
 0.598∗
 0.348
 0.258
 46

 −8.493
 −0.122∗
 1.771∗
 3.428∗∗
 1.351
 0.435
 0.487
 0.363
 28

 −13.847
 −0.077
 −0.120
 2.214∗
 1.139
 1.810∗∗
 0.374
 0.276
 18

 Note: ∗∗∗ p < 0.01; ∗∗ p < 0.05; ∗ p < 0.10

 We are also interested in comparing the relationship between the two measures commonly used to evaluate a faculty
 member’s teaching effectiveness (SETEs and PEER) and our
 independent measure of student achievement (ACHIEVE)
 across the full range of courses. This is important since most
 universities compare, either directly or indirectly, a faculty
 member’s teaching evaluation assessments with other faculty members across departments and schools during annual
 review, tenure, and promotion decision discussions.
 For this analysis, ACHIEVE was the dependent variable, while PEER, SETE, and the control variables were
 independent variables. As previously discussed, for this
 cross-sectional analysis we used the bivariate measure of
 ACHIEVE, “high student achievement” and “low student
 achievement”—the appropriate regression technique is therefore logistic regression. We estimate binary logistic regression models for the full pooled sample, and both the full-time
 and part-time faculty sub-samples. Table 1 reports the results
 of this analysis.
 With respect to the control variables, graduate classes and
 smaller classes clearly tend to have higher levels of student
 achievement. The graduate class variable was positive, and
 statistically significant in all three models—the pooled sample, and both the full-time and part-time faculty sub-samples.
 Class size, while negative in all three regressions was statistically significant in both in the pooled sample and the full-time
 faculty sample. The on-line class variable had opposite signs
 between the estimated regression models, and was statistically significant only in the full-time faculty sample. Overall,
 all the regression models were statistically significant, and
 had reasonably high R2s.
 Of interest to our research are the two “teaching effectiveness” evaluation metrics: student evaluations of teaching
 (SETE) and the faculty member’s annual classroom peer
 observations (PEER). Both metrics were positive in all three
 equations. The SETE variable, however, was statistically significant only for the pooled sample. The PEER variable was
 also statistically significant in the pooled sample regression.

 PREDICTING STUDENT ACHIEVEMENT

 Most interesting are the results for the two sub-samples
 of full-time and part-time faculty. As previously discussed,
 there was a significant difference in the way “peers” were
 selected between these two groups, with the selection of
 “peers” for part-time faculty more of an independent, “armslength” process. Given this important difference, the pooled
 sample may be too heterogeneous across the PEER variable
 and the model estimates therefore misleading.. Examining
 the two sub-samples should provide additional insight. In
 the full-time faculty sub-sample, the PEER variable, while
 indicating a positive relationship, was not statistically significant. However, in the part-time faculty model, which has a
 much stronger peer selection control process, the classroom
 observation variable (PEER) was both positive and statistically significant. The SETE metric, while positive in both
 equations, was not statistically significant in either.

 DISCUSSION AND CONCLUSION
 As the debate continues about which measures of teaching
 effectiveness should be used to evaluate faculty for personnel
 decisions, there is an increasing need for continued investigation into the validity of these different metrics. With respect
 to student evaluations of teaching, McKeachie (1996) succinctly summarized the problem, “If student ratings are part
 of the data used in personnel decisions, one must have convincing evidence that they add valid evidence of teaching
 effectiveness” (McKeachie 1996, 3). The same can certainly
 be said for faculty peer evaluations.
 While there is a large body of empirical literature examining the validity of SETEs, the results of these studies are
 open to vast differences in interpretation. To date, however,
 the empirical basis for arguing for the validity of peer evaluations or classroom observations of teaching is based primarily
 on studies that correlate peer evaluations with SETES. Unfortunately, there are few, if any, studies that examine the
 relationship between peer evaluations and an actual, independent measure of student achievement, and then compare
 the strength of this relationship with student ratings.
 Our research represents an attempt to start filling this gap
 in our knowledge. Given that major institutions of higher
 learning around the world regularly employ both student
 ratings and peer evaluations of teaching for faculty personnel decisions without knowing more about the true validity
 of these two metrics in assessing teaching effectiveness is
 somewhat surprising.
 In our study we were able to examine the validity of one
 important component of peer evaluations, the classroom observation, from two perspectives. Using a multi-section class
 taught by different instructors, we compared the annual classroom observation ratings for faculty members against the results of an independent student learning outcome assessment
 measure in courses those faculty members taught. Not only
 did we find that the annual classroom observation metric was

 53

 significantly and positively correlated with student achievement, but that it was also a much better predictor of student
 achievement than student ratings of teaching (SETEs) from
 the classes. This is exactly the type of validity testing called
 for by Cohen and McKeachie (1980).
 We also wanted to examine the validity of both classroom
 observation and SETEs in a manner which somewhat paralleled the way that most universities actually use such measures for personnel decisions, that is, across different instructors, courses and departments. Again, using the standardized
 learning assessment, our analysis again offered two conclusions. First, a faculty member’s annual classroom observation
 rating was positively related to student achievement, particularly when the process reflected a somewhat “arms-length”
 selection of the actual observer. Second, under these conditions of stricter peer-selection control, a faculty member’s
 annual classroom observation rating was more significantly
 related to student achievement than the course SETEs. In addition, although not directly the focus of our study, we also
 found evidence that class size was negatively related to student achievement, with smaller classes outperforming larger
 classes on the average.
 Our analysis supports the validity of university-level classroom observation by peers, particularly if done under relatively strict peer-selection controls. And it should be noted
 that our peer evaluation process followed few of the complex
 observation, training, feedback, and reporting protocols suggested by the rapidly expanding normative peer evaluation
 literature—our reviewers were simply colleagues asked to
 observe another’s class with a simple check-list.
 Obviously there are limitations to our research. First, our
 sample size was relatively small, particularly in our multisection analysis. While this should suggest caution in interpretation, this type of research will always struggle with sample size issues. Second, it would have been ideal if we could
 have obtained the actual post-observation forms so that multiple, independent scorers could have provided the quantitative
 ratings. This would have allowed for a test of inter-rater reliability. And finally, our data came from only one university,
 albeit across different departments and disciplines.
 Given these limitations, however, we feel that our results
 are noteworthy, particularly since almost no published research has appeared that directly correlates classroom peer
 observation results to an independent measure of student
 achievement designed around agreed upon student learning outcomes. Although university accrediting bodies are
 encouraging more assurance of learning outcome measurement, few universities at the present time are taking a standardized and quantifiable approach to assessing learning
 outcomes across all disciplines that lend themselves to crossinstitutional analysis. We hope that as more and more outcome data do become standardized, quantified, and available
 from different institutions that additional empirical analysis will continue to examine these fascinating, and highly
 charged debates.

 54

 GALBRAITH AND MERRILL

 REFERENCES
 Abrami, P., L. Levanthal, & R. Perry. 1982. Educational seduction. Review
 of Educational Research 52: 446–464.
 Ambady, N., & Rosenthal, R. 1993. Half a minute: Predicting teacher evaluations from thin slices of nonverbal behavior and physical attractiveness.
 Journal of Personality and Social Psychology 64: 431–441.
 Anderson, K., & Smith, G. 2005. Students preconceptions of professors:
 Benefits and barriers according to ethnicity and gender. Hispanic Journal
 of Behavioral Sciences 27(2): 184–201.
 Arreola, R. 2007. Developing a comprehensive faculty evaluation system.
 3rd ed. Bolton, MA: Anker Publishing.
 Atamian, R., & G. Ganguli. 1993. Teacher popularity and teaching effectiveness: Viewpoint of accounting students. Journal of Education for Business
 68(3): 163–169.
 Baarveld, F., B. Kollen, & K. Groenier 2007. Expertise in sports medicine
 among family physicians: What are the benefits? The Open Sports
 Medicine Journal 1: 1–4.
 Balam, E., & D. Shannon. 2010. Student ratings of college teaching: A
 comparison of faculty and their students. Assessment and Evaluation in
 Higher Education 35(2): 209–221.
 Barrett, J., K. Hart, J. Schmerier, K. Willmartch, J. Carey, & S. Mohammed.
 2009. Criterion validity of the financial skills subscale of the direct assessment of functional status scale. Psychiatry Research 166(2/3): 148–
 157.
 Bernstein, D. 2008. Peer review and the evaluation of the intellectual work
 of teaching. Change, March/April: 48–51.
 Bernstein, D., A. Burnett, A. Goodburn, & P. Savory. 2006. Making teaching
 and learning visible: Course portfolios and the peer review of teaching.
 Bolton, MA: Anker Publishing.
 Bernstein, D., & R. Edwards. 2001. We need objective, rigourous peer review
 of teaching. Chronicle of Higher Education 47(17): B24.
 Blackhart, G., B. Peruche, C. DeWall, & T. Joiner. 2006. Factors influencing
 teaching evaluations in higher education. Teaching of Psychology 33:
 37–39.
 Bolotin, A. 2006. Fuzzy logic approach to robust regression of uncertain
 medical categories. World Academy of Science, Engineering and Technology 22: 106–111.
 Bowling, N. 2008. Does the relationship between student ratings of course
 easiness and course quality vary across schools? The role of school academic rankings. Assessment and Evaluation in Higher Education 33(4):
 455–464.
 Boysen, G. 2008. Revenge and student evaluations of teaching. Teaching of
 Psychology 35(3): 218–222.
 Buck, S., & D. Tiene. 1989. The impact of physical attractiveness, gender,
 and teaching philosophy on teacher evaluations. Journal of Educational
 Research 82: 172–177.
 Burns, C. 1998. Peer evaluation of teaching: Claims vs. research. University
 of Arkansas, Little Rock, AK. http://eric.ed.gov/ERICWebPortal/search/
 detailmini . jsp ? nfpb = trueand andERICExtSearch SearchValue 0 =
 ED421470andERICExtSearch SearchType 0=noandaccno=ED421470
 Campbell, J. and W. Bozeman. 2008. The value of student ratings: Perceptions of students, teachers, and administrators. Community College
 Journal of Research and Practice 32(1): 13–24.
 Carrell, S., & J. West. 2010. Does professor quality matter? Evidence from
 random assignments of students to professors. Journal of Political Economy 118(3): 409–432.
 Cavanagh, R. 1996. Formative and summative evaluation in the faculty
 peer review of teaching. Innovative Higher Education 20(4): 235–
 240.
 Centra, J. 1993. Reflective faculty evaluation. San Francisco: Jossey-Bass.
 Centra, J. 1994. The use of teaching portfolios and student evaluations for
 summative evaluation. Journal of Higher Education 65: 555–570.
 Chism, N. 2007. Peer review of teaching: A sourcebook. 2nd ed. Bolton,
 MA: Anker Publishing.

 Cohen, J. 1969. Statistical power analysis for the behavioural sciences, San
 Diego, CA: Academic Press.
 Cohen, J. 1981. Statistical power analysis for the behavioural sciences. 2nd
 ed. Hillsdale, NJ: Lawrence Erlbaum Associates.
 Cohen, P. 1981. Student ratings of instruction and student achievement.
 Review of Educational Research 51(3): 281–309.
 Cohen, P. 1982. Validity of student ratings in psychology courses: A research
 synthesis. Teaching of Psychology 9(2): 78–82.
 Cohen, P. 1983. Comment on a selective review of the validity of student
 ratings of teaching. Journal of Higher Education 54(4): 448–458.
 Cohen, P., & W. McKeachie. 1980. The role of colleagues in the evaluation
 of college teaching. Improving college and university teaching 28(4):
 147–154.
 Costello, J., B. Pateman, H. Pusey, & K. Longshaw. 2001. Peer review
 of classroom teaching: An interim report. Nurse Education Today 21:
 444–454.
 Costin, P. 1987 Do student ratings of college teachers predict student
 achievement? Teaching of Psychology 5(2): 86–88.
 Courneya, C., D. Pratt, & J. Collins. 2007. Through what perspective do
 we judge the teaching of peers? Teaching and Teacher Education 24: 69–
 79.
 Davies, M., J. Hirschberg, J. Lye, & C. Johnston. 2007. Systematic influences
 on teaching evaluations: The Case for Caution. Australian Economic
 Papers 46(1): 18–38.
 DeZure, D. 1999. Evaluating teaching through peer classroom observation.
 In Changing practices in evaluating teaching, ed. P. Seldin. Bolton, MA:
 Anker Publishing.
 Dowel, D., & J. Neal. 1982. A selective review of the validity of student
 ratings of teaching. Journal of Higher Education 32(1): 51–62.
 Dowell, D., & J. Neal. 1983. The validity and accuracy of student ratings
 of instruction: A reply to Peter A. Cohen. Journal of Higher Education
 54(4): 459–463.
 Emery, C., T. Kramer, & R. Tian. 2003. Return to academic standards: A critique of student evaluations of teaching effectiveness. Quality Assurance
 in Education 11(1): 37–46.
 Feldman, K. 1989a. The association between student ratings of specific
 instructional dimensions and student achievement: Refining and extending the synthesis of data from multisection validity studies. Research in
 Higher Education 30(6): 583–645.
 Feldman, K. 1989b. Instructional effectiveness of college teachers as judged
 by teachers themselves, current and former students, colleagues, administrators, and external (neutral) observers. Research in Higher Education
 30(2): 137–194.
 Feldman, K 2007. Identifying exemplary teachers and teaching: Evidence
 from student ratings. In The scholarship of teaching and learning in
 higher education: An evidence-based perspective, eds. R. Perry and J.
 Smart, 93–129. Dordrecht, The Netherlands: Springer.
 Felton, J., J. Mitchell, & J. Stinson. 2004. Web-based student evaluations
 of professors: The relations between perceived quality, easiness and sexiness. Assessment and Evaluation in Higher Education, 29(1): 91–108.
 Fetterley, J. 2005. Teaching and “my work”. American Literary History
 17(4): 741–752.
 Galbraith, C., G. Merrill, & D. Kline. 2011. Are student evaluations of
 teaching effectiveness valid for measuring student learning outcomes
 in business related classes? A neural network and Bayesian analysis. Research in Higher Education: 1–22. http://www.springerlink.com/
 content/2058756205016652.
 Glazerman, S. S., Loeb, D., Goldhaber, S., Raudenbush, D., Staiger, G., &
 Whitehurst, G. 2010. Evaluating teachers: The important role of valueadded. Palo Alto, CA: Center for Educational Policy Analysis, Stanford
 University.
 Hon, J., K. Lagden, A. McLaren, D. O’Sullivan, L. Orr, P. Houghton, & M.
 Woodbury. 2010. A prospective multicenter study to validate use of the
 PUSH© in patients with diabetic, venous, and pressure ulcers. Ostomy
 Wound Management 56(2): 26–36.

 PREDICTING STUDENT ACHIEVEMENT
 Hutchings, P. 1996. The peer review of teaching: Progress, issues and
 prospects. Innovative Higher Education 20(4): 221–234.
 Hutchings, P., ed. 1998. The course portfolio. Sterling, VA: Stylus.
 Johnson, I. 2010. Class size and student performance at a public research
 university: A cross-classified model. Research in Higher Education.
 http://www.springerlink.com/content/0l35t1821172j857/fulltext.pdf
 Kremer, J 1990. Construct validity of multiple measures in teaching, research, and service and reliability of peer ratings. Journal of Educational
 Psychology 82: 213–218.
 Kohut, G., C. Burnap, & M. Yon. 2007. Peer observation of teaching: Perceptions of the observer and the observed. College Teaching 55(1): 19–25.
 Langbein, L. 2008. Management by results: Student evaluation of faculty
 teaching and the mis-measurement of performance. Economics of Education Review 27(4): 417–428.
 Lopus, J., & N. Maxwell. 1995. Should we teach microeconomic principles
 before macroeconomic principles? Economic Inquiry 33(2): 336–350.
 Malik, D. 1996. Peer review of teaching: External review of course content.
 Innovative Higher Education. 20(4): 277–286.
 Mazumdar, M., & R. Glassman. 2000. Categorizing a prognostic variable: Review of methods, code for easy implementation and applications to decision-making about cancer treatments. Statistics Medicine 19:
 113–132
 McCallum, L. 1984. A meta-analysis of course evaluation data and its use
 in the tenure decision. Research in Higher Education 21: 150–158.
 McKeachie, W. 1979. Student ratings of faculty: A reprise. Academe 65(6):
 384–397.
 McKeachie, W. 1996. Student ratings of teaching. Occasional Paper No.
 33. American Council of Learned Societies, University of Michigan.
 http://archives.acls.org/op/33 Professonal Evaluation of Teaching.htm
 McNatt, B. 2010. Negative reputation and biases student evaluations of
 teaching: Longitudinal results from a naturally occurring experiment.
 Academy of Management Learning and Education 9(2): 225–242.
 Muennig, P., N. Sohler, & B. Mahato. 2007. Scoioeconomic status as an independent predictor of physiological biomarkers of cardiovascular disease: Evidence from NHANES. Preventive Medicine.
 http://www.sciencedirect.com.
 Muthén, B., & G. Speckart. 1983. Categorizing skewed, limited dependent
 variables. Evaluation Review 7(2): 257–269.
 Naftulin, D., J. Ware, & F. Donnelly. 1973. The Doctor Fox lecture: A
 paradigm of educational seduction. Journal of Medical Education 48:
 630–635.

 55

 Pascarella, E., & P. Terenzini. 2005. How college affects students: A third
 decade of research. San Francisco: Jossey-Bass
 Peel, D. 2005. Peer observation as a transformatory tool? Teaching in Higher
 Education 10(4): 489–504.
 Peterson, K. 2000. Teacher evaluation: A comprehensive guide to new directions and practices. 2nd ed.. Thousand Oaks, CA: Corwin Press.
 Pounder, J. 2007. Is student evaluation of teaching worthwhile? An analytical framework for answering the question. Quality Assurance in Education 15(2): 178–191.
 Riniolo, T., K. Johnson, T. Sherman, & J. Misso. 2006. Hot or not: Do professors perceived as physically attractive receive higher student evaluations?
 The Journal of General Psychology 133(1): 19–35.
 Shortland, S 2004. Peer observation: A tool for staff development or compliance? Journal of Further and Higher Education 28: 219–227.
 Smith, B. 2007. Student ratings of teaching effectiveness: An analysis of endof-course faculty evaluations. College Student Journal 471(4): 788–800.
 Steward, R., & R. Phelps. 2000. Faculty of color and university students:
 Rethinking the evaluation of faculty teaching. Journal of the Research
 Association of Minority Professors 4(2): 49–56.
 Sutkin, G., E. Wagner, I.Harris, & R. Schiffer. 2008. What makes a good clinical teacher in medicine? A review of the literature, Academic Medicine
 83(5): 452–466.
 Taylor, J. 2007. The teaching/research nexus: A model for institutional
 management. Higher Education 54(6): 867–884.
 Varni, J., M. Seid, & P. Kurtin. 2001. PedsQLTM4.0: Reliability and validity
 of the Pediatric Quality of Life InventoryTMVersion 4.0 Generic Core
 Scales in healthy and patient populations. Medical Care 39(8): 800–
 812.
 Whitfield, K., R. Buchbinder, L. Segal, & R. Osborne. 2006. Parsimonious and efficient assessment of health-related quality of life in
 osteoarthritis research, validation of the Assessment of Quality of
 Life (AQoL) instrument. Health and Quality of Life Outcomes 4(19).
 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1538577/#B19
 Yon, M., C. Burnap, & G. Kohut. 2002. Evidence of effective teaching:
 Perceptions of peer reviewers. College Teaching 50(3): 104–110.
 Youmans, R., & B. Jee. 2007. Fudging the numbers: Distributing chocolate
 influences student evaluations of an undergraduate course. Teaching of
 Psychology 34(4): 245–247.
 Zietz, J., & H. Cochran. 1997. Containing cost without sacrificing achievement: Some evidence from college-level economics classes. Journal of
 Education Finance 23: 177–192.

 1007859

 research-article2021

 JMTXXX10.1177/10570837211007859Journal of Music Teacher EducationHash

 Article

 Reliability and Construct
 Validity of the edTPA for
 Music Education

 Journal of Music Teacher Education
 1–15
 © National Association for
 Music Education 2021
 Article reuse guidelines:
 sagepub.com/journals-permissions
 https://doi.org/10.1177/10570837211007859
 DOI: 10.1177/10570837211007859
 jmte.sagepub.com

 Phillip M. Hash1

 Abstract
 The purpose of this study was to examine the psychometric quality of Educative
 Teacher Performance Assessment (edTPA) scores for 136 preservice music teachers
 at a Midwest university. I addressed the factor structure of the edTPA for music
 education, the extent to which the edTPA fits the one- and three-factor a priori
 models proposed by the test authors, and the reliability of edTPA scores awarded
 to music education students. Factor analysis did not support the a priori one-factor
 model around teacher readiness, or the three-factor model based on the edTPA
 tasks of Planning, Instruction, and Assessment. Internal consistency was acceptable
 for all rubrics together and for the Instruction task. However, estimates of interrater
 reliability fell substantially below those reported by test administrators. These findings
 indicate the need for revision of the edTPA for music education and call into question
 its continued use among music teacher candidates in its current form.
 Keywords
 assessment, edTPA, music teacher preparation, student teaching, teacher readiness
 The Educative Teacher Performance Assessment (edTPA) is a portfolio-based subjectspecific project completed by preservice candidates during their clinical semester.
 Educator preparation programs in 41 states and the District of Columbia currently
 administer the edTPA and at least 19 states and the District of Columbia use the assessment for initial licensure. The Stanford Center for Assessment, Learning, and Equity
 (SCALE) is the sole developer of the edTPA and Stanford University is the exclusive
 owner. The university has licensed the Evaluation Systems group of Pearson to provide
 1

 Illinois State University, Normal, USA

 Corresponding Author:
 Phillip M. Hash, School of Music, Illinois State University, Campus Box 5660, Normal, IL 61790-5660, USA.
 Email: [email protected]

 2

 Journal of Music Teacher Education 00(0)

 operational support for national administration of the assessment (Powell & Parkes,
 2020; SCALE, 2019a).
 Candidates completing the edTPA engage in three tasks: Planning, Instruction, and
 Assessment. The complete portfolio consists of several artifacts including lesson
 plans, instructional materials, assessments, written commentaries, teaching videos,
 and student work samples, as dictated by separate handbooks for 28 content areas.
 Music candidates follow the K–12 Performing Arts Assessment Handbook (SCALE,
 2018a), which also includes theater and dance. SCALE (2013) states that the theoretical framework for the edTPA evolved from a three-step process that included the
 following:
 1.
 2.
 3.

 Subject-specific expert design teams who provided content validity evidence
 of the specific job-related competencies assessed within each subject area.
 A job analysis study to confirm the degree to which the job requirements of a
 teacher align to the edTPA.
 A content validation committee to rate the importance, alignment, and representativeness of the knowledge and skills required for each edTPA rubric in
 relation to national pedagogical and content-specific standards.

 Among other requirements, candidates must attend to academic language demands in
 all three tasks, which include teaching subject-specific vocabulary, engaging in a language function (e.g., analyze, describe, identify, create), and demonstrating the use of
 syntax and/or discourse (e.g., speaking or writing) within the discipline (SCALE,
 2018a).
 According to SCALE (2019a), handbooks in all content areas share approximately
 80% of their design. The other 20% contains key subject-specific components of
 teaching and learning drawn from the content standards authored by national organizations. However, it is unclear how the edTPA relates to standards of the National
 Association of Schools of Music (2020), the National Association for Music Education
 (2014), or any other arts organization.
 Candidates submit their portfolios to Pearson, who employs independent evaluators
 to score the materials. Scoring for most portfolios involves 15 rubrics, five per task,
 graded on a scale of one (novice not ready to teach) to five (highly accomplished
 beginner). This process results in a possible maximum total score of 75. Evaluators
 review specified artifacts and written commentary separately for each rubric rather
 than considering all parts of the assessment together. Candidates not achieving the
 minimum benchmark set by their state or institution can revise and resubmit one, two,
 or all three tasks (Parkes & Powell, 2015; SCALE, 2018b).
 The pool of edTPA scorers includes P–12 teachers and college faculty with pedagogical content knowledge and experience preparing novice teachers. They possess
 discipline-specific expertise and score only those portfolios for which they are qualified. Although the performing arts include music, theater, and dance, only scorers with
 knowledge and experience in music evaluate candidates in this discipline (Pearson,
 personal communication, April 21, 2020).

 Hash

 3

 Evaluators complete an extensive training program and must demonstrate their
 ability to determine scores consistently and accurately (SCALE, 2019a). SCALE randomly selects 10% of portfolios for double scoring to maintain reliability. In addition,
 portfolios scored within a defined range above and below the state-specific (currently
 35–41) or SCALE-recommended (currently 42) cut score undergo a second and sometimes third review. In these cases, a scoring supervisor resolves instances where Scorer
 1 and Scorer 2 (a) are more than 1 point apart on any rubric or (b) determine total
 scores on opposite sides of the cut score. The supervisor also resolves cases where
 both scorers fall above or below the cut score but have five or more adjacent rubric
 scores (SCALE, 2019b).
 Proponents claim that the edTPA provides an authentic means of assessing teacher
 readiness by measuring candidates’ ability to create lesson plans, implement instruction, and assess student learning in an actual classroom environment. Supporters also
 emphasize the assessment’s uniformity across disciplines and seemingly impartial
 evaluation, as well as the potential for the edTPA to shape teacher education programs
 and curricula. Some college faculty believe that the edTPA has fostered their professional growth, while cooperating teachers in K–12 school districts report that the
 assessment provides guidance for them in mentoring candidates during the student
 teaching semester (Darling-Hammond & Hyer, 2013; Pecheone & Whittaker, 2016;
 Sato, 2014).
 Critics cite concerns with ecological validity of the edTPA and state that candidates
 might make instructional decisions to meet the requirements of the rubrics rather than
 long-term student needs (Parkes & Powell, 2015). In addition, the two required video
 excerpts (maximum 10 minutes each) could alter the teaching environment, create
 privacy concerns, foster anxiety among candidates, and fail to capture nuanced student
 interactions and other aspects of teaching stipulated in the rubrics (Bernard & McBride,
 2020; Choppin & Meuwissen, 2017).
 Behizadeh and Neely (2018) questioned the consequential validity of the edTPA in
 relation to positive and negative social outcomes, especially in an urban teacher preparation program focused on social justice. Participants (N = 16) in this study, who were
 mostly candidates of color and first-generation college students, stated that the edTPA
 increased their mental and financial stress and lacked a social justice orientation in the
 scoring procedures. They also felt pressure to select the highest achieving classes for
 their lesson segment and to teach content that fulfilled scoring criteria, regardless of
 students’ needs. Authors have also criticized the corporate control of the scoring process, the high cost for teacher candidates ($300 for initial submission), and the effect
 of the edTPA on preparation program autonomy (e.g., Dover et al., 2015; Heil & Berg,
 2017; Parkes, 2020).
 The content and evaluation standards of the edTPA can present problems specific to
 the music classroom. For example, the timeline requiring candidates to teach their
 entire unit in three to five consecutive lessons might not allow K–12 students to engage
 in creative artistic processes authentically (Heil & Berg, 2017). The assessment can
 also force candidates in secondary ensemble programs to teach edTPA lessons unrelated to the goals of the classroom and within a tight rehearsal schedule dictated by
 public performances (Powell & Parkes, 2020).

 4

 Journal of Music Teacher Education 00(0)

 SCALE (e.g., 2015, 2018c, 2019a) annually reports the reliability and validity of
 the edTPA using data from statistical tests conducted on aggregated scores of all content areas with at least 10 portfolio submissions from the previous calendar year. In
 2018, internal consistency as measured by Cronbach’s α equaled .89 for the performing arts, and for all subjects combined. Interrater reliability estimates using the kappan (kn) statistic averaged .91 among the 15 evaluation rubrics. Factor analysis
 supported both the one-factor and three-factor models, with all loadings exceeding
 .50. According to SCALE (2015), these results “confirm [ ] that the tasks are measuring a common unifying teaching construct and that there are three common latent
 constructs . . . which [comprise] each of the three tasks” (p. 22). Factor correlations
 in the three-factor model ranged from .71 to .78, which SCALE (2018c) claims “supports the edTPA structure consisting of three correlated abilities: Planning, Instruction,
 and Assessment” (p. 25).
 Gitomer et al. (2021) questioned the psychometric validity and reliability of the
 edTPA due to (a) the use of aggregated data across content areas, (b) the supposed
 existence of both a one- and a three-factor model, (c) measures of internal consistency
 involving scores of all evaluators combined, and (d) the utilization of exact + adjacent
 agreements rather than only exact agreements to calculate interrater reliability through
 the kn statistic. The authors illustrated the difference in interrater agreement indices
 attained using exact agreements only versus exact + adjacent agreements as used by
 SCALE. The simulation involved rubric scores from 184 students from one of the
 author’s institutions and interrater agreement coefficients for all handbooks combined
 from the 2017 edTPA Administrative Report (SCALE, 2018c). Results indicated that
 interrater reliability for individual rubrics ranged from kappa indices of .06 to .32 (M
 = .23) using only exact agreements, compared with .85 to .97 (M = .91) as reported
 by SCALE. The authors acknowledged the need for analysis of individual content
 areas and called for SCALE to make these data publicly available.
 Musselwhite and Wesolowski (2019) used the Rasch Measurement Model to analyze edTPA scores of music students (N = 100) from three universities in the United
 States. They examined (a) the validity and reliability of the 15 rubrics, (b) the extent
 to which the rubric criteria fit the measurement model and vary in difficulty, and (c)
 if category response structures for each criterion empirically cooperate to provide
 meaningful measures. Reliability of separation, similar in interpretation to
 Cronbach’s alpha, fell within the upper range of acceptability for students (Rel. =
 .89) and rubric criteria (Rel. = .95), meaning edTPA scores could be used to separate
 high- and low-achieving students and the most and least difficult rubric criteria.
 Rubrics within each of the three tasks demonstrated adequate data-model fit.
 However, based on underuse of the lowest (1) and highest (5) ratings, the authors
 suggested that response categories were not capturing the full range of candidate
 performance or the results may not reflect the expected and intended meaning of the
 rubrics (e.g., “novice not ready to teach,” “highly accomplished beginner”). In addition, violations of monotonicity (i.e., the assumption that variables move consistently in the same or opposite directions) raised concerns with the overall rating
 scale structure.

 Hash

 5

 Austin and Berg (2020) analyzed the reliability, validity, and utility of edTPA scores
 for music teacher candidates (N = 60) over a 3-year period from 2013 to 2015. Scores
 for all three tasks (α = .76-.81) and the 15 rubrics combined (α = .84) demonstrated
 adequate internal consistency. Factor analysis supported the construct validity of the
 assessment and produced a clear structure that corresponded to the three edTPA tasks.
 Criterion-related validity evidence was mixed, however, with most correlations
 between edTPA scores and the 16 variables examined being of modest magnitude
 (<.25).

 Purpose and Need for the Study
 Annual edTPA Administrative Reports (e.g., SCALE, 2015, 2018c, 2019a) provide
 factor analysis and interrater agreement data for all content areas combined, as well as
 Cronbach’s alpha for each handbook with at least 10 submissions. The reports provide
 no data related to internal consistency (α) of each task or to factor analysis for specific
 disciplines. The 2018 Administrative Report states that “factor analyses models of
 latent structure are reviewed for each field [handbook] with appropriate sample size”
 (SCALE, 2019a, p. 15). However, only state-level technical advisory committees have
 access to these data (Pearson, personal communication, March 18, 2020).
 Detailed reliability and validity data for individual subject areas assessed by the
 edTPA are necessary for policymakers and teacher educators to evaluate the efficacy
 of this instrument. However, SCALE does not make these data available to the public
 (Gitomer et al., 2021). Therefore, the purpose of this study was to examine the psychometric quality of edTPA scores for portfolios completed by 136 preservice music
 teachers. Research questions were as follows:
 Research Question 1: What factor structure emerges from edTPA scores for music
 education?
 Research Question 2: To what extent do edTPA scores for music education fit the
 one- and three-factor a priori models proposed by SCALE (2013)?
 Research Question 3: What is the internal consistency and interrater reliability of
 the edTPA for music education students?
 This research will help estimate the reliability and construct validity of the edTPA
 specifically among preservice music educators and provide discipline-specific data to
 compare against that available publicly (e.g., SCALE, 2019a).

 Method
 Data
 Data for this study consisted of all edTPA rubric scores (N = 2,040) attained between
 fall 2015 and spring 2020 for preservice music educators (N = 136) at one large university in the Midwestern United States. The sample involved 61 males and 75 females.

 6

 Journal of Music Teacher Education 00(0)

 All participants were pursuing a Bachelor of Music Education degree and following
 either an instrumental (n = 93, 68.4%) or a vocal (n = 43, 31.6%) track. With one
 exception, all students passed the edTPA on their initial attempt. The institution piloted
 the edTPA beginning in 2013, two years before the state implemented the assessment
 as a requirement for teacher licensure (Adkins et al., 2015).
 A comparison of music scores from this study with national data for the K–12
 Performing Arts Handbook indicated higher than average final (Music: M = 51.62;
 Performing Arts: M = 46.36) and rubric scores (Music: M = 3.44; Performing Arts:
 M = 3.18). Individual rubric means all exceeded national averages and the 3.0 benchmark associated with candidates being “competent and ready to teach.” In addition,
 frequency counts and skewness indices indicated a normal distribution (see Table S1
 in the online supplement). Consistent with national data for the K–12 Performing Arts
 (SCALE, 2018c, 2019a, 2019c), 6% of rubric scores in this study consisted of a 1 or a
 5 with about 95% of scores falling within the 2 to 4 range.

 Construct Validity
 Preliminary examination of construct validity involved a series of factor analyses
 using various methods and rotations to determine the best model fit based on criteria
 for simple structure (Asmus, 1989; J. D. Brown, 2009):
 1.
 2.
 3.
 4.

 Each variable produces at least one zero loading (−.10 to +.10) on some
 factor.
 Each factor has at least as many zero loadings as there are factors.
 Each pair of factors contains variables with significant loadings (≥.30) on one
 and zero loadings on the other.
 Each pair of factors contains only a few complex variables (loading ≥.30 on
 more than one factor).

 Final analysis for this study involved principal axis factoring using Kaiser normalization and promax rotation with kappa set at the default value of 4. The first analysis
 used an eigenvalue of one criterion to determine if a factor structure other than that
 determined by SCALE (2013) might emerge from the edTPA for music education.
 Subsequent analysis tested the existence of the a priori models, which include a singlefactor solution around teacher readiness and a three-factor model aligned with edTPA
 tasks: Planning, Instruction, and Assessment.
 I considered the effectiveness of the models based on communalities (proportion of
 each variable’s total variance explained by all factors) and the extent to which items
 achieved a high loading on their intended factor. Generally, researchers consider loadings of .30 to .40 meaningfully large (Miksza & Elpus, 2018). The pattern matrix
 (unique contribution of each factor to a variable’s variance) served as the primary
 determinant used to identify which items clustered into factors. I also examined the
 structure matrix (correlation of each variable and factor) to verify the interpretation.
 Bartlett’s test of sphericity indicated if there were adequate correlations for data

 Hash

 7

 reduction, and the Kaiser-Meyer-Olkin measure determined sampling adequacy
 (Asmus, 1989; J. D. Brown, 2009). Maximum interfactor correlations of .80 served as
 the standard for adequate discriminant validity (T. A. Brown, 2015).
 SCALE analyzes internal structure of the edTPA for all content areas combined
 through a confirmatory factor analysis using maximum likelihood estimation, which
 assumes a normal distribution and is most appropriate for large sample sizes (Costello
 & Osborne, 2005; Miksza & Elpus, 2018). Principal axis factoring used in this study
 better fit the data and proved more effective in achieving a simple solution (e.g., J. D.
 Brown, 2009).

 Reliability
 Cronbach’s alpha provided a measure of internal consistency for the complete edTPA,
 individual tasks determined by SCALE (2019a), and factors identified in this study. A
 coefficient of α ≥ .80 served as the minimum acceptable benchmark as per general
 practice in the social sciences (e.g., Carmines & Zeller, 1979; Krippendorff, 2013).
 SCALE (2019a) analyzed interrater reliability for each rubric using the kappan
 statistic:
 kn =

 AO − 1 / n
 1 − 1/ n

 where AO represents observed agreement and n equals the number of possible adjudication categories/classifications.1 Due to the lack of agreement indices for data in this
 study, I replicated the procedure of Gitomer et al. (2021) and calculated Cohen’s kappa
 formula instead:
 A − AC
 k= O
 1 − AC
 This estimate of interrater reliability used the proportions of exact agreement (AO)
 reported in the 2018 edTPA Administrative Report for all content areas combined and
 chance agreement (AC) coefficients from the music scores analyzed here. Chance
 agreement indices equaled the sum of the cross-multiplied proportions of rubric scores
 in each category (1–5) from portfolios that did not contain fractional numbers (e.g.,
 2.5) due to double scoring (n = 128).2 Thus, kappa is higher to the extent that observed
 agreement exceeds the expected level of chance agreement (Brennan & Prediger,
 1981). Due to the unavailability of data from two independent evaluators, calculations
 of chance agreement involved multiplying duplicate proportions of one scorer
 (Gitomer et al., 2021).
 Like Gitomer et al. (2021), I only considered exact agreements when estimating
 kappa to provide a more precise estimate of interrater reliability. Kappa coefficients
 reported by SCALE (2019a) are likely inflated because calculations involved exact +
 adjacent agreements on a 5-point scale, where about 95% of scores fell between 2 and
 4 (Stemler & Tsai, 2008). Kappa can range from −1 to +1 with interpretations of poor
 (below 0.00), slight (0.00-0.20), fair (0.21-0.40), moderate (0.41-0.60), substantial
 (0.61-0.80), and almost perfect (0.81-1.00; Landis & Koch, 1977).

 8

 Journal of Music Teacher Education 00(0)

 The purpose of estimating k was to demonstrate the difference in readings based on
 exact + adjacent agreements versus those attained with exact agreements only. The
 use of exact + adjacent agreements is problematic on a 5-point scale, especially when
 scorers rarely use the highest and lowest categories. This approach is less problematic
 when estimating reliability for longer scales because of the underlying possible score
 range and the precision required to attain perfect agreement (Stemler & Tsai, 2008).

 Results
 Factor Structure
 I conducted an exploratory factor analysis (EFA) using principal axis factoring with
 promax rotation and an eigenvalue of one criterion for extraction. Based on Bartlett’s
 test output (χ2 = 726.9, p < .001) and Kaiser-Meyer-Olkin measure (.89), I concluded
 the underlying data were adequately correlated and the sample size was appropriate
 for conducting an EFA. Eigenvalues equaled 5.88 (Factor 1), 1.34 (Factor 2), and 1.06
 (Factor 3), with Factor 1 accounting for 35.7% of the variance in edTPA ratings, followed by Factors 2 (5.8%) and 3 (3.4%) for a cumulative explained variance of 44.9%.
 The resulting three-factor model met all criteria for simple structure with the exception
 of three rubrics failing to achieve a 0 loading on any factor, a criterion which may be
 difficult to meet with smaller sample sizes and fewer extracted factors.
 The three-factor model resulting from edTPA scores for music education (see Table
 S2 in the online supplement) did not support the a priori structure proposed by SCALE
 (2019a) around the three tasks. R1-R5, R11, R12, and R15 from Tasks 1 and 3 clustered into Factor 1. Factor 2 consisted of R6-R9 from Task 2, and Factor 3 consisted of
 R10 from Task 2, and R13 and R14 from Task 3. The eight rubrics comprising Factor
 1 suggest an interpretation of “Planning and Assessment.” Factor 2, consisting of
 R6-R9, resembled Task 2 (Instruction). Factor 3, containing R10, R13, and R14, defied
 a clear interpretation.
 With interfactor correlations ranging from .51 (Factors 2 and 3) to .67 (Factors 1
 and 3), I concluded that the three-factor model provided adequate discriminant validity (T. A. Brown, 2015), despite lack of support for construct validity. An additional
 analysis examined the one-factor solution around teacher readiness, which resulted in
 factor loadings of .46 to .74 (M = .59; SD = .09) an d explained just 35.1% of the
 variance (see Table S2 in the online supplement).

 Reliability
 (In subsequent discussions, tasks refer to the a priori groupings of rubrics around
 Planning, Instruction, and Assessment [e.g., SCALE, 2019a] and factors denote groupings that emerged from the analysis described here.) I report two forms of reliability
 estimation for preservice music teachers’ edTPA scores—internal consistency (how
 consistent ratings are within a priori edTPA tasks or factors extracted through EFA)
 and interrater reliability (how consistent edTPA rubric scores are across evaluators).

 Hash

 9

 Estimates of internal consistency (α) for the three a priori tasks ranged from .73 for
 Task 1 (Planning) and .74 for Task 3 (Assessment) to .81 for Task 2 (Instruction).
 Alpha coefficients for factors produced by the EFA ranged from .66 for Factor 3
 (rubrics 10, 13, 14) to .81 for Factor 1 (rubrics 1-5, 11, 12, 15) and .82 for Factor 2
 (rubrics 6-9). Regardless of whether tasks or factors served to frame the grouping of
 rubrics, scores for rubrics thought to represent Instruction yielded the highest level of
 internal consistency. When all 15 rubrics were considered together as a single measure
 of teacher readiness, the resulting alpha was .88.
 Estimated interrater reliability using only exact agreements for Cohen’s kappa
 (Gitomer et al., 2021) ranged from .07 to .51 for individual rubrics and averaged .25
 (SD = .12) overall. These findings are similar to estimated k for the Performing Arts
 (Range = −.01-.32; M = .24; SD = .09) calculated from rubric scores reported in the
 spring 2019 edTPA National Performance Summary (SCALE, 2019c) and exact agreement indices for all content areas combined from the 2018 Administrative Report.
 Estimated k from both analyses differed greatly from kappan statistics reported by
 SCALE (Range = .85-.98, M = .91, SD = .04) for all handbooks together (SCALE,
 2019a; see Table S3 in the online supplement).

 Discussion
 In this study, I examined the reliability and construct validity of edTPA scores for
 preservice music teachers. Readers should interpret results with caution due to limitations of the study. In addition to a relatively small nonrandom sample, all data came
 from one institution and may not reflect broader trends. It is also important to note
 differences between statistical procedures used in this study and those involved in
 analyses published in the Administrative Reports (SCALE, 2013, 2015, 2018c, 2019a)
 when making comparisons.

 Construct Validity
 The factor structure that I obtained through EFA raises important questions about the
 construct validity of the edTPA for music education. According to SCALE (2019a), all
 28 content areas share approximately 80% of their design around Planning, Instruction,
 and Assessment. However, this design results in standardization that might fail to capture the uniqueness of teaching and learning in some disciplines (e.g., Powell &
 Parkes, 2020). The percent of variance explained by the one- and three-factor solutions in this study indicates that the 15 rubrics do not represent the totality of what
 occurs in the music classroom. Individual factor loadings also suggest that some
 rubrics might measure elements of instruction connected less to music than other content areas. Variables related to academic language demands, for example, loaded .39 to
 .55 on either model.
 It is unclear why three of the Assessment (Task 3) rubrics (R11, R12, & R15) loaded
 with the Planning (Task 1) rubrics (R1-R5) onto Factor 1. The titles of these tasks,
 “Planning for Instruction and Assessment;” and “Assessing Student Learning,” imply

 10

 Journal of Music Teacher Education 00(0)

 a relationship. Maybe these tasks are more closely related in music than in other subjects. However, the factor structure that emerged in Austin and Berg (2020) clearly
 aligned with the Planning, Instruction, and Assessment tasks, and does not support this
 assertion. Perhaps the teacher preparation program involved in this study taught
 assessment in such a way that caused students to view planning and assessment as
 being so closely associated that the scores they received for the a priori assessment
 rubrics did not coalesce in a meaningful way and, instead, loaded onto two different
 factors.
 The failure of the theoretical model (e.g., SCALE, 2019a) to emerge in this study is
 problematic when scorers evaluate individual tasks. Rubrics in Task 1 (R1-R5) are not
 inclusive of those that represented a single construct (i.e., Factor 1: R1-R5, R11, R12,
 R15). Likewise, R10 from Task 2 did not load with other instructional rubrics (R6R9), and rubrics associated with Task 3 (R11-R15) loaded onto two different factors.
 SCALE could mitigate this concern by allowing graders to consult all materials as
 evidence for any task. However, scoring rules treat the three tasks as separate entities
 by prohibiting evaluators from considering evidence from one task when scoring
 another. For example, a scorer cannot use lesson plans from Task 1 as evidence for
 achievement on Task 3 (Parkes & Powell, 2015).

 Reliability
 Measures of internal consistency (α) in this study for all tasks and total scores exceeded
 .70 and were similar to those attained by Austin and Berg (2020). However, only Task
 2 (Instruction, R6-R10) met the .80 benchmark for acceptable reliability while Tasks 1
 (Planning, R1-R5) and 3 (Assessment, R11-R15) did not. Lower alpha coefficients for
 individual tasks could be a function of the number of items in each (Carmines & Zeller,
 1979). Alpha readings in this study, like those by SCALE (2019a), might also be inaccurate due to combining (a) different observations by multiple evaluators and (b) nonindependent rubric scores assigned by single raters. This procedure ignores the effects
 of individual scores on internal consistency of the edTPA evaluation form, which might
 result in inflated alpha coefficients (Gitomer et al., 2021; Miksza & Elpus, 2018).
 Interrater reliability estimates (k) in this study might be imprecise because of the
 statistical procedures used in the absence of two sets of evaluator ratings for preservice
 music teachers in this study. However, the wide disparity between these and kn indices
 listed for all content areas combined (SCALE, 2019a) were likely due to SCALE’s use
 of exact + adjacent agreements in the calculations rather than differences in k and kn
 formulas (Gitomer et al., 2021). Although coefficients based on adjacent + exact
 agreements appear in the literature, their use depends on raters assigning scores across
 all possible categories for discrete 5-point rubrics. Underuse of the highest and lowest
 scoring options results in a scale where nearly all points will be adjacent and in agreement indices usually above 90% (Stemler & Tsai, 2008). About 95% of edTPA music
 scores in this study and for content areas nationally (SCALE, 2019c) fell within a scale
 of 2 to 4. Consequently, agreement indices for individual rubrics listed in the 2018
 Administrative Report (2019a) ranged from .94 to .99.

 Hash

 11

 Summary and Recommendations
 Factor analysis indicated that while the three-factor model for the edTPA accounted
 for almost one-half of the variance in music teacher readiness, the single-factor model
 accounted for just over one-third. Although scores for all rubrics sufficiently loaded on
 the single-factor model, the three-factor model lacked clarity and interpretability in
 relation to a priori tasks proposed by the test authors (e.g., SCALE, 2019a). In addition, measures of internal consistency for two of the three tasks did not meet the .80
 benchmark for acceptability (Carmines & Zeller, 1979), and estimated interrater
 agreement ranged from only slight to moderate (Landis & Koch, 1977). These findings support the need for analysis by content area (e.g., Gitomer et al., 2021) and challenge the aggregated data published by SCALE.
 Policymakers, teacher educators, and other stakeholders should consider findings
 from this study when making decisions about implementation and continuation of the
 edTPA. Although this research focused solely on psychometric qualities, decisionmakers must also weigh ethical and philosophical concerns such as consequential and
 ecological validity, socioeconomic factors, racial bias, and potential effects on K–12
 student learning (e.g., Powell & Parkes, 2020). If the edTPA is to continue to serve as
 a high-stakes assessment for preservice music teachers, it should act as only one component among multiple measures of readiness. Perhaps policymakers should allow
 candidates scoring below their benchmark to make up the deficiency through grade
 point averages, student teaching evaluations, content exams, or other criteria (e.g.,
 Parkes, 2020).
 SCALE should consider revising the edTPA for specific content areas, especially
 when data do not support reliability and validity. For the performing arts, test authors
 should divide music, theater, and dance into separate handbooks, and then work with
 educators to develop scoring procedures and criteria to better reflect the specific types
 of teaching and learning that occur in these classrooms. Changes might include altering
 the number of rubrics and their descriptors to focus more on creating, performing, and
 responding, and less on learning about the subject through writing and discussion.
 These changes are not unprecedented. The world languages and classical languages
 handbooks each contain 13 rubrics. In addition, one version of the elementary education handbook consists of four tasks with 18 rubrics total (SCALE, 2019a). Regardless,
 scoring rules should allow evaluators to consult all materials throughout the grading
 process to account for the holistic nature of teaching (e.g., Powell & Parkes, 2020) and
 to compensate for different factor structures that might exist in various subject areas.
 Public data published in the Administrative Reports for all handbooks combined
 hold little meaning, since the assessment is designed, administered, and scored within
 separate disciplines. Instead, these analyses should reflect a higher level of transparency and contain complete data for each area. Results from factor analysis, for example, should include information not currently available such as the percentage of
 variance explained by each factor, communalities, and the type of matrix (e.g., pattern,
 structure) used in the interpretation.
 Internal consistency coefficients (α) should account for measurement error caused
 by raters. One method might be to calculate α for all portfolios graded by an individual

 12

 Journal of Music Teacher Education 00(0)

 scorer, and then report an average for each task and all 15 rubrics combined within a
 content area. Test administrators should also consider a different procedure for calculating interrater reliability. The current method of combining exact + adjacent agreements for use in kn is too liberal, especially with underuse of the lowest and highest
 ratings (Stemler & Tsai, 2008). Likewise, using only exact agreements in the measurement might be too conservative concerning the practical application of edTPA scores
 in readiness-for-licensure decisions, which flow from total scores rather than individual rubric scores. Instead, SCALE should consider use of a weighted kappa to provide
 a more accurate representation of interrater reliability. This procedure penalizes disagreements in terms of their severity, whereas unweighted kappa treats all disagreements equally (Sim & Wright, 2005). Regardless, agreement indices and proportions
 of scores for all rubrics in each content area should appear with other public data so
 that scholars outside of SCALE and Pearson can verify statistical analysis and conduct
 further research.
 The high stakes nature of the edTPA for preservice teachers requires valid and reliable results in all disciplines. Continuous research is needed to monitor the psychometric qualities and identify weaknesses in this assessment. In the absence of publicly
 available data, researchers could replicate this study and others (Austin & Berg, 2020;
 Musselwhite & Wesolowski, 2019) by combining scores from multiple institutions to
 create analyses that are more robust. Future studies should involve multiple statistical
 procedures due to limitations and advantage of various methods. For example, the
 Rasch model can compensate for differences in rater severity or sample characteristics
 (Musselwhite & Wesolowski, 2019; Stemler & Tsai, 2008). Educator preparation programs considering the edTPA for internal use or states adopting the assessment as a
 licensure requirement should not do so without evidence of validity and reliability for
 each content area.
 Declaration of Conflicting Interests
 The author declared no potential conflicts of interest with respect to the research, authorship,
 and/or publication of this article.

 Funding
 The author received no financial support for the research, authorship, and/or publication of this
 article.

 ORCID iD
 Phillip M. Hash

 https://orcid.org/0000-0002-3384-4715

 Supplemental Material
 Supplemental material for this article is available online.

 Notes
 1.

 SCALE (2019b) uses categories of agreement (n = 2) rather than rubric categories (n = 5)
 as the unit of n in their calculations, stating that, “given the three possible classifications of

 Hash

 2.

 13

 agreement (perfect, adjacent, and nonagreement), . . . perfect and adjacent were combined
 as the agreement statistic” (p. 6). SCALE does not provide details about these calculations
 beyond stating the use of kappan. However, calculating this statistic using the exact +
 adjacent agreements for each rubric provided by SCALE (2019) and 2 as the value for n
 resulted in the same kn coefficients provided in the 2018 Administrative Report.
 Rubrics that undergo double scoring are averaged when Scorer 1 and Scorer 2 reach
 adjacent agreement. Rubric scores more than one number apart are resolved by a scoring
 supervisor.

 References
 Adkins, A., Klass, P., & Palmer, E. (2015, January). Identifying demographic and preservice
 teacher performance predictors of success on the edTPA [Conference presentation]. 2015
 Hawaii International Conference on Education. Honolulu, Hawaii. http://hiceducation.org/
 wp-content/uploads/proceedings-library/EDU2015.pdf
 Asmus, E. P. (1989). Factor analysis: A look at the technique through the data of Rainbow.
 Bulletin of the Council for Research in Music Education, 101, 1–29. www.jstor.org/stable/40318371
 Austin, J. R., & Berg, M. H. (2020). A within-program analysis of edTPA score reliability,
 validity, and utility. Bulletin of the Council for Research in Music Education, 226, 46–65.
 https://doi.org/10.5406/bulcouresmusedu.226.0046
 Behizadeh, N., & Neely, A. (2018). Testing injustice: Examining the consequential validity of
 edTPA. Equity & Excellence in Education, 51(3–4), 242–264. http://doi.org/10.1080/1066
 5684.2019.1568927
 Bernard, C., & McBride, N. (2020). “Ready for primetime:” edTPA, preservice music educators,
 and the hyperreality of teaching. Visions of Research in Music Education, 35, 1–26. wwwusr.rider.edu/%7Evrme/v35n1/visions/Bernard%20and%20McBride_Hyperreality%20
 Manuscript.pdf
 Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41(3), 687–699. https://doi.
 org/10.1177/001316448104100307
 Brown, J. D. (2009). Choosing the right type of rotation in PCA and EFA. Shiken: JALT Testing
 & Evaluation Newsletter, 13(3), 20–25. http://hosted.jalt.org/test/PDF/Brown31.pdf
 Brown, T. A. (2015). Confirmatory factor analysis for applied research (2nd ed.). Guilford
 Press.
 Carmines, E. G., & Zeller, R. A. (1979). Reliability and validity assessment. Sage.
 Choppin, J., & Meuwissen, K. (2017). Threats to validity in the edTPA video component. Action
 in Teacher Education, 39(1), 39–53, https://doi.org/10.1080/01626620.2016.1245638
 Costello, A, B., & Osborne, J. (2005). Best practices in exploratory factor analysis: Four recommendations for getting the most from your analysis. Practical Assessment, Research, and
 Evaluation, 10, Article 7. https://doi.org/10.7275/jyj1-4868
 Darling-Hammond, L., & Hyer, M. E. (2013). The role of performance assessment in developing teaching as a profession. Rethinking Schools, 27(4). www.rethinkingschools.org/
 articles/the-role-of-performance-assessment-in-developing-teaching-as-a-profession
 Dover, A., Schultz, B., Smith, K., & Duggan, T. (2015). Embracing the controversy: edTPA,
 corporate influence, and the cooptation of teacher education. Teachers College Record,
 Article 18109. www.tcrecord.org/books/Content.asp?ContentID=18109

 14

 Journal of Music Teacher Education 00(0)

 Gitomer, D. H., Martinez, J. F., Battey, D., & Hyland, N. E. (2021). Assessing the assessment:
 Evidence of reliability and validity in the edTPA. American Educational Research Journal,
 58(1), 3–31. https://doi.org/10.3102%2F0002831219890608
 Heil, L., & Berg, M. H. (2017). Something happened on the way to completing the edTPA:
 A case study of teacher candidates’ perceptions of the edTPA. Contributions to Music
 Education, 42, 181–200. www.jstor.org/stable/26367442
 Krippendorff, K. (2013). Content analysis: An introduction to its methodology (3rd ed.). Sage.
 Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical
 date. Biometrics, 33(1), 159–174. http://dx.doi.org/10.2307/2529310
 Miksza, P., & Elpus, K. (2018). Design and analysis for quantitative research in music education. Oxford University Press.
 Musselwhite, D. J., & Wesolowski, B. C. (2019). Evaluating the psychometric qualities of the
 edTPA in the context of pre-service music teachers. Research Studies in Music Education.
 Advance online publication. https://doi.org/10.1177/1321103X19872232
 National Association for Music Education. (2014). 2014 Music standards. https://nafme.org/
 my-classroom/standards/core-music-standards/
 National Association of Schools of Music. (2020). Handbook 2019-20. https://bit.ly/3jTVKQi
 Parkes, K. A. (2020). Student teaching and certification assessments. In C. Conway, K.
 Pellegrino, A. M. Stanley, & C. West (Eds.), Oxford handbook of preservice music teacher
 education in the United States (pp. 231–252). Oxford University Press.
 Parkes, K. A., & Powell, S. R. (2015). Is the edTPA the right choice for evaluating teacher
 readiness? Arts Education Policy Review, 116(2), 103–113. https://doi.org/10.1080/1063
 2913.2014.944964
 Pecheone, R. L., & Whittaker, A. (2016). Well-prepared teachers inspire student learning. Phi
 Delta Kappan, 97(7), 8–13. https://doi.org/10.1177/0031721716641641
 Powell, S. R., & Parkes, K. A. (2020). Teacher evaluation and performativity: The edTPA as a
 fabrication. Arts Education Policy Review, 121(4), 131–140. https://doi.org/10.1080/1063
 2913.2019.1656126
 Sato, M. (2014). What is the underlying conception of teaching of the edTPA? Journal of
 Teacher Education, 65(5), 421–434. http://doi.org/10.1177/0022487114542518
 Sim, J., & Wright, C. C. (2005). The kappa statistic in reliability studies: Use, interpretation,
 and sample size requirements. Physical Therapy, 85(3), 257–268. https://doi.org/10.1093/
 ptj/85.3.257
 Stanford Center for Assessment, Learning, and Equity. (2013). 2013 edTPA field fest: Summary
 report. https://secure.aacte.org/apps/rl/res_get.php?fid=827
 Stanford Center for Assessment, Learning, and Equity. (2015). Educative assessment and
 meaningful support: 2014 EdTPA administrative report. https://secure.aacte.org/apps/rl/
 res_get.php?fid=2188&ref=edtpa
 Stanford Center for Assessment, Learning, and Equity. (2018a). edTPA K-12 Performing arts
 assessment handbook (Version 06). http://ceit.liu.edu/Certification/EdTPA/2018/edtpapfa-handbook%202018.pdf
 Stanford Center for Assessment, Learning, and Equity. (2018b). Understanding rubric level
 progressions: K–12 performing arts (Version 01). https://concordia.csp.edu/teachered/wpcontent/uploads/sites/3/K-12-Performing-Arts-Rubric-Progressions.pdf
 Stanford Center for Assessment, Learning, and Equity. (2018c). Educative assessment and
 meaningful support: 2017 EdTPA administrative report. https://secure.aacte.org/apps/rl/
 res_get.php?fid=4271&ref=edtpa

 Hash

 15

 Stanford Center for Assessment, Learning, and Equity. (2019a). Educative assessment and
 meaningful support: 2018 EdTPA administrative report. https://secure.aacte.org/apps/rl/
 res_get.php?fid=4769&ref=edtpa
 Stanford Center for Assessment, Learning, and Equity. (2019b). Affirming the validity and
 reliability of edTPA [White paper]. http://edtpa.aacte.org/wp-content/uploads/2019/12/
 Affirming-Validity-and-Reliability-of-edTPA.pdf
 Stanford Center for Assessment, Learning, and Equity. (2019c). edTPA EPP performance summary: January 2019 - June 2019. https://sasn.rutgers.edu/sites/default/files/sites/default/
 files/inline-files/Jan%20to%20June%202019%20edTPA.pdf
 Stemler, S. E., & Tsai, J. (2008). Best practices in interrater reliability: Three common
 approaches. In J. Osborn (Ed.), Best practices in quantitative methods (pp. 29–49). Sage.

 Revista Educación
 ISSN: 0379-7082
 ISSN: 2215-2644
 [email protected]
 Universidad de Costa Rica
 Costa Rica

 Actualización de la evaluación docente
 de posgrados en una universidad
 multicampus: experiencia desde la
 Universidad Santo Tomás (Colombia)[1]
 Patiño-Montero, Freddy; Godoy-Acosta, Diana Carolina; Arias Meza, Deyssy Catherine
 Actualización de la evaluación docente de posgrados en una universidad multicampus: experiencia desde la
 Universidad Santo Tomás (Colombia)[1]
 Revista Educación, vol. 46, núm. 2, 2022
 Universidad de Costa Rica, Costa Rica
 Disponible en: https://www.redalyc.org/articulo.oa?id=44070055006
 DOI: https://doi.org/10.15517/revedu.v46i2.47955

 Esta obra está bajo una Licencia Creative Commons Atribución-NoComercial-CompartirIgual 3.0 Internacional.

 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 Freddy Patiño-Montero, et al. Actualización de la evaluación docente de posgrados en una universid...

 Artículos cientíﬁcos

 Actualización de la evaluación docente de posgrados en una universidad
 multicampus: experiencia desde la Universidad Santo Tomás (Colombia)[1]
 Update of the Postgraduate Teaching Evaluation in a Multicampus University: Experience from the Santo Tomás
 University (Colombia)
 Freddy Patiño-Montero
 Universidad Santo Tomás,, Bogotá, Colombia
 [email protected]

 DOI: https://doi.org/10.15517/revedu.v46i2.47955
 Redalyc: https://www.redalyc.org/articulo.oa?
 id=44070055006

 https://orcid.org/0000-0001-5795-4911
 Diana Carolina Godoy-Acosta
 Universidad Santo Tomás, Bogotá, Colombia
 [email protected]
 https://orcid.org/0000-0002-1903-0854
 Deyssy Catherine Arias Meza
 Universidad Santo Tomás, Bogotá, Colombia
 [email protected]
 https://orcid.org/0000-0001-6689-5706

 Recepción: 20 Agosto 2021
 Aprobación: 20 Septiembre 2021

 Resumen:
 Este artículo presenta los resultados de una investigación evaluativa cuyo objetivo estuvo orientado a realizar un ejercicio de
 metaevaluación de la evaluación docente de posgrados de la Universidad Santo Tomás, durante el período 2017-2020. Los
 referentes teóricos se ubican en orden a las categorías: evaluación educativa, evaluación del profesorado .investigación evaluativa,
 así como los referentes institucionales que se tuvieron en cuenta dentro del proceso. La investigación se ubica en el paradigma
 cualitativo y corresponde a una metodología de investigación evaluativa que permitió el diseño de ocho pasos que orientaron la
 realización del estudio; esto posibilitó la metaevaluación de la evaluación docente de los posgrados de la Universidad Santo Tomás,
 donde participaron personas estudiantes, docentes, directoras de programa y decanas de facultad en el diagnóstico, mesas de trabajo,
 aplicación de pilotaje, evaluación del instrumento ﬁnal e implementación de la evaluación. Los resultados alcanzados se presentan
 en coherencia con la metodología, en consideración de que no se hacían efectivas políticas y procedimientos institucionales, unido
 a que el derecho a réplica del profesorado era casi nulo. Por otro lado, lo más relevante es la deﬁnición de una evaluación docente
 personalizada de acuerdo con su plan de trabajo y la evaluación del desempeño del personal docente contratado por orden de
 prestación de servicios. Todo esto conllevó a la consolidación y parametrización de un aplicativo institucional. Finalmente, se
 esbozan algunas conclusiones del proceso y recomendaciones de carácter metodológico para adelantar este tipo de trabajos en
 instituciones de educación superior multicampus.
 Palabras clave: Evaluación educativa, Evaluación docente, Educación superior, Investigación evaluativa.

 Abstract:
 is article presents the results of an evaluative research whose objective was oriented to carry out a meta-evaluation exercise of
 the postgraduate teacher evaluation of the Santo Tomás University, during the period 2017-2020. e theoretical referents are
 placed in order of the categories: educational evaluation, the evaluation of the teaching staﬀ and evaluative research, as well as the
 institutional referents that were considered within the process. e research is located in the qualitative paradigm and corresponds
 to an evaluative research methodology that permitted the eight-step design that oriented the accomplishment of the study. is
 made possible the meta-evaluation of the teaching evaluation of the postgraduate courses of the Santo Tomas University, where
 students, teachers, program directors, and deans of faculty participated in the diagnosis, working tables, application of piloting,
 evaluation of the ﬁnal instrument, and implementation of the evaluation. e achieved results are presented in coherence with
 the methodology, considering that institutional policies and procedures were not implemented, together with the fact that the
 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 1

 Revista Educación, 2022, vol. 46, núm. 2, Julio-Diciembre, ISSN: 0379-7082 2215-2644

 teachers' right to reply was almost nil. On the other hand, the most relevant is the deﬁnition of a personalized teacher evaluation
 according to their work plan, and the evaluation of the performance of the teacher staﬀ hired by order of service provision.
 All this led to the consolidation and parameterization of an institutional application. Finally, some conclusions of the process
 and recommendations of a methodological nature are outlined to carry out this type of work in multi-campus higher education
 institutions.
 Keywords: Educational Evaluation, Teacher Evaluation, Higher Education, Evaluative Research.

 Epígrafe
 “O la evaluación es útil o no tiene sentido realizarla; tiene que ser un instrumento para la acción y no un mero
 mecanismo de justiﬁcación o para tranquilizar conciencias. Los evaluadores debemos ser beligerantes en este
 sentido” (Escudero-Escorza, 2000, p. 406).
 Introducción
 La Universidad Santo Tomás [2] (USTA) es una Institución de Educación Superior de carácter privado,
 con presencia nacional a través de la sede principal Bogotá, seccionales en Bucaramanga y Tunja, y sedes
 en Medellín y Villavicencio. Adicionalmente, cuenta con Centros de Atención Universitaria (CAU) en 23
 ciudades y municipios del país. La oferta académica comprende 76 programas de pregrado y 129 de posgrado,
 en los cuales están matriculadas cerca de 32,000 personas estudiantes, divididas en 29,000 en pregrado y 3000
 en posgrado. Para cumplir con su misión, la USTA cuenta con 2,350 docentes con dedicación de tiempo
 completo, medio tiempo y hora cátedra (Mesa-Angulo, 2020).
 Desde este contexto, el proceso investigativo inició con un ejercicio de diagnóstico, realizado en 2017,
 sobre el estado de la evaluación docente en posgrado, donde se pudo constatar que esta no obedecía a las
 mismas dinámicas que en pregrado, al punto que cada uno de los programas tenía sus propios instrumentos
 y metodologías. Además de lo anterior, se identiﬁcaron algunos problemas como:
 •
 •
 •
 •
 •

 Poca signiﬁcatividad del instrumento que diligencia el estudiantado, puesto que la redacción de los
 descriptores en su mayoría está referida únicamente a la modalidad presencial.
 La escasa motivación y participación por parte del estudiantado.
 La participación intermitente por parte del profesorado.
 La poca implementación de planes de mejoramiento por parte del cuerpo docente.
 La escasa información para la toma de decisiones desde la gestión de los programas respecto a la
 continuidad del cuerpo docente.

 En virtud de lo anterior, el objetivo principal de esta investigación fue realizar un ejercicio de
 metaevaluación de la evaluación docente de posgrados de la USTA. Sus objetivos especíﬁcos fueron: a)
 analizar los referentes conceptuales y metodológicos que soportan los procesos de investigación evaluativa
 y evaluación educativa, b) evaluar el nivel de implementación de las políticas y procedimientos para la
 evaluación docente de posgrados de la USTA, y c) proponer una nueva batería de instrumentos que atienda
 a las necesidades de los programas de posgrado.
 Estado de la cuestión
 Respecto a la evaluación educativa, como se evidencia en los estudios que se reﬁeren a lo largo de esta
 publicación, su evolución se ubica en la misma historia de la educación, en cuanto al concepto, alcance y
 metodologías. Asimismo, indican que el siglo XX fue especialmente signiﬁcativo en tanto que se alcanza la
 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 2

 Freddy Patiño-Montero, et al. Actualización de la evaluación docente de posgrados en una universid...

 profesionalización de la evaluación educativa y se supera la perspectiva objetivante centrada en la medición.
 Así, con base en el enfoque constructivista, el propósito pasa a ser el aprendizaje del estudiantado, lo que
 implica un cambio de perspectiva respecto a los ﬁnes de la educación y la misma evaluación, como se percibe
 en los análisis de Santos-Guerra (2010), Casanova (2007), Escudero-Escorza (2003), Rossett y Sheldon
 (2001), House (1995), Stuﬄebeam y Shinkﬁeld (1987), Guba y Lincoln (1989), entre otros.
 En cuanto a la evaluación del profesorado, se encuentra que los diversos trabajos de revisión teórica
 identiﬁcan algunas tendencias, entre las cuales se destacan: la complejidad de la labor docente, la falta de
 consenso frente a lo que signiﬁca ser docente de calidad en la universidad; la diversidad de criterios con
 relación a la selección y evaluación, asociadas a las nociones subyacentes sobre la buena enseñanza; la tendencia
 a reducir las funciones del profesorado universitario únicamente a la docencia; la inﬂuencia de la docencia
 en la calidad educativa; la diversidad de funciones, agentes y metodologías de evaluación, hasta los estímulos
 salariales y la carrera académica, entre otras, de acuerdo con los trabajos de Rueda (2014), Ramírez-Garzón
 y Montoya-Vargas (2014), Montoya y Largacha (2013), Fernández y Coppola (2012), Escudero-Muñoz
 (2010), Murillo-Torrecilla (2008), y Tejedor-Tejedor y Jornet-Meliá (2008).
 De forma complementaria, la revisión permitió constatar que la investigación evaluativa se ha venido
 fortaleciendo desde las últimas décadas como una de las metodologías de investigación, cuya ﬁnalidad se
 centra en el mejoramiento de la calidad, especialmente del servicio educativo, con un alto énfasis a generar
 participación de las partes involucradas y con una amplia ﬂexibilidad metodológica, como indican BelandoMontoro y Alanís-Jiménez (2019), Escudero-Escorza .2006, 2019), Tejedor-Tejedor y Jornet-Meliá (2008),
 Tejedor-Tejedor (2009), Litwin (2010) y Saravia-Gallardo (2004).
 Referentes conceptuales
 Evaluación educativa
 Respecto a la primera categoría, evaluación educativa, se identiﬁca como un proceso formativo que se
 realiza sobre las acciones desarrolladas en el marco de las instituciones educativas, con la intención de detectar
 diﬁcultades e implementar planes de mejora que permitan solucionarlas de manera satisfactoria y pertinente.
 Por ende, implica aspectos tales como los resultados de aprendizaje y la evaluación institucional.
 Con base en el concepto propuesto, resulta relevante lo expuesto por Casanova (2007) en el Manual de
 Evaluación Educativa, cuando aﬁrma que este “consiste en un proceso sistemático y riguroso de recogida de
 datos” (p. 60), cuyo propósito es disponer de información continua y signiﬁcativa, que permita formar juicios
 de valor para tomar decisiones que mejoren la actividad educativa.
 Los elementos planteados por Casanova adquieren relevancia puesto que se entiende como el resultado de
 un conjunto de actividades, claramente relacionadas entre sí, que se dan en el marco del transcurrir cotidiano
 de las instituciones educativas, desde su fase inicial (diagnóstico) hasta la entrega de resultados, pero sin
 terminar con estos, pues una vez se obtienen, se da inicio al ciclo de mejoramiento, que implica ir a cada
 una de las instancias, factores y actores evaluados para establecer las rutas más adecuadas para asegurar que
 efectivamente el proceso en sí mismo se vaya cualiﬁcando.
 En línea con ello, Scriven (1967) aﬁrma que “la evaluación es en sí misma una actividad metodológica que
 es esencialmente similar si estamos tratando de evaluar máquinas de café o máquinas de enseñanza, los planes
 para una casa o los planes para un programa de estudios” (p. 40). Scriven enmarca la evaluación como un
 procedimiento, que lleva implícita la idea de secuencia, progresión en la ejecución de una serie de pasos. De
 hecho, al revisar la propuesta de Scriven es posible aﬁrmar que su objetivo consiste en desplazar la evaluación
 desde los objetivos hacia las necesidades, en tanto que toda ella está orientada hacia la persona consumidora
 (usuario).
 Por su parte, para Rossett y Sheldon (2001), la evaluación es el proceso de examen de un programa o
 proceso para determinar qué funciona, qué no y por qué. La evaluación determina el valor de los programas y
 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 3

 Revista Educación, 2022, vol. 46, núm. 2, Julio-Diciembre, ISSN: 0379-7082 2215-2644

 actúa como modelo para el juicio y la mejora. Respecto a este punto, las personas autoras nuevamente toman
 el término proceso para referirse a la evaluación, al tiempo que incluyen el elemento valorativo que se debe
 dar dentro de este, así como la intención de utilizarlos en perspectiva de mejoramiento, en este caso de los
 programas.
 Entonces, conviene preguntarse: ¿cuál evaluación y al servicio de quién? La evaluación educativa es un
 tema que ha logrado mantener especial relevancia en el ámbito académico, en cuanto aspecto neurálgico
 en los procesos educativos. Es decir, se ha evidenciado que la evaluación es un tema que trasciende los
 espacios convencionales de debate, puesto que en ella convergen múltiples factores y relaciones humanas,
 tales como: la dimensión social, en cuanto que es de alguna manera una forma de establecer relaciones entre
 los el estudiantado, entre el estudiantado y el profesorado, y entre cuerpo docente, ya que se pregunta por
 los valores, el respeto por las personas y el sentido de la justicia [por mencionar algunas] (Santos-Guerra,
 2010); una dimensión política (House, 1995), por tanto, no debe ser solo veraz, sino justa, en la medida que
 es tomada la mayoría de la veces como un instrumento de poder que, según su uso, llega a determinar la vida
 de las personas; una dimensión ﬁlosóﬁca, en cuanto que se debe preguntar por el fundamento, la razón de ser
 de la acción evaluativa para que no quede reducida a una mera actividad desarticulada, es decir, al plano del
 activismo sin ninguna reﬂexión; de igual manera una dimensión teleológica, es decir, que exista una mirada,
 un horizonte claro hacia el cual se quiere llegar, un para qué, que ayude a darle sentido al proceso.
 Ahora bien, de acuerdo con la línea teórica deﬁnida por personas autoras como Stake (2006), en la cual la
 evaluación educativa es el proceso de emitir un juicio de valor con base en evidencias objetivas sobre el mérito y
 deﬁciencias de algo. Del mismo modo, Cordero y Luna (2010) argumentan que “la evaluación comprende dos
 componentes: el estudio empírico, determinar los hechos y recolectar la información de manera sistemática;
 y la delimitación de los valores relevantes para los resultados del estudio” (p. 193). Justamente esa es la postura
 que se asume al inicio de este apartado.
 En síntesis, como aﬁrma Cabra-Torres (2014):
 la evaluación ha servido de motor para gran parte de los cambios de orientación de los sistemas educativos, en razón de
 la información que produce y de los interrogantes que despiertan la gestión y el análisis de los resultados que entrega a la
 sociedad (p. 178).

 Evaluación del profesorado
 En cuanto a la evaluación del profesorado, se concibe como una herramienta de gestión que posibilita
 el desarrollo de la carrera docente, en el marco de una institución educativa (en este caso, de educación
 superior). En este sentido, implica la recolección de información por parte de los agentes e instancias en las
 que se desempeña, en el marco de las funciones universitarias, con el ﬁn de establecer estrategias y actividades
 que le sirvan al personal docente para identiﬁcar sus fallas y mejorarlas con apoyo de la institución. Al
 mismo tiempo, posibilita a las Instituciones de educación superior la implementación de planes de formación
 docente que redunden en beneﬁcios para quienes obtienen bajas caliﬁcaciones en este proceso, lo cual ha
 permitido caracterizar cada vez más ideas acerca de los atributos del buen profesor (Belando-Montoro y AlanísJiménez, 2019). Como última instancia, provee de herramientas a las Instituciones de Educación Superior
 (IES) para la toma de decisiones informadas sobre la continuidad o no de un profesor o profesora.
 Lo aﬁrmado hasta el momento se encuentra en plena consonancia con los planteamientos de Montoya
 y Largacha (2013), Vásquez-Rizo y Gabalán-Coello (2012), Fernández y Coppola (2010), y Luna-Serrano
 (2008), quienes destacan que la evaluación de la docencia universitaria implica una amplia diversidad de
 agentes evaluadores, en tanto que la profesión académica no se limita únicamente a la función docente,
 sino que son amplias y diversas las funciones y roles del profesorado en las universidades. De allí que aún
 hoy no haya consenso respecto a cómo evaluarla ni cuál es el mejor método. En consecuencia, como indica
 Rueda (2014), “es necesario reconocer la relevancia del rol que puede cumplir la evaluación sistemática del

 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 4

 Freddy Patiño-Montero, et al. Actualización de la evaluación docente de posgrados en una universid...

 desempeño docente en la profesionalización y perfeccionamiento permanente del profesorado” (p. 99). Es
 decir, se enmarcan en una categoría más amplia como es la profesión académica.
 De allí que un error bastante común es considerar la profesión académica desde un ideal de institución
 de educación superior, desde el desempeño de la función docente propiamente dicha e incluso desde una
 modalidad tradicional, puesto que “la actividad docente no se restringe a la interacción en el aula, existen
 otros modelos de enseñanza como la formación en servicio o la educación a distancia” (Rueda, Luna, García
 y Loredo, 2011; citado en Rueda, 2014, p. 100)
 Así, por ejemplo, al revisar textos clásicos sobre evaluación del maestro y la maestra, se encuentra con que
 ya en ellos se enunciaban retos a los que se enfrenta como el incremento de conocimientos, los cambios del
 estudiantado, como se indicaba en su momento, “la creciente investigación en la psicología, la sociología y
 campos aﬁnes, que es pertinente a la enseñanza y aprendizaje” (Simpson, 1967, p. 12), toda vez que tensionan
 sus propias prácticas pedagógicas.
 Lo planteado hasta el momento evidencia la necesidad de realizar un análisis multidimensional del trabajo
 del profesorado, que atiende a diferentes perspectivas de quienes reciben su servicio, e incluso que propios
 miembros del cuerpo docente se puedan autoevaluar a partir de los mismos criterios con que son evaluados
 externamente, de manera que este ejercicio realmente se haga a partir de aspectos conmensurables. En este
 sentido, se debe tener en cuenta la ﬁnalidad del proceso evaluativo y autoevaluativo. Es decir, “para que
 el maestro adquiera una preparación excelente y su enseñanza alcance un nivel superior, se requiere que
 preste continua atención al problema de la autoevaluación y su meta reconocida: el automejoramiento del
 maestro” (Simpson, 1967, p. 11).
 Ahora bien, en cuanto a estos aspectos metodológicos, se encuentra que la estrategia y el instrumento más
 común para realizar la evaluación es el uso de cuestionarios de opinión que responde el estudiantado, los
 cuales remiten especialmente a aspectos didácticos y evaluativos, tal como se encuentra en las investigaciones
 de Rueda, Luna, García y Loredo (2011; citado en Rueda, 2014) y Litwin (2010), realizadas en México y
 Argentina, respectivamente.
 Sobre los últimos rasgos mencionados, es pertinente enunciar que, efectivamente, también son utilizados
 en la evaluación realizada en la USTA, como se aprecia en algunas referencias a documentos institucionales:
 Referentes institucionales

 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 5

 Revista Educación, 2022, vol. 46, núm. 2, Julio-Diciembre, ISSN: 0379-7082 2215-2644

 TABLA 1

 Referentes institucionales

 Fuente: elaboración propia con base en documentos institucionales

 Fuente: elaboración propia con base en documentos institucionales
 Procedimientos Metodológicos
 De acuerdo con diversas personas autoras, especialmente al revisar la obra de Escudero-Escorza (2000, 2006,
 2016 y 2019), la investigación evaluativa es concebida como una metodología del ámbito de las ciencias
 sociales que se ha fortalecido en las últimas décadas, en tanto que brinda las herramientas suﬁcientes para
 la implementación de un ejercicio de evaluación riguroso, con la participación de los directos involucrados.
 Ello, en perspectiva de que sea posible la deﬁnición de los aspectos que requieren ajustes, modiﬁcaciones o
 supresiones en el marco del mejoramiento de los procesos o las prácticas donde se requieran implementar
 cambios a través de un ejercicio evaluativo más participativo, consciente y pertinente, que contribuya a la
 calidad de la educación.
 Delimitada de esta forma, se puede aﬁrmar que se ubica en campo de la investigación cualitativa, en tanto
 que “incluye formulaciones paradigmáticas múltiples y, también, complejas críticas, epistemológicas y éticas,
 a la metodología de investigación tradicional en las ciencias sociales” (Denzin y Lincoln, 2012, p. 24). Es
 decir, al centrarse en la evaluación respecto a diferentes campos de conocimiento, advierte en sí misma una
 intención de valorar y mejorar los procesos que se dan en su interior. Dicha perspectiva transformadora
 enfatiza, como aﬁrma Escudero-Escorza (2016), “la función de esta al servicio del cambio social y, en
 concreto, al servicio de la mejora social” (p.14). En ese sentido, la investigación evaluativa en educación
 beneﬁcia de forma signiﬁcativa a todos los agentes e instancias educativas y, por tanto, a las instituciones y
 a la sociedad en general.
 Dado que en parte su sello diferenciador radicó en la evaluación de programas sociales, en algún punto de su
 desarrollo llegó a confundirse con esta actividad. Sin embargo, dada su amplitud y fundamentación terminó
 por imponerse la tradición de la investigación evaluativa; “mientras que la evaluación de programas se deﬁnió

 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 6

 Freddy Patiño-Montero, et al. Actualización de la evaluación docente de posgrados en una universid...

 como la investigación evaluativa directamente aplicada a programas sociales” (Owen y Rogers, 1999, citado
 en Escudero-Escorza, 2006, p. 181).
 En el mismo texto, Escudero-Escorza (2006) presenta una serie de elementos que permiten identiﬁcar la
 investigación evaluativa. A continuación, se retoman algunos de estos:
 •
 •
 •
 •
 •
 •

 La solución de problemas concretos como ﬁnalidad.
 Se investiga sobre todo en situaciones naturales.
 La observación es la principal fuente de conocimiento.
 Se emplean tanto cuantitativos como cualitativos.
 Se busca la mejora de programas sociales.
 Se informa a los responsables de tomar decisiones sobre programas y prácticas. (pp. 180 - 181)

 Asimismo, no se puede soslayar la intencionalidad o los propósitos con los cuales se realizan las evaluaciones
 del profesorado, entre los cuales, según un estudio reciente, se cuentan cerca de 15 tipos distintos de
 propósitos (Escudero-Escorza, 2019). A pesar de ello, en la actualidad sigue sin haber suﬁciente consenso
 respecto a lo que es “un buen profesor” (Tejedor-Tejedor, 2009, p. 79). Por tanto, cobra sentido el hecho
 que sean las propias IES quienes, a la luz de su ﬁlosofía institucional y horizonte estratégico, puedan
 “determinar el modelo de profesor que se quiere, estableciendo los comportamientos que se consideran
 deseables para después analizar en qué medida la conducta del profesor satisface el referente de calidad
 establecido” (Tejedor-Tejedor, 2009, p. 93). Lo cual, para el caso de la USTA está claramente identiﬁcado
 en los elementos destacados de sus documentos institucionales, tal como se evidenció en la Tabla 1.
 En función de lo anterior y en atención al proyecto deﬁnido se estableció una serie de etapas que hicieron
 posible el ejercicio deﬁnido desde el alcance de sus objetivos. Como se percibe en la Figura 1, las personas
 investigadoras trazaron una ruta metodológica propia para la investigación, compuesta por ocho (8) fases,
 la cual recoge, en buena medida, las principales recomendaciones de los expertos en este campo, en el
 entendido que “todas las aproximaciones metodológicas son útiles en algún momento y para alguna faceta
 evaluativa y que todas tienen sus limitaciones y que en la práctica se requieren generalmente aproximaciones
 metodológicas diversas y complementarias” (Escudero-Escorza, 2019, p. 24). Asimismo, dada la naturaleza
 del estudio, se establece un muestreo cualitativo, en atención a las fases establecidas y a los agentes e instancias
 intervinientes en el proceso, es decir, un tipo muestreo no probabilístico, como se recomienda en estos casos
 (Hernández-Sampieri y Mendoza-Torres, 2018). Por tanto, se ﬁjó la muestra por criterios y por conveniencia
 (Otzen y Manterola, 2017). Por criterios, porque debido a la estructura orgánica de la universidad fue
 necesario garantizar representación de diversos organismos colegiados, y por conveniencia, en cuanto a que
 se realizaron invitaciones al grupo de representantes referido más adelante, el cual participó de manera
 voluntaria en diversos ejercicios que contribuyeron en las diferentes fases del proceso.
 A continuación, se representan los momentos, los cuales son detallados posteriormente en el apartado
 sobre resultados y discusión.

 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 7

 Revista Educación, 2022, vol. 46, núm. 2, Julio-Diciembre, ISSN: 0379-7082 2215-2644

 FIGURA 1.

 Metaevaluación de la evaluación docente de posgrados
 Fuente: elaboración propia

 Análisis y discusión de resultados
 Desarrollo de las fases:
 1. Análisis contextual . Se realiza la revisión de referentes documentales que constate la necesidad de la
 consolidación de un sistema de evaluación, en el cual se encontró que en 2013 la Unidad de Posgrados
 realizó un primer ejercicio diagnóstico y según el Informe de Gestión de la Vicerrectoría Académica General
 (VAG) - Plan de Acción 2011-2013, se recomendó consolidar el sistema institucional de evaluación docente
 a nivel de Posgrados. Por otro lado, se evidencia que entre los años 2014 y 2016 los programas de Posgrados
 aplicaron instrumentos de evaluación de manera espontánea, no uniﬁcados ni sistemáticos, estos ejercicios
 evaluativos no fueron obligatorios y, en general, no contemplaron lo deﬁnido en la Dimensión de la Política
 Docente ([USTA], 2015), sino criterios establecidos al interior de cada de programa. Se percibe una serie de
 diﬁcultades asociadas a la baja participación de docentes y de estudiantes, está última no llegaba al 30 %; lo
 que evidenció que buena parte de quienes participaban son estudiantes que perdieron asignaturas.
 A la luz de los resultados obtenidos y tal como se muestran en el acápite anterior, cabe resaltar que es
 imposible generar una propuesta nueva si se desconocen los esfuerzos previos que ha realizado la institución,
 dado que es allí donde se obtienen experiencias e insumos sobre los cuales replantearse dinámicas y alcances
 para que, según un contexto con sus particularidades, se logre cumplir con los objetivos académicos y
 administrativos de los programas.
 2. Revisión de referentes conceptuales e institucionales . Esta actividad fue realizada de manera independiente
 y posteriormente se uniﬁcaron y discutieron los hallazgos por parte de los equipos de trabajo de Currículo
 de la DUAD [3] , en su momento VUAD [4] junto los profesionales de la Unidad de Posgrados de la
 sede Principal. A partir de allí, se consolida el proyecto de evaluación docente, que responde a los procesos
 planeados en la articulación con la VAG, donde se realizó un primer diagnóstico acerca de las metodologías
 e instrumentos aplicados en los diferentes programas de posgrados, en la Sede Principal y de la DUAD. En
 este punto se pudo constatar que, incluso, en términos de políticas y procedimientos institucionales, muchos
 aspectos que estaban deﬁnidos en los documentos, o bien no se conocían o no se hacían efectivos en la DUAD.
 3. Análisis de la implementación de la evaluación. Unido a los hallazgos del punto anterior, aquí se encontró
 que la socialización de los resultados de su evaluación directamente con el profesorado y el derecho a réplica a
 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 8

 Freddy Patiño-Montero, et al. Actualización de la evaluación docente de posgrados en una universid...

 partir de estos, antes de consignarlos de manera deﬁnitiva era casi nulo. Asimismo, que un alto porcentaje de
 docentes no consultaba los resultados de su evaluación, o lo hacía únicamente como requisito para participar
 en las convocatorias de ascenso en el escalafón docente. Finalmente, y quizás, lo más relevante aquí fue
 la deﬁnición de una evaluación docente personalizada en función del plan de trabajo elaborado por cada
 docente al inicio de semestre, lo cual no ocurría y llamó la atención como la mayor oportunidad de mejora
 en el nuevo procedimiento a implementar. También se identiﬁcó que la condición de profesores, tanto en los
 procesos de autoevaluación con ﬁnes de renovación de Registro Caliﬁcado y Acreditación de Alta Calidad
 de los programas de Posgrado, exigían información documental sobre los procesos de evaluación docente de
 cara a la deﬁnición de planes de mejora y de formación propios de la carrera docente, esto exigía un ejercicio
 más riguroso respecto a la trazabilidad del ejercicio docente.
 4. Evaluación de la batería de instrumentos utilizada . A partir del trabajo articulado entre la Unidad
 de Posgrados y el Equipo de Currículo de la DUAD, se constata la existencia de múltiples instrumentos
 utilizados por los programas, diferentes al oﬁcial. Ahora bien, respecto al instrumento oﬁcial deﬁnido por
 la Unidad de Desarrollo Curricular y Formación Docente [UDCFD], se constató que buena parte de
 los descriptores estaban determinados para la modalidad presencial y que incluso no correspondían a las
 dinámicas propias de los posgrados. A continuación, se reﬁeren algunos a modo de ejemplo:
 •
 •

 Manual del deportista fue divulgado a tiempo y es de total conocimiento
 El docente utiliza el gimnasio de la USTA como soporte de la preparación física integral.

 Asimismo, se evidenció que los factores relacionados con la integración de las funciones sustantivas desde
 el currículo, y que son evidentes para el estudiantado, no se evalúan de manera directa. Es decir, no se evalúan
 acciones relacionadas con investigación y proyección social, que son trabajadas desde la docencia.
 En este mismo sentido es importante resaltar en implementaciones similares que existen instituciones
 que, tal como la Universidad Santo Tomás, cuentan con distintas modalidades de enseñanza dentro de sus
 Facultades, esto hace aún más desaﬁante el reto de construir un único modelo de evaluación, pues el ejercicio
 académico y administrativo requiere de especiﬁcidades acorde a las necesidades de cada modalidad. Esto
 requiere de tiempo y negociaciones entre los diferentes actores participantes del instrumento de evaluación
 para lograr que en consenso se acojan las particularidades de cada uno.
 5. Formulación del nuevo instrumento de evaluación. Teniendo en cuenta el marco de referencia que ofrece
 el documento Dimensión de la Política Docente [DPD], en el cual se deﬁnen todos los aspectos de orden
 conceptual y metodológico de la evaluación docente, el equipo de trabajo decide acatarlas en su gran mayoría,
 especialmente aquellas de orden conceptual. Se contemplan adicionalmente las particularidades del personal
 docente que está vinculado por orden de prestación de servicios (OPS), dado que suman un gran número en
 los programas de posgrado. En el aspecto metodológico, especíﬁcamente en lo referido a los instrumentos y
 a la escala de ponderación se proponen los cambios más signiﬁcativos. A continuación, se presentan algunos
 de ellos:
 •

 •

 Luego de diversos análisis, el equipo deﬁne asumir la escala de valoración de la DPD que contempla
 seis escalas de ponderación que van del 0 al 5, correspondiente con los siguientes criterios: 0 No se
 cumple, 1 Se cumple insuﬁcientemente, 2 Se cumple con bajo grado, 3 Se cumple medianamente, 4
 Se cumple en alto grado y 5 Se cumple plenamente.
 La DPD deﬁne tres instrumentos: uno para estudiantes; otro para decanos; y uno para docentes.
 Los tres son diferentes en cuanto a la redacción de los descriptores, número de descriptores por
 aspecto evaluado, etc. Lo anterior se considera una oportunidad de mejora que se asume en la nueva
 propuesta. Así, la nueva propuesta consta de los siguientes instrumentos:
 I.

 nstrumento de evaluación de estudiantes sobre el desempeño docente de posgrado.

 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 9

 Revista Educación, 2022, vol. 46, núm. 2, Julio-Diciembre, ISSN: 0379-7082 2215-2644

 II.
 III.
 IV.

 Instrumento de evaluación del director o líder del Programa Académico de Posgrado al
 Docente.
 Instrumento de evaluación del decano al docente, este se utilizaría en caso tal de que el
 programa de posgrado no cuente con la ﬁgura de director o líder del Programa Académico
 de Posgrado.
 Instrumento de Autoevaluación Docente de Posgrado

 Estos instrumentos están basados directamente en la evaluación que realiza el estudiantado sobre su
 desempeño, ya que es este es quien tiene la asignación más alta en la escala de ponderación; la de estudiantes
 llega al 50 % y la autoevaluación docente a un 25 %, y la evaluación de la persona decana o directora de
 programa el restante 25 %, para un total de 100 %.
 En este punto, se remitió al Departamento de Talento Humano el nuevo instrumento propuesto para
 evaluar la viabilidad jurídica relacionada con la contratación de personal docente OPS. Además, para efectos
 de garantizar el cumplimiento de aspectos señalados en la DPD, desde la UDCFD se asigna a una persona
 docente que haga parte del equipo de trabajo para el diseño e implementación de la propuesta de la Evaluación
 Docente de Posgrados.
 Es importante resaltar que para los programas de posgrado presenciales se logró, en articulación con el
 departamento de TIC de la USTA, que la batería de preguntas se ﬁltrara según el plan de trabajo de cada
 docente para que solo aparecieran aquellas preguntas que estuvieran relacionadas con las actividades para las
 cuales fueron asignadas horas desde la nómina de cada programa (Plan de Trabajo Docente). Para el caso de
 la DUAD, debido a una incompatibilidad de sistema, se realizó en un formulario de Google que permitía
 tener toda la batería de preguntas con la posibilidad de decir no aplica a la actividad que no hacía parte del
 plan de trabajo.
 6. Socialización y ajustes con los agentes involucrados. Se realizó la socialización con grupos de representantes
 de cada uno de los agentes involucrados, tales como: estudiantes, docentes, comités de currículo de
 Facultad (DUAD), personas coordinadoras de Programa, decanas, decanos, vicerrectoras y vicerrectores de
 la modalidad distancia (2017-2018). En todos los casos se recibieron las sugerencias y recomendaciones en
 términos de forma, redacción y número de descriptores propuestos; las cuales fueron asumidos casi en su
 totalidad.
 7. Validación de métricas de los instrumentos. Habida cuenta del proceso anterior, se deﬁnió la versión
 preliminar de los instrumentos para la Evaluación Docente de Posgrados, la cual se validó por parte de pares
 académicos y de un experto en psicometría de la Facultad de Psicología. Conforme a la retroalimentación,
 se realizaron los respectivos ajustes.
 8. Implementación. La implementación comienza con un pilotaje con algunos grupos de estudiantes de los
 programas de posgrado de modalidad distancia, que mostraban una continuidad en matrículas, tales como:
 Maestría en Didáctica, Maestría en Educación, Especialización en Pedagogía para la Educación Superior,
 Especialización en Patología de la Construcción, Maestría en Gestión de Cuencas Hidrográﬁcas.
 Además de lo anterior, se realizó una segunda validación del instrumento que se hizo a través de un segundo
 ejercicio piloto para la parametrización del Aplicativo Institucional con los posgrados de Maestría en Calidad
 y Gestión Integral, Especialización en Administración y Gerencia de Sistemas de la Calidad, Especialización
 en Finanzas y Especializaciones en Finanzas y Gerencia Empresarial, todos ellos de modalidad presencial.
 En este orden de ideas, parte de la innovación se concentró en la actualización y mejoramiento de la
 herramienta de sistematización de los procesos de evaluación docente de posgrados en la Universidad Santo
 Tomás, mediante la parametrización del instrumento, el cual consta de dos interfaces: la primera para el
 administrador del aplicativo institucional a través de un micrositio, y la segunda para el usuario, en este caso
 los actores involucrados (personal directivo, docentes y estudiantes).
 Este instrumento genera información conﬁable y clasiﬁcada en 9 tipos de reportes que permiten evidenciar
 oportunamente los resultados del proceso de Evaluación Docente en los Posgrados; también se puede acceder
 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 10

 Freddy Patiño-Montero, et al. Actualización de la evaluación docente de posgrados en una universid...

 fácilmente a esta información en línea, por parte de los decanos, decanas, directoras, directores y docentes, lo
 anterior como suministro para la toma de decisiones en las diferentes instancias institucionales (Dirección
 de Investigación e Innovación, 2021).
 Después del desarrollo de la actualización de los instrumentos, la validación y la parametrización en
 el aplicativo institucional, se continuó con la implementación de esta herramienta únicamente en los
 posgrados de la sede principal en Bogotá, dadas las diferencias en los sistemas académicos entre los programas
 presenciales y los programas a distancia, así como en las sedes y seccionales.
 Así las cosas, el proceso de implementación se ha llevado a cabo desde el primer semestre de 2019 hasta el
 segundo semestre de 2020. Sin embargo, dados los ajustes realizados a nivel institucional con ocasión de la
 pandemia mundial por la COVID-19, durante el período 2020-1, en los programas presenciales no se realizó
 la evaluación docente debido a una decisión de la alta dirección de la Universidad. En virtud de ello, en la
 Figura 2, se presenta la información del proceso de aplicación en los diferentes períodos hasta el 2020-2:

 FIGURA 2

 Relación de participantes en la evaluación docente, período 2019-1.
 Fuente: elaboración propia

 Como se evidencia en la ﬁgura, en contraste con lo aﬁrmado en los apartados anteriores, es notable la
 participación en el primer ejercicio de evaluación docente correspondiente al nuevo instrumento, aplicativo
 y procedimiento, en todos los casos con cifras superiores al 70 %, cuando anteriormente estas llegaban al 30
 %. De esta manera, el grupo poblacional con mayor participación fue el personal directivo del programa con
 un 96.43 %, seguido por el docente con un 81,27 % y, ﬁnalmente, el de estudiantes con un 72,87 %.
 Del mismo modo, en la Figura 3 se puede apreciar que en el 2019-2 el porcentaje de estudiantes que
 participaron en el proceso de evaluación docente fue del 69,47%, lo cual da cuenta de una leve disminución
 respecto al período anterior, no así en los casos relacionados con el personal directivo del programa y docentes,
 donde es notoria la disminución en la participación. En el primer caso, tal contracción está en el orden del
 14 % de diferencia, mientras en el segundo es cercano al 6 %, con respecto al período 2019-1. Lo anterior
 puede ser evidencia de la importancia de trabajar de forma sistemática en la cultura de la evaluación entre los
 miembros de la comunidad académica, además, es posible que la mayor participación del profesorado en el
 primer semestre obedece a contar con un insumo indispensable para la convocatoria al ascenso en el escalafón
 docente que se realiza en el segundo semestre.

 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 11

 Revista Educación, 2022, vol. 46, núm. 2, Julio-Diciembre, ISSN: 0379-7082 2215-2644

 FIGURA 3.

 Relación participantes evaluación docente período 2019-2
 Fuente: elaboración propia

 Tal como se observa en la Figura 4, en el 2020-2 del total de estudiantes matriculados, el 68,58 % llevó
 a cabo el proceso de evaluación docente. Por su parte, la autoevaluación contó con una participación del
 81,86 % del total de docentes y un 78,13 % de las personas directoras de posgrados. Estos porcentajes
 evidencian que se mantiene la disminución en la participación del estudiantado, mientras que se presenta la
 mayor participación del cuerpo docente desde la implementación del nuevo instrumento y procedimiento.
 Asimismo, se recupera la tasa de participación de las personas directoras de programa.

 FIGURA 4.

 Relación de participantes en la evaluación docente, período 2020-2.
 Fuente: elaboración propia

 En las Figuras 2, 3 y 4 se observa que, dada la naturaleza voluntaria de la participación por parte del
 estudiantado, se ha contado con una participación cercana al 70 % que evidencia la responsabilidad las
 personas estudiantes en su proceso de aprendizaje y una conciencia respecto a las implicaciones que tiene su
 voz en el mejoramiento de los programas académicos. Para el cuerpo docente, el porcentaje de participación
 en su autoevaluación es cercano al 80 % y se considera como una participación lejana del ideal, la cual es del
 100 %, dada las directrices institucionales que incentivan al personal docente a participar en su proceso de
 caliﬁcación. Caso parecido ocurre con las personas directoras de los programas, que, aunque participan en
 su gran mayoría, siguen faltando a la participación de la totalidad de ellos en su compromiso con evaluar al
 estamento docente de los programas que dirigen.
 Finalmente, los resultados obtenidos durante estos 3 periodos académicos dan cuenta de la necesidad de
 seguir cultivando la participación de todos los actores con el ﬁn de llegar al 100 % en la participación de todos
 los integrantes.
 En este mismo orden de ideas, y con el ﬁn de complementar los resultados, en la Figura 5 se observa el
 porcentaje promedio de participantes para el corte de los tres períodos de implementación.

 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 12

 Freddy Patiño-Montero, et al. Actualización de la evaluación docente de posgrados en una universid...

 FIGURA 5.

 Porcentaje promedio de participación en la evaluación de docentes de posgrado
 Fuente: elaboración propia

 Por otra parte, en la Figura 6 se presenta el promedio obtenido en la evaluación global de docentes
 de posgrado en la modalidad presencial en los períodos 2019-1 y 2019-2, lo cual da cuenta de una alta
 ponderación, en consideración de que la máxima escala posible es 5,0.

 FIGURA 6.

 Promedio general - Evaluación docente para los periodos 2019-1 y 2019-2
 Fuente: elaboración propia

 Ahora bien, para el caso de los posgrados de la DUAD, se llevó a cabo la implementación de los nuevos
 instrumentos de evaluación docente y la compilación de los datos a través de formularios inteligentes, en este
 caso se utilizó la plataforma Google Forms, de la cual se obtuvo la información de la Figura 7:

 FIGURA 7.

 Participación en la evaluación docente en los posgrados de la DUAD periodos 2019-1 a 2020-2
 Fuente: elaboración propia

 El estudiantado como principal fuente de información realiza de forma anónima el proceso de evaluación
 docente, por lo anterior los datos extraídos de los formularios son con base en el total de docentes asignados
 a espacios académicos en los diferentes periodos, es decir, que la participación presente en la Figura 7 se
 determinó así:
 Para el 2019-1, del total de 115 docentes, el 57 % fue evaluado mínimo por una persona estudiante, el 23
 % llevó a cabo su autoevaluación y el 73 % de los directores de posgrado evaluaron al estamento docente;
 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 13

 Revista Educación, 2022, vol. 46, núm. 2, Julio-Diciembre, ISSN: 0379-7082 2215-2644

 para el 2019-2, del total de 102 docentes el 53 % fue evaluado mínimo por una persona estudiante, el 55 %
 llevó a cabo su autoevaluación y el 64 % de las personas directoras de posgrados la realizaron; para el 2020-1
 del total de 96 docentes el 74 % fue evaluado mínimo por una persona estudiante, el 60 % llevó a cabo su
 autoevaluación y todas las personas directoras de posgrados evaluaron al equipo docente; para el 2020-2,
 del total de 96 docentes, el 69 % fue evaluado mínimo por una persona estudiante, el 70 % llevó a cabo su
 autoevaluación y el 72 % de las personas directoras de posgrados evaluaron al profesorado.
 TABLA 2.

 Disminución de matrícula entre los períodos 2019-1 y 2020-2

 Fuente: elaboración propia

 En los datos de la Tabla 2 se aprecia la disminución de la matrícula entre los períodos relacionados, lo cual
 da cuenta de un fenómeno nacional, marcado por una reducción signiﬁcativa en la educación superior, la cual
 aún hoy es objeto de estudio por parte de las IES, personas investigadoras y el mismo Ministerio de Educación
 Nacional. Además, es evidente que, en período de lanzamiento del nuevo instrumento de evaluación, se
 presentó la tasa más alta de participación en los tres agentes establecidos. Sin embargo, llama la atención
 que la participación de estudiantes y directivos ha venido decreciendo, lo cual implica la implementación de
 nuevas estrategias de motivación y divulgación por parte de la Unidad de Posgrados, para poder retomar los
 índices iniciales e incluso llegar a mejorarlos. Este fenómeno no ocurre con cuerpo docente, quienes muestran
 una participación sostenida durante los períodos, lo cual puede ser atribuible al interés que representan estos
 resultados en aspectos relacionados con el escalafón docente.
 Conclusiones
 El objetivo principal de la investigación fue realizar un ejercicio de metaevaluación de la evaluación docente
 de posgrado de la USTA, de manera que, a partir de allí, se pudiera proponer un nuevo procedimiento y
 batería de instrumentos de evaluación uniﬁcado para todos los programas de posgrado. De allí que en 2019
 se implementó el aplicativo institucional de Evaluación Docente de Posgrados en todos los programas de
 posgrados Bogotá de la sede principal. Esto permitió contar con información conﬁable, oportuna y clasiﬁcada
 para tomar decisiones y formular los distintos planes de acción según las directrices institucionales vigentes.
 Por los planteamientos realizados a lo largo del texto, se aﬁrma que se cumplió el objetivo propuesto, al
 tiempo que se espera que la evaluación docente adquiera mayor relevancia en el mejoramiento de las prácticas
 pedagógicas en relación con su plan de trabajo; mejorar los resultados de aprendizaje en el estudiantado;
 caracterizar mejor al personal docente para replantear la asignación de funciones en perspectiva de potenciar
 las capacidades, y establecer planes de formación que permitan incidir directamente en los aspectos a mejorar.
 Es decir, se espera que, a partir de las buenas prácticas en la implementación de la nueva propuesta de
 evaluación docente, se impacte directamente en la toma de decisiones, de forma que antes de determinar
 la salida de docentes de la Universidad, se logre aprovechar realmente los resultados de la evaluación en
 la adecuada ubicación de ellos en las funciones que realizan mejor; al tiempo que se capaciten en aquellos
 aspectos para los que su formación previa, su experiencia o simplemente la ausencia de ella, hayan llevado a
 bajos resultados en los procesos de evaluación. Lo anterior, enmarcado plenamente en la cualiﬁcación de la
 profesión académica al interior de la Universidad Santo Tomás.
 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 14

 Freddy Patiño-Montero, et al. Actualización de la evaluación docente de posgrados en una universid...

 Un aspecto importante dentro del proceso fue evidenciar que en las Universidades Multicampus, como es
 el caso del estudio, aunque desde la gestión directiva se realizan esfuerzos signiﬁcativos en el establecimiento
 de políticas y procedimientos de carácter institucional, en ocasiones la inclusión de las tres modalidades de
 la oferta educativa de la USTA en estos aspectos suele ser un elemento que requiere la implementación de
 oportunidades de mejora. Ello, en tanto que en la actualidad la USTA cuenta con más de 34.000 estudiantes a
 nivel nacional, lo que requiere sinergias entre sedes, seccionales y DUAD, para que se logre llevar a cabalidad la
 planeación institucional que obedece en este caso al Plan Integral Multicampus, que proyecta la Universidad
 hasta 2027.
 La existencia de un nuevo procedimiento y batería de instrumentos, y su implementación, permitió a los
 posgrados de la USTA contar con información y acceder a nuevos escenarios de participación para para
 la toma de decisiones en las diferentes instancias institucionales. Asimismo, el seguimiento a la aplicación
 periódica de estos instrumentos permitió disminuir la brecha de complejidad de la cultura evaluativa
 universitaria, que presentaba diﬁcultades para sintonizar a los actores y el proceso de evaluación.
 Finalmente, es importante recomendar el apoyo de las directivas a este tipo de investigaciones que implican
 ejercicios de metaevaluación de las prácticas, procesos y procedimientos de la vida universitaria, en tanto que
 este tipo de acciones requieren buena disposición, recursos y toma de decisiones respecto a las conclusiones
 o innovaciones que deriven de ellas.
 Finalmente, se recomienda a todas las personas que se acerquen a un ejercicio similar al presente,
 propiciar una cultura institucional donde se reconozca la importancia de la evaluación en todos los niveles y
 participantes, no solo académicos sino también administrativos, ya que este proceso permea todas las áreas y
 permite alcanzar niveles de calidad que potencian las instituciones a lo largo del tiempo, pues se deja bien en
 claro que la evaluación no se reduce a un ejercicio puntual, sino que abarca un proceso de constante cambio
 que implica una revisión periódica de sus instrumentos y procedimientos.
 Referencias
 Belando-Montoro, M. y Alanís-Jiménez, J. F. (2019). Perspectivas Comparadas entre los Docentes de Posgrado de
 Investigadores en Educación de la UNAM y la UCM. REICE: Revista Iberoamericana sobre Calidad, Eﬁcacia y
 Cambio en Educación, 17(4), 93-110. https://doi.org/10.15366/reice2019.17.4.005
 Cabra-Torres, F. (2014). Evaluación y formación para la ciudadanía: una relación necesaria. Revista Ibero-Americana
 De Educação, (64), 177-193. https://doi.org/10.35362/rie640413
 Casanova, M. (2007). Manual de Evaluación Educativa. (9ª Ed.). La Muralla.
 Cordero, G. y Luna, E. (2010). Revista Iberoamericana de Evaluación Educativa, 3(1e), 191-202. https://revistas.ua
 m.es/riee/article/view/4503/4927
 Dirección de Investigación e Innovación. (2021). Certiﬁcado de innovación de procedimiento y servicio. Universidad
 Santo Tomás.
 Denzin, N. K. y Lincoln, Y. S. (Coords.) (2012). Manual de investigación cualitativa. (Vol. 1). El campo de la
 investigación cualitativa. Gedisa, S.A.
 Escudero-Escorza, T. (2000). La evaluación y mejora de la enseñanza en la universidad: otra perspectiva. Revista de
 Investigación Educativa, 18(2), 405-416. https://revistas.um.es/rie/article/view/121071/113761
 Escudero-Escorza, T. (2003). Desde los test hasta la investigación evaluativa actual. Un siglo, el XX, de intenso
 desarrollo de la evaluación en educación. Relieve, 9(1), 11-43. https://ojs.uv.es/index.php/RELIEVE/article/v
 iew/4348/4025
 Escudero-Escorza, T. (2006). Claves identiﬁcativas de la investigación evaluativa: análisis desde la práctica. Contextos
 Educativos, 8(9), 179-199. https://redined.educacion.gob.es/xmlui/handle/11162/47847
 Escudero-Escorza, T. (2016). La investigación evaluativa en el Siglo XXI: Un instrumento para el desarrollo educativo
 y social cada vez más relevante. RELIEVE, 22(1), 1-20. http://dx.doi.org/10.7203/relieve.22.1.8164
 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 15

 Revista Educación, 2022, vol. 46, núm. 2, Julio-Diciembre, ISSN: 0379-7082 2215-2644

 Escudero-Escorza, T. (2019). Evaluación del Profesorado como camino directo hacia la mejora de la Calidad Educativa.
 Revista de Investigación Educativa, 37(1), 15–37. https://doi.org/10.6018/rie.37.1.342521
 Escudero-Muñoz, J. M. (2010). La selección y la evaluación del profesorado. Revista Interuniversitaria de Formación
 del Profesorado, 24(2), 201-221. https://www.redalyc.org/articulo.oa?id=27419198010
 Fernández, N. y Coppola, N. (2010). La evaluación de la docencia universitaria desde un abordaje institucional. Revista
 Iberoamericana de Evaluación Educativa, 3(1), 37-50. https://repositorio.uam.es/handle/10486/661582
 Fernández, N. y Coppola, N. (2012). Aportes para la reﬂexión sobre la evaluación de función docente universitaria.
 Revista Iberoamericana de Evaluación Educativa, 5(1e), 106-119. https://revistas.uam.es/riee/article/view/4430
 Guba, E. G. y Lincoln, Y. S. (1989). Fourth Generation Evaluation [Evaluación de cuarta generación]. Sage
 Hernández-Sampieri, R. y Mendoza-Torres, C. P. (2018). Metodología de la investigación: las rutas cuantitativa,
 cualitativa y mixta. McGraw-Hill Interamericana Editores, S.A.
 House, E. (1995). Evaluación, ética y poder. Morata.
 Litwin, E. (2010). La evaluación de la docencia: plataformas, nuevas agendas y caminos alternativos. Revista
 Iberoamericana de Evaluación Educativa 2010, 3(1), 51-59. https://revistas.uam.es/riee/article/view/4504
 Luna-Serrano, E. (2008). Evaluación en contexto de la docencia en posgrado. . REencuentro. Análisis de Problemas
 Universitarios, 75-84. https://www.redalyc.org/pdf/340/34005307.pdf
 Mesa-Angulo, J. (2020). La Santo Tomás: una universidad país. Ediciones USTA. https://repository.usta.edu.co/ha
 ndle/11634/29077?show=full
 Montoya, J. y Largacha, E. (2013). Calidad de la educación superior: ¿Recursos, actividades o resultados? En L. OrozcoSilva (Ed.), La educación superior: retos y perspectivas. (pp. 379-417). Ediciones Uniandes.
 Murillo-Torrecilla, F. (2008). La evaluación del profesorado universitario en España. Revista Iberoamericana de
 Evaluación Educativa, 1(3), 29-45. https://repositorio.uam.es/handle/10486/661532
 Otzen, T. y Manterola, C. (2017). Técnicas de Muestreo sobre una Población a Estudio. International Journal of
 Morphology, 35(1), 227-232. https://scielo.conicyt.cl/pdf/ijmorphol/v35n1/art37.pdf
 Ramírez-Garzón, M. I. y Montoya-Vargas, J. (2014). La evaluación de la calidad de la docencia en la universidad: Una
 revisión de la literatura. REDU. Revista de Docencia Universitaria, 12(2), 77-95. https://riunet.upv.es/handle/
 10251/139977
 Rossett, A. y Sheldon, K. (2001). Beyond the Podium: Delivering Training and Performance to a Digital World. [Más
 allá del podio: brindar capacitación y rendimiento en un mundo digital]. Jossey-Bass/Pfeiﬀer.
 Rueda, M. (2014). Evaluación docente: La valoración de la labor de los maestros en el aula. Revista Latinoamérica de
 Educación Comparada, 5(6), 97-106. http://www.saece.com.ar/relec/revistas/6/art1.pdf
 Santos-Guerra, M. (2010). La evaluación como aprendizaje: una ﬂecha en la diana. (3a ed.). Bonum.
 Stake, R. (2006). Evaluación comprensiva y evaluación basada en estándares. Editorial Graó.
 Saravia-Gallardo, M. A. (2004). Evaluación del profesorado universitario. Un enfoque desde la competencia profesional
 [Tesis Doctoral, Universidad de Barcelona]. https://dialnet.unirioja.es/servlet/tesis?codigo=3411
 Scriven, M. (1967). e methodology of evaluation [La metodología de la evaluación]. En R. Tyler, R. Gagné y M.
 Scriven (Eds.), Perspectives of curriculum evaluation. (pp. 39-83). Rand McNally.
 Simpson, R. H. (1967). La autoevaluación del maestro. (E. F. Setaro, Trad.). Paidós.
 Stuﬄebeam, D. L. y Shinkﬁeld, A. J. (1987). Evaluación sistemática. Guía teórica y práctica. Paidós-MEC.
 Tejedor-Tejedor, F. J. (2009). Evaluación del profesorado universitario: enfoque metodológico y algunas aportaciones
 de la investigación. Estudios Sobre Educación, (16). https://dadun.unav.edu/handle/10171/9169
 Tejedor-Tejedor, F. J. y Jornet-Meliá, J. M. (2008). La evaluación del profesorado universitario en España. Revista
 electrónica de investigación educativa, 10 (SPE), 1-29. https://redie.uabc.mx/redie/article/view/199
 Universidad Santo Tomás. (2004a). Estatuto Docente. Ediciones USTA
 Universidad Santo Tomás. (2004b). Proyecto Educativo Institucional (3a ed.). https://usantotomas.edu.co/documen
 tos-institucionales
 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 16

 Freddy Patiño-Montero, et al. Actualización de la evaluación docente de posgrados en una universid...

 Universidad Santo Tomás. (2010a). Dimensión de la Política Docente. https://usantotomas.edu.co/documentos-inst
 itucionales
 Universidad Santo Tomás. (2010b). Modelo Educativo Pedagógico. USTA Ediciones.
 Universidad Santo Tomás. (2015). Documento Marco Desarrollo Docente. USTA Ediciones.
 Vásquez-Rizo, F. E. y Gabalán-Coello, J. (2012). La evaluación docente en posgrado: variables y factores inﬂuyentes.
 Educación y Educadores, 15(3), 445-460. https://www.redalyc.org/pdf/834/83428627006.pdf

 Notas
 [1] Este artículo es resultado del proceso de metaevaluación adelantado sobre la Evaluación Docente (Procedimientos e
 Instrumentos), realizado por la Unidad de Posgrados y el Equipo de Currículo de la Facultad de Ciencias y Tecnologías,
 Universidad Santo Tomás. Bogotá, 2017-2020.
 [2] Primera universidad privada del país en obtener la Acreditación Institucional de Alta Calidad en la modalidad
 Multicampus (Resolución número 01456 del 29 de enero de 2016, MEN).
 [3] DUAD: División de Educación Abierta y a Distancia.
 [4] VUAD: Vicerrectoría de Educación Abierta y a Distancia

 Información adicional
 Cómo citar: Patiño-Montero, F., Godoy-Acosta, D. C. y Arias-Meza, D. C. (2022). Actualización de la
 evaluación docente de posgrados en una universidad multicampus. Experiencia desde la Universidad Santo
 Tomás (Colombia). Revista Educación, 46(2). http://doi.org/10.15517/revedu.v46i2.47955

 PDF generado a partir de XML-JATS4R por Redalyc
 Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

 17

 Çukurova Üniversitesi Eğitim Fakültesi Dergisi
 Vol: 48 Numb: 2 Page: 1299-1339
 https://dergipark.org.tr/tr/pub/cuefd

 Analyzing Academic Members’ Expectations from a Performance
 Evaluation System and Their Perceptions of Obstacles to Such an
 Evaluation System: Education Faculties Sample
 Gürol YOKUŞ a*, Tuğba YANPAR YELKEN b
 a

 Sinop Üniversitesi, Eğitim Fakültesi, Sinop/Türkiye
 Mersin Üniversitesi, Eğitim Fakültesi, Mersin/Türkiye

 b

 Article Info
 DOI: 10.14812/cufej.467359
 Article history:
 Received 04.10.2018
 Revised
 25.03.2019
 Accepted 18.10.2019
 Keywords:
 Performance evaluation,
 Quality in higher education,
 Accountability.

 Abstract
 The assesment and evaluation of academic members in faculties in a systematic way is
 a crucial issue because higher education institutions put a large emphasis on a
 transparent, efficient and successful management. This study aims to conduct a mixed
 (quantitative and qualitative) research about the expectations of Education Faculties’
 academic members about a performance evaluation approach and the obstacles to such
 an evaluation system. Convergent parallel mixed method design has been preferred as
 research model. “Expectations from performance assessment” subscale and “barriers
 to performance assessment” subscale have been used as data collection tools which are
 developed by Tonbul (2008). Independent Samples t-test and ANOVA are used for
 analysis of quantitave data; and content analysis is used for analysis of qualitative data.
 As a result of this study, it is found out that academic members have a moderate level
 of expectations from a performance evaluation approach. The highest expectations
 belong to assistant professors while the lowest belong to professors. The mostly agreed
 expectations of academic members from a performance evaluation approach are found
 to be “developing a consensus about the criteria of an effective academician, affecting
 professional development of academic members positively and increasing workload of
 academic members”. The most frequent obstacles to a performance evaluation
 approach emerged as “current organizational mechanism of higher education
 institutions” and “workload of faculty academic members”. The scores of both
 expectations and obstacles significantly differ depending on “taking academic incentive,
 work experience in higher education, academic title and satisfaction level of
 academicians from their institutions”. As a result of qualitative analysis, there emerge
 many themes and codes related to a performance evaluation system. In “Attitude
 Towards Performance Approach” theme, the most frequent codes appeared to be
 “adopters, doubters”. In Academicians’ Priorities theme, the codes emerged as
 “research and publications, evaluation of quality of instruction, advisory for
 undergraduates and postgraduates”; In Positive Effects theme, the codes emerged as
 “motivation, financial support, search of quality”; In Negative Effects theme, the codes
 emerged as “intra-institutional rivalry, academic dishonesty”; In Obstacles theme, the
 codes emerged as “intense workload, lack of instrintic motivation”; and finally In
 Suggestions theme, the codes emerged as “more officer employment, institutional
 support for academic efforts and research publishings”.

 Eğitim Fakültesi Öğretim Elemanlarının Performans Değerlendirme
 Yaklaşımından Beklentileri ve Performansın Önündeki Engellere İlişkin
 Görüşlerinin İncelenmesi: Karma Yöntem Araştırması

 *

 Makale Bilgisi

 Öz

 DOI: 10.14812/cufej.467359

 Öğretim elemanlarının performansının sistematik şekilde ölçülmesi ve değerlendirilmesi
 yükseköğretim kurumlarının kalitesi için önemlidir. Bu çalışmanın amacı, çeşitli devlet
 üniversitelerinin Eğitim Fakültelerinde görev yapan öğretim elemanlarının performans
 değerlendirme yaklaşımından beklentileri ve performans değerlendirmenin önündeki

 Author: [email protected]

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339
 Makale Geçmişi:
 Geliş
 04.10.2018
 Düzeltme 25.03.2019
 Kabul
 18.10.2019
 Anahtar Kelimeler:
 Performans değerlendirme,
 Yükseköğretimde kalite,
 Hesap verebilirlik.

 engellere ilişkin görüşlerinin nicel ve nitel olarak incelenmesidir. Bu araştırma
 kapsamında karma araştırma yöntemlerinden yakınsayan paralel karma desen tercih
 edilmiştir. Veri toplama aracı olarak Tonbul (2008) tarafından geliştirilen “Performans
 Değerlendirme Yaklaşımına İlişkin Beklentiler” altölçeği ve “Performans Değerlendirme
 Sisteminin Önündeki Engeller” altölçeği kullanılmıştır. Nicel veriler için ilişksiz
 örneklemler t-testi, ANOVA; nitel veriler için içerik analizi tercih edilmiştir. Çalışmanın
 sonucunda, öğretim elemanlarının performans değerlendirme yaklaşımıyla ilgili
 beklentilerinin orta düzeyde olduğu, performans değerlendirme yaklaşımıyla ilgili en
 yüksek beklentiye sahip olanların doktor öğretim üyeleri, en düşük beklentiye sahip
 olanların ise profesörler olduğu ortaya çıkmıştır. Performans değerlendirmenin
 önündeki en önemli iki engelin ise yükseköğretim kurumlarının mevcut örgütsel işleyişi
 ve öğretim üyelerinin iş yükü olduğu görülmüştür. Performans Değerlendirmeye İlişkin
 Beklentiler ve Engellerle ilgili puanlar “akademik teşvik alma, çalışma deneyimi,
 akademik unvan ve kurumdan memnuniyet düzeyi”ne göre anlamlı farklılık
 göstermektedir. Nitel analiz sonucunda ise en sık tekrar eden kodlara bakıldığında ise
 Değerlendirmeye Karşı Tutum temasında “benimseyenler, şüpheyle yaklaşanlar”;
 Akademisyenlerin Öncelikleri temasında “akademik yayınlar”, “öğretimin kalitesinin
 değerlendirilmesi”, “lisans ve lisansüstü danışmanlık”; Olumlu etkileri temasında
 “motivasyon”, “maddi destek”, “kalite arayışı”; Olumsuz etkileri temasında “kurum içi
 rekabet, akademik sahtekarlıklar”; Engeller temasında “yoğun iş yükü, içsel motivasyon
 eksikliği”; Öneriler temasında ise “memur istihdamı, yayın ve çalışmaların kurumca
 desteklenmesi” kodları ortaya çıkmıştır.

 Introduction
 Nowadays, many organizations focus on making a systematic performance evaluation of its members
 for a transparent, efficient and successful management. In higher education, public or private universities
 make effort to produce a reliable evaluation system. The higher education instutions feel the necessity to
 identify performance indicators and announce their level of achieving mission and strategies due to a
 variety of reasons such as global competitiveness and society pressure for transparency (Hamid, Leen, Pei
 & Ijab 2008). Especially in competitive environment of 21 st century, a better performance evaluation
 system creates advantages for universities and it offers opportunities for evaluating their own running
 process and members in a more effective way.
 When literature is reviewed, it is noticed that there are discussions related to accountability of higher
 education institutions. The base of discussions focuses on evaluation of performance of institutions and
 making a public announcement of the results involving stakeholders’ views. Also, universities are criticized
 for their academic members behaving like ivory towers as a closed society (Glaser, Halliday, & Eliot, 2003).
 The criticisms are summarized by Esen and Esen (2015):
 •
 •
 •
 •
 •

 The research conducted by academic members doesn’t focus on societal problems.
 Their studies are too much theoretical.
 Societal resources are wasted in vain (Etzkowitz, Webster, Gebhardt, & Terra, 2000).
 Research are not transformed into communal, and they are conducted esoterically.
 The identities of academicians transform into individuals with constricted autonomy who were
 worried about disturbing university or administrative structure (Elton, 1999).

 Higher education institutions should not be viewed as unamenable organizations and institutions,
 although they function as autonomous. Higher education institutions have the power to influence the
 society, economic structure and social life to which they belong. Therefore, instead of being ivory towers,
 universities should take science, society and nation together and perform at international quality
 standards and feel a conscientious responsibility to prioritize social benefit rather than career
 development. Vidovich and Slee (2001) claims that it is necessary to make performance evaluations in
 universities for the following reasons:
 
 

 accountability to customers (continuous improvement activities for scientific research),
 accountability to government (efficient and productive use of resources),
 1300

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 

 accountability to students and society (providing comprehensive educational experiences,
 providing vocational training to improve the quality of life, meeting the labor force needs of the
 society).

 Since the beginning of 21st century, higher education has gone through significant changes. UNESCO
 (2004) makes a list of the global developments which provide new inferences for higher education
 institutions: i) the emergence of new education providers such as multi-national companies, corporate
 universities, and media companies; ii) new forms of delivering education including distance, virtual and
 new face-to-face, such as private companies; iii) greater diversification of qualifications and certificates;
 iv) increasing mobility of students, programmes, providers and projects across national borders; v) more
 emphasis on lifelong learning which in turn increases the demand for postsecondary education; and vi)
 the increasing amount of private investment in the provision of higher education. Considering all these
 developments, higher education institutions have the capacity of affecting the society, economic
 structure and social life. Therefore, they are expected to make performance at international quality
 standards considering science, community and nation altogether instead of being ivory towers and they
 are expected to prioritize societal benefit as well as carreer development. Vidovich and Slee (2001)
 emphasize that making a performance evaluation in universities is necessary in terms of accountability to
 members (sustainable enhancement efforts for scientific research), accountability to government
 (efficient and creative use of resources) and accountability to students&society (providing extensive
 educational experience, providing professional education for increasing life quality, meeting the need of
 society’s workforce).
 Performance evaluation in higher education involves a variety of products and processes. In its
 essence, performance evaluation indicates the minimum acceptable level in terms of quality and it
 provides opportunity for identifying strenghts and weaknesses of individuals and institutions. In this way,
 individuals and institutions not only become aware of their weakness, but also recognize at what aspects
 they are good at. Batool, Qureshi and Raouf (2010) state that performance evaluation might not include
 all dimensions of this concept and performance evaluation of an institution does not mean the same thing
 with assessing academic programs, courses or the quality of graduates. They pointed out that
 performance evaluation of an institution mean assessing the current situation in terms of the quality and
 effectiveness of the institution.
 Within the context of this study, performance evaluation in higher education is defined as «assessing
 the professional qualifications of academic members related to their instructional roles and their level of
 contribution to accomplishing institutional goals. Therefore, a performance evaluation system is
 necessary for three purposes: assessing academic members’ a variety of studies such as research,
 academic service, instruction and publications, offering them a comprehensive feedback supporting their
 self-development and valuing their current performance. Vincent (2010) points out the advantages of a
 performance evaluation approach in higher education:
 • Development and progression of individuals stand on realistic goals.
 • It creates conformity between individuals’ goals and institution’s goals.
 • It helps to identify the strengths and weakness of individuals within an organization.
 • It works as a feedback mechanism for purpose of enhancement.
 • It helps to identify which courses and instruction are needed.
 • It helps the institution to take a major role and responsibility in terms of education, society,
 economics and politics.
 Tonbul (2008) claims that performance evaluation practices increase the accomplishment level of
 institutional goals, help to identify failing issues in organizational process and provide specific data about
 the organizational climate and culture’s effect on members; which in turn lead to an increase in
 institutional performance. It is seen that organizations become more successful and lasting which make
 1301

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 an effective and functional use of feedback mechanism in processes related to workflow and organization
 (Latham & Pinder, 2005). Kalaycı (2009) attracts the attention that it is very unlikely to predict success or
 failure in higher education without a proper evaluation; however, with evaluation of educational
 performances of academic members, it becomes open to criticism by other stakeholders and this situation
 is challenging. This issue might result in negative circumstances. For instance, Kim et al. (2016) claim that
 a large number of professors put a low emphasis on their role of educator while putting a greater
 emphasis on their researcher identity; because faculty evaluation systems are mainly based on research.
 In order not to cause negative consequences, performance evaluations should not be done for fulfilling
 formality or obligation. This threat is especially valid for public universities funded by government. Kalaycı
 and Çimen (2012) attract the attention that public universities need quality studies from now on and it
 emerges as a necessity for them to perform institutional quality process practices not just for purpose of
 formality but increasing quality and standing out in this competitive environment.
 The major reasons which encourage universities to make performance evaluation in 21st century
 emerge as institutional image and reputation, internationalization and global university rankings. There
 are many factors affecting institutional reputation and image. In a report published by Higher Education
 Authority (2013), it appears that academic members are closely interested in their field of expertise which
 indicates that they are continually following studies which are conducted in litearature review. When it
 comes to internationalization, an institution’s including both national and international students and
 academic members indicates that it has a global identity and is ready for global competitiveness in global
 market (O'Connor et al., 2013). However, the number of students and academic members is not a
 sufficient indicator for quality. The quality of academic members and the quality of their teaching
 performance should also be assessed because they affect the the quality of education and they are
 regarded as assurance for quality control (Açan and Saydan, 2009).
 When literature is reviewed, it is noticed that the most frequently used performance assessment and
 evaluation techniques in higher education are Self-Assessment, Key Performance Indicators (KPI), Relative
 Evaluation, Appraisal, Six Sigma and Total Quality Management (Çalışkan, 2006; Kalaycı, 2009; Paige,
 2005). All of these techniques might not be appropriate for assessing individual performances of academic
 members. For instance, performance comparison technique involves evaluating the current performance
 of an individual with performance of another one who is accepted as leader within the same context. This
 might not be inappropriate for evaluating academic members’ performance because it is strictly
 dependent upon excellence of quality; however, each individual differs from each other in terms of
 working style and self-development. Among these techniques, Key Performance Indicators stand out as a
 convenient way as an evaluation method in higher education. In KPIs, performance indicators are
 operationally defined and it is specified which operations constitute a concept.
 When current practices are reviewed related to performance evaluation in Turkish higher education,
 it is criticized that there is only made a quantitative assessment of academic members’ research and
 publications and the evaluation is based on subjective judgements (Esen and Esen, 2015). In this regard,
 Council of Higher Education started academic incentive system in 2015 to increase academic members’
 motivation in Turkey and to support their academic activities financially (Academic Incentive Grant
 Regulation, 2015). Within this academic incentive regulation, the performance of the academic staff is
 evaluated by the Council of Higher Education based on their national and international projects, research,
 publications, exhibitions, patents received, references to their studies, and academic awards received. As
 a result, faculty members who perform sufficient work are financially supported. Apart from academic
 incentive, there is a variety of performance evaluations of academic members in Turkish higher education
 system such as:
 a) Registry system
 b) Academic promotion and appointment criteria
 c) Questionnaires of Academic Member Evaluation
 d) Annual reports
 1302

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 e) Surveys of Student Views
 (Esen & Esen, 2015)
 Performance evaluation in higher education is very important in terms of increasing the effectiveness
 of the services provided; however, the criteria and reliability of this process are as important. In this
 regard, Çakıroğlu, Aydın and Uzuntiryaki (2009) state that there are very promising studies which indicate
 the reliability of the evaluations made by experienced faculty members and they emphasize that the
 following criteria should be taken into consideration during evaluation:
 

 
 
 
 
 
 

 collecting data from various sources relaed to teaching performance (such as colleagues,
 students, advisors, master students, graduates) and in different formats (student assessment
 surveys, student interviews, observation results, course materials, student products, etc.),
 clearly identifying evaluation criteria,
 informing about evaluation process,
 informing the assessors on how to make an assessment,
 the candidates not playing an evaluative role,
 random selection of the assessors among those who meet the criteria,
 minimum 3 and maximum 5 members taking part in jury.

 The basis of the evaluation of the performance of faculty members is to increase the effectiveness of
 universities. There is increasing pressure on national and global universities to systematically perform
 performance evaluations due to concepts such as quality, efficiency, effectiveness, accountability. The
 reason why education faculties are preferred in this study is that Higher Education Council of Turkey
 emphasizes accreditation studies especially in education faculties within the scope of “Bologna Process.
 Higher education institutions in Turkey aim to increase their accountability as a quality indicator and
 inform the internal and external stakeholders of the current situation. In order to prove that they have
 accomplished their mission and vision within this scope, universities carry out performance evaluation
 studies of the instructors and present this to the knowledge of the public, students, families, government
 and private sector. In the accreditation process carried out in education faculties, it is important to identify
 academic staffs’ expectations and barriers for performance assessment. Therefore, while performance
 evaluation is so important for higher education institutions, research is needed to determine the
 expectations of the instructors whose performance is evaluated.
 Within the context of this study, there is made a quantitative and qualitative analysis of Education
 Faculty academic members’ expectations from a performance evaluation system and the obstacles to
 such an evaluation system. The following research questions are attempted to be answered:
 1. What are the expectations of academic members in Education Faculties from a performance
 evaluation system?
 1.1. Do the expectations of academic members from a performance evaluation system differ
 depending on following as variables: academic title, academic experience, academic incentive status and
 satisfaction from institutions?
 2. What are the perceptions of academic members in Education Faculties related to obstacles to a
 performance evaluation system?
 2.1. Do the perceptions of academic members related to obstacles to a performance evaluation
 system differ depending on following variables: academic title, academic experience, academic incentive
 status and satisfaction from institutions?
 3. What are general views of academic members in Education Faculties related to performance
 evaluation system?

 1303

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Method
 Convergent parallel mixed design has been preferred as research model in this study. Quantitative and
 qualitative data were collected simultaneously, analyzed independently and then they were converged in
 discussion. There is an equal emphasis on both quantitative and qualitative part in convergent mixed
 design and there is made independent analysis and eventually interpretations are made using both data
 (Creswell and Plano Clark, 2014). The Figure 1 shows the mixed design used in this research:

 Descriptive
 statistics
 t-test and ANOVA
 Qantitative data
 collection and
 analysis

 Qualitative data
 collection and
 analysis

 Interpretations of
 both quantitative
 and qualitative
 analysis

 Content
 Analysis

 Şekil 1. A model for a convergent parallel design in mixed research studies
 Participants
 The data of this study were collected in 2018 from academic members in Education Faculties in Turkey
 including dr. research assistants, assistant professors, associate professors and professors. Participants
 are from different regions of Turkey including Marmara, Black Sea, Egean, Mediterranean and East
 Anatolia. The instructors who have too much course load are not included in the study group and data are
 collected only from the faculty members who completed their doctoral education. Within the context of
 this study, convenient sampling technique was used for quantitative data for sample selection and data
 were obtained from 104 academic members in six universities who agreed to participate in this research.
 For qualitative data, participants were selected with maximum diversity sampling technique for purpose
 of collecting all kinds of different views about the current situation which is among purposeful sampling
 techniques. Qualitative data were obtained from 50 academic members in Education Faculties.
 Quantitative phase includes 25 dr. research assistants, 35 assistant professors, 31 associate professors
 and 13 professors. Since convenient sampling is used, sampling is not made according to the department
 criteria; but ultimately, 22 percent of participants teach in Science Education Department, 11 percent
 teach in Pre-School Education Department, 28 percent teach in Educational Sciences Department and 31
 percent teach in Primary School Teaching Department. In qualitative phase, samples include 13 research
 assistants, 17 assistant professors, 15 associate professors and 5 professors. Maximum diversity has been
 achieved according to academic title and department variable. 20 percent of participants teach in Science
 Education Department, 10 percent teach in Pre-School Education Department, 40 percent teach in
 Educational Sciences Department and 30 percent teach in Primary School Teaching Department.
 Data Collection Tool
 In this study, personal information form, “Expectations from Performance Evaluation Approach”
 subscale with 4-likert 16 items and “Obstables to Performance Evaluation Approach” subscale with 10
 items developed by Tonbul (2008) were used for data collection. Exploratory factor analysis and varimax
 rotation were applied for scale development. The internal consistency reliability related to subscale
 1304

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 “Expectations from Performance Evaluation Approach” was found to be .92, and subscale “Obstables to
 Performance Evaluation Approach” found to be .87. The internal consistency of these subscales was
 recalculated in this study and reliability of the first subscale appeared as .84 and second subscale as .78.
 If Cronbach Alpha Coefficient value - an indicator of homogeneity between scale items- is between .60.80, it is an evident of high reliability (Tonbul, 2008). The items in these subscales are accumulated in one
 factor and this one factor explains fifty-six percent of total variance.
 Also, a questionnaire with open-ended questions was developed for purpose of supporting
 quantitative data and making a deeper analysis. A professor from Educational Sciences Department, an
 Associate Professor from Assessment and Evaluation Department and a Professor who works as an expert
 in higher education quality studies analyzed the questions and made some suggestions. The questions
 were revised in light of these suggestions. The final form of questions includes:
 2.1. What do you think about making a periodic and data-based assessment of academic members?
 2.2. What criteria should be assessed within performance evaluation? Could you order these criteria
 according to significance level for you?
 2.3. What are the positive and negative consequences of making a performance evaluation of
 academic members?
 2.4. What are the obstacles to performance of academic members in higher education and what do
 you suggest for overcoming these obstacles?
 Data Analysis
 The equality of variances and normality of data were checked in order to identify the analysis method
 for quantitative data. The skewness and kurtosis values ranged from -1 to +1 which indicated that data
 distributed normally. Also, sample size was bigger than 50 (N=104); therefore, Kolmogorov Smirnov test
 was done for normality of data and it was found not to be significant (p>.05) which was an indicator of
 normality. As a result, parametric tests were used in the study. Independent Samples T test was done for
 checking whether there was a significant difference between participants in terms of academic incentive
 variable. One Way of Variance Analysis (ANOVA) was done for checking whether there was a significant
 difference between participants in terms of variables of work experience, academic title and satisfaction
 from institution.
 Inductive content analysis was done for analyzing qualitative data. Rater reliability agreement
 percentages were identified by investigating academic members’ views collected by open-ended
 questions. Academic members’ views collected by questionnaire were coded by researcher and one
 independent expert. Miles and Huberman (1994)’s reliability formula was used for calculation of
 agreement percentages.
 Reliability = Agreement/(Agreement + Disagreement)
 The interrater-reliability related to all codes identified by two raters was found to be 0.89. It is possible
 to assert that reliability is met for data analysis because %80 and above agreement percentage is accepted
 as sufficient (Mokkink et al., 2010). In this study, there has been used a variety of validity strategies listed
 by Creswell (2003) which are frequently used in qualitative research methods such as “Members’ Check,
 “External Audits”, “Rich, Thick Description” and “Chain of Evidence”. The participants were asked whether
 the findings of the study reflect their own ideas correctly, an independent expert who had little contact
 with the study participants and who knew the method of study was consulted and this study remained as
 loyal to the nature of the data as possible with direct quotations.

 1305

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Findings
 3.1 Findings Related to Expectations From a Performance Evaluation System
 The first research question of this study “What are the perceptions of academic members related to
 their expectations from a performance evaluation system?” was attempted to be answered. Table 1
 presents the general score mean of participants and Table 2 presents the score means depending on
 academic titles.
 Table 1.
 The General Score Mean Related To Academic Members’ Expectations from a Performance Evaluation
 System

 General score mean
 Expectation Subscale

 of

 N

 Minimum

 Maximum

 Mean

 104

 1,50

 3,31

 2,3023

 Standard Deviation
 ,43859

 In Table 1, when score means of academic members are reviewed, it is seen that their expectations
 from a performance evaluation system is not at a high level ( =2,30), it is at moderate level (which means
 partially agree). Table 2 presents ANOVA test results which indicate whether academic members’
 expectations significantly differ depending on academic titles:
 Table 2.
 The ANOVA Results Related To Whether Expectations from a Performance Evaluation System Differ
 Depending On Academic Title
 N

 Standart
 Deviation

 Sum of
 Squares df

 Mean of F
 Squares

 p

 Source of Difference
 Research
 Assistant>Associate
 Prof.

 25 2,4525

 ,50688

 Research
 Assistant

 Between
 Groups

 3

 1,774
 12,24.000Assistant Prof.
 >Associate Prof.

 5,321

 Associate Prof. >Prof.
 Assistant
 Professor

 5

 Associate
 Professor

 1

 Professor
 Total

 3
 2,4875

 ,25174

 3
 2,1754

 ,44177 Intragroup

 1
 1,8173

 ,16230

 104 2,3023

 ,43859

 3

 14,492 100
 145

 When arithmetic mean and standard deviation values according to academic titles are analyzed, it is
 observed that assistant professors have the highest, on the other hand, professors have the lowest
 expectations from a performance evaluation system. As there appears a significant difference between
 groups in Table 2, post hoc tests have been used for identifying between which groups the significant
 1306

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 difference is. As the variances are found not to be equal with “Levene F” test, Games-Howel statistical
 method has preferred which works well with unequal groups. As a result of analysis, it is found out that
 Research Assistants and Assistant Professors have higher level of expectations than Associate Professors
 and Professors.
 When subscale is analyzed item by item, it appears that the highest expectations from a performance
 evaluation system are:
 It creates a consensus on the criteria of being an effective academic member (
 It positively affects academic members’ professional development (
 It increases workload of academic members (

 =3,42)

 =3,27)

 =2,40)

 It causes tension within institution ( =2,39)
 Academic members’ the lowest expectations from a performance evaluation system appear as:
 It increases academic members’ motivations (

 =1,90)

 It contributes to development of a qualified institutional culture (values, attitude towards work,
 understanding of responsibility, relationships etc.) ( =1,76)
 It helps academic members to get better prepared for their courses (

 =1,70)

 Table 3 presents the analysis results related to whether academic members’ expectations froma
 performance evaluation system differ depending on academic incentive status.
 Table 1.
 T-test Results Related to Whether Expectations from a Performance Evaluation System Differ Depending
 On Academic Incentive Status
 Standard
 Deviation

 N
 Academic
 Incentive

 Yes, I take

 52

 2,43

 ,38

 No, I don’t

 52

 2,16

 ,45

 t
 ,322

 p
 ,002

 Firstly, equality of variances are checked with Levene test and significance value in appropriate t
 column is accepted. As a result of analysis, expectations from a performance evaluation system
 significantly differ depending academic incentive status [t(102)=3,22, p<.05)]. Academic members who
 take academic incentive -a financial aid which is given to academic members who produce a certain
 number of research and projects- have significantly higher level ol expectations than those who do not.
 Table 4 presents the ANOVA results related to whether academic members’ expectations from a
 performance system differ depending on work experience.

 1307

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Table 4.
 The ANOVA Results Whether Expectations from a Performance Evaluation System Differ Depending On
 Work Experience
 Standard
 Deviation

 N

 Sum of
 squares

 df

 Mean of
 squares

 F

 p

 Source
 Difference

 of

 0-5 years> more
 than 15 years
 0-5 years

 17

 2,43

 ,51

 Between
 Groups

 1,55

 3

 4,67

 10,28

 ,000

 6-10 years> more
 than 15years
 11-15 years>more
 than 15 years

 6-10 years

 38

 2,43

 ,28

 11-15 years

 14

 2,51

 ,44

 More than 15
 35
 years

 2,00

 ,39

 104 2,30

 ,43

 Total

 1,51 100

 Within
 group

 15,1

 103

 When Table 4 is reviewed, it is clearly seen that the lowest scores of expectations belong to academic
 members who have more than 15 years working experience. The mean scores of other three groups are
 higher than the mean score of this group at a significant level; but there is no significant difference
 between the mena scores of these three groups. Table 5 presents the ANOVA results related to whether
 academic members’ expectations from a performance system differ depending on satisfaction level from
 their institutions.
 Table 5.
 The ANOVA Results Related To Whether Expectations from a Performance Evaluation System Differ
 Depending On Satisfaction Level from Institution
 Standard
 Deviation

 N

 Low

 10

 2,70

 ,31

 Moderate 35

 2,39

 ,32

 42

 2,00

 ,47

 Very high 17

 1,80

 ,11

 High

 Total

 104 2,30

 Sum of
 squares

 df

 Mean
 Square

 Between
 Groups

 5,97

 3

 1,991

 Within
 Groups

 13,08

 100

 ,138

 ,43859

 103

 1308

 F

 p

 14,383 ,000

 Source of
 variation
 Low, Moderate
 > High, Very
 High

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 In Table 5, there is observed a significant difference between mean scores of groups (p<.05) ; therefore
 Games-Howel post hoc test is done due to unequality of variances for identying the source of variation.
 As a result of post-hoc test, it is observed that academic members who have low and moderate level of
 satisfaction from their institutions have significantly higher level of expectations from a performance
 evaluation system than those who have high and very high level of satisfaction from their institution.
 3.2 The Obstacles to a Performance Evaluation System
 The second research question of this study “What are the perceptions of academic members related
 to the obstables to a performance evaluation system?” is attempted to be answered. Table 6 presents the
 mean and standard deviation related to scores of academic members.
 Table 6.
 The General Mean Score of Academic Members Related To the Obstacles to a Performance Evaluation
 System

 The Obstacles Subscale

 N

 Minimum

 Maximum

 Mean Standard Deviation

 104

 2,20

 3,80

 3,02

 ,57517

 When Table 6 is reviewed, it is seen that the mean score of academic members is high ( =3,02), which
 mean that academic members agree with the items in this subscale as obstacles to a performance
 evaluation system. When it is analyzed item by item, the most frequently agreed obstacles are:
 Higher education institutions’ current organizational structure (hieraricical organization, distribution
 of authority and responsibilities, autonomy limits of units) = 3,80
 Academic members’ workload

 =3,68

 Academic members least agree on the following obstacle to a performance evaluation system “cultural
 structure (ignoring the problems, personal conflicts, exreme tolerance, discomfort of criticism, lack of
 confidence, lack of competitive understanding at Eurapean standards) ( =1,91)
 Table 7 presents the anaylsis results related to whether academic members’ perceptions of obstables
 to a performance evaluation system differ depending on academic incentive status.
 Table 7.
 The T-Test Results Related To Whether Academic Members’ Perceptions of Obstacles to a Performance
 Evaluation System Differ Depending On Academic Incentive Status
 Standard
 deviation

 N
 Academic Incentive

 Yes, I take

 52

 2,14

 ,54

 No, I don’t

 52

 2,74

 ,51

 t
 5,77

 p
 ,000

 When Table 7 is reviewed, it is seen that academic members’ perceptions of obstacles differ depending
 on academic incentive status at a significant level [t(102)=5,77, p<.05)]. Academic members who take
 academic incentive have significantly lower perceptions of obstacles to a performance evaluation system.
 Table 8 presents the ANOVA results related to whether academic members’ perceptions of obstacles to a
 performance evaluation system differ depending on academic title.

 1309

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Table 8.
 The ANOVA Results Related To Whether Academic Members’ Perceptions of Obstacles to a Performance
 Evaluation System Differ Depending On Academic Title
 Standard
 Deviation

 N

 Sum of
 Mean
 squares df Square

 2,98

 Between
 .30181
 3
 Groups 11,089

 Assistant
 Prof.
 35

 3,42

 .36202

 Associate
 31
 Prof.

 3,38

 .63314

 Professor 13

 2,96

 .83254

 Research 25
 Assistant

 Total

 104

 3,
 02

 3,696

 F

 p

 Source
 variation

 of

 Assistant Prof.>
 Research Assistant,
 13,508 ,000
 Professor
 Associate
 Prof
 >Research Assistant,
 Professor

 Within
 Group

 27,365

 100 ,274

 .61101

 When Table 8 is reviewed, it is seen that there is a statistically significant difference between groups
 (p<.05); therefore, Games-Howel statistical post hoc test (for cases of unequal variations) is used for
 identifying the source of variation. As a result of analysis, it is observed that the highest scores of obstacles
 belong to Assistant Professors and Associate Professors, the lowest scores belong to Research Assistants
 and Professors. Table 9 presents the ANOVA results related to whether academic members’ perceptions
 of obstacles to a performance evaluation system differ depending on working experience.
 Table 9.
 The ANOVA Results Related To Whether Perceptions of Obstacles to a Performance Evaluation System
 Differ Depending On Working Experience
 Standard
 Deviation

 N

 0-5 years 17

 6-10 years 38

 2,72

 ,51

 3, 26

 ,28

 14

 3,78

 ,44

 More than
 35
 15 years

 2,88

 ,39

 104 3,02

 ,54

 11-15
 years

 Total

 Sum of
 Mean
 squares df Squares
 Between
 Groups

 21,938

 3 4,67

 F

 44,27

 p

 ,000

 Source
 variation

 of

 11-15years,
 6-10 years>05years, more
 than 15 years
 11-15 years
 >6-10 years

 Within Group

 1,51 100
 16,51
 103

 1310

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 When Table 9 is reviewed, it is seen that there is a statistically significant difference; therefore GamesHowell post hoc test is done. As a result of test, it appears that academic members who have 0-5 year
 working experience have the lowest scores from obstacles subscale, and those who have 11-15 years of
 working experience have the highest scores from obstacles subscale. Academic members who work more
 than 10 and less than 15 years think that almost all items in the subscale really pose an obstacle to a
 performance evaluation.
 Table 10 presents the ANOVA results related to whether academic members’ perceptions of obstacles
 to a performance evaluation system differ depending on satisfaction level from institution.
 Table 10.
 The ANOVA Results Related To Whether Academic Members’ Perceptions of Obstacles to a Performance
 Evaluation System Differ Depending On Satisfaction Level From Institution
 Standard
 Deviation

 N

 10

 3,36

 ,31

 Moderate 35

 3,58

 ,32

 42

 2,62

 ,47

 Very high 17

 2,58

 ,11

 Low

 High

 Total

 104 3,02

 Sum of
 squares
 Between
 Groups

 Within
 Group

 5,97

 13,08

 ,43859

 Mean
 Squares

 df

 3

 100

 1,991

 F

 14,38

 P

 Source of
 variation

 Low,Moderate
 Level>High,
 ,00
 Very high

 ,138

 103

 When Table 10 is reviewed, it is seen that there is a statistically significant difference; therefore
 Games-Howell post hoc test is done. As a result of test, it appears that academic members who have high
 and very high level of satisfaction from their institutions have significantly lower scores of obstacles to a
 performance evaluation system.
 3.3 Qualitative Analysis of Academic Members’ General Views Related to Performance Evaluation
 System
 Within context of this study, qualitative data have been obtained from academic members related to
 their views about performance evaluation system. The data collected have been analyzed with content
 analysis. As a result of content analysis, there emerges the following six themes: “attitude towards
 performance evaluation theme, priorities of academic members, positive effects of performance
 evaluation, negative effects of performance evaluation, obstacles to performance evaluation and
 suggestions for obstacles”.
 1.

 What do you think about making a periodic and data-based assessment of academic
 members?

 There is a difference of opinion among academic members in Educaton Faculties. Although it appears
 that most of the academic members support a perodic and data-based assessment, there are some other
 academic members who have negative attitudes and criticism for such a system by asserting that it is wide
 open to abuse. Table 11 presents the analysis of qualitative data about this theme.
 1311

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Table 11.
 The Analysis of Data Related to Making a Periodic Performance Evaluation
 Theme
 Attitude
 Towards
 Performance
 Evaluation

 Description

 Having positive, negative or recessive
 attitudes about performance evaluation

 Codes

 Frequency

 Adopters

 28

 Doubters

 12

 Resistants

 10

 When Table 11 is reviewed, it is seen that most of the academic members adopt such a system. They
 claim that performance evaluation would support their development in many aspects. The codes related
 to views of academic members are given below:
 Adopters: “I believe that performance evaluation will bring about good results in assuring quality in
 higher education”(K6)
 Doubters: “It is very nice to be supported by system. But is it all about publishing? It is a matter of
 question for me how will this evaluation be done, and by whom?” (K5)
 Resistants: “performance can not be assessed. It is ridiculous to compare individuals. It has been tried
 many times before, but it is found out to be useless” (K13)
 2. What criteria should be assessed within performance evaluation? Could you order these criteria
 according to significance level for you?
 Academic members in Education Faculties express a variety of views related to what criteria should be
 included within evaluation. They express significance level of these criteria, which provide valuable
 qualitative data. Table 12 presents the analysis of qualitative data about which criteria should be included
 within performance evaluation.
 Table 12.
 The codes related to academic members’ preference for performance evaluation criteria
 Theme

 Priorities of
 Academic
 Members

 Descriptors

 The criteria which
 should be included
 within performance
 evaluation
 and
 ordering
 these
 criteria according to
 significance level

 Codes

 Frequency

 Research and publications

 17

 The quality of instruction

 10

 Undergraduate and postgraduate advisory

 8

 Workload (course hours etc.)

 6

 Jury memberships (Jury of thesis, Jury of
 Associate Professor etc.)
 Perosnal interest and career

 5
 4

 When Table 12 is reviewed, it is seen that academic members in the first place want their research
 and publications to be assessed, and then their teaching quality during classroom. According to them,
 teaching quality includes methods they use, the quality of presentation of content, material use and every
 1312

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 piece of effort which makes learning permanent. Academic members also want their personal interests
 and career to be assessed by the system. The codes related to views of academic members about
 performance evaluation criteria are given below:
 Research and Publications: “The most important criteria in a performance evaluation system must be
 academic members’ publications in terms of quantity and quality” (K6)
 The quality of instruction: “Instruction is as important as academic studies. Classroom work, expecially
 activities and teaching methods can be assessed”
 Undergraduate and postgraduate advisory: “We are not just researchers, but also advisors and this
 issue is ignored by the system. For instance thesis advisory is a tedious job and should be included within
 evaluation”(K22)
 Workload: “There is no time left for other things rather than teaching courses. An academic member
 should be assessed with his/her courses, efforts he/she makes for students and administration. Academic
 members who teach more courses are the best academicians.” (K30)
 3. What are the positive and negative consequences of making a performance evaluation of academic
 members?
 Academic members in Education Faculties put emphasis on both positive and negative impacts of a
 performance evaluation system. In “Positive Impacts” theme, the codes appear as “motivation”, “financial
 support”, “search of quality”, “support of development via self-criticism”, “continuity of dynamism”;
 however, in “Negative Impacts” theme, the codes appear as “intra-institutional rivalry”, “academic
 dishonesty”, “cause of stress”, “domination of quantity over quality”. Table 13 presents the analysis of
 qualitative data about positive and negative impacts of performance evaluation.
 Table 13.
 The codes related to academic members’ views about positive and negative impacts of performance
 evaluation
 Theme

 Description

 Codes

 Frequency

 Motivation

 12

 Financial Support

 8

 Search of quality

 8

 Supporting development via selfcriticism

 4

 The continuity of dynamism

 4

 Positive Impacts

 Positive consequences of
 performance evaluation
 system

 Negative Impacts

 Negative consequences of
 performance evaluation

 Intra-institutional rivalry

 7

 Academic dishonesty

 6

 Cause of stress

 6

 Domination of quantity over quality

 8

 When Table 13 is reviewed, it is seen that academic members express 36 views under 5 codes related
 to positive impacts theme and 27 views under 4 codes related to negative impacts theme. The codes
 1313

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 related to views of academic members about positive and negative impacts of performance evaluation
 are given below:
 Motivation: “This system motivates academic members to conduct new studies” (K24)
 Search of quality: “academic members who are subject to an evaluation feel the need to pursue quality.
 No one wants to be called a bad teacher” (K9)
 The continuity of dynamism: “In public universities, especially old academic members are resistant to
 renew themselves. This situation leads to fossilization in higher education; because there is no evaluation
 and sanction. Evaluation results in dynamism”(K29)
 Intra-institutional rivalry: “it prevents cooperation, grows jealousy, a competitive environment
 increases egoist behaviors rather than productivitiy” (K36)
 Academic dishonesty: “conducting research with fake data, request others to write his/her name as
 the last name in studies with no effort”
 Domination of quantity over quality: “publish publish publish, it is enough. There are lots of academic
 members who make research but what about the quality? No one asks for this question. No one talks
 about quality now”
 4. What are the obstacles to performance of academic members in higher education and what do you
 suggest for overcoming these obstacles?
 Academic members in Education Faculties list a number of obstacles to a performance evaluation
 system and then offer some suggestions for overcoming these obstacles. In “Obstacles” theme, the codes
 appear as “intensive workload (courses, advisory, administrative duties)”, “efforts are not appreciated”,
 “cumbersome organizational process”, “lack of internal motivation” and “too crowded classrooms”; in
 “Suggestions” theme, the codes appear as “reducing course loads of academic members”, “institutional
 support for academic efforts and research publishings”, “evaluation criteria determined by universities”,
 “perodical budget allocation to academic members from The Council of Higher Education” and lasty
 “employing more officers”. Table 14 presents the analysis of qualitative data related to academic
 members’ views about obstacles and suggestions.
 Table 14.
 The Codes Related To Academic Members’ Views about Obstacles to a Performance Evaluation and Their
 Suggestions
 Theme

 Description

 Codes

 Frequency

 Obstacles
 Intensive
 workload
 administrative duties)
 Obstacles
 to
 a
 performance evaluation
 system

 (courses,

 advisory,

 18

 Efforts are not appreciated

 12

 Cumbersome organizational process

 10

 Too crowded classrooms

 8

 Lack of internal motivation

 6

 Reducing courseload of academic members

 11

 Suggestions

 1314

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Suggestions
 for
 overcoming obstacles

 institutional support for academic efforts and
 research publishings

 11

 Evaluation criteria should be determined by
 universities

 10

 Periodical budget allocation to academic members
 from YÖK

 8

 Employing more officers

 4

 When Table 14 is reviewed, it is seen that academic members express 48 views under 5 codes in
 “Obstacles” theme and 44 views under 5 codes in “Suggestions” theme. The codes related to views of
 academic members about obstacles to a performance evaluation and their suggestions are given below:
 Intensive workload: “it takes time to make something of high quality. There is left no time for academic
 members. They teach courses, take care of students or are busy with administrative duties.” (K25)
 Lack of internal motivation: “there are lost of things in academic life which decreases motivation. If an
 individual starts this profession for some other reasons, he/she has low level of motivation for selfdevelopment”(K19)
 Cumbersome organizational process: “burecracy and very slow running process put an obstacle to
 performance while making projects or other studies” (K8)
 Institutional support for academic efforts and research publishings: “the most important suggestion
 for performance increase is that academic members should be supported by institution. This might include
 research, publication, congress participation or educations for self-development” (K7)
 Employing more officers: “If the institution employs more officers, academic members will be freed
 from paperworks” (K21)
 Periodic budget allocation to academic members from YÖK: “The Council of Higher Education should
 allocate a certain amount of budget for academic members, ask them to plan their budget use and make
 budget-product comparison at the end of period” (K10)
 Discussion & Conclusion
 In accordance with findings of this study, it is observed that there is a difference of opinion among
 academic members related to performance evaluation system. It is seen that academic members in
 education faculties who have more than 15 years of working experience and highly satisfied have lower
 expectations about performance evaluations than others. When academic members’ views are reviewed
 depending on academic title, it is seen that research assistants and assistant professors have positive
 attitude towards performance evaluation, while associate professors and professor show low level of
 positive behaviors towards performance evaluation. Accordingly, Stonebraker and Stone (2015)
 emphasize that there is an increase in the average age of academic members with the elimination of
 mandatory retirement and this raises some concerns about the impact of this aging on productivity in
 class. They claim that age has a negative impact on student ratings of faculty members that is strong
 across genders and groups of academic disciplines. However, this negative effect begins after faculty
 members reach their mid-forties. This explains the reason for negative attitudes of professors towards
 performance evaluation system. This finding is also parallel with Esen and Esen’s (2015) study findings.
 They find out in their study that the there is a decrease in positive perception of academic members about
 the positive impacts of performance evaluation as there is a progress in academic titles. Bianchini, Lissoni,
 and Pezzoni, (2013) emphasize that the students tend to evaluate professors’ performance more
 negatively than assistant professors. From a general point of view, it appears in this study that there is a
 hesitation and lack of confidence in academic community about the efficiency of a performance
 evaluation system.
 1315

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 This study indicates that academic members expect from a performance evaluation system to develop
 a consensus about the criteria of an effective academician, positively affect professional development of
 academic members; on the other hand increase workload of academic members and lead to intrainstitutional tension. Qualitative analysis also shows that nearly half of academic members support
 performance evaluation while a certain number of academic members hesitates about how it will be
 applied and by whom. The academic members in the faculties of education claim that performance
 evaluation increases motivation and search for quality; but it may also lead to competition within
 institution and academic fraud. Traditionally, performance evaluation in faculties tend to focus on
 research indicators (Bogt and Scapens, 2012); therefore higher education institutions plan their
 evaluations considering governmental funding, research awards and high rankings which all lead to an
 evaluation which only favours academic members with top publications (Douglas 2013, Hopwood 2008).
 These findings differ to a certain extent from studies of Tonbul (2008); Esen and Esen (2015) and Başbuğ
 and Ünsal (2010). Tonbul (2008) asserts that academicians have higher expectations from performance
 evaluation approach because they think evaluation approach helps to identify the obstacles to an effective
 performance and recognize one’s own deficiencies. Accordingly, Esen and Esen (2015) emphasize that
 academic members expect from a performance evaluation system to develop a qualified organizational
 culture, provide continuity of organizational innovation, positively affect professional development of
 academic members and helps to recognize own deficiencies. This study also indicates that the most
 important obstacles to performance evaluation appear as organizational process of higher education
 institutions, intensive workload and lack of intrinsic motivation. Within the scope of the proposals, they
 request for employment of more officers and institutional support for their publications and academic
 studies. As a result of Tonbul (2008)’s study, he lists the obstacles to performance evaluation as
 inadequacy of organizational opportunities, the organizational culture and uncertainty in evaluation
 criteria. In study of Esen and Esen (2015), it is found out that the most important factors which put
 obstacle to performance evaluation are inadequacy of organizational opportunities, current
 organizational process of higher education institutions and academic promotion criteria. Also, Başbuğ and
 Ünsal (2010) claim that the lack of physical conditions for scientific research is the most significant factor
 which puts obstacle to academic performance.
 Academic members in this study emphasize that they prefer to be evaluated according to following
 criteria: first of all for their academic publications and research, secondly their quality of instruction and
 thirdly their counseling service to postgraduates. This finding is supported by Braunstein ve Benston
 (1973) as they find out that research and visibility are highly related in evaluation of performance of
 academic members, but effective teaching is only moderately related to these performance criteria. In
 practice, academic members’ performance of instruction is mostly done by students. Arnăutu and Panc
 (2015) criticize this situation by claiming that research and scientific productivity, administrative capacity
 and reputation are not presented in the evaluation made by students, therefore they do not have
 information necessary to evaluate academic members’ role within faculty. Ünver (2012) conducts
 research about evaluation of academic members by students and it comes out that most of the academic
 members think that students fail to make an objective evaluation of academic members; therefore, they
 prefer making academic studies rather than focusing on students’ views about their teaching
 performance. Turpen, Henderson, and Dancy (2012) state that the faculties focus on the students' test
 performance and academic success as quality criteria while higher education institutions focus on
 quantitative scoring of students when evaluating the quality of teaching. Within this respect, the quality
 of the measurement tools is very important for assessment of teaching performance. Kalaycı and Çimen
 (2012) examine the assessment tools used in the process of evaluating the instructional performance of
 academicians in higher education institutions and find out that quality of instruction and course
 evaluation surveys are developed without any particular approach and twenty percent of items are
 inappropriate according to item construction and writing rules, therefore these assessment tools fail to
 evaluate academic members’ performance.It is shown in some studies that the assessment of the
 performance of the instructors by the students may be related to the quality of the teaching as well as
 the qualities of physical attraction and comfort of the course which are not related to the teaching
 (Hornstein, 2017; Tan et al., 2019). Shao, Anderson, and Newsome (2007) claim that academic members
 1316

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 request peers/colleagues’ considerations for performance assessment and other criteria such as class
 visits, preparation for class, follow-up of current developments in the field. There are other factors
 affecting performance evaluations of academic members. Özgüngör and Duru (2014) find out that there
 is deterioration in the perceptions of the instructors as there is an increase in course load, instructors’
 experience, and the total number of students taking instructor’s course. It comes out that the students of
 the Faculty of Education tend to give higher scores to the faculty members than the students of all other
 faculties, whereas the students of the Faculty of Technical Education and Engineering give lower scores
 to the faculty members. It is also revealed that faculty members with a course load of 45 hours or more
 are evaluated more negatively than other faculty members with less course load. In Faculty of Education,
 the faculty members with 60-100 students receive the worst performance evaluations. Arnăutu and Panc
 (2015) refer to students and academic members’ different expectations from each other; claiming that
 students focus on communicative issues and expect from professors a good relationship and personalized
 feedback, while professors believe that the attention should be focused on the quality of the education
 process (such as information update).
 In this study, it is found out that the performance evaluation of the academic members creates a
 consensus on the criteria of the effective academic member and positively affects the professional
 development of the academic members. These qualifications enhance the professional quality of
 academic members working in the faculties of education and provide a sustainable professional
 development process. Filipe, Silva, Stulting and Golnik (2014) emphasize that sustainable professional
 development improved through performance evaluation is not only limited to educational activities, but
 also develops qualities such as management, teamwork, professionalism, interpersonal communication
 and accountability. Açan and Saydan (2009) attempt to determine the academic quality characteristics of
 the academic members and come up with those criteria: “the teaching ability of the instructor, the
 assessment and evaluation skills of the instructor, the empathy of the instructor, the professional
 responsibility of the instructor, the instructor's interest in the course and the gentleness of the instructor”.
 Esen and Esen (2015) state that the performances of faculty members in the United States are generally
 based on four factors which include instruction, research (professional development), community service
 and administrative service. Among them, they emphasize that the most important ones are the instruction
 and research dimension. Performance evaluation results are used for making decisions about whether
 they are appropriate in their current position, promoting them or extending working periods of academic
 members.
 In this study, it is seen that academic members who do not take academic incentive have lower
 expectations than those who deserve such a payment. Kalaycı (2008), regarding performance evaluation
 system in Turkey, claim that it is not even in preparation stage compared to global practices. However,
 there has occurred a number of promising developments in this area in Turkish higher education. Focusing
 on this problem, the Council of Higher Education in Turkey decided to create Higher Education Quality
 Council in 2015 to provide assurance that “a higher education institution or program fully fulfills the
 quality and performance processes in line with internal and external quality standards”. In parallel, the
 Academic Incentive Award Regulation has been put into practice in order to evaluate the performance of
 academic staff working in higher education according to standard and objective principles, to increase the
 effectiveness of scientific and academic studies and to support academic members. It seems to succeed
 its aim because in this study academic members who take incentive are highly motivated and they make
 consensus on the criteria of the effective faculty member which are in compliance with the academic
 incentive award.
 It is important to make performance evaluation in higher education in terms of increasing efficieny of
 services; however, it is also important to determine which criteria will be used and assure reliability of
 assesment. In this respect, Çakıroğlu, Aydın and Uzuntiryaki (2009) claim that there are very promising
 research about the reliability of experienced academic members’ evaluations and they emphasize that
 the following criteria should be considered within the context of evaluations:

 1317

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 • Data about instructional performance should be collected from a variety of sources (colleagues,
 students, advisors, postgraduate students, graduates etc.) and in a variety of formats (student evaluation
 surveys, student interviews, observation results, course materials, student products etc.),
 • clearly identifying evaluation criteria,
 • informing evaluators about how to make evaluation process
 • selecting evaluators randomly from candidates who meet criteria of being evaluator
 • jury should include at least 3, at most 5 members.
 To sum up, academic members’ views about performance evaluation are analyzed and it is recognized
 that there is no consensus among academic members about performance evaluation. Academic members
 are aware of positive impacts of such a system; however, they also have concerns about the realiability
 of assessment, evaluation criteria, evaluation process and evaluators. This study indicates that the most
 important criteria for academic members which should be included in evaluation are research and
 publication, quality of instruction and undergraduate & postgraduate advisory. Among positive impacts
 of performance evaluation system, it stands out that performance evaluation motivates academic
 members, provides financial support and leads to search of quality; however, academic members put
 emphasis on negative impacts of such a system which include intra-institutional competition and
 academic fraud. Academic members make some suggestions for overcoming obstacles which include
 reducing course loads, providing more institutional support for academic efforts, allocation of a certain
 amount of budget for each member from the Council of Higher Education and employing more officers.
 There is a variety of requests about performance evaluation criteria; however, it is important to establish
 an effective evaluation system based on monitoring of peformance based on multiple data types in terms
 of improving the quality of higher education and making systematic improvements.
 As a result of this research, it is recommended that higher education institutions increase the
 objectivity and efficiency in performance evaluations and create human resources services within
 faculties. Also, they should design sustainable strong performance plans, use a holistic evaluation cycle,
 provide consultancy services to academic members, students and internal stakeholders on how to
 improve performance, prepare understandable and objective guidelines for performance evaluators, and
 develop institutional culture which specifies that feedback is valuable not judgmental.

 1318

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Türkçe Sürümü
 Giriş
 Performansın sistematik şekilde ölçülmesi ve değerlendirilmesi her tür organizasyonun şeffaf, etkili ve
 başarılı bir işleyiş için üzerinde hassasiyetle durduğu bir boyuttur. Özel veya kamu yükseköğretim
 kurumlarının çoğu sistematik bir değerlendirme için çeşitli çalışmalar yapmaktadırlar. Tüm yükseköğretim
 kurumları, küresel ölçekteki artan rekabet ve şeffaflığa ilişkin toplumsal baskı gibi nedenlerle üniversiteler
 standart performans göstergeleri belirlemenin yanı sıra vizyon, misyon ve stratejilere erişme düzeyini
 ortaya koyma ihtiyacı hissetmektedirler (Hamid, Leen, Pei & Ijab 2008). Özellikle bugünün rekabetçi
 ortamında, daha iyi bir değerlendirme sistemi üniversitelere yön gösterici avantajlar sunmakta ve kendi
 çalışanlarını ve işleyişi değerlendirmelerine fırsat tanımaktadır. Devlet üniversitelerinin kamu tarafından
 finanse edilmesi, baskı altında kalınmadan verimli bir performans değerlendirme yapılması için uygun
 ortam sağlamaktadır.
 Alan yazına bakıldığında yükseköğretim kurumlarının hesap verebilirliği ile ilgili çeşitli tartışmaların
 olduğu görülmektedir. Bu tartışmaların temeli, kurumların performansının değerlendirilmesi ve sonuçların
 halka açık biçimde diğer paydaşların da katılımına imkan tanıyacak şekilde yayınlanması ile ilgilidir.
 Yükseköğretime getirilen başka bir eleştiri de üniversitelerin performansının en önemli belirleyicilerinden
 olan öğretim üyelerinin, dünyadan kopuk “kapalı bir toplum” olarak bir “fildişi kule”de yaşadıkları
 iddiasıdır (Glaser, Halliday, & Eliot, 2003). Esen ve Esen (2015) bu eleştirileri şu şekilde özetlemektedir:
 • Öğretim üyelerinin yaptıkları çalışmaların toplumsal sorunlara dönük olmadığı,
 • fazlasıyla teorik olduğu,
 • toplumsal kaynakların boşa harcandığı yönündeki eleştiriler (Etzkowitz, Webster, Gebhardt, &
 Terra, 2000).
 • araştırmaların toplumsala dönüştürülmesi yerine, tek taraflı ve sadece o alanla sınırlandırılmış
 olarak yürütülmesi,
 • akademisyen kimliğinin bulunduğu üniversite ya da yönetsel yapıyı ürkütmekten tedirgin, özerkliği
 daralmış bir kimliğe dönüşmesi (Elton, 1999).
 Yükseköğretim kurumları, her ne kadar özerk olarak görev yapsa da bireysel organizasyon ve kuruluşlar
 gibi ele alınmamalıdır. Yükseköğretim kurumları ait oldukları toplumu, ekonomik yapıyı ve sosyal yaşamı
 etkileme gücüne sahip kurumlardır. Dolayısıyla, üniversiteler fildişi kuleler yerine bilim, toplum ve ulusu
 birarada ele alıp uluslararası kalite standartlarında performans göstermeli ve kariyer gelişimi yerine
 toplumsal faydayı ön planda tutmayı vicdani bir sorumluluk olarak hissetmelidirler. Üniversitelerde
 performans değerlendirmenin yapılması, çalışanlara hesap verebilirlik (bilimsel araştırmalar için sürekli
 iyileştirme faaliyetleri), devlete hesap verebilirlik (kaynakların verimli ve üretken kullanımı), öğrenci ve
 topluma hesap verebilirlik (kapsamlı eğitsel deneyimler sunma, yaşam kalitesini artıracak mesleki
 eğitimler sunma, toplumun işgücü ihtiyacını karşılama) açısından gereklidir (Vidovich & Slee, 2001). Ayrıca,
 yükseköğretimde performans değerlendirmeyi gerekli hale getiren küresel gelişmeleri UNESCO (2004)
 “girişimci üniversiteler, şirket üniversiteleri gibi yeni kurumlar; uzaktan, sanal ve özel şirketler gibi yeni
 eğitim hizmeti dağıtım türleri; yeterlilik ve sertifikaların daha fazla çeşitlenmesi; yurtdışına yönelik artan
 öğrenci, program, tedarikçi ve proje hareketliliği; yükseköğretim sunumunda artan özel yatırımlar şeklinde
 sıralamıştır. Bu gelişmeler kalite, erişim, çeşitlilik ve finansman açısından yükseköğretime yönelik önemli
 çıkarımlardır (akt.Tezsürücü & Bursalıoğlu, 2013). Yükseköğretimde performans değerlendirmesi hem
 çeşitli süreçleri hem de ürünleri kapsamaktadır. Temelinde, performans değerlendirmesi kalite açısından
 kabul edilebilir minimum düzeyi göstermekte ve bireylerin/kurumların gelişmeye açık yönlerini
 1319

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 tanımalarına olanak sağlamaktadır. Birey veya kurumlar, sadece gelişmeye açık yönlerinin farkına
 varmamakta; mevcut haliyle hangi yönlerde güçlü olduğunu da saptamaktadırlar. Batool, Qureshi & Raouf
 (2010), performans değerlendirmesi denildiğinde bu kavramın bütün boyutları kapsamayabileceğini,
 kurumsal performans değerlendirmesinin, akademik programların, derslerin veya mezunların kalitesini
 ölçmekle aynı anlama gelmediğini belirtmişlerdir. Kurumsal performans değerlendirme daha çok kurumun
 kalite ve etkililiği açısından mevcut durumunun değerlendirilmesi demek olduğunun altını çizmişlerdir.
 Bu çalışma kapsamında yükseköğretimde performans değerlendirmesi «öğretim elemanlarının
 öğretimsel rollerine ilişkin mesleki yeterliliğinin ve aynı zamanda kurumsal hedeflerin yerine getirilmesine
 katkı düzeyinin ölçülmesi» olarak tanımlanabilir. Öğretim elemanlarının araştırma, akademik hizmet,
 eğitim-öğretim, yayın gibi çeşitli çalışmalarının değerlendirilmesi, geri dönüt verilerek bireylerin
 gelişiminin desteklenmesi ve çalışmalarının takdir edilmesi, performans değerlendirme sisteminin varlığını
 zorunlu hale getirmektedir. Vincent ve Nithila (2010), yükseköğretimde gerçekleştirilecek bir performans
 değerlendirmesi yaklaşımının sağlayacağı avantajlar arasında şunları dile getirmektedir:
 • Bireyin gelişim ve ilerlemesinin gerçekçi hedeflere dayanmasını sağlar.
 • Bireyin hedefleriyle kurumun hedeflerini birbirine uygun hale getirir.
 • Organizasyon içindeki bireylerin zayıf yönleri ve güçlü yönlerini teşhis eder.
 • İyileştirme amaçlı geri dönüt mekanizması işlevi görür.
 • İhtiyaç duyulan eğitim ve kursları belirlemeye yarar.
 • Kurumun eğitsel, toplumsal, ekonomik ve siyasal olarak daha büyük rol ve sorumluluklar almasını
 sağlar.
 Tonbul (2008) ise performans değerlendirme uygulamalarının, örgütsel hedeflerin gerçekleşme
 düzeyini artıracağı, kurumsal işleyişte aksayan yönlerin saptanmasını kolaylaştıracağı, örgütsel iklim ve
 kurum kültürünün çalışanlar üzerindeki etkisine ilişkin özgül veriler sağlayabileceği ve bu doğrultuda
 örgütsel performansın artacağını belirtmiştir. İş akışı ve organizasyonla ilgili süreçlerde, geribildirim
 düzeneğini etkin ve işlevsel biçimde işe koşan örgütlerin daha başarılı ve kalıcı oldukları görülmektedir
 (Latham & Pinder, 2005). Kalaycı (2009) yükseköğretimde değerlendirme yapmadan başarıyı veya
 başarısızlığı yordama olasılığının düşük olduğunu; fakat akademisyenlerin öğretim performanslarının
 değerlendirilmesiyle birlikte öğrenme-öğretme ortamlarının herkesçe sorgulamaya açık hale geleceğini,
 bu durumun ise oldukça zorlayıcı olduğunu ifade etmiştir. Bununla ilgili olarak, Kim ve diğerleri (2016) pek
 çok profesörün eğitimcilik rolüne daha düşük önem verdiği, araştırmacı rolüne daha büyük öncelik
 verdiğine; çünkü fakülte değerlendirme sisteminin araştırmaya dayalı olduğuna vurgu yapmışlardır.
 Performans değerlendirme sadece zorunluluk ve formalite amacıyla yapılmamalıdır. Bu tehlike özellikle
 devlet üniversiteleri için ihtimal dâhilindedir. Kalaycı ve Çimen (2012) “artık devlet üniversitelerinin de
 “kalite süreçleri uygulamalarını” formaliteyi tamamlamak amacıyla değil, gerçekten kaliteyi yükseltmek ve
 rekabette öne çıkmak amacı ile yürütmesi gerektiğini, devlet üniversitelerinin de kalite çalışmalarına
 gereksinimi olduğunu” belirtmişlerdir.
 21. yüzyılda üniversitelerini performans değerlendirmeye zorlayan sebepler arasında kurumsal itibar,
 uluslararasılaşma ve dünya üniversite sıralamaları yer almaktadır. Kurumsal itibarı belirleyen pek çok
 faktör bulunmaktadır. Higher Education Authority’nin (2013) araştırma ve öğretimle ilgili yaptığı itibar
 araştırmasında; akademisyenlerin uzmanlık alanlarındaki bölümlerle yakından ilgilendikleri ve bilgi sahibi
 oldukları ortaya çıkmıştır. Kurumun uluslararası ve ulusal hem öğretim elemanı hem öğrenci barındırması
 kurumun global kimliğe sahip olduğu ve küresel markette rekabete hazır olduğu izlenimi verdiği ifade
 edilmiştir (O'Connor ve diğerleri, 2013). Kurumun uluslararası öğretim elemanı, öğrenci bulundurması
 başlı başına yeterli değildir. Bir üniversitenin kalitesi ve niteliğine ilişkin en önemli göstergelerden biri
 öğretim elemanlarının performansı ve bununla doğrudan ilgili olarak verilerin derslerin kalite düzeyidir.
 Öğretim elemanı kalitesi, eğitimin kalitesini doğrudan etkileyen faktörlerin başında gelmekte, öğretim
 elemanlarının performanslarının değerlendirilmesi kalite kontrolünün güvencesi olarak görülmektedir
 (Açan ve Saydan, 2009).
 1320

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Yükseköğretim kurumları da dâhil olmak üzere kurumsal anlamda en sık kullanılan performans ölçüm
 ve değerlendirme tekniklerine bakıldığında bunların «Öz Değerlendirme, Temel Performans Göstergeleri
 (TPG), Göreceli Değerlendirme, Takdir Etme, Altı Sigma, Toplam Kalite Yönetimi» olduğu görülmektedir
 (Çalışkan, 2006; Kalaycı, 2009; Paige, 2005). Öğretim üyelerinin bireysel değerlendirmesi kapsamında
 burada belirtilen tekniklerin hepsi uygun veya uygulanabilir olmayabilir. Örneğin, performans
 karşılaştırması tekniği bir bireyin aynı bağlamda öncü/örnek/lider kabul edilen bir başkasıyla
 karşılaştırılarak, mevcut performansının değerlendirilmesini içermektedir. Anlaşıldığı üzere, bu teknik
 mükemmeliyet arayışında en iyi örneklerin rehberlik edici yönünü kullanmak isteyen bir organizasyon için
 uygun olabilir; fakat bütün personelin değerlendirilmesinde uygun değildir; çünkü her birey çalışma şekli
 açısından ve kendini geliştirme yöntemi olarak birbirinden ayrılmaktadır. Bu teknikler arasında örneğin,
 TPG tekniği, yükseköğretimde öğretici konumunda olanların performanslarını değerlendirmede kullanmak
 için uygundur. TPG tekniğinde değerlendirilecek performans göstergelerinin işe vuruk tanımları yapılır. İşe
 vuruk tanımda önemli olan, bir kavramın hangi işlemlerle tanımlandığının belirtilmesidir.Küresel düzeyde
 gerçekleştirilen performans ölçümleri ve değerlendirme teknikleri her ülkenin yükseköğretim
 kurumlarında birebir aynı şekilde uygulanmayabilir. Türkiye'de performans değerlendirmeye ilişkin
 mevcut uygulamalara bakıldığında “öğretim üyelerinin sadece araştırma ve yayın etkinlikleri konusundaki
 performansını nicel olarak ölçüldüğü ya da subjektif yargılar temelinde değerlendirme yapıldığı”
 görülmektedir (Esen ve Esen, 2015). Bununla ilgili olarak Yükseköğretim Kurumu Türkiye’deki
 akademisyenlerin akademik faaliyetlerini desteklemek ve motivasyonlarını artırmak amacıyla 2015 yılında
 akademik teşvik uygulaması başlatmıştır. Bu yönetmelikte “Devlet yükseköğretim kurumları kadrolarında
 bulunan öğretim elemanlarına yapılacak olan akademik teşvik ödeneğinin uygulanmasına yönelik olarak,
 bilim alanlarının özellikleri ve öğretim elemanlarının unvanlarına göre akademik teşvik puanlarının
 hesaplanmasında esas alınacak faaliyetlerin ayrıntılı özellikleri ve bu faaliyetlerin puan karşılıkları ile bu
 hesaplamaları yapacak komisyonun oluşumu” hakkında detaylı değerlendirme ölçütleri yer almaktadır
 (Akademik Teşvik Ödeneği Yönetmeliği, 2015). Akademik teşvik sistemi ile birlikte öğretim elemanlarının
 ulusal ve veya uluslararası yürüttükleri proje, araştırma, yayın, sergi, aldıkları patent, çalışmalarına yapılan
 atıflar, almış oldukları akademik ödüller esas alınarak Yükseköğretim Kurulu tarafından performansları
 değerlendirilmektedir. Bunun sonucunda yeterli çalışmayı gerçekleştiren öğretim elemanları maddi açıdan
 desteklenmektedirler. Alan yazındaki öğretim elemanlarının performans değerlendirmelerinin nasıl
 yapıldığına bakıldığında ise çeşitli yöntemlerin olduğu görülmektedir. Türkiye'de öğretim üyelerinin
 performanslarını değerlendirmede kullanılabilecek birbirinden bağımsız çeşitli yöntemler şunlardır:
 a. Sicil sistemi
 b. Akademik yükseltilme ve atanma kriterler
 c. Öğretim üyesi değerlendirme anketleri
 d. Yıllık sunulan faaliyet raporları
 e. Akademik teşvik uygulaması
 f. Öğrenci anketleri
 (Esen ve Esen, 2015)
 Yükseköğretimde performans değerlendirmenin yapılması, verilen hizmetlerin etkililiğini artırma
 açısından oldukça önemlidir; fakat yapılacak performans değerlendirmenin hangi kriterlere göre
 yapılacağı ve güvenirliği en az onun kadar önemlidir. Bu konuda, Çakıroğlu, Aydın ve Uzuntiryaki (2009)
 “deneyimli öğretim üyelerinin yaptığı değerlendirmelerin güvenirliği konusundaki araştırmaların oldukça
 ümit verici” olduğunu belirtmişler ve aşağıdaki kriterlerin göz önünde bulundurulması gerektiğine vurgu
 yapmışlardır:
 • öğretim performansına yönelik verilerin çeşitli kaynaklardan (meslektaş, öğrenci, danışman,
 lisansüstü öğrencisi, mezun gibi) toplanması ve farklı formatlarda (öğrenci değerlendirme anketleri,
 öğrenci görüşmeleri, gözlem sonuçları, ders materyalleri, öğrenci ürünleri vb.) olması,
 1321

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 • değerlendirme kriterlerinin açıkça belirlenmesi,
 • değerlendirilecek kişilere nasıl değerlendirileceğine yönelik bilgilendirme yapılması,
 • değerlendiricilere nasıl değerlendirme yapacaklarına yönelik bilgilendirme yapılması,
 • aday konumunda olan kişilerin değerlendirici rolü almaması,
 • değerlendiricilerin kriterleri sağlayanlar arasında rastgele yöntemle seçilmesi,
 • jürinin en az 3 en çok 5 üyeden oluşması.
 Öğretim elemanlarının performanslarının değerlendirilmesinin temelinde üniversitelerin etkililiğini
 artırma amacı yatmaktadır. Bu çalışmada eğitim fakültelerinin tercih edilmesinin sebebi özellikle “Bologna
 Süreci” kapsamında YÖK’ün eğitim fakültelerinde akreditasyon çalışmaları üzerinde önemle durmasıdır.
 Üniversitelerde eğitim fakültelerinde yürütülen akreditasyon çalışmalarında akademik personelin
 performans değerlendirmeye yönelik beklentilerinin ve engellerin belirlenmesi amaca ulaşma bakımından
 önemlidir. Türkiye’de bulunan yükseköğretim kurumları bir kalite göstergesi olarak hesap verebilirliğini
 artırmayı ve mevcut durumlarını iç ve dış paydaşlarına bildirmeyi amaçlamaktadırlar. Üniversiteler bu
 kapsamda misyon ve vizyonlarını gerçekleştirdiklerini kanıtlamak amacıyla öğretim elemanlarına ait
 performans değerlendirme çalışmaları yürütmekte ve bunu rapor olarak halkın, öğrencilerin, ailelerin,
 hükümetin, özel sektörün bilgisine arz etmektedirler. Ulusal ve küresel ölçekte üniversiteler üzerinde
 kalite, verimlilik, etkililik, hesap verebilirlik gibi kavramlardan dolayı sistematik olarak performans
 değerlendirmeleri yapmaya yönelik artan bir baskı bulunmaktadır. Dolayısıyla, performans değerlendirme
 yükseköğretim kurumları için bu kadar önemliyken performansı değerlendirilen öğretim elemanlarının
 beklentilerinin ne olduğunun belirlenmesi konusunda araştırma yapılmasına ihtiyaç bulunmaktadır.
 Yükseköğretimde öğretim elemanlarının performans değerlendirme yaklaşımından beklentileri ve
 performans değerlendirmenin önündeki engellere ilişkin görüşlerinin nicel ve nitel olarak incelemeyi
 amaçlayan bu çalışma kapsamında, aşağıdaki sorular araştırılmıştır:
 1. Eğitim Fakültesindeki öğretim elemanlarının performans değerlendirme yaklaşımına ilişkin
 beklentileri nasıldır?
 1.1. Öğretim elemanlarının performans değerlendirme yaklaşımına ilişkin beklentileri çeşitli
 değişkenlere göre anlamlı farklılık göstermekte midir? (akademik ünvan, akademik deneyim, teşvik alma
 durumu, kurumundan memnuniyet düzeyi)
 2. Eğitim Fakültesindeki öğretim elemanlarının performans değerlendirme sisteminin önündeki
 engellere ilişkin görüşleri nasıldır?
 2.1. Öğretim elemanlarının performans değerlendirme sisteminin önündeki engellere ilişkin algıları
 çeşitli değişkenlere göre anlamlı farklılık göstermekte midir? (akademik ünvan, akademik deneyim, teşvik
 alma durumu, kurumundan memnuniyet düzeyi)
 3. Eğitim Fakültesindeki öğretim elemanlarının performans değerlendirme yaklaşımına ilişkin genel
 görüşleri nelerdir?
 Yöntem
 Bu araştırma kapsamında karma araştırma yöntemlerinden yakınsayan paralel karma desen tercih
 edilmiştir. Nicel ve nitel veriler eş zamanlı toplanmış, ayrı ayrı analiz edilmiş ve bulguları karşılaştırılmıştır.
 Yakınsayan paralel desende, nitel ve nicel araştırmalara eşit öncelik tanınır, analiz sırasında ayrı
 çözümlemeler yapılır ve en sonunda birlikte yorumlama gerçekleşir (Creswell ve Plano Clark, 2014). Bu
 araştırmada kullanılan karma desen Şekil 1‘de gösterilmiştir:

 1322

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Betimsel İstatistik

 Nicel veri
 toplama ve
 analizi

 Nitel Veri
 Toplama ve
 analizi

 t-test ve ANOVA

 Nicel ve nitel
 analizlerin
 birlikte
 yorumlanması
 İçerik
 Analizi

 Şekil 2.
 Karma Araştırmalarda Paralel Yakınsak Desen Önerisi
 Katılımcılar
 Bu çalışmanın verileri, 2018 yılı içerisinde devlet üniversiteleri bünyesinde faaliyet gösteren Eğitim
 Fakültelerinde görev yapmakta olan araştırma görevlisi doktor, doktor öğretim üyesi, doçent ve profesör
 kadrosunda bulunan öğretim elemanlarından elde edilmiştir. Çalışma grubu Marmara Bölgesi, Karadeniz
 Bölgesi, Ege Bölgesi, Akdeniz Bölgesi ve Doğu Anadolu Bölgesinde yer alan devlet üniversitelerinin Eğitim
 Fakültelerinde görev yapan katılımcılardan oluşmaktadır. Ders yüklerinin yoğunluğu yüzünden ve bu
 araştırma kapsamında sadece doktora eğitimini tamamlayan öğretim elemanlarından veri toplandığı için
 öğretim görevlileri çalışma grubuna dâhil edilmemiştir. Nicel boyuttaki veriler toplanırken elverişli
 örnekleme tekniği kullanılmış ve çalışmaya katılmayı kabul eden altı üniversiteden 104 öğretim
 elemanından veri toplanmıştır. Nitel boyutta ise örneklem seçimi maksimum çeşitlilik örneklemesiyle elde
 edilmiş, ve incelenen durum hakkındaki farklı görüşleri temsil eden 50 katılımcıdan veri toplanmıştır. Nicel
 aşamada 25 araştırma görevlisi doktor, 35 doktor öğretim üyesi, 31 doçent, 13 profesör yer almaktadır.
 Elverişli örnekleme kullanıldığı için bölüm kriterine göre örneklem alımı yapılmamıştır; fakat nihai olarak
 katılımcıların yüzde 22’si Fen Eğitimi, yüzde 11’i Okul Öncesi Eğitimi, yüzde 28’i Eğitim Bilimleri, yüzde 31’i
 de Sınıf Eğitimi bölümünde görev yapmaktadır. Nitel aşamada 13 araştırma görevlisi doktor, 17 doktor
 öğretim üyesi, 15 doçent ve 5 profesör yer almaktadır. Nitel aşamada katılımcılar belirlenirken akademik
 unvan ve bölüm değişkenine göre maksimum çeşitlilik sağlanmıştır. Katılımcıların yüzde 20’si Fen Eğitimi,
 yüzde 10’u Okul Öncesi Eğitimi, yüzde 40’ı Eğitim Bilimleri ve yüzde 30’u Sınıf Eğitimi bölümünde görev
 yapmaktadır.
 Kullanılan Veri Toplama Araçları
 Bu çalışmada veri toplamak amacıyla kişisel bilgi formu, Tonbul (2008) tarafından geliştirilen 16
 maddeden oluşan 4’lü likert tipinde “Performans Değerlendirme Yaklaşımına İlişkin Beklentiler” altölçeği
 ve 10 maddeden oluşan “Performans Değerlendirme Sisteminin Önündeki Engeller” altölçeği
 kullanılmıştır. Ölçek geliştirilirken açımlayıcı faktör analizi ve varimax dik döndürme tekniği uygulanmıştır.
 Kullanılan ölçme aracına ait Cronbach alfa güvenirlik değerlerinin, “Performans Değerlendirme
 Yaklaşımına İlişkin Beklentiler” ölçeği için 0.92 olduğu, “Performans Değerlendirme Sisteminin Önündeki
 Engeller” altölçeği için .87 olduğu ortaya çıkmıştır. Bu çalışmanın verileri ile tekrar güvenirlik analizi
 gerçekleştirilmiş ve Cronbach alfa değeri birinci altölçek için .84, ikinci altölçek için .78 bulunmuştur. Ölçek
 maddeleri arasındaki homojenliği ölçen Cronbach Alfa değeri .60 ile .80 arasında olması ölçeğin üst
 düzeyde güvenirliğe sahip olduğunun bir kanıtıdır (Tonbul, 2008) Kullanılan bu ölçekte dağılım tek faktörde
 toplanmış ve tek faktör toplam varyansın %55,8’ini açıklamaktadır.Ayrıca, nicel verilerin nitel verilerle
 desteklenmesi ve zengin çözümleme amacıyla performans değerlendirme yaklaşıma ilişkin açık uçlu
 sorular sorulmuştur. Eğitim Bilimleri alanından bir profesör, Ölçme ve Değerlendirme alanından bir doçent
 1323

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 ve yükseköğretim çalışmaları alanında çalışan bir uzmandan görüşleri alınmış ve gerekli düzeltmeler
 yapılmıştır. Açık uçlu soruların nihai hali şu şekildedir:
 2.1. Akademisyenlerin performansının verilere dayalı ve periyodik olarak ölçülüp değerlendirilmesi
 hakkındaki düşünceniz nedir?
 2.2. Performansa dayalı değerlendirme yapılırken, içerisinde hangi boyutların olmasını istersiniz? Bu
 boyutları önem sırasına göre maddeler halinde yazınız.
 2.3. Performans değerlendirmesinin akademisyenlerin performansını etkileyen olumlu ve olumsuz
 yönleri nelerdir?
 2.4. Yükseköğretimde akademisyenlerin performansını artırma önündeki engeller nelerdir ve bu
 engellerin ortadan kalkması için önerileriniz nelerdir?
 Veri Analizi
 Nicel verilerin hangi yöntemle çözümleneceğini belirlemek için varyansların eşitliği ve verilerin
 dağılımına ilişkin normallik değerine bakılmıştır. Bu amaçla çarpıklık ve basıklık katsayılarına bakılmış ve (1,+1) aralığında olduğu görülmüştür. Ayrıca örneklem sayısı 50’den büyük olduğu için (N=104) Kolmogrov
 Smirnov testi yapılmış ve test sonucunda anlamlılık değerinin (p>.05) olduğu görülmüştür. Normallik
 varsayımı sağlandığı için, akademik teşvik alma durumu değişkeni açısından katılımcıların verdikleri
 yanıtlar arasında anlamlı bir fark olup olmadığını test etmek için “İlişkisiz Örneklemler için t-test”
 yapılmıştır. Çalışma deneyimi, akademik ünvan ve kurumundan memnuniyet düzeyi değişkenleri açısından
 katılımcıların ölçek maddelerine verdikleri yanıtlar arasında anlamlı bir fark olup olmadığını test etmek
 amacıyla tek yönlü varyans analizi (ANOVA) yapılmıştır.
 Nitel verilerin analizinde tümevarımsal içerik analizi kullanılmıştır. Açık Uçlu Anket ile toplanan
 akademisyen görüşleri üzerinden kodlayıcı güvenirliği uyuşum yüzdeleri belirlenmiştir. Bu değerler
 belirlenirken açık uçlu ankette yer alan akademisyen görüşleri bir araştırmacı ve bir uzman tarafından
 kodlanmıştır. Bu işlem ankette yer alan her madde için tekrar edilmiştir. Uyuşum yüzdeleri, Miles ve
 Huberman’ın (1994) güvenirlik formülü kullanılarak hesaplanmıştır.
 Güvenirlik = Görüş Birliği / (Görüş Birliği + Görüş Ayrılığı)
 Hesaplama sonucunda performans değerlendirme yaklaşımıyla ilgili görüşlere ilişkin güvenirlik 0.89
 bulunmuştur. Uyuşum yüzdesinin % 80 ya da daha üstü olması yeterli görüldüğünden veri analizi açısından
 güvenirliğin sağlandığı söylenebilir (Mokkink ve diğerleri, 2010). Bu araştırmada Creswell (2003)
 tarafından sıralanan nitel araştırma yöntemlerinde kullanılan “Katılımcı Kontrolü, “Uzman Kanısı”, “Zengin
 Betimleme” ve “Kanıt Zinciri” geçerlik stratejilerinden yararlanılmıştır. Katılımcılara çalışma bulgularının
 kendi düşüncelerini doğru yansıtıp yansıtmadığını sorulmuş, çalışma katılımcılarıyla az teması olan ve
 çalışma yöntemini bilen bağımsız bir uzmana danışılmış ve doğrudan alıntılarla verinin doğasına mümkün
 olduğu ölçüde sadık kalınmıştır.
 Bulgular
 3.1 Performans Değerlendirme Yaklaşımına İlişkin Beklentiler
 Araştırmada ilk olarak “Öğretim elemanlarının performans değerlendirme yaklaşımına ilişkin
 beklentileri nasıldır?” sorusuna cevap aranmış ve katılımcıların ölçekten aldıkları genel puan ortalaması
 Tablo 1’de sunulmuştur.
 Tablo 1.
 Öğretim Elemanlarının Performans Değerlendirme Yaklaşımına İlişkin Beklentilerinin Genel Ortalaması

 Beklenti Genel Ortalama

 N
 104

 Minimum
 1,50

 1324

 Maksimum
 3,31

 Ortalama
 2,3023

 Standart Sapma
 ,43859

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Tablo 1’de öğretim elemanlarının ölçekten elde ettiği puan ortalamasına bakıldığında ( =2,30),
 performans değerlendirme yaklaşımıyla ilgili beklentilerinin yüksek olmadığı, orta düzeyde (kısmen
 katılıyorum) olduğu dikkat çekmektedir. Öğretim elemanlarının performans değerlendirme yaklaşımına
 ilişkin beklentilerinin akademik ünvan değişkenine göre anlamlı bir farklılık gösterip göstermediğine ilişkin
 ANOVA test sonuçları Tablo 2’de yer almaktadır.
 Tablo 2.
 Akademik Ünvan Değişkenine Göre Performans Değerlendirme Yaklaşımı Beklentilerine İlişkin Beklentiler
 ANOVA Testi
 N

 Arş.Gör.Dr.
 Dr.Öğr.Üyesi
 Doç.Dr.

 Standart
 Sapma

 25

 2,4525 ,506

 35

 2,4875 ,251

 31

 2,1754 ,441

 Kareler
 Toplamı
 Gruplararası

 Grup içi
 Prof.Dr.
 Toplam

 5,321

 14,492

 13 1,8173 ,162
 104 2,3023 ,438

 Kareler F
 Sd Ort.

 p

 Farkın kaynağı

 Arş.Gör>Doç.Dr.,
 ProfDr.Öğr.Üyesi>
 3 1,774
 12,24 .000 Doç, Dr., Prof.
 Doç.Dr.>Prof.

 10
 0
 145

 Ölçekten alınan ortalama puanların akademik ünvanlara göre aritmetik ortalama ve standart sapma
 değerine bakıldığında ise performans değerlendirme yaklaşımıyla ilgili en yüksek beklentiye sahip olanların
 doktor öğretim üyesi olduğu görülürken, en düşük beklentiye sahip olanların ise profesörler olduğu ortaya
 çıkmaktadır. Tablo 2’de gruplar arası anlamlı farklılık olduğu ortaya çıktığı için, anlamlı farklılığın hangi
 gruplar arasında olduğunu görmek amacıyla post hoc testlerine bakılmıştır. “Levene F” testine ait olan
 (Sig) değeri p<.05 olduğu için varyansların eşit olmadığı görülmektedir; dolayısıyla bu durumlarda gruplar
 arasında karşılaştırma yaparken tercih edilen Post Hoc testlerinden Games-Howel istatistik yöntemi
 kullanılmıştır. Analiz sonucunda Araştırma Görevlileri ile doktor öğretim üyelerinin ortalama puanları
 doçent ve profesörlerin puanlarından anlamlı derecede yüksektir. Araştırma görevlileri ile doktor öğretim
 üyeleri arasında beklenti puanlarında anlamlı farklılık yoktur.
 Ölçekte yer alan maddeler incelendiğinde ise performansa ilişkin en yüksek beklentilerin aşağıdaki
 maddelerle ilgili olduğu görülmektedir:
 Etkili öğretim üyesinin kriterleri konusunda görüş birliğinin oluşması sağlanır. (
 Öğretim üyesinin mesleki gelişimi olumlu etkilenir. ( =3,27)
 Öğretim üyesinin iş yükü artar. ( =2,40)
 Kurum içi gerginliğe neden olur. ( =2,39)

 =3,42)

 Öğretim elemanlarının performans değerlendirme yaklaşımına ilişkin en düşük beklentileri ise
 şunlardır:
 Öğretim üyelerinin motivasyonu artar. ( =1,90)
 Nitelikli bir kurum kültürünün (değerler, işe ilişkin tutum ve sorumluluk anlayışı, ilişkiler vb) gelişmesine
 katkıda bulunur. ( =1,76)
 Öğretim üyesinin derslere daha hazırlıklı gelmesi sağlanır. ( =1,70)
 Öğretim elemanlarının performans değerlendirme yaklaşımına ilişkin beklentilerinin akademik teşvik
 alma durumu değişkenine göre anlamlı bir farklılık gösterip göstermediğine ilişkin analiz sonuçları Tablo
 3’te yer almaktadır.

 1325

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Tablo 2.
 Akademik Teşvik Değişkenine Göre Performans Değerlendirme Yaklaşımı Beklentilerine İlişkin T- Testi
 Sonuçları
 N
 Akademik Teşvik

 SS

 t

 p

 ,322

 ,002

 Aldım

 52

 2,43

 ,38

 Almadım

 52

 2,16

 ,45

 Tablo 3’te yer alan analiz sonucunda, performans değerlendirme yaklaşımına ilişkin beklentilerin
 akademik teşvik alma durumuna göre anlamlı şekilde farklılaştığı görülmektedir [t(102)=3,22 p<.05)].
 Akademik teşvik almış olan öğretim elemanlarının performans değerlendirme yaklaşımından beklentileri,
 almayanlara göre anlamlı derecede yüksektir.
 Öğretim elemanlarının performans değerlendirme yaklaşımına ilişkin beklentilerinin çalışma deneyimi
 değişkenine göre anlamlı bir farklılık gösterip göstermediğine ilişkin ANOVA test sonuçları Tablo 4’te yer
 almaktadır.
 Tablo 4.
 Çalışma Deneyimi Değişkenine Göre Performans Değerlendirme Yaklaşımı Beklentilerine İlişkin Beklentiler
 ANOVA Testi
 Standart
 Sapma

 N

 0-5 sene

 17

 2,43

 ,51

 6-10 sene

 38

 2,43

 ,28

 14 2,51
 11-15 sene
 15
 seneden
 35 2,00
 fazla
 Total
 104 2,30

 ,44

 Kareler
 Toplamı

 Gruplar
 1,55
 arası

 Grup içi

 1,51

 Sd

 3

 Kareler F
 Ort.

 4,67

 10,28

 p

 ,00

 Farkın kaynağı
 0-5sene> 15seneden
 fazla
 6-10sene>
 15seneden fazla
 11-15sene>15
 seneden fazla

 100 15,1

 ,39
 ,43

 103

 Tablo 4’te analiz sonucunda performans değerlendirmeyle ilgili beklentilere ilişkin en düşük puana
 sahip olanların 15 senden fazla çalışma deneyimi olanlar olduğu ortaya çıkmıştır. Diğer bütün grupların
 puan ortalamaları, bu grubun puan ortalamasından anlamlı derecede yüksektir. İlk üç grubun kendi
 aralarında puan ortalamaları arasında anlamlı faklılık bulunmamaktadır.
 Öğretim elemanlarının performans değerlendirme yaklaşımına ilişkin beklentilerinin kurumlarından
 memnniyet düzeyi değişkenine göre anlamlı bir farklılık gösterip göstermediğine ilişkin ANOVA test
 sonuçları Tablo 5’te yer almaktadır.

 1326

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Tablo 5.
 Kurumundan Memnuniyet Değişkenine Göre Performans Değerlendirme Yaklaşımı Beklentilerine İlişkin
 ANOVA Testi
 Standart
 Sapma

 N

 Az

 Orta Düzey

 10

 2,70

 ,31

 35

 2,39

 ,32

 42

 2,00

 ,47

 Tamamıyla

 17

 1,80

 ,11

 Toplam

 104 2,30

 Oldukça

 ,438

 Kareler
 Toplamı

 Sd

 Gruplar
 5,97
 arası

 Grup içi 13,08

 3

 Kareler Ort. F

 p

 1,991
 14,38

 ,00

 Farkın kaynağı
 Az,Orta
 Düzeyde>
 Oldukça,
 Tamamıyla

 100
 ,138
 103

 Tablo 5’te ANOVA testi sonucunda gruplar arası anlamlı farklılık (p<.05) olduğu ortaya çıktığı için hangi
 gruplar arasında anlamlı farklılık olduğuna bakılmıştır. “Levene F” testine ait olan (Sig) değeri p<.05 olduğu
 için varyansların eşit olmadığı görülmektedir; dolayısıyla bu durumlarda gruplar arasında karşılaştırma
 yaparken tercih edilen Post Hoc testlerinden Games-Howel istatistik yöntemi kullanılmıştır. Post-hoc testi
 sonucunda bulunduğu kurumdan az ve orta düzeyde memnun olan öğretim elemanları, oldukça ve
 tamamıyla memnun olanlara göre performans değerlendirme yaklaşımıyla ilgili anlamlı derecede daha
 yüksek beklentilere sahiptir.
 3.2 Performans Değerlendirme Yaklaşımının Önündeki Engeller
 Araştırmada ikinci olarak “Öğretim elemanlarının performans değerlendirme yaklaşımının
 önündeki engellere yönelik görüşleri nasıldır?” sorusuna cevap aranmış ve katılımcıların ölçekten aldıkları
 puan ortalaması ve dağılımın standart sapması Tablo 6’da sunulmuştur.
 Tablo 6.
 Öğretim Elemanlarının Performans Değerlendirme Yaklaşımının Önündeki Engellere İlişkin Genel Puan
 Ortalamaları
 Standart
 N
 Minimum
 Maksimum
 Ortalama
 Sapma
 Engeller Altölçeği
 104
 2,20
 3,80
 3,02
 ,57517
 Tablo 6’da öğretim elemanlarının ölçekten elde ettiği puan ortalamasına bakıldığında ( =3,02),
 performans değerlendirme yaklaşımının önündeki engellerle ilgili, ölçekte yer alan maddelere katıldıkları
 görülmektedir. Madde madde bakıldığında, öğretim elemanlarının performans değerlendirmenin
 önündeki engellere ilişkin en fazla katıldıkları ifadelerin şunlar olduğu görülmektedir:
 Yükseköğretim kurumlarının mevcut örgütsel işleyişi (hiyerarşik yapılanma, yetki ve sorumlulukların
 dağılımı, birimlerin özerklik sınırları). = 3,80
 Öğretim üyesinin iş yükü. =3,68

 1327

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Performans değerlendirme yaklaşımına ilişkin en az katıldıkları ifade ise “Kültürel yapı (olumsuzlukları
 görmezden gelme, kişisel çekişmeler, aşırı hoşgörü, eleştirilme rahatsızlığı, güvensizlik, Batı
 standartlarında rekabetçi bir anlayışın eksikliği vb.). ( =1,91)”dır.
 Öğretim elemanlarının performans değerlendirme yaklaşımının önündeki engellere ilişkin görüşlerinin
 akademik teşvik alma durumu değişkenine göre anlamlı bir farklılık gösterip göstermediğine ilişkin analiz
 sonuçları Tablo 7’de yer almaktadır.
 Tablo 7.
 Akademik Teşvik Değişkenine Göre Performans Değerlendirme Yaklaşımının Önündeki Engellere İlişkin TTesti Sonuçları
 N
 Akademik Teşvik

 SS

 t

 P

 5,77

 ,000

 Aldım

 52

 2,14

 ,54

 Almadım

 52

 2,74

 ,51

 Tablo 7’de performans değerlendirme yaklaşımına ilişkin beklentilerin akademik teşvik alma
 durumuna göre anlamlı şekilde farklılaştığı görülmektedir [t(102)=5,77, p<.05)]. Akademik teşvik almış
 olan öğretim elemanlarının, performans değerlendirme yaklaşımının önündeki engeller altölçeğinden
 anlamlı derecede daha düşük puan aldıkları görülmektedir.
 Öğretim elemanlarının performans değerlendirme yaklaşımının önündeki engellere ilişkin görüşlerinin
 akademik ünvan değişkenine göre anlamlı bir farklılık gösterip göstermediğine ilişkin ANOVA test sonuçları
 Tablo 8’de yer almaktadır.
 Tablo 8.
 Akademik Ünvana Göre Performans Değerlendirme Yaklaşımının Önündeki Engellere İlişkin Anova Testi
 Standart
 Sapma

 N
 Arş.Gör.Dr 25
 .
 Dr.Öğr.Üy
 35
 esi

 2,98

 ,30181

 3,42

 ,36202

 Doç.Dr.

 31

 3,38

 ,63314

 Prof.Dr.
 Toplam

 13 2,96
 104 3,02

 ,83254
 ,61101

 Kareler
 Toplamı
 Gruplar
 arası

 Sd
 3

 11,089

 Kareler Ort.
 3,696

 F

 p

 Farkın kaynağı

 Doç.>Arş.Gör.Dr.,
 13,50 ,000 Prof.Dr.
 Dr.Öğr.Üyesi>Arş
 .Gö.Dr,Prof.Dr.

 Grup içi

 27,365

 100

 ,274

 Tablo 8’de gruplar arası anlamlı farklılık olduğu ortaya çıktığı için (p<.05), anlamlı farklılığın hangi
 gruplar arasında olduğunu belirlemek için post hoc testlerine bakılmıştır. “Levene F” testine ait olan (Sig)
 değeri p<.05 olduğu için varyansların eşit olmadığı görülmektedir; dolayısıyla bu durumlarda gruplar
 arasında karşılaştırma yaparken tercih edilen Post Hoc testlerinden Games-Howel istatistik yöntemi
 kullanılmıştır. Analiz sonucunda performans değerlendirmenin önündeki engellerle ilgili en yüksek
 puanların dr. öğretim üyeleri ve doçentlere ait olduğu, en düşük puanların ise araştırma görevlileri ve
 profesörlere ait olduğu ortaya çıkmıştır. Araştırma görevlileri ile profesörlerin engellere ilişkin puanları
 arasında istatiktiksel olarak anlamlı farklılık yoktur.
 Öğretim elemanlarının performans değerlendirme yaklaşımının önündeki engeller altölçeği
 puanlarının çalışma deneyimi değişkenine göre anlamlı bir farklılık gösterip göstermediğine ilişkin ANOVA
 test sonuçları Tablo 9’da yer almaktadır.

 1328

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Tablo 9.
 Çalışma Deneyimi Değişkenine Göre Performans Değerlendirme Yaklaşımının Önündeki Engellere İlişkin
 ANOVA Testi
 Standart
 Sapma

 N
 0-5 sene 17

 2,72

 ,51

 6-10 sene 38

 3, 26

 ,28

 14

 3,78

 ,44

 35

 2,88

 ,39

 104 3,02

 ,54

 11-15
 sene
 15
 seneden
 fazla
 Total

 Kareler
 Toplamı
 Gruplar
 21,938
 arası

 Sd

 Kareler Ort. F

 3

 4,67

 p

 44,276

 ,000

 Farkın kaynağı
 11-15sene, 6-10
 sene>0-5sene,
 15seneden fazla
 11-15 sene >610sene

 Grup içi

 1,51

 100

 16,516
 103

 Tablo 9’da yer alan post hoc analiz sonucuna göre, performans değerlendirmenin önündeki engeller
 altölçeğinden en düşük puan alanların 0-5 sene çalışma deneyimi; en yüksek puan alanların ise 11-15 sene
 çalışma deneyimi olanlar olduğu ortaya çıkmıştır. 11-15 sene çalışma deneyimi olanların performans
 değerlendirme önündeki engellere ilişkin puanları diğer bütün gruplara göre anlamlı derecede yüksektir.
 11-15 sene çalışma deneyimine sahip olan grup, çoğu şeyin performans değerlendirmeyi engellediğini
 düşündükleri ve neredeyse her maddenin engel olarak adlandırıldığı bir grup olarak ortaya çıkmıştır.
 Öğretim elemanlarının performans değerlendirme yaklaşımının önündeki engeller altölçeğinden
 aldıkları puanlar, kurumlarından memnniyet düzeyi değişkenine göre (az, orta düzeyde, oldukça ve
 tamamıyla) anlamlı bir farklılık gösterip göstermediğine ilişkin ANOVA test sonuçları Tablo 10’da yer
 almaktadır.
 Tablo 10.
 Kurumdan Memnuniyete Göre Performans Değerlendirme Önündeki Engellere İlişkin ANOVA Testi
 N
 10

 2,58

 ,31

 35

 2,62

 ,32

 42

 3,48

 ,47

 Tamamıyla

 17

 3,36

 ,11

 Toplam

 104 3,02

 Az
 Orta Düzeyde
 Oldukça

 Kareler
 Toplamı

 Standart
 Sapma

 Gruplar 5,97
 arası

 Grup içi

 Kareler Ort. F
 Sd
 3

 13,08

 ,43859

 p

 100

 1,991

 14,383

 ,000

 Farkın
 kaynağı
 Az,Orta
 Düzeyde>
 Oldukça,
 Tamamıyla

 ,138

 103

 Tablo 10’da ANOVA testi sonucunda gruplar arası anlamlı farklılık (p<.05) olduğu ortaya çıktığı için,
 anlamlı farklılığın hangi gruplar arasında olduğunu belirlemek amacıyla post hoc testi yapılmıştır. “Levene
 F” testine ait olan (Sig) değeri p<.05 olduğu için varyansların eşit olmadığı görülmektedir; dolayısıyla bu
 durumlarda gruplar arasında karşılaştırma yaparken tercih edilen Post Hoc testlerinden Games-Howel
 istatistik yöntemi kullanılmıştır. Post-hoc testi sonucunda bulunduğu kurumdan az ve orta düzeyde
 memnun olan öğretim elemanları, oldukça ve tamamıyla memnun olanlara göre performans
 değerlendirmenin önündeki engellerle ilgili belirtilen maddelere daha fazla katıldıklarını ortaya çıkmıştır.

 1329

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 3.3 Performans Değerlendirmeye Yönelik Genel Yaklaşıma İlişkin Nitel Analiz
 Çalışma kapsamında Eğitim Fakültesi öğretim elemanlarının performans değerlendirmeyle ilgili genel
 yaklaşımlarına ilişkin nitel veriler toplanmıştır. Toplanan nitel veriler içerik analizi yöntemiyle analiz
 edilmiştir. Yükseköğretimde performans değerlendirme yaklaşımına ilişkin dört açık uçlu soru sorulmuş ve
 cevaplardan elde edilen nitel veriler içerik analiziyle incelenmiştir. İçerik analizi sonucunda “tutum boyutu,
 akademisyenlerin öncelikleri, performans değerlendirmenin olumlu etkileri, performans
 değerlendirmenin olumsuz etkileri, performans değerlendirme önündeki engeller, peformans
 değerlendirmeyi engelleyen faktörlere ilişkin öneriler” temaları ortaya çıkmıştır.
 1. Akademisyenlerin performansının verilere dayalı ve periyodik olarak ölçülüp değerlendirilmesi
 hakkındaki düşünceniz nedir?
 Türkiye’deki Eğitim Fakültelerindeki öğretim elemanları arasında, bu konuya ilişkin görüş
 ayrılıkları bulunmaktadır. Katılımcıların çoğunluğu verilere dayalı ve periyodik bir değerlendirmeden yana
 olsa da bu yaklaşıma ilişkin olumsuz tutumları olan ya da değerlendirme yaklaşımının suistimallere açık
 olduğundan şüphelen bireyler bulunmaktadır. Buna ilişkin nitel verilerin analizi Tablo 11’de yer almaktadır.
 Tablo 11.
 Performans Değerlendirmenin Periyodik Yapılmasına Yönelik Verilerin İçerik Analiziyle Kodlanması
 Tema

 Tanım

 Tutum
 Boyutu

 Performans Değerlendirme
 yaklaşımına ilişkin olumlu,
 olumsuz veya çekinik tutum
 içerisinde olma

 Kodlar

 Frekans

 Benimseyenler

 28

 Şüpheyle yaklaşanlar

 12

 Direnç gösterenler

 10

 Tablo 11 incelendiğinde öğretim elemanlarının çoğu performans değerlendirmenin birçok yönden
 olumlu olacağını ve böyle bir değerlendirmeyi destekleyeceklerini belirtmişlerdir. Öğretim elemanlarının
 vermiş oldukları cevaplar “Tutum Boyutu” teması içerisinde yer alan “benimseyenler”, “şüpheyle
 yaklaşanlar”, ve “direnç gösterenler” kodları altında incelenmiştir. Kanıt zinciri göz önünde bulundurularak
 bu kodlara ilişkin görüşlerden bazıları aşağıda verilmiştir:
 Benimseyenler: “Yükseköğretimde kalite ve niteliği sağlamada performans değerlendirmenin iyi sonuçlar
 getireceğine inanıyorum” (K6)
 Şüpheyle yaklaşanlar: “Çalışmalara destek verilmesi güzel. Ancak her şey yayın mı? Değerlendirmeyi
 kimlerin nasıl yapacağı bende soru işareti” (K5)
 Direnç gösterenler: “Performans ölçülemez. Bireyleri karşılaştırmak anlamsızdır. Tarih boyunca denendi,
 bir faydası görülmedi, tekrar denemenin anlamı yok” (K13)
 2. Performansa dayalı değerlendirme yapılırken, içerisinde hangi boyutların olmasını istersiniz? Bu
 boyutları önem sırasına göre maddeler halinde yazınız.
 Eğitim Fakültelerindeki öğretim elemanları performans değerlendirme yaklaşımında hangi boyutların
 yer alması gerektiğiyle ilgili çeşitli görüşler belirtmişlerdir. Öğretim elemanlarının hangi boyutlara ne kadar
 önem verdiklerini ifade etmeleri nitel açıdan önemli veriler sağlamıştır. Buna ilişkin nitel verilerin analizi
 Tablo 12’de yer almaktadır.

 1330

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Tablo 12.
 Performans Değerlendirmede Yer Alması Gereken Boyutlara İlişkin Verilerin İçerik Analiziyle Kodlanması
 Tema

 Akademisyenlerin
 Öncelikleri

 Tanım

 Performans
 değerlendirme
 yaklaşımında yer alması
 gereken ögeler ve bu
 ögelerin önem sırasına
 konulması

 Kodlar

 Frekans

 Akademik yayınlar

 17

 Öğretimin
 değerlendirilmesi

 kalitesinin

 10

 Lisans ve lisansüstü danışmanlık

 8

 İş yükleri (ders saati vb)

 6

 Jüri Üyelikleri (Tez, doçentlik vb.)

 5

 Öznel ilgi ve uğraş alanları

 4

 Tablo 12 incelendiğinde öğretim elemanlarının bir performans değerlendirme kapsamında, öncelikle
 akademik yayınların sayısının ve kalitesinin ölçülmesini, bundan sonra sınıf içerisinde öğretim elemanının
 ders işleme biçimi, kullandığı yöntemler, içeriği sunuş kalitesi, materyal kullanımı, öğrenmeyi kalıcı hale
 getirmek için yaptığı her şeyin değerlendirilmesi gerektiğini belirtmişlerdir. Öğretim elemanlarının
 değerlendirme ögelerinin önemiyle ilgili vermiş oldukları cevaplar “Akademisyenlerin Öncelikleri” teması
 içerisinde yer alan “akademik yayınlar”, “öğretimin kalitesinin değerlendirilmesi”, “lisans ve lisansüstü
 danışmanlık”, “iş yükleri”, “jüri üyelikleri” ve “öznel ilgi ve uğraş alanları” kodları altında incelenmiştir.
 Kanıt zinciri göz önünde bulundurularak bu kodlara ilişkin görüşlerden bazıları aşağıda verilmiştir:
 Akademik yayınlar: “Öğretim elemanlarıyla ilgili yapılan bir performans değerlendirmenin en başlıca
 üzerinde durması gereken boyut öğretim elemanlarının yayın yapması, bu yayınların kalite ve niteliğinin
 ölçülmesidir.” (K6)
 Öğretimin kalitesinin değerlendirilmesi: “Akademik çalışmalar kadar önemli olan başka bir boyut
 öğretimdir. Sınıf içi çalışmalar, özellikle aktivite ve öğretim yöntemlerine bakılabilir”
 Lisans ve lisansüstü danışmanlık: “Öğrencilere yapılan danışmanlıklar göz ardı ediliyor. Mesela tez
 danışmanlığı oldukça zahmetli bir iş. Bu performansın da değerlendirmeye alınması lazım”(K22)
 İş yükleri: “Derse girmekten diğer şeylere zaman kalmıyor. Bir akademisyen yayından çok girdiği
 derslerle ölçülebilir. Çok derse giren hocalar çok çalışan hocalardır.” (K30)
 3. Performans değerlendirmesinin akademisyenlerin performansını etkileyen olumlu ve olumsuz
 yönleri nelerdir?
 Eğitim Fakültelerindeki öğretim elemanları, performans değerlendirme yaklaşımının yükseköğretimde
 performansı olumlu veya olumsuz etkileyebileceğini belirtmişlerdir. “Olumlu Etkileri” teması altında
 “motivasyon”, “maddi destek”, “kalite arayışı”, “özeleştirinin gelişmeyi teşvik etmesi”, “dinamizmin
 sürekliliğinin sağlanması” kodları ortaya çıkarken; “Olumsuz Etkileri” teması altında “kurumiçi rekabet”,
 “akademik sahtekarlıklar”, “stres kaynağı”, “niceliğin niteliği gölgelemesi”, kodları ortaya çıkmıştır. Buna
 ilişkin nitel verilerin analizi Tablo 13’te yer almaktadır.

 1331

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Tablo 13.
 Performans değerlendirme yaklaşımının yol açacağı olumlu ve olumsuz etkilere ilişkin verilerin içerik
 analiziyle kodlanması
 Tema

 Tanım

 Kodlar

 Frekans

 Motivasyon

 12

 Maddi destek

 8

 Kalite Arayışı

 8

 Özeleştirinin gelişmeyi teşvik etmesi

 4

 Dinamizmin sürekliliğinin sağlanması

 4

 Kurumiçi rekabet

 7

 Akademik sahtekarlıklar

 6

 Stres kaynağı

 6

 Niceliğin niteliği gölgelemesi

 8

 Olumlu Etkileri

 Performans
 değerlendirme
 yaklaşımının yol
 açacağı olumlu
 durumlar

 Olumsuz Etkileri

 Performans
 Değerlendirme
 yaklaşımının yol açacağı olumsuz
 durumlar

 Tablo 13 incelendiğinde öğretim elemanlarının performans değerlendirilme yaklaşımının yol açacağı
 hem olumlu hem olumsuz durumlarla ilgili görüş belirttiği görülmektedir. Öğretim elemanları performans
 değerlendirmenin olumlu etkilerine yönelik 5 kod altında 36 görüş belirtirken; olumsuz etkilerine yönelik
 4 kod altında 27 görüş belirtmişlerdir. Kanıt zinciri göz önünde bulundurularak bu kodlara ilişkin
 görüşlerden bazıları aşağıda verilmiştir:
 Motivasyon: “Öğretim üyesini yeni çalışmalar yapmaya yönlendirir” (K24)
 Kalite Arayışı: “Değerlendirmeye tabi tutulan akademisyenler bir kalite arayışı içerisine girer. Kimse
 kötü hoca olarak anılmak istemez” (K9)
 Dinamizmin sürekliliği: “Devlet üniversitelerinde özellikle eski hocalar kendini yenilemek konusunda
 isteksizler. Bu durum da yükseköğretimin köhneleşmesine yol açıyor; çünkü bir değerlendirme ve yaptırım
 yok. Değerlendirme demek aynı zamanda dinamizm anlamına gelir”(K29)
 Kurumiçi rekabet: “İş birliğini engeller, kıskançlıklar olabilir, bir rekabet ortamı doğarsa bu verimi
 artırmak yerine egoist davranışları artırır” (K36)
 Akademik sahtekarlıklar: “Sahte verilerle yayın yapma, sonuncu isim olarak adını yazdırma gibi şeyler
 olabilir”
 Niceliğin, niteliği gölgelemesi: “Yayın yayın yayın nereye kadar. Şimdi herkes bir sürü yayın yapıyor ama
 kaçı kaliteli bu normal değil. Birisi çok sayıda kaliteli yayın yapabilir ama kaçı böyle?”
 4. Yükseköğretimde akademisyenlerin performansını artırma önündeki engeller nelerdir ve bu
 engellerin ortadan kalkması için önerileriniz nelerdir?

 1332

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Eğitim Fakültelerindeki öğretim elemanları, performans değerlendirmenin önündeki engeller ve
 bunlara yönelik çeşitli öneriler belirtmişlerdir. “Engeller” teması altında “yoğun iş yükü (ders, danışmanlık,
 idari görevler)”, “içsel motivasyon eksikliği”, “kalabalık öğrenci sayısı”, “çabaların takdir görmemesi”,
 “örgütsel işleyişin hantallığı” kodları ortaya çıkarken; “Öneriler” teması altında “memur istihdamı”, “yayın
 ve çalışmaların kurumca desteklenmesi”, “ders yükünü düşük tutmak”, “bireye YÖK tarafından dönemlik
 bütçe tahsisi” kodları ortaya çıkmıştır. Buna ilişkin nitel verilerin analizi Tablo 14’te yer almaktadır.
 Tablo 14.
 Performans Artırma Önündeki Engeller Ve Bu Engellere Yönelik Önerilere İlişkin Verilerin İçerik Analiziyle
 Kodlanması
 Tema

 Tanım

 Kodlar

 Frekans

 Yoğun iş yükü (ders,danışmanlık,idari görevler)

 18

 Çabaların takdir görmemesi

 10

 Örgütsel işleyişin hantallığı

 8

 Kalabalık öğrenci sayısı

 6

 İçsel motivasyon eksikliği

 4

 Ders yükünü düşük tutmak

 9

 Yayın ve çalışmaların kurumca desteklenmesi

 8

 Ölçütlerin üniversiteler tarafından belirlenmesi

 5

 Memur istihdamı

 4

 Bireye, YÖK tarafından dönemlik bütçe tahsisi

 4

 Engeller

 Öğretim
 performanslarını
 önündeki engeller

 elemanlarının
 artırmasının

 Öneriler

 Performans artırma önündeki
 engellerin ortadan kalkması için
 öneriler

 Tablo 14 incelendiğinde, öğretim elemanları engellere yönelik 5 kod altında 44 görüş belirtirken;
 önerilere yönelik 4 kod altında 26 görüş belirtmişlerdir. Kanıt zinciri göz önünde bulundurularak bu kodlara
 ilişkin görüşlerden bazıları aşağıda verilmiştir:
 Yoğun iş yükü: “Kaliteli bir şey ortaya koymak için zaman lazım. Öğretim elemanlarının zamanı yok. Ya
 ders vermekte, ya bir öğrencisiyle ilgilenmekte veya bir idari görevi var onun işleriyle uğraşmak
 durumunda” (K25)
 İçsel motivasyon eksikliği: “Akademide motive edici unsurlardan çok motivasyonu düşüren şeyler var.
 Meslek seçiminde kişi isteyerek de başka sebeplerle akademiye girdiyse performansını iyileştirmesi gerekli
 isteği olmaz”(K19)
 Örgütsel işleyişin hantallığı: “Proje ve benzeri çalışmalarda çok yavaş işleyen resmi süreç, bürokrasi ve
 kağıt işleri performans artışı önünde engel olur” (K8)
 Yayın ve çalışmaların kurumca desteklenmesi: “Performans artışı için en büyük önerim çalışanların
 çabalarının kurum tarafından desteklenmesidir. Bu yayın olur, kongre olur veya kişisel gelişim için eğitim
 olur” (K7)
 Memur istihdamı: “Bölümlere daha fazla memur alınırsa en azından öğretim elemanlarını uğraştıran
 ve bir sürü zamanını alan evrak işlerinden kurtulmuş olurlar.” (K21)

 1333

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 YÖK tarafından dönemlik bütçe tahsisi: “YÖK, her öğretim elemanına dönem başında bir bütçe ayırmalı,
 bütçe kullanma süreçlerini planlamalarını istemeli ve dönem sonunda bütçe-ürün karşılaştırması
 yapmalıdır”(K10)
 Tartışma ve Sonuç
 Bu çalışmanın bulguları doğrultusunda, yükseköğretimde performans değerlendirme yaklaşımıyla ilgili
 eğitim fakültelerindeki öğretim elemanlarının görüşlerinin oldukça farklılaştığı görülmektedir. 15 seneden
 fazla çalışma deneyimine sahip olanların ve bulunduğu kurumdan yüksek düzeyde memnun olan öğretim
 elemanlarının performans değerlendirmeye ilişkin beklentilerinin diğerlerine göre düşük olduğu
 görülmektedir. Akademik ünvana göre bakıldığında, doktor araştırma görevlileri ve doktor öğretim
 üyelerinin performans değerlendirme yaklaşımına olumlu baktığı, doçentlerin ve profesörlerin ise
 performans değerlendirmeye düşük düzeyde olumlu baktıkları görülmektedir. Benzer şekilde,
 Stonebraker ve Stone (2015) zorunlu emekliliğin kalkmasıyla birlikte öğretim elemanlarının yaş
 ortalamalarında artış olduğunu, bu yaşlanmanın sınıf içerisinde üretkenlik açısından getireceği
 olumsuzluklar konusunda endişeler bulunduğunu belirtmektedirler. Öğretim elemanlarının
 performanslarının öğrenciler tarafından değerlendirilmesinde yaş değişkeninin olumsuz bir etkisi
 olduğunu ve bu etkinin cinsiyet ve adademik branş bazında da görüldüğü gözlenmektedir; fakat bu
 olumsuz etki öğretim elemanları kırklı yaşların ortasına ulaşınca kadar görülmemektedir. Bu bulgu, Esen
 ve Esen’in (2015) çalışmasıyla paralellik göstermektedir. Onların çalışmasında akademik unvanlar
 yükseldikçe performans değerlendirmesinin hem öğretim üyeleri için, hem de kurumlar için yaratacağı
 sonuçlara ilişkin olumlu algılamanın azaldığı ortaya çıkmıştır. Bianchini, Lissoni ve Pezzoni, (2013)
 performans değerlendirme ile ilgili yaptıkları çalışmada öğrencilerinin profesörleri, doktor öğretim
 üyelerinden daha olumsuz değerlendirdiklerini belirtmişlerdir. Genel olarak öğretim elemanlarının nitel
 görüşlerine bakıldığında ise akademik camiada akademik unvan fark etmeksizin performans
 değerlendirme yaklaşımıyla ilgili birtakım güvensizlik ve tereddütlerin olduğu görülmektedir.
 Performans değerlendirme yaklaşımına ilişkin öğretim elemanlarının beklentilerine bakıldığında etkili
 öğretim üyesinin kriterleri konusunda görüş birliğinin oluşması, öğretim üyesinin mesleki gelişiminin
 olumlu etkilenmesi, öğretim üyesinin iş yükünün artması ve kurum içi gerginliğe neden olması bakımından
 yüksek beklenti içerisinde oldukları ortaya çıkmıştır. Nitel bulgulara bakıldığında ise öğretim elemanları
 arasında performans değerlendirme konusunda benimseyenler ve şüpheyle yaklaşanlar olmak üzere
 farklılaşmanın olduğu görülmektedir. Eğitim fakültelerindeki öğretim elemanları performans
 değerlendirmenin motivasyon ve kalite arayışını artırdığını; fakat bunun yanında kurum içi rekabet ve
 akademik sahtekarlıklara yol açabileceğini ifade etmişlerdir. Geleneksel olarak fakültelerde performans
 değerlendirmeleri araştırma göstergeleri üzerinde odaklanmaktadır (Bogt ve Scapens, 2012); bu yüzden
 yükseköğretim kurumları değerlendirme yaparken devletten alınan destek, araştırma ödülleri ve
 araştırmada üst sıralarda olma gibi sadece en iyi yayınları yapan öğretim elemanlarını desteklemek üzere
 eğilim göstermektedirler (Douglas 2013, Hopwood 2008). Bu çalışmada performans değerlendirmenin
 önündeki en önemli engelleri yoğun iş yükü ve içsel motivasyon eksikliği olarak görürken; öneriler
 kapsamında ise daha fazla memur istihdamını, yayın ve çalışmaların kurumca desteklenmesini
 belirtmişlerdir. Bu bulgular, performans değerlendirmeyle ilgili çalışma yürüten Tonbul (2008); Esen ve
 Esen (2015) ve Başbuğ ve Ünsal’ın (2010) bulgularından farklılık göstermektedir. Tonbul (2008), öğretim
 üyelerinin, uygulamaya konulacak bir performans değerlendirme yaklaşımına genelde olumlu yaklaştığını,
 beklenti açısından ise etkili performansın önündeki engellerin saptanması ve öğretim üyesinin kendi
 eksiklerini görmesi bakımından daha yüksek beklenti düzeyi içerisinde olduklarını ifade etmiştir. Esen ve
 Esen (2015) ise öğretim üyeleri arasında performansların değerlendirilmesinin kurumlar ve öğretim üyeleri
 için yaratacağı katkının olumlu yönde olacağına dair bir algı bulunduğunu belirtmektedirler. Beklentilerle
 ilgili olarak da performans değerlendirmeye yönelik nitelikli bir kurum kültürünün gelişmesi, kurumsal
 yenileşmenin süreklilik kazanması, öğretim üyelerinin mesleki gelişiminin olumlu etkilenmesi ve öğretim
 üyelerinin kendi eksiklerini daha iyi görmesi boyutlarında akademisyenlerin beklenti içerisinde olduklarına
 vurgu yapmışlardır.

 1334

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Bu çalışmanın sonucunda performansın önündeki en önemli engellerin yükseköğretim kurumlarının
 mevcut örgütsel işleyişi ve öğretim elemanlarının iş yükü olduğu, Tonbul’un (2008) çalışmasında örgütsel
 olanakların yetersizliği, kurumlarda egemen olan kültür ve değerlendirme ölçütleri konusundaki belirsizlik
 olduğu, Esen ve Esen’in çalışmasında ise (2015) en önemli engellerin sırasıyla kurumsal olanakların
 eksikliği, yükseköğretim kurumlarının mevcut örgütsel işleyişi ve akademik yükseltme kriterleri olduğu
 ortaya çıkmıştır. Başbuğ ve Ünsal (2010) akademik personelin çoğunluğunun performans
 değerlendirilmesine olumlu baktığını ve performansın etkileyen en önemli engelleyici faktörün bilimsel
 araştırmanın gerektirdiği fiziksel koşullardan mahrum olmak (laboratuvar, oda, araç-gereç, vb.) olduğunu
 belirtmişlerdir. Özgüngör ve Duru (2014) ise ders yükü, deneyim, öğretim elemanının toplam öğrenci sayısı
 arttıkça öğretim elemanına yönelik algılarda olumsuzlaşma olduğunu tespit etmiştir. Eğitim Fakültesi
 öğrencilerinin öğretim elemanlarına diğer tüm fakültelerin öğrencilerinden daha yüksek puanlar
 verdiklerini, Teknik Eğitim ve Mühendislik Fakültesi öğrencilerinin ise öğretim elemanlarına diğer tüm
 fakültelerin öğrencilerinden daha düşük puanlar verdiklerini göstermiştir. Ders yüküyle ilgili analizler, ders
 yükü 45 saat ve daha fazla olan öğretim elemanlarının, ders yükü daha az olan tüm öğretim
 elemanlarından daha olumsuz değerlendirildiklerini ortaya koymuştur. Eğitim Fakültesi için 60-100 arası
 öğrencisi olan öğretim elemanları en kötü değerlendirmeleri almışlar. Arnăutu ve Panc (2015) öğrenci ve
 öğretim elemanlarının farklı beklentileri olduğunu, öğrencilerin daha çok iletişimsel konular üzerinde
 odaklanıp profesörlerden iyi bir ilişki kurmaları ve kişisel dönüt vermelerini bekledikleri, profesörlerin ise
 eğitsel sürecin kalitesi üzerinde (bilginin güncelliği gibi) durduklarını belirtmektedirler.
 Bu çalışmada öğretim elemanlarının performans değerlendirme kapsamında öncelikle araştırma ve
 akademik yayınların değerlendirilmesini, daha sonra öğretim hizmetleri ve lisansüstü danışmanlık
 hizmelerinin değerlendirilmesini istedikleri görülmektedir. Bu bulgu Braunstein ve Benston’ın (1973)
 çalışması tarafından desteklenmektedir. Onların çalışmasında araştırma ve prestijin performans
 değerlendirme birbiriyle yüksek derecede ilişkili olduğu, etkili öğretimin performans değerlendirmeyle
 orta derecede ilişkili olduğu ortaya çıkmaktadır. Öğretim elemanlarının öğretim hizmetinin kalitesi
 öğrenciler tarafından değerlendirilmektedir; fakat Arnăutu ve Panc (2015) bu durumu eleştirmekte ve bu
 değerlendirmelerde araştırma ve yayın üretkenliği, yönetim yeterlilikleri ve akademik tanınırlık göz
 önünde bulundurulmadığını, dolayısıyla öğrencilerin öğretim elemanlarının fakülte içerisinde roller
 hakkında yeterli bilgiye sahip olmadıklarını vurgulamaktadır. Öğretim elemanlarının performansının
 öğrenciler tarafından değerlendirmesi konusunda çalışma yürüten Ünver (2012), öğretim elemanlarının
 çoğunun öğrencilerin öğretimi objektif olarak değerlendireceğini düşünmediğini, öğrencilerin kendilerine
 dair ortaya koyduğu öğretim becerilerine ilişkin görüşleri üzerinde düşünmek yerine akademik çalışmalar
 yapmayı tercih ettiğini belirtmiştir. Turpen, Henderson ve Dancy (2012) yükseköğretim kurumlarının
 öğretimin kalitesini değerlendirirken öğrencilerden gelen niceliksel puanlamalar üzerinde odaklandığını;
 fakültelerin ise öğrencilerin test performansları ve akademik başarılarını kıstas aldıklarını belirtmektedir.
 Bu açıdan, öğretim performansı değerlendirmede kullanılan ölçme araçlarının niteliği oldukça önem
 kazanmaktadır. Kalaycı ve Çimen (2012), yükseköğretim kurumlarında akademisyenlerin öğretim
 performansını değerlendirme sürecinde kullanılan anketleri incelemiş ve anketlerin belli bir sistematiği
 temele almadan hazırlandığını, anketlerde yer alan maddelerin beşte birinin madde yazım kurallarına
 uygun olmadığını, dolayısıyla öğretim elemanlarının performansını ölçmede yetersiz kaldığını ortaya
 koymuştur. Bazı çalışmalarda da öğretim elemanlarının performansının öğrenciler tarafından
 değerlendirilmesi konusunda öğrencilerin değerlendirmelerinin öğretimin kalitesiyle ilgili olduğu kadar
 öğretimle ilişiği olmayan fiziksel çekicilik ve dersin rahatlığı gibi niteliklerle de ilgilisi olabileceği ortaya
 konulmuştur (Hornstein, 2017; Tan ve diğerleri, 2019). Shao, Anderson ve Newsome (2007) öğretim
 hizmetinin kalitesinin değerlendirilmesi hususunda akademisyenlerin sınıf ziyaretleri, derse hazırlık,
 alandaki güncel gelişmeleri takip etme durumu ve meslektaş değerlendirmelerine daha fazla yer
 verilmesine ilişkin beklentilerinin olduğunu belirtmektedirler.
 Bu çalışmada öğretim elemanları performans değerlendirmesinin etkili öğretim üyesinin kriterleri
 konusunda görüş birliği oluşturduğu ve öğretim üyesinin mesleki gelişimini olumlu etkilediği ortaya
 çıkmıştır. Bu nitelikler eğitim fakültelerinde görev yapan öğretim elemanlarının mesleki açıdan kalitelerini
 artırmakta ve sürdürebilir bir mesleki gelişim süreci sağlamaktadır. Filipe, Silva, Stulting ve Golnik (2014)
 1335

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 performans değerlendirme sayesinde iyileşen sürdürülebilir mesleki gelişimin sadece eğitsel etkinliklerle
 sınırlı olmadığını, aynı zamanda yönetim, takım çalışması, profesyonellik, kişilerarası iletişim ve hesap
 verebilirlik gibi nitelikleri de geliştirdiğini vurgulamaktadırlar. Açan ve Saydan (2009) öğretim elemanlarına
 yönelik akademik kalite beklentilerini belirlenmeye çalışmışlar ve öğretim elemanının akademik kalite
 özelliklerinin “öğretim elemanının öğretim yeteneği, öğretim elemanının ölçme-değerlendirme becerisi,
 öğretim elemanının empati kurma becerisi, öğretim elemanının mesleki sorumluluğu, öğretim elemanının
 derse ilgiyi özendirme becerisi, öğretim elemanının derse verdiği önem ve öğretim elemanının nezaketi”
 boyutlarından oluştuğunu tespit etmişlerdir. Esen ve Esen (2015), Amerika Birleşik Devletleri’nde öğretim
 üyelerinin performanslarının genellikle dört boyut esas alınarak yapıldığını, bu boyutların sırasıyla eğitimöğretim, araştırma (profesyonel gelişim), topluma hizmet ve yönetime hizmet olduğunu ifade etmiştir. Bu
 dört boyut arasında ise en önemli olanların eğitim-öğretim boyutu ile araştırma boyutu olduğuna vurgu
 yapmışlardır. Bu boyutlara göre yapılan performans değerlendirme sonuçlarının ise öğretim üyelerinin
 görev süresinin uzatılmasında, bulunduğu kadrodaki uygunluğuna karar verilmesinde ve terfisinde
 kullanıldığı ifade edilmiştir.
 Bu çalışmada akademik teşvik almayan öğretim elemanlarının performans değerlendirmeye ilişkin
 beklentilerinin diğerlerine göre düşük olduğu görülmektedir. Kalaycı (2008), Türkiye’de performans
 değerlendirme ile ilgili olarak bu konudaki çabalar ve çalışmaların dünyadaki uygulamalar yanında henüz
 mayalanma aşaması değil, malzemelerin hazırlanma aşamasında bile olmadığını belirtmektedir. Bu
 sorunun üzerinde odaklanan Yükseköğretim Kurulu, 2015 yılında “bir yükseköğretim kurumunun veya
 programının iç ve dış kalite standartları ile uyumlu kalite ve performans süreçlerini tam olarak yerine
 getirdiğine dair güvence sağlayabilmek için” Yükseköğretim Kalite Kurulu oluşturulmuştur. Buna paralel
 olarak, yükseköğretimde çalışan akademik personelin performansını standart ve nesnel esaslara göre
 değerlendirmek, bilimsel araştırmalar ve akademik çalışmaların etkililiğini artırmak ve akademiyenleri
 desteklemek amacıyla Akademik Teşvik Ödeneği Yönetmeliği yürürlüğe konulmuştur. Bu çalışmada ortaya
 çıkan performans değerlendirme sisteminin olumlu etkileri arasında akademik elemanların motive olması,
 öğretim elemanlarının etkili öğretim üyesinin kriterleri konusunda görüş birliğinin oluşması ile ilgili
 beklentilerle akademik teşvik yönetmeliğinin uyumlu olduğu ve akademik teşviğe hak kazanan öğretim
 elemanlarının performans değerlendirmeyle ilgili beklentilerinin yüksek olduğu görülmektedir.
 Özet olarak, performans değerlendirme durumuna ilişkin eğitim fakültesindeki öğretim elemanları
 arasında bir görüş birliği olmadığı görülmüştür. Öğretim elemanlarının performans değerlendirmesinin
 olumlu etkileri konusunda farkındalıkları bulunmakta; fakat ölçmenin güvenirliği, değerlendirme kriterleri,
 değerlendirme süreci ve değerlendiriciler hakkında endişeleri bulunmaktadır. Bu çalışma kapsamında,
 öğretim elemanları için değerlendirmede yer alması gereken en önemli kriterlerin sırasıyla araştırma ve
 yayın, yapılan öğretimin kalitesi, lisans ve lisansüstü danışmanlık olduğu ortaya çıkmıştır. Performans
 değerlendirme sisteminin olumlu etkileri arasında akademik elemanların motive olması, finansal destek
 sağlanması ve kalite arayışına sev etmesi olarak belirtilmiştir. Buna rağmen, öğretim elemanları
 değerlendirme sisteminin olumsuz etkileri arasında kurumiçi rekabet ve akademik sahtekarlık yer
 almaktadır. Öğretim elemanları tarafından performans değerlendirme ile ilgili sorunların çözülmesi
 amacıyla ders yüklerinin azaltılması, akademik çabalara kurumsal destek sağlanması, öğretim elemanına
 araştırmalar için YÖK tarafından bütçe ayrılması ve daha fazla memur istihdamı gibi öneriler getirilmiştir.
 Performans değerlendirmede yer alması gereken ölçütlerle ilgili farklı talepler bulunsa da yükseköğretimin
 kalitesini artırma ve sistematik iyileştirmeler yapma açısından performans takibi ve çoklu veri türlerine
 dayalı etkili bir değerlendirme sistemi oluşturmak oldukça önemli görülmektedir.
 Bu araştırmanın sonucunda öneriler kapsamında yükseköğretim kurumlarının performans
 değerlendirme sürecinde nesnellik ve etkililiği artırmaları ve fakülteler içerisinde insan kaynakları
 hizmetleri oluşturmaları tavsiye edilmektedir. Ayrıca, bu kurumların sürdürülebilir güçlü performans
 planları tasarlamaları, bütüncül bir değerlendirme döngüsü kullanmaları, öğretim elemanlarına,
 öğrencilere ve iç paydaşlara performansın nasıl iyileştirilebileceğine ilişkin danışmanlık hizmetleri
 sunulması, performans değerlendiriciler için anlaşılır ve nesnel yönergeler hazırlanması ve dönütlerin
 yargılayıcı değil değerli olduğunu düşündüren kurum içi kültürün geliştirilmesi önerilmektedir.
 1336

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 References
 Açan, B., & Saydan, R. (2009). Öğretim elemanlarının akademik kalite özelliklerinin değerlendirilmesi:
 Kafkas Üniversitesi İİBF örneği. Atatürk Üniversitesi Sosyal Bilimler Enstitüsü Dergisi, 13 (2), 226-227.
 Arnăutu, E., & Panc, I. (2015). Evaluation criteria for performance appraisal of faculty members. ProcediaSocial and Behavioral Sciences, 203, 386-392.
 Başbuğ, G., & Ünsal, P. (2010). Kurulacak bir performans değerlendirme sistemi hakkında akademik
 personelin görüşleri: Bir kamu üniversitesinde yürütülen anket çalışması. İstanbul Üniversitesi Psikoloji
 Çalışmaları Dergisi, 29(1), 1-24.
 Batool, Z., Qureshi, R. H., & Raouf, A. (2010). Performance evaluation standards for the HEIs. Higher
 Education Commission Islamabad, Pakistan. Retrieved October 12, 2019 from
 https://au.edu.pk/Pages/QEC/Manual_Doc/Performance_Evaluation_Standards_for_HEIs.pdf
 Bianchini, S., Lissoni, F., & Pezzoni, M. (2013) Instructor characteristics and students’ evaluation of
 teaching effectiveness: Evidence from an Italian engineering school. European Journal of Engineering
 Education, 38 (1),38-57.
 Bogt, H. J., & R. W. Scapens. (2012). Performance management in universities: Effects of the transition to
 more quantitative measurement systems. European Accounting Review, 21 (3), 451–97
 Braunstein, D. N., & Benston, G. J. (1973). Student and department chairman views of the performance
 of university professors. Journal of Applied Psychology, 58(2), 244.
 Creswell, J.W., & Plano Clark, V.L. (2014). Designing and conducting mixed methods research. Thousand
 Oakes, CA, Sage Publications.
 Çakıroğlu, J., Aydın, Y., & Uzuntiryaki, E. (2009). Üniversitelerde öğretim performansının değerlendirilmesi.
 Orta Doğu Teknik Üniversitesi Eğitim Fakültesi Raporu.
 Çalışkan, G. (2006). Altı sigma ve toplam kalite yönetimi. Elektronik Sosyal Bilimler Dergisi, 5(17), 60-75.
 Douglas, A. S. (2013). Advice from the professors in a university social sciences department on the
 teaching-research nexus. Teaching in Higher Education, 18 (4), 377–88.
 Elton, L. (1999). New ways of learning in higher education: managing the change. Tertiary Education and
 Management, 5(3), 207-225.
 Esen, M., & Esen, D. (2015). Öğretim üyelerinin performans değerlendirme sistemine yönelik tutumlarının
 araştırılması. Yüksekögretim ve Bilim Dergisi, 5(1). 52-67
 Etzkowitz, H., Webster, A., Gebhardt C., & Terra., B.R.C. (2000). The future of the university and the
 university of the future: evolution of ivory tower to entrepreneurial paradigm. Research Policy, 29(2),
 313-330.
 Filipe, H. P., Silva, E. D., Stulting, A. A., & Golnik, K. C. (2014). Continuing professional development: Best
 practices. Middle East African journal of ophthalmology, 21(2), 134.
 Glaser, S., Halliday, M. I., & Eliot, G. R. (2003). Üniversite mi? Çeşitlilik mi? Bilgideki önemli ilerlemeler
 üniversitenin içinde mi, yoksa dışında mı gerçekleşiyor?. N. Babüroğlu (Ed.), Eğitimin Geleceği
 Üniversitelerin ve Eğitimin Değişen Paradigması (ss. 167-178). İstanbul: Sabancı Üniversitesi Yayını.
 Hamid, S., Leen, Y. M., Pei, S. H., & Ijab, M. T. (2008). Using e-balanced scorecard in managing the
 performance and excellence of academicians. PACIS 2008 Proceedings, 256.
 Higher Education Authority (2013). Towards a performance evaluation framework: Profiling Irish Higher
 Education. Dublin: HEA
 Hornstein, H. A. (2017). Student evaluations of teaching are an inadequate assessment tool for evaluating
 faculty performance. Cogent Education, 4(1), 1304016.
 1337

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Hopwood, A. G. (2008). Changing pressures on the research process: on trying to research in an age when
 curiosity is not enough. European Accounting Review, 17 (1), 87–96.
 Kalaycı, N. (2009). Yüksek öğretim kurumlarında akademisyenlerin öğretim performansını değerlendirme
 sürecinde kullanılan yöntemler. Kuram ve Uygulamada Egitim Yönetimi Dergisi, 15(4), 625-656.
 Kalaycı N., & Çimen O. (2012). Yükseköğretim kurumlarında akademisyenlerin öğretim performansını
 değerlendirme sürecinde kullanılan anketlerin incelenmesi. Kuram ve Uygulamada Eğitim Bilimleri,
 12(2), 1-22
 Kim, H. B., Myung, S. J., Yu, H. G., Chang, J. Y., & Shin, C. S. (2016). Influences of faculty evaluating system
 on educational performance of medical school faculty. Korean Journal Of Medical Education, 28(3),
 289-294.
 Latham, G. P., & Pinder, C. C. (2005). Work motivation theory and research at the dawn of the twenty-first
 century. Annu. Rev. Psychol., 56, 485-516.
 Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: An expanded sourcebook (2. Basım).
 California: SAGE Publications.
 Mokkink, L. B., Terwee, C. B., Gibbons, E., Stratford, P. W., Alonso, J., Patrick, D. L., & de Vet, H. C. (2010).
 Inter-rater agreement and reliability of the COSMIN Checklist. BMC Medical Research Methodology,
 10, 82.
 O'Connor, M., Patterson, V., Chantler, A., & Backert, J. (2013). Towards a performance evaluation
 framework: profiling Irish higher education. NCVER's free international Tertiary Education Research.
 Retrieved September 8, 2019 from http://hea.ie/assets/uploads/2017/06/Towards-a-PerformanceEvaluation-Framework-Profiling-Irish-Higher-Education.pdf.
 Özgüngör, S., & Duru, E. (2014). Öğretim elemanları ve ders özelliklerinin öğretim elemanlarının
 performanslarına ilişkin değerlendirmelerle ilişkileri. Hacettepe Üniversitesi Eğitim Fakültesi Dergisi,
 29 (29-2), 175-188.
 Paige, R. M. (2005). Internationalization of higher education: Performance assessment and indicators.
 Nagoya Journal of Higher Education, 5(8), 99-122.
 Shao, L. P., Anderson, L. P., & Newsome, M. (2007). Evaluating teaching effectiveness: Where we are and
 where we should be. Assessment & Evaluation in Higher Education, 32(3), 355-371.
 Stonebraker, R. J., & Stone, G. S. (2015). Too old to teach? The effect of age on college and university
 professors. Research in Higher Education, 56(8), 793-812.
 T. C. Resmi Gazete. (2015). Akademik teşvik ödeneği yönetmeliği. Karar Sayısı: 2015/8305. Kabul tarihi:
 14/12/2015. Yayımlandığı tarih: 18 Aralık 2015. Sayı: 29566.
 Tan, S., Lau, E., Ting, H., Cheah, J. H., Simonetti, B., & Lip, T. H. (2019). How do students evaluate
 instructors’ performance? Implication of teaching abilities, physical attractiveness and psychological
 factors. Social Indicators Research, 1-16.
 Tezsürücü, D., & Bursalıoğlu, S. A. (2013). Yükseköğretimde değişim: kalite arayışları. Kahramanmaraş
 Sütçü İmam Üniversitesi Sosyal Bilimler Dergisi, 10 (2), 97-108.
 Tonbul, Y. (2008). Öğretim üyelerinin performansının değerlendirilmesine ilişkin öğretim üyesi ve öğrenci
 görüşleri. Kuram ve Uygulamada Eğitim Yönetimi, 56 (56), 633-662.
 Turpen, C., Henderson, C., & Dancy, M. (2012, Ocak). Faculty perspectives about instructor and
 institutional assessments of teaching effectiveness. In AIP conference proceedings, 1413 (1), 371-374.
 UNESCO (2004), Higher Education in a Globalized Society. UNESCO Education Position Paper, France

 1338

 Gürol YOKUŞ & Tuğba YANPAR YELKEN.– Çukurova Üniversitesi Eğitim Fakültesi Dergisi, 48(2), 2019, 1299-1339

 Ünver, G. (2012). Öğretim elemanlarının öğretimin öğrencilerce değerlendirilmesine önem verme
 düzeyleri. Hacettepe Üniversitesi Eğitim Fakültesi Dergisi, 43, 472-484.
 Vidovich, L. ve Slee, R. (2001). Bringing universities to account? Exploring some global and local policy
 tensions. Journal of Education Policy, 16(5), 431-453.
 Vincent, T. N. (2010). A constructive model for performance evaluation in higher education institutions.
 Retrieved from https://ssrn.com/abstract=1877598 adresinden erişilmiştir.

 1339

 Journal of University Teaching & Learning Practice
 Volume 18
 Issue 8 Standard Issue 4

 Article 14

 2021

 Preservice teachers’ perceptions of feedback: The importance of timing,
 purpose, and delivery
 Christina L. Wilcoxen
 University of Nebraska, United States of America, [email protected]

 Jennifer Lemke
 University of Nebraska, United States of America, [email protected]

 Follow this and additional works at: https://ro.uow.edu.au/jutlp

 Recommended Citation
 Wilcoxen, C. L., & Lemke, J. (2021). Preservice teachers’ perceptions of feedback: The importance of
 timing, purpose, and delivery. Journal of University Teaching & Learning Practice, 18(8). https://doi.org/
 10.53761/1.18.8.14

 Research Online is the open access institutional repository for the University of Wollongong. For further information
 contact the UOW Library: [email protected]

 Preservice teachers’ perceptions of feedback: The importance of timing, purpose,
 and delivery
 Abstract
 If the purpose of feedback is to reduce the discrepancy between the established goal and what is
 recognized, then how can this discrepancy be minimized through support and guidance? Feedback is
 instrumental to a preservice teacher development during their teacher preparation program. This
 qualitative study examines 31 first year teachers’ previous experiences with feedback during their
 undergraduate practicums. The two research questions addressed: What can be learned from PSTs’
 perceptions of feedback practices utilized in teacher preparation programs? and What modifications or
 adaptations can be made to current feedback practices and structures in teacher preparation programs to
 enhance teacher efficacy and classroom readiness? Semi structured interviews provided a comparison of
 qualitative data and an opportunity for open ended questioning. Using descriptive analysis, researchers
 discovered that current feedback loops and structures can inhibit pre-service teachers’ ability to make
 meaning from the information and move their learning and instruction forward. As teacher preparation
 programs work to establish more dialogic approaches to feedback that provide pre-service teachers with
 multiple opportunities to reflect individually and collaboratively with university faculty, timing, purpose,
 and delivery are important components to consider. Although this article is written based on preservice
 teacher perceptions, the implications pertain to multiple fields and authors share a universal framework
 for feedback.

 Practitioner Notes
 1. The goal of teacher preparation is simple: create teachers who are well equipped with the
 knowledge and skills to positively impact PK-12 students. Field experiences are
 embedded throughout teacher preparation programs to provide pre-service teachers
 (PSTs) with meaningful opportunities to develop their ability and knowledge of effective
 instructional practices.
 2. As teacher preparation programs work to establish more dialogic approaches to feedback
 that provide pre-service teachers with multiple opportunities to reflect individually and
 collaboratively with university faculty, timing, purpose, and delivery are necessary
 considerations.
 3. What is the timing of the delivery? The timing of the delivery of feedback must be
 considered. Frequency plays a large role in how PSTs view and utilize feedback.
 4. Do receivers of the feedback understand the purpose? Ties to evaluation and the need for
 directive solutions impact preservice teachers understanding of the purpose behind the
 feedback. One way to support this need it to strengthen PSTs’ assessment feedback
 literacy.
 5. Does the delivery clarify the content and support reflection? As university faculty continue
 to explore how to provide explicit feedback, delivery methods that support reflection and
 pre-service teacher’s growth are important to consider. With the purpose of feedback
 being to help reduce the discrepancy between the intended goal and outcome, pre-service
 teachers must have easy access and retrieval of feedback.

 Keywords
 Preservice teaching, feedback literacy, assessment, teacher preparation

 This article is available in Journal of University Teaching & Learning Practice: https://ro.uow.edu.au/jutlp/vol18/iss8/
 14

 Wilcoxen and Lemke: Preservice teachers’ perceptions of feedback: The importance of timing, purpose, and delivery

 Preservice Teachers’ Perceptions of Feedback: The Importance of
 Timing, Purpose, and Delivery

 The goal of teacher preparation is simple: create teachers who are well equipped with the
 knowledge and skills to positively impact preschool through high school students. Field
 experiences are embedded throughout teacher preparation programs to provide pre-service
 teachers (PSTs) with meaningful opportunities to develop their ability and knowledge of effective
 instructional practices. Practicum experiences in classrooms give PSTs opportunities to practice
 specific pedagogies with students and refine their abilities in real time (Cheng, et al., 2012). It is
 critically important for PSTs to experience the teaching process to develop pedagogical and
 reflective skills as well as teacher efficacy (Darling-Hammond, 2012; Liakopoulou, 2012;
 McGlamery & Harrington, 2007). These structured experiences can bridge understanding on how
 to apply feedback and make connections in the context of a school setting (Flushman, et al., 2019).
 This practice builds confidence in effectively delivering instruction and managing challenges that
 occur in the learning environment.

 If the purpose of feedback is to reduce the discrepancy between the established goal and what is
 recognized (Hattie and Timperley, 2007), then how can this discrepancy be minimized through
 support and guidance? Feedback is instrumental to a PSTs development during their teacher
 preparation program and learning is optimized “when they receive systematic instruction, have
 multiple practice opportunities and receive feedback that is immediate, positive, corrective and
 specific (Scheeler et al., 2004, p. 405). It is important to guide PSTs to interpret their experiences
 in authentic settings (Schwartz et al., 2018) and to support the development of effective teaching
 practices (Hammerness et al., 2005). Constructive feedback coupled with reflective opportunities
 allow the PST to distinguish effective classroom practices from those that are not (Hudson, 2014;

 1

 Journal of University Teaching & Learning Practice, Vol. 18 [2021], Iss. 8, Art. 14

 Pena & Almaguer, 2007). “Good quality external feedback is information that helps students
 troubleshoot their own performance and self-correct: that is, it helps students take action to reduce
 the discrepancy between their intentions and the resulting effects” (Nicol & Macfarlane-Dick,
 2006, p. 208). For feedback to be integrated effectively, it needs to be timely, specific, and
 accessible to encourage the individual to apply what they learned in future teaching opportunities
 (Van Rooij et al., 2019). This is correlational to self-efficacy.

 Feedback can also be a significant source of self-efficacy in pre-service teachers (Mulholland &
 Wallace, 2001; Mahmood et al., 2021; Schunk & Pajares, 2009). Though feedback can come in a
 variety of formats, Rots et al. (2007) found that quality feedback and supervision provided by
 university faculty correlated to higher levels of self-efficacy in pre-service teachers. Efficacy
 increases when university faculty use prompts to encourage PSTs to focus on what went well and
 build upon the strengths of the lesson (Nicol & Macfarlane-Dick, 2006). Timing, purpose, and
 delivery play an important role in how faculty use feedback practices with pre-service teachers.

 In many current teacher preparation program models, PSTs spend more time working in the field
 than they do in coursework (National Council for Accreditation of Teacher Education [NCATE],
 2010). With such an emphasis placed on practicum experiences (American Association of
 Colleges of Teacher Education [AACTE], 2018; Lester & Lucero, 2017) and the critical role these
 play in the development of pre-service teachers, one must consider if current feedback practices
 and structures positively contribute to higher levels of teacher efficacy and classroom readiness.
 The role of university faculty is to acknowledge and clearly articulate the strengths and
 weaknesses of the lesson to promote productive behaviors that will positively contribute to student
 learning (Fletcher, 2000). This gap in the research does not include preservice teacher
 perceptions. Therefore, it is imperative to consider the perception of pre-service teachers regarding

 https://ro.uow.edu.au/jutlp/vol18/iss8/14

 2

 Wilcoxen and Lemke: Preservice teachers’ perceptions of feedback: The importance of timing, purpose, and delivery

 their experiences with feedback, how these experiences align with high quality feedback practices,
 and how they are designed for students who experience them (Smith and Lowe, 2021).

 This qualitative study examines first year teachers’ previous experiences with feedback during
 their undergraduate practicums. The study is expected to contribute to a deeper understanding of
 what feedback practices pre-service teachers determine as beneficial and their interpretation of the
 context, in addition to what action steps or modifications teacher preparation programs can take to
 maximize feedback practices within practicum experiences.

 The Purpose of Feedback
 Feedback has often functioned as a punisher or reinforcer, a guide or rule, or served as a
 discriminating or motivating stimulus for individuals (Mangiapanello & Hemmes, 2015).
 Historically feedback has been a one-way transmission of information (Ajjawi & Boud, 2017), but
 contemporary views on feedback recognize it as a reciprocal exchange between individuals
 focused on knowledge building versus the arbitrary delivery of information (Archer 2010).

 Daniels & Bailey (2014) defined performance feedback as, “information about performance that
 allows a person to change his/her behavior” (p. 157). Studies show organizations that establish
 strong feedback environments exhibit better outcomes in terms of employee performance
 (Steelman et al., 2004). Constructive feedback in the presence of a well built feedback hierarchy,
 builds intrinsic motivation of employees (Cusella, 2017; The Employers Edge, 2018). With that
 explanation, appropriate and meaningful feedback are essential in ensuring that good practices are
 rewarded, ineffective practices corrected and pathways to improvement and success identified
 (Cleary & Walter, 2010).

 3

 Journal of University Teaching & Learning Practice, Vol. 18 [2021], Iss. 8, Art. 14

 A key purpose of feedback in teacher preparation programs is to enhance pre-service teachers’
 knowledge and skills (AACTE, 2018). Feedback serves as one component within complex
 structures and interactions to support PSTs’ development (Evans, 2013). Through feedback, PSTs
 realize their strengths and weaknesses, gain understanding of instructional methods, and develop a
 repertoire of strategies to enhance their performance and student learning (Nicole & MacfarlaneDick, 2006). With this knowledge and understanding, PSTs have opportunities to act upon the
 received feedback to improve their performance and enhance student learning (Carless et al.,
 2011). Feedback allows PSTs to define effective teaching practices and determine what
 instructional methods are valued in specific learning environments.

 Feedback is also meant to stimulate PST’s self-reflection. Feedback allows the pre-service teacher
 to deconstruct and reconstruct instructional methods and practices with guidance from university
 faculty. Specific feedback and reflective dialogue contribute to the pre-service teacher’s ability to
 critically reflect on their performance individually and use this understanding and knowledge to
 regulate future teaching experiences (Tulgar, 2019). These reflective opportunities to identify
 strengths and weaknesses create pathways to improvement.

 Feedback can also serve as a way for university faculty to monitor, evaluate and track pre-service
 teacher’s progress and performance (Price et al., 2010). Many teacher preparation programs use
 feedback as a measure in evaluating PST performance during practicums or other field-based
 components. This feedback, often documented through rubrics or other assessment criteria, is
 useful in helping establish measurable goals and effective teaching practices across a teacher
 preparation program. When the feedback or assessment tools reflect the objectives and goals of the
 program, they can strengthen the connection between theory and practice, thereby increasing PST
 learning (Ericsson, 2002; Grossman, et al., 2008; Vasquez, 2004). PSTs rely on experienced

 https://ro.uow.edu.au/jutlp/vol18/iss8/14

 4

 Wilcoxen and Lemke: Preservice teachers’ perceptions of feedback: The importance of timing, purpose, and delivery

 individuals such as university faculty to articulate, model and provide high quality feedback
 through practicums (Darling-Hammond & MacDonald, 2000). This guidance increases
 connections between coursework and the classroom.

 With research suggesting that pre-service teachers welcome constructive feedback and the
 opportunity to learn (Chaffin & Manfredo, 2009; Chesley & Jordan, 2012), university faculty must
 seek collaborative opportunities to provide effective feedback that positively contributes to the
 development of PSTs. A major role of university faculty is to guide the PST in setting goals for
 practicum that foster their development and growth as an educator. When university faculty
 clearly articulate the strengths and weaknesses of the lesson and assist the PST in identifying their
 next actions, outcomes can be achieved faster.

 Components of Effective Feedback
 Effective feedback provides the learner with a clear understanding of how the task is being
 accomplished or performed and offers support and direction in increasing their efforts to achieve
 the desired outcome (Hattie and Timperley, 2007). This model reinforces the need for feedback to
 be timely, content specific ,and delivered to meet the needs of the individual receiving it.

 Timing
 The timing of feedback plays an essential role in shaping PSTs understanding of effective teaching
 practices and effective instructional methods. Feedback can be provided to PSTs in a variety of
 structures and formats. Deferred feedback refers to notes or qualitative data collected when
 observing shared upon completion of the lesson with the teacher (Scheeler et al., 2009). Deferred

 5

 Journal of University Teaching & Learning Practice, Vol. 18 [2021], Iss. 8, Art. 14

 feedback is less intrusive because it allows the teacher to deliver the lesson without disruption.
 Immediate feedback refers to when university faculty stop the lesson or instructional activity being
 observed to provide corrective feedback and/or modeling when a problem is noted (Scheeler et al.,
 2009). Scheeler et al. (2004) found “targeted teaching behaviors were acquired faster and more
 efficiently when feedback was immediate” (p. 403). Immediate feedback also reduced the
 likelihood of teachers continuing ineffective teaching practices.

 Explicit, Quality Feedback
 Corrective feedback that identifies errors and ineffective teaching methods with targeted ways to
 correct them is one of the most influential means of feedback (Chan et al., 2014; Van Houten,
 1980). Studies found that desired teacher behaviors resulted from feedback that was both positive
 and corrective, focused on specific teaching behaviors and practices, and provided concise
 suggestions for change (Scheeler et al., 2004; Woolfolk, 1993). Feedback that is individualized
 and centered on the needs of the individual yields more effective outcomes for learning (Ciman &
 Cakmak, 2020; Pinger et al., 2018). When this aligns to the goals and objectives of the specific
 lesson, it provides valuable insight as to where the PST is in relation to the goal (Bloomberg &
 Pitchford, 2017). This type of feedback increases self-efficacy as it allows the PST to see growth
 over time.

 Delivery
 The delivery of observational feedback may vary depending on the development and readiness of
 the PST. Although the goal is for teachers to engage in self-directed reflection, some teachers may
 need more support and guidance as they maneuver through the dimensions and complexities of
 teaching. A variety of differentiated coaching strategies have been researched over the years
 regarding instructional practice and student learning (Aguilar, 2013; Costa & Garmston, 2002;

 https://ro.uow.edu.au/jutlp/vol18/iss8/14

 6

 Wilcoxen and Lemke: Preservice teachers’ perceptions of feedback: The importance of timing, purpose, and delivery

 Knight, 2016; Sweeney, 2010). These include both conversational and written feedback between
 the PST and university faculty.

 The New Teacher Center (2017) outlines three differentiated dialogic coaching approaches;
 instructive, collaborative, and facilitative. Instructive coaching is directive and guided by the
 university faculty who analyze performance and lead conversations. Collaborative coaching is less
 directive and both the PST and university faculty have an equal voice in the conversation.
 Facilitative coaching allows the teacher to lead the reflective conversation, while university
 faculty provides feedback with probing questions to facilitate critical thinking and problem
 solving. These conversations contain minimal feedback from university faculty and topics for
 discussion are often directed by the teacher.

 While oral feedback is a powerful tool in constructing relationships between the PST and
 university faculty, written feedback is just as important as it provides pre-service teachers with
 formal documentation of clearly articulated strengths and weaknesses. Written comments are far
 more effective than a grade or evaluation (Black & Wiliam, 1998; Crooks, 1988) and provide both
 the university faculty and the PST with a record of performance in response to learning needs
 (Flushman et al., 2019). Conversation and dialogue include the thoughts and beliefs of the PST
 and provide faculty an opportunity to gauge their depth of understanding. Written support
 provides documentation and a reference for PSTs.

 Methodology

 7

 Journal of University Teaching & Learning Practice, Vol. 18 [2021], Iss. 8, Art. 14

 This study looks to uncover how university faculty can effectively integrate high quality feedback
 practices into practicum experiences. Specifically, what can be learned from PSTs’ perceptions of
 feedback practices utilized in teacher preparation programs? What modifications or adaptations
 can be made to current feedback practices and structures in teacher preparation programs to
 enhance teacher efficacy and classroom readiness? In the context of this study, not only were
 PSTs’ experiences with feedback considered, but also how these experiences and perceptions align
 with high quality feedback practices.

 Design and Participants
 Researchers used semi-structured interviews to provide a comparison of qualitative data and an
 opportunity for open ended questioning (Yin, 2016). The 30-minute interviews were recorded and
 transcribed for analysis in Fall 2020. Participation was voluntary and researchers used purposeful
 sampling (Yin, 2016) from a pool of participants in their first year of teaching. Researchers
 selected beginning teachers because they are most relative to the practicum experiences since they
 are recent graduates. Additionally, all participants experienced the same interruptions in teaching
 during March 2020. Researchers sought a range of participant perspectives; therefore, the study
 consisted of 31 beginning teachers who spanned seven school districts and 24 schools within a
 midwestern metropolitan environment. All teachers held a bachelor’s degree and teaching
 certification from a 4-year university or college. Representation included two private institutions
 and three public institutions. All participants were female apart from one male. Grade levels
 spanned preschool through eighth grade with five special education perspectives spanning grades
 preschool through sixth grade. The school districts are in one state and serve approximately onethird of their state’s total student population (over 100,000 students). Demographic information is
 presented in Table 1.

 https://ro.uow.edu.au/jutlp/vol18/iss8/14

 8

 Wilcoxen and Lemke: Preservice teachers’ perceptions of feedback: The importance of timing, purpose, and delivery

 Table 1
 Characteristics of Participants
 Teaching Endorsement

 Teachers
 N = 31 teachers

 PreK-K

 5

 First - Third

 10

 Fourth - Sixth

 8

 Middle School

 3

 Special Education

 5

 Teaching Environment

 District Representation
 N = 7 districts, 24 schools

 Suburban

 51%

 Rural

 6%

 Urban

 42%

 Data Collection & Analysis
 Questions asked during the interviews addressed previous experiences with feedback during
 practicums. Application was also addressed in reference to how it influenced teaching behaviors
 and actions. More than one researcher took part in the collection, analysis, and interpretation of the
 data. Both researchers were involved in the preparation of the questions and in the data analysis.

 Using descriptive analysis to interpret the data obtained from the semi structured interviews,
 researchers identified themes using the following process to construct theory: 1) review of the
 transcribed interviews, 2) open coding, 3) identification of categories and/or themes, and 4) data
 abstraction (Lawrence & Tar, 2013). Since researcher one conducted the interviews, researcher
 two reviewed all the transcripts to familiarize themself with the content. Next, open coding

 9

 Journal of University Teaching & Learning Practice, Vol. 18 [2021], Iss. 8, Art. 14

 determined themes in participant answers. Patterns in the data showed consistency in ideas
 (Eisenhardt, 1989; Orlikowski, 1993) and researchers identified overall themes amongst the
 answers. Once established, researchers coded the remaining transcripts independently. Since
 coding semi structured interviews involves determining the intent or meaning behind questions
 answered, researchers also addressed intercoder reliability and agreement (Campbell et. al., 2013).
 Both noted the same themes with only 20% discrepancy or 80% agreement. Using negotiated
 agreement, researchers adjudicated the coding disagreements through negotiation for concordance.
 After reconciling the initial disagreements, researchers coded the transcripts using the identified
 themes. Inter-rater reliability was 97%.

 Results
 Results indicated three themes. All stemmed from participant perspectives of beneficial practices
 and what they found value within or wanted more of during their PST experiences. Out of 31
 participants, 29 were coded with at least one of the three themes. Participants who mentioned
 more than one theme were counted as part of each theme mentioned; 11 of the 31 mentioned more
 than one identified theme. See Table 2.

 Table 2
 Themes found in the feedback
 Beneficial Practice

 Frequency and structure of the feedback

 Percent (n = 44)

 40%

 Example Comment: This respondent reflected on the difference between a few visits and multiple. “Let me come
 observe you and give you tips here and there” as compared to someone providing feedback multiple times a
 week.
 The need for explicit and quality feedback

 https://ro.uow.edu.au/jutlp/vol18/iss8/14

 30%

 10

 Wilcoxen and Lemke: Preservice teachers’ perceptions of feedback: The importance of timing, purpose, and delivery

 Example Comment: This respondent reflected on how grace and time are not always the most beneficial. My
 institution “just gave a lot of grace and comfort and even during student teaching … I really enjoy getting told
 what I can improve on because there’s always room for improvement and I like the different ideas.”
 The need for conversation linked to feedback

 30%

 Example Comment: The respondent believed that “conversations more focused on do you think the students
 understood the concept? How do you feel that it went?” would help PSTs engage in daily reflective practice and
 goal setting.

 Timing
 Frequency was the most cited need at 40% and noted by 55% (n = 17) of respondents.
 Overwhelmingly, participants referred to the feedback received as pre-service teachers as
 “minimal”. Other phrases included “too spaced out”, “lumped together at the end” and “few”.
 Multiple participants mentioned having only been provided feedback following an observation
 only once or twice. Even when the feedback provided the next steps towards improvement,
 participants still felt it was too late. “It’s like … now I can’t implement that until next semester” or
 “Here’s the feedback. Remember when you get a job.” Participants felt the timing of the feedback
 negatively affected the implementation. They wanted more consistency with small tips in real time
 throughout the experience.

 Explicit, Quality Feedback
 A need for explicit and quality feedback was cited next at 30% and noted by 42% (n = 13) of
 respondents. “I always like it straight forward. I want all of the feedback that I can get because I
 feel like that's going to help me grow”. Another noted that they wanted specific feedback on areas
 to improve instead of “a lot of grace and comfort.” They additionally noted building confidence
 without the skills to back it, does not lead to improvement. Another commented that university
 faculty was “really really nice but the feedback was all positive like she was kind of scared to give
 constructive feedback.” One commented how she thought the feedback would provide her things

 11

 Journal of University Teaching & Learning Practice, Vol. 18 [2021], Iss. 8, Art. 14

 to work on, but instead the feedback was “you’re doing what you’re supposed to be doing.”
 Participants wanted feedback to provide more direction and insight to enhance instructional
 performance. Feedback only highlighting the positive aspects or acknowledging “no room for
 growth” was not useful or beneficial. One respondent noted, I “hardly ever sat down to discuss
 how I was doing. It was more in passing that the feedback took place.” This led to the third theme.

 Delivery
 A need for conversation linked to feedback was cited next at 30% and noted by 42% (n = 13) of
 responders. Tied to this conversation was the need for explicit feedback mentioned above.
 Participants struggled with the broad categories on rubrics which highlight multiple behaviors. “I
 feel like not all rubric feedback is accurate”. This led some to request more specific targets. They
 felt this could be reached through reflective conversations. One noted the importance of the
 conversation when helping PSTs reflect on practice and setting goals. The respondent believed
 that “conversations more focused on do you think the students understood the concept? How do
 you feel that it went?” would help PSTs engage in daily reflective practice and goal setting. Others
 noted how conversations allowed for “collaboration and brainstorming” and how conversations
 better support the reflection process. Dialogue can be beneficial in the moment and authentic,
 although it was noted that written conversation and feedback can be just as powerful when open
 ended and used as a communication tool.

 Participants noted the importance of written feedback as it provided opportunities to reflect and
 respond. Also, it gave participants insight and context as to what was happening while they were
 teaching. “I don't realize everything good that I'm doing or what I need to improve on. So, when
 university faculty take notes, it really helps me see what I'm actually doing.” Another talked about
 university faculty keeping a notebook. The two used it as a communication tool for written

 https://ro.uow.edu.au/jutlp/vol18/iss8/14

 12

 Wilcoxen and Lemke: Preservice teachers’ perceptions of feedback: The importance of timing, purpose, and delivery

 conversations which the participant “thought was really helpful because … I can look back and
 see what she wrote, and I feel like it was a little more immediate.”

 The results indicated that PSTs believe that timely and explicit feedback are beneficial in both
 goal setting and enhancing their instructional performance. Results also indicated that PSTs find
 both dialogue and written feedback to be useful reflective tools. As teacher preparation programs
 consider feedback structures and the levels of support, these are important implications to consider
 when creating meaningful practicum experiences.

 Discussion
 Reflection is an expectation in teacher preparation (Brookfield, 1995; Darling-Hammond, 2006;
 Liu, 2013). The link between reflection and learning is not new (Dewey 1933; Schön, 1983;
 Ziechner, 1996) as studies highlight that reflection involves emotions and is a context-dependent
 process impacted by social constructs. PSTs are expected to recognize when adjustments are
 needed and make them to effectively meet the needs of the students they serve. A cycle of
 observation, action, and reflection can help PSTs adjust their teaching. This is most effective when
 the cycle is individualized, collaborative, and embeds frequent opportunities to make meaning of
 the information for future use (Vartuli, et al., 2014). Current feedback loops and structures can
 inhibit PSTs' ability to make meaning from the information and move their learning and
 instruction forward. As teacher preparation programs work to establish more dialogic approaches
 to feedback that provide PSTs with multiple opportunities to reflect individually and
 collaboratively with university faculty, timing, purpose, and delivery are necessary components to
 consider. See Figure 1.

 13

 Journal of University Teaching & Learning Practice, Vol. 18 [2021], Iss. 8, Art. 14

 Figure 1
 Feedback Structure for Pre-service Teachers

 What is the timing of the delivery?
 When considering the results, frequency plays a large role in how PSTs view and utilize feedback.
 It was clear that PSTs desire more frequent, immediate feedback to enhance their instructional
 performance. Immediate feedback results in quicker acquisition of effective teacher behaviors and
 greater overall accuracy in the implementation of those behaviors than when delayed feedback is
 provided (Coulter & Grossen, 1997; O’Reilly et al., 1992; 1994). Though some question if
 immediate feedback might interfere with the learning environment and reduce instructional

 https://ro.uow.edu.au/jutlp/vol18/iss8/14

 14

 Wilcoxen and Lemke: Preservice teachers’ perceptions of feedback: The importance of timing, purpose, and delivery

 momentum, advancements in technology make the ability to provide immediate feedback both
 manageable and efficient for both university faculty and pre-service teachers. Devices such as the
 “bug in the ear” (BIE) have been used to provide immediate feedback in a variety of situations.
 Results from various studies show these technologies effectively supported university faculty in
 providing concise, immediate feedback to pre-service teachers to increase their ability to respond
 to the various needs of students and alter or stop ineffective practices in the moment (Coulter &
 Grossen, 1997; Scheeler et al., 2009). As teacher preparation programs consider how to increase
 efforts for university faculty to provide specific, immediate feedback, technical devices have great
 potential to increase desired teaching behaviors and students’ academic performance.

 Do receivers of the feedback understand the purpose?
 Pre-service teachers request explicit, quality feedback, but there is a clear disconnect between this
 concept and the PSTs perceptions of the purpose of the feedback provided. The ties to evaluation
 and the need for directive solutions will not change, so how can mindsets shift to better understand
 the purpose? One way to do this is through strengthening PSTs’ assessment feedback literacy.
 PSTs need opportunities and a repertoire of skills to engage with feedback in authentic ways,
 make sense of the information provided, and determine how the information can be productively
 implemented in future lessons (Carless & Boud, 2018; Price et al., 2010; Smith and Lowe, 2021).
 Feedback literacy can strengthen reflective capacity as students have more opportunities to
 engage, interact with, and make judgments about their own practice (Carless & Boud, 2018;
 Sambell, 2011; Smith and Lowe, 2021). To close the feedback loop, PSTs must acquire the ability
 to process the comments and information received and then act upon the feedback for future
 instruction. Students must learn to appreciate feedback and their role in the process, develop and
 refine their ability to make judgements, and develop habits that strive for continuous improvement
 (Boud & Molloy, 2013). Designing a program curriculum that emphasizes the importance of the

 15

 Journal of University Teaching & Learning Practice, Vol. 18 [2021], Iss. 8, Art. 14

 feedback process and creates opportunities for pre-service teachers to self-evaluate their practice is
 crucial in building capacity for them to make sound judgments. Equally as important is creating
 space for pre-service teachers to co-construct meaning of the feedback and demonstrate how they
 use the information to inform or enhance future instruction (Carless & Boud, 2018; O’Donovan et
 al., 2016). Building programs grounded in feedback literacy provide opportunities to critically
 reflect on choices and draw clear connections between feedback and its purpose.

 Does the delivery clarify the content to support reflection?
 Another consideration worth noting is the need for feedback that prompts both reflection and
 growth of pre-service teachers. Participants in this study indicated that feedback from university
 faculty was not always useful because it could not be applied immediately. They also noted the
 feedback provided did not always prompt reflection that resulted in changes or modifications to
 their future instructional practices or teaching methods. While this discrepancy could be attributed
 to the readiness level of the pre-service teacher, it could also be that the feedback loops and
 structures designed do not create informative pathways that move students learning forward.

 As university faculty continue to explore how to provide explicit feedback, delivery methods that
 support reflection and pre-service teacher’s growth are important to consider. With the purpose of
 feedback being to help reduce the discrepancy between the intended goal and outcome, pre-service
 teachers must have easy access and retrieval of feedback. While we know that reflective coaching
 conversations are beneficial in helping pre-service teachers reflect on their teaching practices and
 to determine alternate methods of instruction that may be more effective, time and availability of
 university faculty may limit these meaningful interactions from taking place. To overcome this
 barrier, teacher preparation programs should consider how they might couple traditional forms of
 written feedback and reflective conversations with digital tools that facilitate collaborative

 https://ro.uow.edu.au/jutlp/vol18/iss8/14

 16

 Wilcoxen and Lemke: Preservice teachers’ perceptions of feedback: The importance of timing, purpose, and delivery

 discussion and grant easier access to feedback allowing pre-service teachers space and opportunity
 to engage in both collaborative and independent reflection and problem solving. Providing preservice teachers with multiple sources of feedback can be a way to increase the visibility of
 feedback for pre-service teachers and encourage them to consistently revisit the information to
 make future instructional decisions and professional judgments.

 Implications
 Current literature highlights the gap between providing feedback and the receiver’s interpretation
 (O’Connor & McCurtin, 2021). This gap creates growth limitations when the learner is not
 gaining what is needed from the feedback. This is especially important in higher education as
 institutions develop students for professional careers which require lifelong learning, critical
 thinking and problem solving, such as education. Therefore, we propose the following framework
 and action steps to support the understanding of and implementation of feedback for PSTs. We
 also assert that this framework could span multiple disciplines and professional contexts.

 17

 Journal of University Teaching & Learning Practice, Vol. 18 [2021], Iss. 8, Art. 14

 Figure 2
 Framework to Support Pre-service Teacher Capacity Building for Feedback

 Limitations and Implications for Future Research
 Although the results of this study provide insight into PSTs feedback experiences, they must be
 interpreted within the limitations of the study. The first limitation is that all participants in this
 study only represent 5 universities across 3 states. We recognize that this limitation in our sample
 does not represent the scope of teacher preparation programs across the country but believe that
 the results provide worthwhile insights into PSTs experiences with feedback in practicum
 experiences. Future studies including participants across numerous states and teacher preparation
 programs would allow for more diverse experiences and perspectives to be represented.

 https://ro.uow.edu.au/jutlp/vol18/iss8/14

 18

 Wilcoxen and Lemke: Preservice teachers’ perceptions of feedback: The importance of timing, purpose, and delivery

 Another limitation in this study is that all participants experienced disruptions in their
 undergraduate practicum experiences. These disruptions likely resulted in condensed or altered
 experiences which could have impacted the opportunities and quality of feedback provided by
 university faculty. Future studies that include participants whose experiences consist of traditional
 structures and timelines of practicum experiences may better reflect the experiences of PSTs'
 experiences with feedback and practices used by university faculty.

 Conclusion
 Teacher preparation institutions need to reevaluate current feedback practices with PSTs.
 Participants indicated that more frequent conversations would make guidance more explicit and
 support development of practice and reflection. Although this is based on a limited number of
 participants and in one country, the findings are generalizable in most countries. The concept of
 feedback literacy needs to be taught, modeled, and PSTs need to be practicing it throughout their
 course of study for them to better understand the connection between feedback and practice. By
 focusing on timing, delivery, and purpose, teacher preparation institutions can take one step closer
 to developing reflective practitioners who embody the knowledge and skills to positively impact
 learning for every student.

 19

 Journal of University Teaching & Learning Practice, Vol. 18 [2021], Iss. 8, Art. 14

 References
 Aguilar, E. (2013). The art of coaching: Effective strategies for school transformation. Wiley.
 Ajjawi, R., & Boud, D. J. (2017). Researching feedback dialogue: an interactional analysis
 approach. Assessment and Evaluation in Higher Education, 42(2), 252–265.
 https://doi.org/10.1080/02602938.2015.1102863
 American Association of Colleges of Teacher Education [AACTE] Clinical Practice Commission
 (2018). A pivot toward clinical practice, its lexicon, and the renewal of teacher
 preparation. Retrieved from https://aacte.org/resources/ clinical-practicecommission#related-resources
 Archer, J. C. (2010). State of the science in health professional education: Effective feedback.
 Medical Education, 44(1), 101–108. https://doi.org/10.1111/j.1365-2923.2009.03546.x.
 Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education:
 Principles, Policy & Practice, 5(1), 7-74. https://doi.org/10.1080/0969595980050102
 Bloomberg, P., & Pitchford, B. (2017). Leading impact teams: Building a culture of efficacy.
 Corwin.
 Boud, D., & Molloy, E. (Eds.). (2013). Feedback in higher and professional education:
 understanding it and doing it well. Routledge.
 Brookfield, S. D. (1995). Becoming a critical reflective teacher. Jossey-Bass Publishers.
 Campbell, J. L, Quincy, C., Osserman, J., & Pedersen, O. K. (2013). Coding in-depth semi
 structured interviews: Problems of unitization and intercoder reliability and agreement.
 Sociological Methods & Research, 42(3), 294-320.
 https://doi.org10.1177/0049124113500475
 Carless, D., Salter, D., Yang, M., & Lam, J. (2010). Developing sustainable feedback practices.
 Studies in Higher Education, 36(4), 395-407.
 https://doi.org/10.1080/03075071003642449

 https://ro.uow.edu.au/jutlp/vol18/iss8/14

 20

 Wilcoxen and Lemke: Preservice teachers’ perceptions of feedback: The importance of timing, purpose, and delivery

 Carless, D., & Boud, D. (2018). The development of student feedback literacy: enabling uptake of
 feedback. Assessment & Evaluation in Higher Education, 43(8), 1315-1325.
 https://doi.org/10.1080/02602938.2018.1463354
 Chaffin C., & Manfredo J. (2009). Perceptions of preservice teachers regarding feedback and
 guided reflection in an instrumental early field experience. Journal of Music Teacher
 Education, 19(2), 57-72. https://doi.org/10.1177/1057083709354161
 Chan, P. E., Konrad, M., Gonzalez, V., Peters, M. T., & Ressa, V. A. (2014). The critical role of
 feedback in formative instructional practices. Intervention in School and Clinic, 50(2),
 96-104. https://doi.org/10.1177/1053451214536044
 Chesley, G. M., & Jordan, J. (2012). What’s missing from teacher prep. Educational Leadership,
 69(8), 41-45.
 Cheng, M. M., Tang, S. Y., & Cheng, A. Y. (2012). Practicalising theoretical knowledge in
 student teachers' professional learning in initial teacher education. Teaching and Teacher
 Education, 28(6), 781-790. https://doi.org/10.1016/j.tate.2012.02.008
 Cimen, O., & Cakmak, M. (2020). The effect of feedback on preservice teachers’ motivation and
 reflective thinking. Elementary Education Online, 19(2), 932943. https://doi.org/10.17051/ilkonline.2020.695828
 Cleary, M. L., & Walter, G. (2010). Giving feedback to learners in clinical and academic settings:
 Practical considerations. The Journal of Continuing Education in Nursing, 41(4), 153154. https://doi.org/10.3928/00220124-20100326-10
 Costa, A. L., & Garmston, R. (2002). Cognitive coaching: A foundation for renaissance schools.
 Christopher-Gordon Publishers.
 Coulter, G. A., & Grossen, B. (1997). The effectiveness of in-class instructive feedback versus
 after-class instructive feedback for teachers learning direct instruction teaching behaviors.
 Effective School Practices, 16(4), 21–35.

 21

 Journal of University Teaching & Learning Practice, Vol. 18 [2021], Iss. 8, Art. 14

 Crooks, T. J. (1988). The impact of classroom evaluation practices on students. Review of
 Educational Research, 58(4), 438-481. https://doi.org/10.3102/00346543058004438
 Cusella, L., (2017). The effects of feedback on intrinsic motivation: A propositional
 extension of cognitive evaluation theory from an organizational communication
 perspective. Annals of the International Communication Association, 4(1), 367-387.
 https://doi.org/10.1080/23808985.1980.11923812
 Daniels, A. C., & Bailey, J. S. (2014). Performance management: Changing behavior that drives
 organizational effectiveness (5th ed.). Atlanta, GA: Performance Management
 Publications.
 Darling-Hammond, L. (2012). Powerful teacher education: Lessons from exemplary programs.
 John Wiley & Sons.
 Darling-Hammond, L. (2006). Powerful teacher education. San Francisco: Jossey-Bass.
 Darling-Hammond, L., & MacDonald, M. (2000). Where there is learning there is hope: The
 preparation of teachers at the Bank Street College of Education. In L. Darling-Hammond
 (Ed.), Studies of excellence in teacher education: Preparation at the graduate level (1-95).
 American Association of Colleges for Teacher Education.
 Dewey, J. (1933). How we think: A restatement of the relation of reﬂective thinking to
 the educative process. Henry Regnery.
 Eisenhardt, K. M., (1989). Building theories from case study research. Academy of Management
 Review, 14(4), 532-550. www.jstor.org/stable/258557
 Ericsson, K. A. (2002). Attaining excellence through deliberate practice: Insights from the study
 of expert performance. In M. Ferrari (Ed.), The pursuit of excellence in education (pp.
 21-55). Erlbaum.
 Evans, C. (2013). Making sense of assessment feedback in higher education. Review of
 Educational Research, 83(1), 70-120. https://doi.org/10.3102/0034654312474350

 https://ro.uow.edu.au/jutlp/vol18/iss8/14

 22

 Wilcoxen and Lemke: Preservice teachers’ perceptions of feedback: The importance of timing, purpose, and delivery

 Fletcher, S. (2000). Mentoring in schools: A handbook of good practice. Kogan Page.
 Flushman, T., Guise, M., & Hegg, S. (2019). Improving supervisor written feedback: Exploring
 the what and why of feedback provided to pre-service teachers. Issues in Teacher
 Education, 28(2), 46–66.
 Grossman, P., Hammerness, K., & McDonald, M. (2008). Redefining teaching, re-imagining
 teacher education. Teachers and Teaching: Theory and Practice, 15(2), 273-289.
 https://doi.org/10.1080/13540600902875340
 Hammerness, K., Darling-Hammond, L., Bransford, J., Berliner, D., Cochran-Smith, M.,
 McDonald, M., & Zeichner, K. (2005). How teachers learn and develop. In L. DarlingHammond & J. Bransford (Eds.), Preparing teachers for a changing world: What
 teachers should learn and be able to do (358-389). Jossey-Bass.
 Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77
 (1), 81-112. https://doi.org/10.3102/003465430298487
 Hudson, P. (2014). Feedback consistencies and inconsistencies: Eight mentors’ observations on
 one preservice teacher’s lesson. European Journal of Teacher Education, 37(1), 63–73.
 https://doi.org/10.1080/02619768.2013.801075
 Killion, J. (2015). Attributes of an effective feedback process. In: The feedback process:
 Transforming feedback for professional learning. Oxford, Ohio: Learning Forward.
 Knight, J. (2016). Better conversations: Coaching ourselves and each other to be more credible,
 caring, and connected. Corwin.
 Lawrence, J., & Tar, U. (2013). The use of grounded theory technique as a practical tool for
 qualitative data collection and analysis. The Electronic Journal of Business Research
 Methods, 11(1), 29-40.
 Liakopoulou, M. (2012). The role of field experience in the preparation of reflective teachers.

 23

 Journal of University Teaching & Learning Practice, Vol. 18 [2021], Iss. 8, Art. 14

 Australian Journal of Teacher Education, 37(6), 42-54.
 https://doi.org/10.14221/ajte.2012v37n6.4
 Liu, K. (2013). Critical reflection as a framework for transformative learning in teacher
 education. Educational Review, 67(2), 135-157.
 https://doi.org/10.1080/00131911.2013.839546
 Lester, A., & Lucero, R. (2017). Clinical practice commission shares proclamations, tenets at
 AACTE forum. Ed Prep Matters. http://edprepmatters.net/2017/04/clinical-practicecommission-shares-proclamations-tenets-at-aacte-forum/
 Mahmood, S., Mohamed, O., Mustafa, S. M. B. S., & Noor, Z. M. (2021). The influence of
 demographic factors on teacher-written feedback self-efficacy in Malaysian secondary
 school teachers. Journal of Language and Linguistic Studies, 17(4).
 Mangiapanello, K., & Hemmes, N. (2015). An analysis of feedback from a behavior analytic
 perspective. The Behavior Analyst, 38(1), 51–75. doi:10.1007/s40614-014-0026-x.
 McGlamery, S., & Harrington, J. (2007). Developing reflective practitioners: The importance of
 field experience. The Delta Kappa Gamma Bulletin, 73(3), 33-45.
 Mulholland, J., & Wallace, J. (2001). Teacher induction and elementary science teaching:
 Enhancing self-efficacy. Teaching and Teacher Education, 17(2), 243–261.
 https://doi.org/10.1016/s0742-051x(00)00054-8
 National Council for Accreditation of Teacher Education. (2010). Transforming
 teacher education through clinical practice: A national strategy to prepare effective
 teachers. Retrieved from
 http://www.ncate.org/LinkClick.aspx?fileticket=zzeiB1OoqPk%3d&tabid=715
 New Teacher Center (2017). Instructional Mentoring. Retrieved from
 https://newteachercenter.org/.
 Nicol, D. J,. & Macfarlane-Dick, D. (2006). Formative assessment and self-regulated learning: A

 https://ro.uow.edu.au/jutlp/vol18/iss8/14

 24

 Wilcoxen and Lemke: Preservice teachers’ perceptions of feedback: The importance of timing, purpose, and delivery

 model and seven principles of good feedback practice. Studies in Higher Education,
 31(2), 199-218. https://doi.org/10.1080/03075070600572090
 O’Connor, A., McCurtin, A. A feedback journey: employing a constructivist approach to the
 development of feedback literacy among health professional learners. BMC Med Educ
 21, 486 (2021). https://doi.org/10.1186/s12909-021-02914-2
 O’Donovan, B., Rust, C., & Price, M. (2016). A scholarly approach to solving the feedback
 dilemma in practice. Assessment & Evaluation in Higher Education, 41(6), 938-949.
 https://doi.org/10.1080/02602938.2015.1052774
 O'Reilly, M. F., Renzaglia, A., & Lee, S. (1994). An analysis of acquisition, generalization and
 maintenance of systematic instruction competencies by preservice teachers using
 behavioral supervision techniques. Education and Training in mental Retardation and
 Developmental disabilities, 29(1), 22-33. https://www.jstor.org/stable/23879183
 O'Reilly, M. F., Renzaglia, A., Hutchins, M., Koterba-Buss, L., Clayton, M., Halle, J. W., & Izen,
 C. (1992). Teaching systematic instruction competencies to special education student
 teachers: An applied behavioral supervision model. Journal of the Association for
 Persons with Severe Handicaps, 17(2), 104-111.
 https://doi.org/10.1177/154079699201700205
 Orlikowski, W. J. (1993). CASE tools as organizational change: Investigating incremental and
 radical changes in systems development. MIS Quarterly, 17(3), 309-340.
 https://doi.org/10.2307/249774
 Rots, I., Aelterman, A., Vlerick, P., & Vermeulen, K. (2007). Teacher education, graduates’
 teaching commitment and entrance into the teaching profession. Teaching and Teacher
 Education, 23(5), 543–556. https://doi.org/10.1016/j.tate.2007.01.012
 Pena, C., & Almaguer, I. (2007). Asking the right questions: online mentoring of student teachers.
 International Journal of Instructions Media, 34(1), 105-113.

 25

 Journal of University Teaching & Learning Practice, Vol. 18 [2021], Iss. 8, Art. 14

 Pinger, P., Rakoczy, K., Besser, M. & Klieme, E. (2016). Implementation of formative assessment
 – effects of quality of programme delivery on students’ mathematics achievement and
 interest. Assessment in Education: Principles, Policy & Practice, 25(2), 160-182.
 https://doi.org/10.1080/0969594x.2016.1170665
 Price, M., Handley, K., Millar, J. & O’Donovan, B. (2010). Feedback: all that effort, but what is
 the effect? Assessment & Evaluation in Higher Education, 35(3), 277-289.
 https://doi.org/10.1080/02602930903541007
 Sambell, K. (2011). Rethinking feedback in higher education. ESCalate.
 Scheeler, M. C., Ruhl, K. L., & McAfee, J. K. (2004). Providing performance feedback to
 teachers: A review. Teacher Education and Special Education: The Journal of the
 Teacher Education Division of the Council for Exceptional Children, 27(4), 396-407.
 Scheeler, M. C., Bruno, K., Grubb, E., & Seavey, T. L. (2009). Generalizing teaching techniques
 from university to K-12 classrooms: Teaching preservice teachers to use what they learn.
 Journal of Behavioral Education, 18(3), 189-210. https://doi.org/10.1007/s10864-0099088-3
 Schön, D. A. (1983). The reﬂective practitioner. Basic Books.
 Schwartz, C., Walkowiak, T. A., Poling, L., Richardson, K., & Polly, D. (2018). The nature of
 feedback given to elementary student teachers from university supervisors after
 observations of mathematics lessons. Mathematics Teacher Education & Development,
 20(1), 62–85.
 Schunk, D., & Pajares, F. (2009). Self efficacy theory. In Handbook of Motivation at School (pp.
 35–54). New York: Routledge.
 Smith, M., & Lowe, C. (2021). DIY assessment feedback: Building engagement, trust and
 transparency in the feedback process. Journal of University Teaching and Learning
 Practice, 18(3), 9-14. https://doi.org/10.53761/1.18.3.9

 https://ro.uow.edu.au/jutlp/vol18/iss8/14

 26

 Wilcoxen and Lemke: Preservice teachers’ perceptions of feedback: The importance of timing, purpose, and delivery

 Steelman, L., Levy, P., & Snell, A., (2004). The feedback environment scale: Construct definition,
 measurement and validation. Educational and Psychological Measurement, 64(1), 165184.
 Sweeney, D. R. (2010). Student-centered coaching: A guide for K-8 coaches and principals.
 SAGE Publications.
 The Employers Edge, (2018). Feedback to boost motivation. Retrieved
 fromhttp://www.theemployersedge.com/providing-feedback/
 Tulgar, A. (2019). Four Shades of Feedback: The Effects of Feedback in Practice Teaching on
 Self-Reflection and Self-Regulation. Alberta Educational Journal of Research, 65(3),
 258-277.
 Van Houten, R. (1980). Learning through feedback. Human Sciences Press.
 Van Rooij, E.C.M, Fokkens-Bruinsma, M., & Goedhart, M. (2019). Preparing science
 undergraduates for a teaching career: Sources of their teacher self-efficacy. The Teacher
 Educator, 54(3), 270-294. https://doi.org/10.1080/08878730.2019.1606374
 Vartuli, S., Bolz, C., & Wilson, C. (2014). A learning combination: Coaching with CLASS and
 the project approach. Early Childhood Research & Practice, 16(1), 1.
 Vasquez, C. (2004). “Very carefully managed”: Advice and suggestions in post observation
 meetings. Linguistics and Education, 15(1-2), 33-58.
 https://doi.org/10.1016/j.linged.2004.10.004
 Woolfolk, A. (1993). Educational psychology. Allyn & Bacon.
 Yang, M., & Carless, D. (2013). The feedback triangle and the enhancement of dialogic feedback
 processes. Teaching in Higher Education, 18(3), 285–297.
 Yin, R.K. (2016). Qualitative research from start to finish, Second Edition. The Guilford Press.
 Zeichner, K. (1996). Teachers as reﬂective practitioners and the democratization of school reform.
 In K. Zeichner, S. Melnick, & M. L. Gomez (Eds.), Currents of reform in preservice

 27

 Journal of University Teaching & Learning Practice, Vol. 18 [2021], Iss. 8, Art. 14

 teacher education (pp. 199-214). Teachers College Press.

 https://ro.uow.edu.au/jutlp/vol18/iss8/14

 28

 The current issue and full text archive of this journal is available on Emerald Insight at:
 www.emeraldinsight.com/2050-7003.htm

 JARHE
 10,4

 University faculty’s perceptions
 and practices of student centered
 learning in Qatar

 514

 Alignment or gap?
 Saed Sabah

 Received 10 November 2017
 Revised 27 December 2017
 Accepted 11 February 2018

 Department of Educational Sciences, College of Education,
 Qatar University, Doha, Qatar and
 Hashemite University, Zarqa, Jordan, and

 Xiangyun Du
 Department of Educational Sciences, College of Education,
 Qatar University, Doha, Qatar and
 UNESCO Center for PBL, Aalborg University, Aalborg, Denmark
 Abstract
 Purpose – Although student-centered learning (SCL) has been encouraged for decades in higher education, to
 what level instructors are practicing SCL strategies remains in question. The purpose of this paper is to investigate
 a university faculty’s understanding and perceptions of SCL, along with current instructional practices in Qatar.
 Design/methodology/approach – A mixed-method research design was employed including quantitative
 data from a survey of faculty reporting their current instructional practices and qualitative data on how these
 instructors define SCL and perceive their current practices via interviews with 12 instructors. Participants of
 the study are mainly from science, technology, engineering and mathematics (STEM) field.
 Findings – Study results show that these instructors have rather inclusive definitions of SCL, which range from
 lectures to student interactions via problem-based teamwork. However, a gap between the instructors’ perceptions
 and their actual practices was identified. Although student activities are generally perceived as effective teaching
 strategies, the interactions observed were mainly in the form of student–content or student-teacher, while
 student–student interactions were limited. Prevailing assessment methods are summative, while formative
 assessment is rarely practiced. Faculty attributed this lack of alignment between how SCL could and should
 be practiced and the reality to external factors, including students’ lack of maturity and motivation due to the
 Middle Eastern culture, and institutional constraints such as class time and size.
 Research limitations/implications – The study is limited in a few ways. First regarding methodological
 justification the data methods chosen in this study were mainly focused on the faculty’s self-reporting. Second
 the limited number of participants restricts this study’s generalizability because the survey was administered
 in a volunteer-based manner and the limited number of interview participants makes it difficult to establish
 clear patterns. Third, researching faculty members raises concerns in the given context wherein extensive
 faculty assessments are regularly conducted.
 Practical implications – A list of recommendations is provided here as inspiration for institutional support
 and faculty development activities. First, faculty need deep understanding of SCL through experiences as learners
 so that they can become true believers and implementers. Second, autonomy is needed for faculty to adopt
 appropriate assessment methods that are aligned with their pedagogical objectives and delivery methods. Input
 on how faculty can adapt instructional innovation to tailor it to the local context is very important for its longterm effectiveness (Hora and Ferrare, 2014). Third, an inclusive approach to faculty evaluation by encouraging
 faculty from STEM backgrounds to be engaged in research on their instructional practice will not only sustain
 the practice of innovative pedagogy but will also enrich the research profiles of STEM faculty and their institutes.

 Journal of Applied Research in
 Higher Education
 Vol. 10 No. 4, 2018
 pp. 514-533
 Emerald Publishing Limited
 2050-7003
 DOI 10.1108/JARHE-11-2017-0144

 © Saed Sabah and Xiangyun Du. Published by Emerald Publishing Limited. This article is published
 under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute,
 translate and create derivative works of this article ( for both commercial and non-commercial
 purposes), subject to full attribution to the original publication and authors. The full terms of this
 licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode
 The authors would like to thank the participants in the study: the authors’ colleagues, who
 supported this study.

 Social implications – The faculty’s understanding and perceptions of implementing student-centered
 approaches were closely linked to their prior experiences—experiencing SCL as a learner may better shape
 the understanding and guide the practice of SCL as an instructor.
 Originality/value – SCL is not a new topic; however, the reality of its practice is constrained to certain social
 and cultural contexts. This study contributes with original and valuable insights into the gap between
 ideology and reality in implementation of SCL in a Middle Eastern context.
 Keywords Qatar, Assessment, Student-centered learning, Instructional practices, STEM faculty
 Paper type Research paper

 1. Introduction
 In general, higher education (HE) faces challenges in providing students with reasoning and
 critical thinking skills, problem formulation and solving skills, collaborative skills and the
 competencies required to cope with the complexity and uncertainty of modern professions
 (Henderson et al., 2010; Seymour and Hewitt, 1997; Martin et al., 2007; Smith et al., 2009). HE
 research often reports that traditional lecture-centered education does not provide
 satisfactory solutions to these challenges (Du et al., 2013; Smith et al., 2009), thereby failing
 to facilitate students’ meaningful learning of their subjects (Henderson et al., 2010). In some
 cases, it has resulted in a deficit of university graduates from certain fields, in particular,
 science, technology, engineering and mathematics (STEM) fields (Graham et al., 2013;
 Seymour and Hewitt, 1997; Watkins and Mazur, 2013). A change in instructional practices is
 believed to be necessary to provide students with the requisite skills and competencies, and
 could potentially serve as a retention strategy in these particular fields such as STEM
 (Graham et al., 2013; Seymour and Hewitt, 1997; Watkins and Mazur, 2013). Therefore, it is
 essential to innovate the pedagogical methods and practices used in these fields (American
 Association for the Advancement of Science (AAAS), 2013; Henderson et al., 2010).
 Instructional change has resulted in a variety of pedagogical reform initiatives that have
 been encouraged in STEM classroom practices, including active learning, inquiry-based
 learning, collaborative learning in teams, interactive learning, technology-enhanced learning,
 and peer instruction. A substantial body of literature has reported research results regarding
 how these innovative instructional strategies affect student learning (Graham et al., 2013;
 Henderson et al., 2010; Watkins and Mazur, 2013). Despite a worldwide trend in instructional
 change toward student-centered learning (SCL), to what extent university instructors are
 implementing or practicing these strategies and how they perceive this change is still in
 question. The international literature has reported that lecture remains the prevailing
 instructional practice in STEM classrooms despite the waves of pedagogical innovation
 encouraged at an institutional level (Hora and Ferrare, 2014; Froyd et al., 2013; Prince and
 Felder, 2006; Walczyk and Ramsey, 2003). In addition, STEM faculty may discontinue its
 practice of certain types of instructional innovation at certain stages of innovation diffusion
 due to various reasons including institutional challenges such as a heavy work load and large
 class sizes, and the lack of individual interests (Henderson and Dancy, 2009). Furthermore, the
 fidelity of the implementation of SCL approaches is also in question (Borrego et al., 2013).
 Therefore, this study aims to investigate how faculty who work as instructors in STEM
 undergraduate programs report their instructional practices and how they perceive the
 implementation of SCL instructional strategies in their situated contexts.
 2. Literature review
 Over the past few decades, a global movement emerges and calls for a new model of learning
 for the twenty-first century and several key elements are highlighted including solving
 complex problems, communication, collaboration, critical thinking, creativity, responsibility,
 empathy, management, among others (NEA, 2010; Scott, 2015). Following this trend,
 university teaching and learning has transformed from being lecture based and teacher
 centered to focusing more on engaging and enhancing student learning (Barr and Tagg, 1995;

 Student
 centered
 learning
 in Qatar
 515

 JARHE
 10,4

 516

 Kolmos et al., 2008; Slavich and Zimbardo, 2012). In the process of this transformation, SCL
 has become a well-used concept. Defined as an approach that “allows students to shape their
 own learning paths and places upon them the responsibility to actively participate in making
 their educational process a meaningful one” (Attard et al., 2010, p. 9), SCL is focused on
 providing an active-learning environment in flexible curricula with the use of learning
 outcomes to understand student achievements (pp. 10-12). Rooted in a constructivist approach
 that moves beyond mere knowledge transmission, such learning is conceived as a process
 whereby learners search for meaning and generate meaningful knowledge based on prior
 experiences (Biggs and Tang, 2011; Dewey, 1938).
 In the STEM fields, instructional practices of instructors are changing from teacherdirected approaches to student-centered approaches to improve the quality of
 undergraduate education ( Justice et al., 2009). A substantial number of studies have
 reported the positive effects of a variety of approaches to student-centered pedagogy in
 STEM HE, such as active learning (Felder et al., 2000; Freeman et al., 2014), small-group
 learning (Felder et al., 2000; Freeman et al., 2014; Springer et al., 1999; Steinemann, 2003),
 and inquiry-based pedagogy (Anderson, 2002; Curtis and Ventura-Medina, 2008; Duran
 and Dökme, 2016; Ketpichainarong et al., 2010; Martin et al., 2007; Simsek and Kabapinar,
 2010). Furthermore, problem- and project-based pedagogy has been well documented as
 an effective way to help students not only construct subject knowledge meaningfully, but
 also develop the skills necessary for many professions, including critical thinking,
 problem solving, communication, management and collaboration (Bilgin et al., 2015;
 Du et al., 2013; He et al., 2017; Kolmos et al., 2008; Lehmann et al., 2008; Steinemann, 2003;
 Zhao et al., 2017).
 Definitions of these terminologies vary and the term SCL in particular is not always used
 with consistent meaning. However, a few points of agreement can be summarized (Rogers,
 2002): who the learners are, the context, the learning activities and the processes. Weimer
 (2002) identifies five key areas for change in the process of transformation from teachercentered to learner-centered classrooms: the balance of power, the function of content, the
 role of the teacher, the responsibility for learning, and the purpose and process of evaluation.
 In relation to the practice and implementation of a student-centered approach, Brook (1999)
 provides a list of guiding principles for the development of constructivist teachers who
 prioritize SCL strategies in HE. These are: using problems that are relevant to students,
 constructing learning around principal concepts, eliciting and appreciating students’
 perspectives, adjusting the curriculum and syllabus to address students’ prior experience,
 and linking assessment to learning goals and student learning.
 A wide range of perspectives has been addressed in previous studies on SCL in HE. Brook
 (1999), Rogers (2002) and Weimer (2002) provide a synthesis of guiding principles suggesting
 three dimensions of focus: instructors (how they understand and perceive the instructional
 innovation they are expected to adopt), student activity and interaction, and assessment.
 The instructor represents an important and challenging aspect of instructional change,
 particularly, regarding innovative pedagogy and SCL (Ejiwale, 2012; Kolmos et al., 2008;
 Weimer, 2002). In a teacher-centered environment, instructors play the dominant role in
 defining objectives, content, student activities and assessment. Whereas in an SCL
 environment, instructors facilitate learning via providing opportunities for students to be
 involved in decision-making regarding goals, content, activities and assessment.
 Nevertheless, in the reality of instructional practice, instructors face the dilemma of, on
 the one hand, giving students the freedom to make decisions on their own, and on the other
 hand, retaining control of classroom activities (Du and Kirkebæk, 2012). In addition, how
 instructors handle the changes in their relationships with students is a determining factor in
 the extent to which SCL can be established. In their meta-analysis of student-teacher
 relationships in a student-centered environment, Cornelius-White (2007) suggests that

 positive teaching relationship variables, such as empathy, warmth, encouragement and
 motivation, are more associated with learner participation, critical thinking, satisfaction,
 drop-out prevention, positive motivation and social connection. In their proposal for
 developing pedagogical change strategies in STEM, Henderson et al. (2010) emphasize that
 the beliefs and behaviors of individual instructors should be targeted because they are
 essential to any strategy for changing the classroom practices and environment. In general,
 the existing literature agrees that for pedagogical change strategy development, it is
 essential to work with the instructors and to understand their current instructional practices
 as well as their perceptions of the change.
 A student-centered approach emphasizes providing students with opportunities to
 participate and engage in activities while interacting with the subject matter, the teacher
 and each other. Student responsibility and ownership of their own learning is regarded as
 essential in facilitating classroom interactions. Self-governance of the interactions can be
 enhanced through collaborative group work when students are expected to negotiate and
 reach consensus on how to work and learn together. Instead of meeting an objective set by
 the instructors, students should take responsibility for organizing learning activities in
 order to reach goals they themselves set (Du et al., 2013; Weimer, 2002). The function of
 teaching content lies in aiding students in learning how to learn, rather than in the
 transmission of factual knowledge (Du and Kirkebæk, 2012).
 Student-centered instructional strategies and practices require a change of assessment
 methods. Formative assessment, which refers to assessment methods that are intended to
 generate feedback on learner performance to improve learning, is often used to facilitate selfregulated learning (Nicol and Macfarlane-Dick, 2006). In their review of formative
 assessment, Black and William (1998) summarize the effectiveness of this method in relation
 to different types of outcomes, educational levels and disciplines. As they emphasize, the
 essential aspect that defines the success of formative assessment is the quality of the
 feedback provided to learners, both formally and informally. Furthermore in formative
 assessment, the process of learning through feedback and dialogue between teachers and
 students and among students is highly accentuated. Various formative assessment methods
 have been reported as additional or alternative methods to the prevailing summative
 assessment methods in STEM in order to align assessment constructively with the
 implementation of SCL (Downey et al., 2006; Prince and Felder, 2006).
 To plan and implement meaningful initiatives for improving undergraduate instruction,
 it is important to collect data on the instructors’ instructional practices (Williams et al.,
 2015). Nevertheless, the existing literature has mainly focused on students’ attitudes,
 performance and feedback on SCL. A limited number of studies have examined the
 outcomes of faculty development activities that encourage research-based instructional
 strategies for SCL. These studies report a good level of faculty knowledge and awareness of
 various alternative instructional strategies in the fields of physics education (Dancy and
 Henderson, 2010; Henderson et al., 2012) and engineering and science education (Brawner
 et al., 2002; Borrego et al., 2013; Froyd et al., 2013). However, instructors’ adoption of
 teaching strategies varies according to individual preferences and beliefs, the contexts of
 disciplines, and institutional policy (Borrego et al., 2013; Froyd et al., 2013), and their
 persistence in the adoption and current use of these strategies (Hora and Ferrare, 2014;
 Henderson and Dancy, 2009; Walter et al., 2016) and their fidelity (how closely the
 implementation follows its original plan) (Borrego et al., 2013) are still in question.
 Therefore, there is a need for additional studies addressing instructors’ understanding,
 beliefs and perceptions about practicing SCL that impact their instructional design for
 classroom interactions, and how they construct assessment methods to align with their
 adoption of instructional strategies. Further research should examine how instructors
 perceive their roles and experiences in the process of instructional change.

 Student
 centered
 learning
 in Qatar
 517

 JARHE
 10,4

 518

 3. Present study
 The state of Qatar has the vision of transforming itself into a knowledge-producing economy
 (General Secretariat for Development Planning, 2008; Rubin, 2012). Accordingly, advancement in
 the fields of science and technology is a critical goal, as is promoting pedagogical practices that
 support engagement in science and technology education (Dagher and BouJaoude, 2011). Qatar
 University (QU) is the country’s foremost institution of HE and aims to become a leader in
 economic and social development in Qatar. In its strategic plan for 2013–2016 (Qatar University
 (QU), 2012), the leadership of QU has called for instructional innovation toward SCL by developing
 “the skills necessary in the 21st century such as leadership, teamwork, communications,
 problem-solving, and promoting a healthy lifestyle” (QU, 2012, p. 13). It is expected that these
 initiatives will be implemented at the university level, particularly in the STEM fields.
 Research on general university instructional practices in Qatar remains sparse, with little
 information available on current instructional practices and to what extent student-centered
 teaching and learning strategies are being implemented. In a recent study, the first on
 university instructional practices in Qatar, Al-Thani et al. (2016) reported that across
 disciplines, instructors’ prioritized lecture-based and teacher-centered instructional practices.
 For example, most participants stressed lecture and content clarity as the most important and
 effective practices. In contrast, instructors mentioned less about student–student interaction,
 the integration of technology and instructional variety received less interest, according to the
 perceptions of the participants. However, little is known about either actual classroom
 practices or the instructors’ perception of SCL, in particular in STEM fields.
 To develop feasible change strategies that could be applied in the Qatar context with the
 aim of facilitating innovation in HE in general and STEM education in particular, it is
 essential to understand current instructional practices and how instructors perceive SCL, as
 well as what strategies are being implemented (Henderson et al., 2010). Therefore, this study
 aims to investigate STEM faculty’s perceptions and instructional practices of SCL and in
 Qatar. The purpose is to generate knowledge on the research-based evaluation of STEM
 faculty’s instructional practices. The study formulates the following research questions:
 RQ1. What are the instructional practices of STEM faculty in Qatar?
 RQ2. To what extent are instructors’ current practices student-centered?
 RQ3. How do STEM faculty perceive SCL, possibilities for implementation and
 challenges in classroom practice?

 4. Research methods
 4.1 Research design
 Ideally, the study of STEM instructional practices involves the use of multiple techniques. The
 methods commonly used to investigate university teaching practices include interviews with
 instructors and students, portfolios written by instructors, surveys of instructors and
 students, and observations in educational settings (AAAS, 2013). However in reality, research
 conditions limit the choice of data collection methods (Creswell, 2013). Although classroom
 observation and portfolios are widely practiced in schools and can be a potential method for
 improving university teaching and learning, these rarely occur in practice except in cases of
 faculty promotion, evaluation or professional development requests (AAAS, 2013). In addition,
 peer and protocol-based observations demand significant resources of human labor, materials,
 equipment and physical conditions, which makes them challenging to implement on a larger
 scale (Walter et al., 2016). Therefore, a mixed-methods research design combining the
 strengths of quantitative and qualitative data – surveys and interviews – was employed as
 the major data generation method in this study (Creswell, 2002).

 4.2 Participants
 An open invitation was sent to the entire faculty in the science, engineering, mathematics and
 health sciences fields, asking them to consider participating on a voluntary basis. A sample of
 65 faculty members (23.4 percent female and 76.4 percent male) completed the questionnaire.
 4.3 Data generation methods
 Survey and instruments. A self-reported questionnaire survey is one of the most efficient
 ways to gain information due to its accessibility, convenience to administrate and relative
 time efficiency (AAAS, 2013, p. 7). Despite the common concern that the faculty may
 inaccurately self-report their teaching practices, recent literature reports that some
 aspects of instruction can be accurately reported by instructors (Smith et al., 2014); this
 approach helps to identify instructional practices that are otherwise difficult to observe
 (Walter et al., 2016).
 The Postsecondary Instructional Practices Survey (PIPS) (Walter et al., 2016) is a newly
 developed instrument aimed at investigating university teaching practices cross-disciplinarily
 from the perspective of instructors. The PIPS was developed on the basis of a conceptual
 framework constructed from a critical analysis of existing survey instruments (Walter et al.,
 2015), the observation codes of the Teaching Dimensions Observational Protocol (Hora et al.,
 2012), and the Reformed Teaching Observation Protocol (Piburn et al., 2000). The PIPS has
 been proven to be valid and reliable while providing measurable variables, and results from
 initial studies have shown that PIPS self-reported data are compatible with the results of
 several Teaching Dimensions Observational Protocol codes (Walter et al., 2016).
 The PIPS includes 24 items for statements and reports regarding instructional practice
 and demographic questions on items such as gender, rank and academic titles. An intuitive,
 proportion-based scoring convention is used to calculate the scores. Two models are used
 for the supporting analysis – a two-factor or five-factor solution. Factors in the five-factor
 model include: six items for student–student interactions, four items for content delivery,
 four items for formative assessment, five items for student–content engagement and four
 items for summative assessment. Factors in the two-factor model include: nine items for
 instructor-centered practices and 13 items for student-centered practices. The responses
 from participants were coded as (0) not at all descriptive of my teaching, (1) minimally
 descriptive of my teaching, (2) somewhat descriptive of my teaching, (3) mostly descriptive
 of my teaching and (4) very descriptive of my teaching.
 In-depth interviews. An interview can provide opportunities to explore teaching practices
 through interactions with the participants. It can also provide the space for in-depth
 questions on specific teaching practices as well as perceptions, beliefs, opinions and
 potentially unexpected findings (Creswell, 2013). During the interviews (interview
 guidelines see Appendix), participants were asked questions about their understanding of
 and past experiences with SCL, their perceptions of the effectiveness of practicing SCL in
 general and in their current environments in particular, what challenges and barriers they
 had experienced, and what institutional support is needed.
 4.4 Procedure
 The questionnaire was sent to all participants in early spring 2017 and was administered by
 Qualtrics. An explanation of the goals of the survey, namely, to understand their current
 practices without intention of assessment, was provided to the participants. A pilot test was
 conducted with a several colleagues who were not participants to ensure that the questions
 were unambiguous and addressed the goals.
 A sample of 65 faculty members (23.4 percent female and 76.4 percent male) completed
 the questionnaire. These were from the schools of sciences, health sciences, pharmacy

 Student
 centered
 learning
 in Qatar
 519

 JARHE
 10,4

 520

 and engineering. The average HE teaching experience of the participants was 14.5 years.
 About 15.6 percent of participants were full professors, 39.1 percent were associate
 professors, 31.3 percent were assistant professors and 14 percent were instructors or
 lecturers. About 58.6 percent of the participants did not have a leadership role (e.g. head of
 department, chair of curriculum committee).
 In total, 12 (4 female and 8 male) of the 65 faculty members who completed the
 questionnaire responded positively to the individual interview request. The interview
 participants include a representative range of STEM faculty members by academic titles (three
 professors, three associate professors and six assistant professors) and gender ( four female
 and eight male). Table I shows details of interview participants’ background information.
 5. Analyses and findings
 5.1 Quantitative data analysis and results
 To answer the first research question, the mean and standard deviation of each item were
 calculated to identify the practices that best describe STEM faculty teaching in the given context.

 Name

 Academic
 Gender rank

 Abdullah

 Male

 Mohammad Male

 Assistant
 Professor
 Assistant
 Professor
 Professor

 Burhan

 Male

 Amin

 Male

 Ali

 Male

 Ibrahim

 Male

 Assistant
 Professor

 Ihab

 Male

 Professor

 Alia

 Female Associate
 Professor

 Faris

 Male

 Sara

 Female Assistant
 Professor

 Iman

 Female Assistant
 Professor

 Duaa

 Female Professor

 Associate
 Professor
 Associate
 Professor

 Assistant
 Professor

 Table I.
 Interview participants’
 background
 Note: All names are anonymous
 information

 Previous pedagogical experiences
 Student experiences in lecture-based learning environment
 Teaching experiences in lecture-based learning environment
 Student experiences in lecture-based learning environment
 Teaching experiences in lecture-based learning environment
 Student experiences in lecture-based learning environment
 Teaching experiences in lecture-based learning environment in 4 countries
 Student experiences in lecture-based learning environment
 Teaching experiences in lecture-based learning environment in 2 countries
 Student experiences in lecture-based learning environment
 Teaching experiences in lecture-based learning environment and in
 active-learning environment
 Student experiences in lecture-based learning environment
 Teaching experiences in lecture-based learning environment and in
 inquiry-based learning environment
 Student experiences in lecture-based learning environment
 Teaching experiences in lecture-based learning environment and in
 project-based learning environment
 Student experiences in lecture-based learning environment
 Teaching experiences in lecture-based learning environment and in
 project-based learning environment
 Student experiences in lecture-based learning environment
 Teaching experiences in lecture-based learning environment and in
 problem-based learning environment
 Student experiences in lecture-based learning environment and
 problem-based learning environment
 Teaching experiences in problem-based learning environment
 Student experiences in lecture-based learning environment and
 problem-based learning environment
 Teaching experiences in problem-based learning environment
 Student experiences in lecture-based learning environment and
 problem-based learning environment
 Teaching experiences in lecture-based learning environment and
 problem-based learning environment in 3 countries

 The grand mean for each factor was also calculated. The descriptive statistics for participants’
 responses to the PIPS are presented in Table II.
 The participants reported that the items of factor 2 (F2), content-delivery practices, were
 mostly descriptive of their teaching (x ¼ 3.14). That is, the items stating that their syllabus
 contains the specific topics that will be covered in every class session (x ¼ 3.58), they
 structure the class session to give students good notes (x ¼ 3.18), and they guide students as
 they listen and take notes (x ¼ 2.89) were mostly descriptive of their content delivery.
 The grand mean of student–content engagement (F4) was relatively high (x ¼ 3.07). This
 means that, for example, instructors frequently ask students to respond to questions during
 class time (x ¼ 3.49) and frequently structure problems so that students are able to consider
 multiple approaches to finding a solution.
 As to the student–student interaction factor (F1), the grand mean (x ¼ 2.18) was
 relatively low compared to the other factors. The item means ranged from 1.9 to 2.51, with

 Factor
 Factor 1: student–student interaction
 P10. I structure class so that students explore or discuss their understanding of new concepts
 before formal instruction
 P12. I structure class so that students regularly talk with one another about course concepts
 P13. I structure class so that students constructively criticize one another’s ideas
 P14. I structure class so that students discuss the difficulties they have with this subject with
 other students
 Grand mean of factor 1
 Factor 2: content delivery practices
 P01. I guide students through major topics as they listen and take notes
 P03. My syllabus contains the specific topics that will be covered in every class session
 P05. I structure my course with the assumption that most of the students have little useful
 knowledge of the topics
 P11. My class sessions are structured to give students a good set of notes
 Grand mean of factor 2
 Factor 3: formative assessment
 P06. I use student assessment results to guide the direction of my instruction during the semester
 P08. I use student questions and comments to determine the focus and direction of classroom
 discussion
 P18. I give students frequent assignments worth a small portion of their grade
 P20. I provide feedback on student assignments without assigning a formal grade
 Grand mean of factor 3
 Factor 4: student–content engagement
 P02. I design activities that connect course content to my students’ lives and future work
 P07. I frequently ask students to respond to questions during class time
 P09. I have students use a variety of means (models, drawings, graphs, symbols, simulations,
 etc.) to represent phenomena
 P16. I structure problems so that students consider multiple approaches to finding a solution
 P17. I provide time for students to reflect about the processes they use to solve problems
 Grand mean of factor 4
 Factor 5: summative assessment
 P21. My test questions focus on important facts and definitions from the course
 P22. My test questions require students to apply course concepts to unfamiliar situations
 P23. My test questions contain well-defined problems with one correct solution
 P24. I adjust student scores (e.g. curve) when necessary to reflect a proper distribution of grades
 The new grand mean of factor 5, excluding P24

 Student
 centered
 learning
 in Qatar
 521

 Mean SD

 2.51 1.03
 2.77 1.02
 2.06 1.07
 2.06 1.11
 2.18 0.82
 2.89 0.99
 3.58 0.77
 2.89 0.94
 3.18 0.83
 3.14 0.55
 2.98 0.86
 2.95
 2.7
 1.82
 2.62

 0.86
 1.22
 1.33
 0.69

 3.11 0.95
 3.49 0.77
 2.92
 2.94
 2.88
 3.07
 3.03
 2.58
 2.91
 0.89
 2.84

 1.04
 0.88
 0.87
 0.53
 1.00
 1.16
 1.13
 1.19
 0.8

 Table II.
 The descriptive
 statistics for
 participants’
 responses to the PIPS
 survey – five-factor
 model analysis

 JARHE
 10,4

 522

 the maximum possible value being 4. Compared with the other items of this factor, item P13
 (“I structure class so that students constructively criticize one another’s ideas”) had the
 lowest mean (x ¼ 1.9), which indicates that this practice is somewhat, but not mostly or very
 much, descriptive of instructors’ practices. The item concerning structuring the class so that
 students discuss the difficulties they have with the subject matter with other students also
 had a low mean (x ¼ 2.06).
 The formative assessment factor (F3) also had a relatively low grand mean (x ¼ 2.62).
 The mean of item P20 was 1.82, indicating that providing feedback on student assignments
 without assigning a formal grade was not very descriptive of QU instructors’ practices.
 The means for the rest of the items ranged from 2.7 to 2.98. Using student comments and
 questions to determine the direction of classroom discussions (x ¼ 2.95) and using student
 assessment results to guide the direction of their instruction (x ¼ 2.98) were mostly
 descriptive of QU instructors’ practices, as reported by participants.
 The summative assessment factor (F5) had a low grand mean (x ¼ 2.35). This relatively
 low mean was greatly impacted by item P24 (“I adjust student scores [e.g. curve] when
 necessary to reflect a proper distribution of the grades”). In the given context, instructors are
 not allowed to adjust student scores, so the result of this item reflects university policy
 rather than individual instructor’s preference. An analysis excluding item P24 shows a
 different picture: the mean of the summative assessment factor without item P24 becomes
 2.84. Thus, the student–student interaction factor and the formative assessment factor
 represent the lowest means in this study.
 To answer the second research question, a paired samples t-test was conducted to
 compare the mean of student-centered items (P02, P04, P06-10, P12-16, P18-20) with the
 mean of the instructor-centered items (P01, P03, P05, P11, P17, P21-24). The mean of
 the student-centered factors is 2.69 and the mean of the instructor-centered factors is 2.76.
 The results of the paired samples t-test found no statistically significant difference
 (α ¼ 0.05) between student-centered mean and instructor-centered mean (t ¼ −1.00,
 df ¼ 64). However, when item 24 is excluded, the mean of the instructor-centered items
 becomes 2.99. A significant difference (α ¼ 0.05) was found between the student-centered
 mean and the new (excluding item 24) instructor-centered mean (t ¼ −4.15, df ¼ 64).
 An alignment was identified between the results of the five-factor model analysis and the
 two-factor model analysis. Quantitative analysis results did not show a correlation between
 instructional practices and demographic factors such as academic rank or years of teaching.
 However, the results identified significant differences in using student-centered
 instructional practices according to the gender of the participant. Based on the data
 reported by participants, the mean of using student-centered instructional practices was
 2.81 for male participants and 2.37 for female participants. A one-way ANOVA found a
 statistically significant difference (α ¼ 0.05) between the student-centered mean of male
 participants and that of female participants (F ¼ 7.64, p ¼ 0.008).
 5.2 Qualitative data analysis and results
 The qualitative analysis provides answers to the third research question. All interviews
 were transcribed before being coded and analyzed. The analysis used an integrated
 approach combining guiding principles on SCL by Brook (1999), Rogers (2002) and
 Weimer (2002), and Kvale and Brinkmann’s (2009) meaning condensation method. The
 analyzed identified emerging themes from instructors’ accounts of their opinions,
 experiences and reflections.
 Instructors’ definitions and perceptions of their roles in SCL. Although all interviewed
 instructors believed they were using SCL strategies in their classrooms, they defined the
 term SCL in various ways. Three categories of definitions were identified; these are

 explained below. Interview data also found a consistency between instructors’ definitions
 and their perceptions of their roles in an SCL environment:
 •

 Category 1: there were three instructors, one professor and two assistant professors,
 all male that believed lecturing to be the best way of teaching and learning.
 According to them, a good lecturer is keen to motivate and encourage students to be
 free thinkers. When students choose to enter a university, they should be sufficiently
 mature and willing to work hard enough to progress through their education.
 Therefore, the university “should be student-centric by definition” (Burhan). This
 definition was supported by the following remark:

 I believe that in our university every instructor is doing SCL in their own way […] but instead of
 standing there reading slides, I think it makes it more student-centered by providing an interesting
 lecture so that when they leave the room you will hear them say, “Wow, this is inspiring and
 interesting.” (Mohammad)

 All three of the instructors interviewed conceived of their role as to “inspire and attract
 students.” As Abdullah commented:
 It is the responsibility of the instructors to find a way to bring in highly interesting lectures to make
 students interested […] to do that, we should prioritize research, so we have something really
 interesting to bring to the class.
 •

 Category 2: instructors in this category included one female associate professor, one
 male assistant professor, two male associate professors and one male professor. They
 believed that in an SCL environment, the instructor should provide activities for
 students to learn hands-on skills and relate theories to certain practices, and that
 students should acquire deep knowledge in the field by working together actively on
 classroom activities. As Ihab commented, “[I]t is so boring to just fill the class with
 me talking and lecturing. It is fun to plan some activities so students can work in a
 team so that they can practice the theories; students like these [activities].” In such an
 environment, the instructor should play the role of “providing” activities and
 “guiding” students to learn the requested, relevant knowledge through these
 activities, as most of them suggested.

 •

 Category 3: this category included two female assistant professors, one female
 professor and one male assistant professor. They believed students should work in
 small groups, with no more than ten people per team, on certain targets, such as
 solving a problem. Students should be responsible for organizing study activities and
 should make decisions on their own to prepare for the requirements of their future
 professions. They should also be allowed to make mistakes and should receive help
 with reflecting on these mistakes in order to improve. As Faris commented:

 I did not like my own student experiences which were filled with lectures and lab work,
 I appreciated my past experienced of working in a more student-centered learning environment,
 which offered me tools to provide what I think as better learning environment now to my students.

 These four instructors used a few different metaphors to describe their roles: “leaders” – “leading
 students to work towards their targets” (Sara and Iman), “observers” – “observing students from
 a distance and only interfering when they got off-track” (Faris), and “facilitators” – “having
 patience when students made mistakes” (Faris), “providing rich resources to students in need of
 help and redirecting students when they were in trouble” (Sara and Iman), “assisting students to
 be able to make their own decisions on learning goals, what to learn and how to learn it, and
 critically evaluate and reflect on their own learning” (Duaa).

 Student
 centered
 learning
 in Qatar
 523

 JARHE
 10,4

 524

 The interview data did not reveal any patterns in teachers’ definitions and perceptions
 according to their academic ranking or gender. However, past experiences with SCL seemed
 to make a difference in their understanding and choice of strategies. For example,
 participants from category 1 mainly experienced lectures as the major source of learning
 and form of teaching in their past student and teaching experiences. Those from category 2
 experienced different types of SCL environments due to their previous work experiences but
 not during their student experiences. Two participants from category 3 experienced SCL in
 the form of problem-based learning (PBL) in their past student experiences, and the other
 two participants had experiences with SCL both as learners and as instructors prior to their
 current jobs. A participant’s past experiences, particularly as a learner, seem to have a close
 link to their current instructional practice. As Sara remarked:
 Having experienced the Problem-Based Learning in my college time, I truly it is the best way to
 learn. Working in team offered us great opportunities to help each other and support each other.
 This means a lot in particular for us female in Arabic culture. We never went out to talk with others
 before and in such a environment we learned how to interact with others and how to behave
 professionally […] we increased our self-confidence and it was very empowering.

 Although all three groups mentioned that students should take responsibility for their own
 learning, when asked to what level students should be involved in deciding what to learn
 and how to learn it, and even how to assess what they have learned, only one instructor (Ali)
 said it would be ideal to involve students in these decisions. However, he had neither
 experienced this himself nor had he observed any such practice in his immediate
 environment. Out of the 12 interviewees, 10 believed that instructors should decide which
 activities to provide, what materials to use and how to structure student activity time and
 form, and should also ensure students reach “the correct” answers.
 While the data are too broad to draw any strong conclusions, the majority of the
 classroom activities that the interviewed instructors exemplified focused on students
 working in groups to fulfill an assignment designed by the instructor or students answering
 questions from the instructor in a teacher-student one-to-one form. The roles described by
 all the instructors involved offering directions and structures. As most of instructors
 mentioned, given the time pressure to deliver all the required content for their courses, they
 had to ensure students progressed through the mandated learning checkpoints.
 Assessment. The interviewed instructors agreed that assessment played an essential role
 in evaluating student learning. One instructor said, an exam “is the best way to engage them
 to learn because they work so hard just before it” (Ibrahim). With the exception of one
 instructor, the respondents gave multiple-choice questions plus short-answer questions as the
 major forms of assessment they used. However, their opinions on what should be included in
 and what should be the focus of the assessment diverged. The instructors provided examples
 that included; “To prepare [students] for their future profession, exams in universities should
 focus on lots of hand-on skills” (Alia); “More writing skills are needed for the exam” (Amin);
 and “Students need to be posed exams that can question their thinking skills” (Faris).
 Two major reasons for the choice of assessment were provided. First, the assessment
 committee within a college or across colleges defined the assessments as exams for some
 undergraduate courses, particularly general courses. This limited the options for instructors
 to design exams different from common exam used in these classes. Second, when
 instructors did have the freedom to design exams for their courses, it is most convenient to
 use assessment forms that can “examine the knowledge students have mastered” and are
 the “least time-consuming” for grading purposes, as 8 out of the 12 interview participants
 expressed. As one participant said, “It takes a few hours to grade multiple-choice question
 exams. With the busy schedule we have, you don’t want to spend several days to grade and
 provide feedback for a few hundred essays” (Ihab).

 Two of the interviewed instructors (Faris and Duaa) expressed their views on how
 formative assessment should be further enhanced in order to better facilitate SCL, only one
 of them enhanced their assessments in daily practice, as Duaa commented:
 Real SCL should involve students not only in deciding on what activities they take in the classroom,
 but also in defining assessment methods, but I can see the students are shocked when I invite them
 to give opinions on how they should be assessed […] it will take more time before more people
 understand that involving students in defining assessment is to motivate them to be more
 responsible instead of cheating.

 Given this challenge, this instructor mainly relied in practice on asking students to identify
 and structure their own projects and problems.
 Challenges. The majority of the instructors believed that students are the most
 challenging factor in implementing ideal SCL in the given context. A major reason cited
 for this is the Arabic culture. Out of the 12 interviewed instructors, 11 believed that most
 students were raised in Arabic families deeply rooted in Middle Eastern culture, where
 family plays an important role in one’s daily life, meaning that most teenagers do not have
 opportunities to live alone and make decisions independently. In addition, their high
 school experiences did not help them become independent learners, as in that setting they
 are used to lectures and taking arranged assignments without asking any “why”
 questions and exams that are mostly in the form of multiple-choice questions that test
 their memories. Students are familiar with being provided with information and
 instruction and having their time arranged and they even prefer it that way. As an
 instructor said, “This is how the students grow up; they are used to it and they cannot take
 responsibilities on their own. They are not motivated to do things independently, no
 matter how the instructor works hard to push them, they are not really ready for a true
 SCL” (Alia).
 Large classroom sizes were identified as another major challenge for implementing
 student–student activities because the students easily slip into a chaotic and “out of control”
 mode, according to some teachers. Interestingly, this was used as an argument for “offering a
 really interesting lecture as an effective approach to provide SCL,” as Abdullah commented.
 Finally, the busy schedule of university faculty remains a reason to limit what they do:
 “if we don’t have so much teaching load, we may have more time to do what could have
 been more student-centered strategies such as letting students identify problems and
 learning needs on their own” (Ali). Although teaching plays an important role in the
 appraisal system at QU, research products, such as publications, remain the major tool to
 evaluate faculty performance. Ibrahim mentioned “when we apply for promotion, which is
 particularly crucial for assistant professors, all what is to be evaluated is the publication
 in one’s own field, as long as we can prove we are able to teach, it is not highly critical how
 we teach.”
 Support needed. Three participants expressed their desire for an institutionalized
 approach to changing the assessment system, allowing for more faculty autonomy to design
 assessment methods that are appropriate for their courses. Most of the suggestions for
 support referred to actions focusing on faculty and students. In total, 11 participants
 suggested more workshops and training sessions for faculty to gain the necessary skills to
 facilitate SCL. Five participants suggested student tutoring programs to help first-year
 undergraduate students learn personal responsibility and to “grow up by following
 suggestions from experienced students” (Faris). One participant even suggested that
 attention to pedagogy should be reduced for now because “We give too much attention to
 the students, nearly like spoon-feeding, worrying too much about whether they are happy or
 not in studying […] students should stand on their own feet, and sometimes they learn by
 being thrown into the deep sea” (Burhan).

 Student
 centered
 learning
 in Qatar
 525

 JARHE
 10,4

 6. Discussion
 In this section, we compare the qualitative data findings and the quantitative study results
 and discuss them in relation to the three dimensions of focus in SCL previously summarized in
 this paper: instructors’ perceptions and roles, student activity and interaction and assessment.
 This is followed by a discussion of STEM instructors’ views on challenges to implementation.

 526

 6.1. STEM instructors’ understanding and perceptions of SCL
 Improving the quality of teaching and learning in the STEM fields necessitates exploring
 the conceptions that faculty instructors hold regarding the learning environment and the
 context of teaching since teaching approaches are strongly influenced by the underlying
 beliefs of the teacher (Kember, 1997). The participants in this study hold different beliefs
 about and attitudes toward SCL strategies. Connections can be identified between the
 participants’ understandings and perceptions of SCL and their prior experiences with it.
 Those who had experienced SCL as learners tended to make more of an effort to implement
 the strategies effectively in their own teaching practice. This finding echoes previous
 studies suggesting that in order to maximize their capability of facilitating PBL faculty
 should be provided with opportunities to experience PBL as learners (Kolmos et al., 2008).
 Comparing results from the quantitative and qualitative data, this study identifies gaps
 between what the instructors consider to be SCL and what they actually practice.
 As suggested by Paris and Combs (2006), the broad and wide-ranging definitions of SCL
 legitimize the instructors’ actual practices. This gap can serve as an alert when a large-scale
 change initiative is being implemented in the given context. As Henderson et al. (2011)
 note, awareness and knowledge of SCL strategies cannot guarantee their actual practice.
 6.2 Student activity and interaction
 This study reported that instructors have a general awareness of using student-centered
 strategies. Student activities are regarded as essential in instructional practices.
 Nevertheless, this study also shows that, in the given context, most classroom
 interactions are in the form of student–content and student-teacher interactions whereas
 student–student interactions remain limited. In practice, a generally low level of SCL can be
 concluded, according to the PIPS instrument (Walter et al., 2016) and the definition of SCL in
 previous studies (Brook, 1999; Rogers, 2002; Weimer, 2002). Student interaction with the
 content and instructor may be directly related to the common concept of instruction and
 may reflect a lecture-centered pedagogic approach. This finding is in line with the report
 from a previous study showing that instructors in Qatar tend to focus on content delivered
 through lectures as an efficient way of teaching (Al-Thani et al., 2016). Previous studies
 (Borrego et al., 2013; Henderson and Dancy, 2009; Walter et al., 2016) also report that the
 levels of implementing instructional practices vary according to different aspects; for
 example, STEM faculties reported limited use of certain strategies such as group work and
 solving problems collaboratively in daily practice despite their high level of knowledge and
 awareness. Instructor’s lack of professional vision on collaborative group work can lead to
 their lack of practice (Modell and Modell, 2017). An often-reported reason is that instructors
 give priority to content delivery due to limited class time (Hora and Ferrare, 2014; Walter
 et al., 2016). Another explanation may be instructors’ lack of confidence in letting students
 take full responsibility for organizing their own learning activities outside of instructors’
 control (Du and Kirkebæk, 2012).
 Student–student interaction received relatively less attention and consideration from the
 participants in this study. Previous studies have found that the length of classes and class
 size were often the most important barrier for the implementation of student-centered
 instructional practices (Froyd et al., 2013). In the context of this study, this may be one of
 the factors limiting the possibility of using student interaction in the classroom. In the

 undergraduate programs, the length of classes is 1 h and 15 minutes, which is counted as a
 two-study-hour class. This limits instructors’ confidence in their ability to deliver heavy
 curriculum content while also providing opportunities to engage students with interactive
 activities. Another possible reason is the bias of the instructors’ knowledge regarding SCL
 strategies; some instructors believe it is sufficient to deliver SCL by simply asking students
 to do something that is different from lecturing (Paris and Combs, 2006; Shu-Hui and Smith,
 2008). Linking the results in the aspect of the instructors’ definition of SCL to their perceived
 roles of teaching, as the participants described in interviews, the instructors also lack the
 belief that interactive student activities can lead to actual learning. Participants consider it
 important that instructors maintain control of classroom activities. For example, Borrego
 et al. (2013) found a strong correlation between instructors’ beliefs regarding problem
 solving and the time students spent on collaborative activities, such as discussing problems.
 6.3 Unaligned assessment
 Although the participants demonstrated an awareness of SCL in general and willingness to
 implement certain SCL strategies, they reported limited critical reflection on assessment
 systems in the given context. Their limited understanding and practice of formative
 assessment is an impediment to the effectiveness of practicing SCL by aligning instruction
 and assessment. Instructional innovation demands changes not only in classroom practices
 but also, more importantly, in assessment methods. Williams et al. (2015) noted that
 formative assessment is a factor that is often ignored or forgotten, even by many of the
 researchers who have developed instruments to describe instructional practices. This study
 similarly found that the summative-oriented prevailing assessment methods at the
 university level remain unchallenged by the instructors. This may be due to their lack of
 knowledge and experience of formative assessment, or due at least in part to the
 convenience of using what they are asked to as well as what they are accustomed. Changing
 teaching methods without a constructive alignment with assessment methods will limit the
 effectiveness of any instructional innovation (Biggs and Tang, 2011).
 6.4 Factors that make a difference
 Previous studies (Dancy and Henderson, 2007, 2010; Froyd et al., 2013; Henderson
 and Dancy, 2009; Henderson et al., 2012) have reported that a faculty member’s use of
 student-centered strategies is often related to demographic factors such as gender, academic
 rank and years of teaching. The results of this study only identified a correlation between
 instructional practices and gender. In contrast to the findings of previous studies, namely,
 that female instructors tend to use student-centered methods more often than male
 instructors and that younger instructors tend to show more interest in adopting new
 pedagogical initiatives, quantitative data of this study found that male participants reported
 higher levels of employing student-centered approaches than female participants, but found
 no patterns regarding academic rank and years of teaching. A major reason may be the
 small number of participants in this study. A possible reason for the gender difference may
 be the imbalanced gender ratio among the overall participants in this study (the proportion
 of female participants was 23.4 percent). Nevertheless, qualitative data did not identify any
 patterns due to gender and academic rank, but rather, identified a connection between the
 instructor’s prior experience with SCL and their understanding, perception and practices, as
 previously discussed.
 6.5 Challenges
 Two categories of instructor concerns and barriers to their sustainable use of instructional
 innovation were identified. Students’ lack of maturity, motivation and responsibility was

 Student
 centered
 learning
 in Qatar
 527

 JARHE
 10,4

 528

 considered the major challenge by most of the interviewed participants, except for those
 who had experienced SCL as a student. Regarding students as the source of the problem and
 blaming students for their own poor performance can be seen as another symptom to be
 associated with a lecturer-centered approach.
 Another major challenge is institutional constraints such as the insufficiency of
 classroom time. Instructors tend to have different opinions regarding the amount of time it
 takes to include interactive student–student activities. Large class size is often a barrier for
 instructors hoping to use interactive student–student activities. Female faculty members
 and younger faculty members are found to have a higher rate of innovative instruction use
 and continuation.
 6.6 Recommendations
 As previous studies (Froyd et al., 2013) have suggested, when an instructional strategy is
 adopted at a low level, it means that it is either not mature or will never achieve full
 adoption. Institutionalized faculty development and support are essential for the further
 implementation of innovative instructional strategies and the persistence and continuation
 of the implementation, as Dancy and Henderson (2007) pointed out, while institutional
 barriers can limit instructional innovations when structures have been set up to function
 well with traditional instruction. The following list of recommendations is provided as
 inspiration for institutional support and faculty development activities:
 •

 First, faculty members need to develop a deep understanding of SCL through
 experiences as learners so that they can become true believers and implementers.

 •

 Second, autonomy is needed for faculty to adopt appropriate assessment methods
 that are aligned with their pedagogical objectives and delivery methods. Input on
 how faculty can adapt instructional innovation to tailor it to the local context is very
 important for its long-term effectiveness (Hora and Ferrare, 2014).

 •

 Third, an inclusive approach to faculty evaluation by encouraging faculty from
 STEM backgrounds to be engaged in research on their instructional practice will not
 only sustain the practice of innovative pedagogy but will also enrich the research
 profiles of STEM faculty and their institutes.

 7. Conclusion
 This study examined university STEM instructors’ understanding and perceptions of SCL
 as well as their self-reported current practices. Results of the study provide insights on how
 institutional strategies of instructional change are continually practiced. The study
 identified a lack of alignment between instructors’ perceptions and their actual practices of
 SCL. Despite agreement on perceiving SCL as an effective teaching strategy, the instructors’
 actual practices prioritize content delivery, the teachers’ role in classroom control, and
 defining student learning activities as well as summative assessment. Student–student
 interactions and formative assessment are limited. The participants tended to blame the
 limited use of SCL on the lack of motivation and readiness among students and on
 institutional constraints. Another perspective to explain this gap may be the diverse yet
 inclusive definitions of SCL espoused by faculty, which tend to legitimate their practices,
 reflecting a rather low level of implementation compared to the literature. This study also
 suggests that faculty’s understanding and perceptions of implementing student-centered
 approaches were closely linked to their prior experiences – experiencing SCL as a learner
 may better shape the understanding and guide the practice of SCL as an instructor.
 Thereafter, recommendations are provided for faculty development activities at an
 institutional level for sustainable instructional innovation.

 The study has a few limitations. First, regarding methodological justification, the data
 methods chosen in this study were mainly focused on the faculty’s self-reporting.
 Although such methods are frequently employed for studying faculty beliefs, perceptions
 and instructional practices (Borrego et al., 2013), data sources from other sources, such as
 observation, may offer information from new perspectives for instructional development
 (Henry et al., 2007). Second, the limited number of participants restricts this study’s
 generalizability because the survey was administered on a volunteer-based manner and
 the limited number of interview participants makes it difficult to establish clear patterns.
 Third, researching faculty members raises concerns in the given context, wherein
 extensive faculty assessments are regularly conducted. Although special considerations
 regarding ethical concerns were taken in this study – for example, participants were
 provided with a clear explanation of the goals and consequences of the study and
 were shown that it had no relation to the university’s annual faculty performance
 assessment – the potential sensitivity may have caused a certain amount of reservation
 among participants regarding sharing further information; this may have limited the
 results of the study.
 In conclusion, the results reported in this paper provide a first impression of the present
 instructional practices in the STEM field in the context of Qatar. Findings of the study,
 although limited to the given context, may have implications for other countries in the Gulf
 Region and Arabic speaking contexts, and potentially an even broader contexts, since
 instructional change toward SCL in STEM classrooms remains a general challenge
 worldwide (Hora and Ferrare, 2014; Froyd et al., 2013). The results imply that more attention
 should be given to faculty development programs to enhance instructor awareness,
 knowledge and skills related student–student interaction and formative assessment. This
 study contributes to further instructional change implementation by introducing a roadmap
 toward change on broader levels, such as strategies of institutional change for instructional
 innovation, as well as toward the establishment of a research-based and evidence-based
 approach to faculty development and institutional change.

 References
 Al-Thani, A.M., Al-Meghaissib, L.A.A.A. and Nosair, M.R.A.A. (2016), “Faculty members’ views of
 effective teaching: a case study of Qatar University”, European Journal of Education Studies,
 Vol. 2 No. 8, pp. 109-139.
 American Association for the Advancement of Science (AAAS) (2013), “Describing and measuring
 STEM teaching practices: a report from a national meeting on the measurement of
 undergraduate science, technology, engineering, and mathematics (STEM) teaching”, American
 Association for the Advancement of Science, Washington, DC, available at: http://ccliconference.
 org/files/2013/11/Measuring-STEM-Teaching-Practices.pdf (accessed November 15, 2006).
 Anderson, R.D. (2002), “Reforming science teaching: what research says about inquiry”, Journal of
 Science Teacher Education, Vol. 13 No. 1, pp. 1-12.
 Attard, A., Di Loio, E., Geven, K. and Santa, R. (2010), Student Centered Learning: An Insight into
 Theory and Practice, Partos Timisoara, Bucharest.
 Barr, R.B. and Tagg, J. (1995), “From teaching to learning: a new paradigm for undergraduate
 education”, Change: The Magazine of Higher Learning, Vol. 27 No. 6, pp. 12-26.
 Biggs, J.B. and Tang, C. (2011), Teaching for Quality Learning at University: What the Student Does,
 McGraw-Hill Education, Berkshire.
 Bilgin, I., Karakuyu, Y. and Ay, Y. (2015), “The effects of project-based learning on undergraduate
 students’ achievement and self-efficacy beliefs towards science teaching”, Eurasia Journal of
 Mathematics, Science & Technology Education, Vol. 11 No. 3, pp. 469-477.

 Student
 centered
 learning
 in Qatar
 529

 JARHE
 10,4

 Black, P. and William, D. (1998), “Assessment and classroom learning”, Assessment in Education:
 Principles, Policy & Practice, Vol. 5 No. 1, pp. 7-74.

 530

 Brawner, C.E., Felder, R.M., Allen, R. and Brent, R. (2002), “A survey of faculty teaching practices and
 involvement in faculty development activities”, Journal of Engineering Education, Vol. 91 No. 4,
 p. 393.

 Borrego, M., Froyd, J.E., Henderson, C., Cutler, S. and Prince, M. (2013), “Influence of engineering
 instructors’ teaching and learning beliefs on pedagogies in engineering science courses”,
 International Journal of Engineering Education, Vol. 29 No. 6, pp. 1456-1471.

 Brook, J.G. (1999), In Search of Understanding: The Case for Constructivist Classrooms, Association for
 Supervision & Curriculum Development, Alexandria.
 Cornelius-White, J. (2007), “Learner-centered teacher-student relationships are effective: a metaanalysis”, Review of Educational Research, Vol. 77 No. 1, pp. 113-143.
 Creswell, J.W. (2002), Educational Research: Planning, Conducting, and Evaluating Quantitative and
 Qualitative Research, Pearson Education, Upper Saddle River, NJ.
 Creswell, J.W. (2013), Qualitative Inquiry and Research Design: Choosing among Five Approaches, Sage.
 Curtis, R. and Ventura-Medina, E. (2008), An Enquiry-Based Chemical Engineering Design Project for
 First-Year Students, University of Manchester, Centre for Excellence in Enquiry-Based
 Learning, Manchester.
 Dagher, Z. and BouJaoude, S. (2011), “Science education in Arab states: bright future or status quo?”,
 Studies in Science Education, Vol. 47, pp. 73-101.
 Dancy, M. and Henderson, C. (2007), “Framework for articulating instructional practices and
 conceptions”, Physical Review Special Topics: Physics Education Research, Vol. 3 No. 1, pp. 1-12.
 Dancy, M. and Henderson, C. (2010), “Pedagogical practices and instructional change of physics
 faculty”, American Journal of Physics, Physics, Vol. 78 No. 10, pp. 1056-1063.
 Dewey, J. (1938), Experience and Education, Collier and Kappa Delta Phi, New York, NY.
 Downey, G.L., Lucena, J.C., Moskal, B.M., Parkhurst, R., Bigley, T., Hays, C. and Lehr, J.L. (2006), “The
 globally competent engineer: working effectively with people who define problems differently”,
 Journal of Engineering Education, Vol. 95 No. 2, pp. 107-122.
 Du, X.Y. and Kirkebæk, M.J. (2012), “Contextualizing task-based PBL”, Exploring Task-Based PBL in
 Chinese Teaching and Learning, pp. 172-185.
 Du, X.Y., Su, L. and Liu, J. (2013), “Developing sustainability curricula using the PBL method in a
 Chinese context”, Journal of Cleaner Production, Vol. 61 No. 15, pp. 80-88.
 Duran, M. and Dökme, İ. (2016), “The effect of the inquiry-based learning approach on students’ criticalthinking skills”, Eurasia Journal of Mathematics, Science & Technology Education, Vol. 12 No. 12.
 Ejiwale, J.A. (2012), “Facilitating teaching and learning across STEM fields”, Journal of STEM
 Education: Innovations and Research, Vol. 13 No. 3, pp. 87-94.
 Felder, R.M., Woods, D.R., Stice, J.E. and Rugarcia, A. (2000), “The future of engineering education II.
 Teaching methods that work”, Chemical Engineering Education, Vol. 34 No. 1, pp. 26-39.
 Freeman, S., Eddy, S.L., McDonough, M., Smith, M.K., Okoroafor, N., Jordt, H. and Wenderoth, M.P.
 (2014), “Active learning increases student performance in science, engineering, and
 mathematics”, Proceedings of the National Academy of Sciences, Vol. 111 No. 23, pp. 8410-8415.
 Froyd, J., Borrego, M., Cutler, S., Henderson, C. and Prince, M. (2013), “Estimates of use of
 research-based instructional strategies in core electrical or computer engineering courses”, IEEE
 Transactions on Education, Vol. 56 No. 4, pp. 393-399.
 General Secretariat for Development Planning (2008), Qatar National Vision 2030, General Secretariat
 for Development Planning, Doha, available at: http://qatarus.com/documents/qatar-nationalvision-2030/ (accessed November 15, 2016).
 Graham, M.J., Frederick, J., Byars-Winston, A., Hunter, A.B. and Handelsman, J. (2013), “Increasing
 persistence of college students in STEM”, Science, Vol. 341 No. 6153, pp. 1455-1456.

 He, Y., Du, X., Toft, E., Zhang, X., Qu, B., Shi, J. and Zhang, H. (2017), “A comparison between the
 effectiveness of PBL and LBL on improving problem-solving abilities of medical students using
 questioning”, Innovations in Education and Teaching International, Vol. 55 No. 1, pp. 44-54,
 available at: https://doi.org/10.1080/14703297.2017.1290539
 Henderson, C. and Dancy, M. (2009), “The impact of physics education research on the teaching of
 introductory quantitative physics in the United States”, Physical Review Special Topics: Physics
 Education Research, Vol. 5 No. 2, pp. 1-15.
 Henderson, C., Beach, A. and Finkelstein, N. (2011), “Facilitating change in undergraduate STEM
 instructional practices: an analytic review of the literature”, Journal of Research in Science
 Teaching, Vol. 48 No. 8, pp. 952-984.
 Henderson, C., Dancy, M. and Niewiadomska-Bugaj, M. (2012), “The use of research-based instructional
 strategies in introductory physics: where do faculty leave the innovation-decision process?”,
 Physical Review Special Topics – Physics Education Research, Vol. 8 No. 2, pp. 1-9.
 Henderson, C., Finkelstein, N. and Beach, A. (2010), “Beyond dissemination in college science teaching:
 an introduction to four core change strategies”, Journal of College Science Teaching, Vol. 39 No. 5,
 pp. 18-25.
 Henry, M.A., Murray, K.S. and Phillips, K.A. (2007), Meeting the Challenge of STEM Classroom
 Observation in Evaluating Teacher Development Projects: A Comparison of Two Widely Used
 Instruments, Henry Consulting, St Louis, MA.
 Hora, M.T. and Ferrare, J.J. (2014), “Remeasuring postsecondary teaching: how singular categories of
 instruction obscure the multiple dimensions of classroom practice”, Journal of College Science
 Teaching, Vol. 43 No. 3, pp. 36-41.
 Hora, M.T., Oleson, A. and Ferrare, J.J. (2012), Teaching Dimensions Observation Protocol (TDOP)
 User’s Manual, Wisconsin Center for Education Research, University of Wisconsin–Madison,
 Madison, WI.
 Justice, C., Rice, J., Roy, D., Hudspith, B. and Jenkins, H. (2009), “Inquiry-based learning in higher
 education: administrators’ perspectives on integrating inquiry pedagogy into the curriculum”,
 Higher Education, Vol. 58 No. 6, pp. 841-855.
 Kember, D. (1997), “A reconceptualisation of the research into university academics’ conceptions of
 teaching”, Learning and Instruction, Vol. 7 No. 3, pp. 255-275.
 Ketpichainarong, W., Panijpan, B. and Ruenwongsa, P. (2010), “Enhanced learning of biotechnology
 students by an inquiry-based cellulose laboratory”, International Journal of Environmental &
 Science Education, Vol. 5 No. 2, pp. 169-187.
 Kolmos, A., Du, X.Y., Dahms, M. and Qvist, P. (2008), “Staff development for change to problem-based
 learning”, International Journal of Engineering Education, Vol. 24 No. 4, pp. 772-782.
 Kvale, S. and Brinkmann, S. (2009), Interviews: Learning the Craft of Qualitative Research, SAGE,
 Thousand Oaks, CA.
 Lehmann, M., Christensen, P., Du, X. and Thrane, M. (2008), “Problem-oriented and project-based
 learning (POPBL) as an innovative learning strategy for sustainable development in engineering
 education”, European Journal of Engineering Education, Vol. 33 No. 3, pp. 283-295.
 Martin, T., Rivale, S.D. and Diller, K.R. (2007), “Comparison of student learning in challenge-based and
 traditional instruction in biomedical engineering”, Annals of Biomedical Engineering, Vol. 35
 No. 8, pp. 1312-1323.
 Modell, M.G. and Modell, M.G. (2017), “Instructors’ professional vision for collaborative learning
 groups”, Journal of Applied Research in Higher Education, Vol. 9 No. 3, pp. 346-362.
 NEA (2010), “Preparing 21st Century students for a global society: an educator’s guide to ‘the four Cs’ ”,
 National Education Association, Washington, DC, available at: www.nea.org/tools/52217
 (accessed December 20, 2017).
 Nicol, D.J. and Macfarlane-Dick, D. (2006), “Formative assessment and self-regulated learning: a model
 and seven principles of good feedback practice”, Studies in Higher Education, Vol. 31 No. 2,
 pp. 199-218.

 Student
 centered
 learning
 in Qatar
 531

 JARHE
 10,4

 Paris, C. and Combs, B. (2006), “Lived meanings: what teachers mean when they say they are learnercentered”, Teachers & Teaching: Theory and Practice, Vol. 12 No. 5, pp. 571-592.
 Piburn, M., Sawada, D., Falconer, K., Turley, J., Benford, R. and Bloom, I. (2000), Reformed Teaching
 Observation Protocol (RTOP), Arizona Collaborative for Excellence in the Preparation of
 Teachers, Tempe.

 532

 Prince, M.J. and Felder, R.M. (2006), “Inductive teaching and learning methods: definitions,
 comparisons, and research bases”, Journal of Engineering Education, Vol. 95 No. 2, pp. 123-138.
 Qatar University (QU) (2012), “Qatar university strategic plan 2013–2016”, available at: www.qu.edu.
 qa/static_file/qu/About/documents/qu-strategic-plan-2013-2016-en.pdf (accessed June 10, 2017).
 Rogers, A. (2002), Teaching Adults, 3rd ed., Open University Press, Philadelphia, PA.
 Rubin, A. (2012), “Higher education reform in the Arab world: the model of Qatar”, available at: www.
 mei.edu/content/higher-education-reform-arab-world-model-qatar (accessed December 15, 2016).
 Scott, L.C. (2015), “The futures of learning 2: what kind of learning for the 21st century?”, UNESCO
 Educational Research and Foresight Working Papers, available at: http://unesdoc.unesco.org/
 images/0024/002429/242996E.pdf (accessed December 22, 2017).
 Seymour, E. and Hewitt, N.M. (1997), Talking About Leaving: Why Undergraduates Leave the Sciences,
 Westview, Boulder, CO.
 Shu-Hui, H.C. and Smith, R.A. (2008), “Effectiveness of interaction in a learner-centered paradigm
 distance education class based on student satisfaction”, Journal of Research on Technology in
 Education, Vol. 40 No. 4, pp. 407-426.
 Simsek, P. and Kabapinar, F. (2010), “The effects of inquiry-based learning on elementary students’
 conceptual understanding of matter, scientific process skills and science attitudes”, ProcediaSocial and Behavioral Sciences, Vol. 2 No. 2, pp. 1190-1194.
 Slavich, G.M. and Zimbardo, P.G. (2012), “Transformational teaching: theoretical underpinnings, basic
 principles, and core methods”, Educational Psychology Review, Vol. 24 No. 4, pp. 569-608.
 Smith, K.A., Douglas, T.C. and Cox, M. (2009), “Supportive teaching and learning strategies in STEM
 education”, in Baldwin, R. (Ed.), Improving the Climate for Undergraduate Teaching in STEM
 Fields. New Directions for Teaching and Learning, Vol. 117, Jossey-Bass, San Francisco, CA,
 pp. 19-32.
 Smith, M.K., Vinson, E.L., Smith, J.A., Lewin, J.D. and Stetzer, K.R. (2014), “A campus-wide study of
 STEM courses: new perspectives on teaching practices and perceptions”, CBE Life Sciences
 Education, Vol. 13, pp. 624-635.
 Springer, L., Stanne, M.E. and Donovan, S.S. (1999), “Effects of small-group learning on
 undergraduates in science, mathematics, engineering, and technology: a meta-analysis”,
 Review of Educational Research, Vol. 69 No. 1, pp. 21-51.
 Steinemann, A. (2003), “Implementing sustainable development through problem-based learning:
 pedagogy and practice”, Journal of Professional Issues in Engineering Education and Practice,
 Vol. 129 No. 4, pp. 216-224.
 Walczyk, J.J. and Ramsey, L.L. (2003), “Use of learner-centered instruction in college science and
 mathematics classrooms”, Journal of Research in Science Teaching, Vol. 40 No. 6, pp. 566-584.
 Walter, E.M., Beach, A.L., Henderson, C. and Williams, C.T. (2015), “Measuring postsecondary teaching
 practices and departmental climate: the development of two new surveys”, in Burgess, D.,
 Childress, A.L. and Slakey, L. (Eds), Transforming Institutions: Undergraduate STEM in the 21st
 Century, G.C. Weaver, W. Purdue, IN, Purdue University Press, West Lafayette, IN, pp. 411-428.
 Walter, E.M., Henderson, C.R., Beach, A.L. and Williams, C.T. (2016), “Introducing the Postsecondary
 Instructional Practices Survey (PIPS): a concise, interdisciplinary, and easy-to-score survey”,
 CBE – Life Sciences Education, Vol. 15 No. 4, pp. 1-11.
 Watkins, J. and Mazur, E. (2013), “Retaining students in science, technology, engineering, and
 mathematics (STEM) majors”, Journal of College Science Teaching, Vol. 42 No. 5, pp. 36-41.

 Weimer, M. (2002), Learner-Centered Teaching: Five Key Changes to Practice, Jossey-Bass,
 San Francisco, CA.
 Williams, C.T., Walter, E.M., Henderson, C. and Beach, A.L. (2015), “Describing undergraduate STEM
 teaching practices: a comparison of instructor self-report instruments”, International Journal of
 STEM Education, Vol. 2 No. 18, pp. 1-14, doi: 10.1186/s40594-015-0031-y.
 Zhao, K., Zhang, J. and Du, X. (2017), “Chinese business students’ changes in beliefs and strategy use in
 a constructively aligned PBL course”, Teaching in Higher Education, Vol. 22 No. 7, pp. 785-804,
 doi: 10.1080/13562517.2017.1301908.
 Appendix
 Interview guidelines
 (1) How do you understand/define SCL? What are important characteristics of SCL in your
 opinion?
 (2) What are your past experiences of using SCL?
 (3) How do you see the role of instructor in an SCL environment, and in which ways is this role
 descriptive of your current practice?
 (4) What are your preferred assessment methods within your current teaching practices and
 why?
 (5) What should be the ideal assessment methods in an SCL environment?
 (6) What are the challenges of practicing SCL in your current environment?
 (7) In your opinion, what institutional supports are needed to implement SCL in Qatar?

 Corresponding author
 Saed Sabah can be contacted at: [email protected]

 For instructions on how to order reprints of this article, please visit our website:
 www.emeraldgrouppublishing.com/licensing/reprints.htm
 Or contact us for further details: [email protected]

 Student
 centered
 learning
 in Qatar
 533

 Enhancing Quality of Teaching in the Built Environment
 Higher Education, UK

 Muhandiramge Kasun Samadhi Gomis
 School of Architecture and Built Environment, University of Wolverhampton,
 Wolverhampton, UK,

 Mandeep Saini
 School of Architecture and Built Environment, University of Wolverhampton,
 Wolverhampton, UK,

 Chaminda Pathirage
 School of Architecture and Built Environment, University of Wolverhampton,
 Wolverhampton, UK,

 Mohammed Arif
 Architecture, Technology and Engineering, University of Brighton, Brighton, UK

 1

 Abstract
 Purpose – The issues in the current Built Environment Higher Education (BEHE) curricula
 recognise a critical need for enhancing the quality of teaching. This paper aims to identify the
 need for a best practice in teaching within Built Environment Higher Education (BEHE)
 curricula and recommend a set of drivers to enhance the current teaching practices in the Built
 Environment (BE) education. The study focused on section one of the National Student Survey
 (NSS) – Teaching on my course; with a core focus on improving student satisfaction, making
 the subject interesting, creating an intellectually stimulating environment, and challenging
 learners.
 Methodology- The research method used in this study is the mixed method, 1.) A document
 analysis consisting of feedback from undergraduate students, and 2.) A closed-ended
 questionnaire to the academics in the BEHE context. More than 375 student feedback were
 analysed to understand the teaching practices in BE and fed forward to developing the closedended questionnaire for 23 academics, including a Head of school, a Principal lecturer, Subject
 leads and lecturers. The data was collected from Architecture, Construction Management, Civil
 Engineering, Quantity Surveying, and Building surveying disciplines representing BE context.
 The data obtained from both instruments were analysed with content analysis to develop 24
 drivers to enhance quality of teaching. These drivers were then modelled using the Interpretive
 Structural Modelling (ISM) method to identify their correlation and criticality to NSS section
 one themes.
 Findings – The study revealed 10 independent, 11 dependent and 3 autonomous drivers,
 facilitating the best teaching practices in BEHE. The study further recommends that the drivers
 be implemented as illustrated in the level partitioning diagrams under each NSS section one to
 enhance the quality of teaching in BEHE.
 Practical implications: The recommended set of drivers and the level partitioning can be set
 as a guideline for academics and other academic institutions to enhance quality of teaching.
 This could be further used to improve student satisfaction and overall NSS results to increase
 the rankings of academic institutions.
 Originality/Value: New knowledge can be recognised with the ISM analysis and level
 partitioning diagrams of the recommended drivers to assist academics and academic
 institutions in developing quality of teaching.

 Keywords – Enhancing Teaching Quality, Built Environment Higher Education, Learning in
 post-COVID, National Student Survey (NSS), Teaching on my course.

 2

 Introduction
 The United Kingdom’s Higher Education (HE) sector is focused on improving the
 quality of teaching (Santos et al., 2020; Tsiligiris and Hill, 2019; Matthews and Kotzee, 2019).
 HE providers continuously attempt to enhance learning standards by assuring teaching
 developments within courses. Hence, knowledge providers make considerable efforts to
 develop pedagogy within BE academia (Van Schaik et al., 2019). However, developing
 teaching within a discipline-specific is challenging (Ovbiagbonhia et al., 2020; McKnight et
 al., 2016). Moreover, Tsiligiris and Hill (2019) and Welzant (2015) noticed an eminent
 knowledge gap in enhancing quality within the current HE curricula. The global COVID
 pandemic has exacerbated the challenges related to teaching and learning within higher
 education (Allen et al., 2020). Both the learners and academics face challenges in maintaining
 quality in HE, especially within the current focus on digitised and Virtual Learning
 Environment (VLE) teaching (Arora and Srinivasan, 2020; Bao, 2020). This study explores
 best practices to improve the quality of teaching across the Built Environment Higher
 Education (BEHE). Thus, the study investigates section one in NSS questions, namely “The
 Teaching on my Course”. The main emphasis is given to four central themes within the NSS
 section one of the questionnaires reflects on whether “the staff is good at explaining things”,
 “made the subject interesting”, how “the course is intellectually stimulating”, and “how the
 course has challenged students to achieve the best work”. Many contemporary learning and
 teaching strategies are present in curriculum development (Tsiligiris and Hill, 2019). However,
 a significant knowledge gap is present in identifying the best use of each theme under NSS
 Section one and developing a best practice to enhance quality of teaching. The data obtained
 by section one of NSS in 2019, 2020 and 2021 highlights the need to enhance teaching in the
 BE curricula. NSS records that the satisfaction level has reduced by 6% in the average
 minimum scoring criteria of “teaching in my course” in 2021 (Office for Students, 2020). It
 further identifies that the average percentile of NSS section one of 2021 was 84% for all
 subjects, whereas BE scored only 79%. This score provides insight into how BE performs
 compared to other subjects within the UK's HE context. Issues in teaching and the COVID
 pandemic may have influenced the significant reduction in NSS score (Arora and Srinivasan,
 2020; Allen et al., 2020). Therefore, this study aims to identify best practices and enhance the
 quality of teaching in BEHE.

 3

 1.0 Literature Review
 1.1 Explaining the subject
 Increasing understanding in an area of expertise is vital in providing pedagogical
 education. Literature (Ferguson, 2012) suggests that teaching helps identify cognition within
 human behaviour and gain insight into relevant information while relating to exposure and
 experience within the subject area within various levels of learning. The levels of learning lead
 to further considerations of self-academic development in students. Findings from Gollub
 (2002) suggest that a better understanding of learning is facilitated around the concepts and
 principles of the subject matter. Moreover, Andersson et al. (2013) highlight that the students
 tend to generate more knowledge by acquiring prerequisite knowledge and utilising them to
 increase their understanding of the subject. In addition, Andersson et al. (2013) suggest that
 learners embrace prior learning to understand interactive learning better. However, in a
 classroom context, the multi-disciplinary orientation of BE makes it challenging to address
 prerequisite knowledge and provide in-depth understanding to learners (Waheed et al., 2020;
 Dieh et al., 2015). Thus, BE academics need to devise module delivery aligning to the subject
 area while reflecting previous knowledge in enhancing knowledge.
 Moreover, Lai (2011) and McKnight et al. (2016) stressed the importance of interactive
 learning within pedagogical education. These studies highlight that learners find that
 knowledge is constructive when peer-reviewed; thus, providing a better environment to
 embrace enhanced knowledge of BE understanding is essential. Moreover, Guo and Shi (2014)
 further explain the uses of collaboration which increases understanding using active strategies.
 However, Guo and Shi (2014) overlook that innovation embedded in learning effectively
 brings collaboration and utilises modern approaches within the classroom context.
 Furthermore, the current pandemic has encouraged active strategies such as blended learning
 and digitised technologies (Allen et al., 2020) within a VLE. However, challenges were
 identified in the definitive use of VLE, which did not advocate sub-teaching concepts such as
 interactive learning and context-based knowledge (Waheed et al., 2020). The "silent"
 classrooms are not appropriate for the transfer and sharing of technical knowledge in BEHE.
 Ultimately, the prospect of teaching signifies innovative approaches and the extent of using
 VLEs (Virtual Learning Environment)’ to make a subject interesting to foster interactive
 learning through the co-creation of knowledge to promote a clear explanation of a BE subject.
 1.2 Making the Subject Interesting
 The students do not engage in situations where they will no longer see value or interest
 in the content taught (Fraser 2019). Lozano et al. (2012) state that analytical competency is
 achieved using theory taught relevant to industrial capacity, creating a platform for students to
 participate in learner engagement. Both Fraser (2019) and Lozano et al. (2012) suggest that
 collaboration between academics and learner is significant to active learner engagement in
 developing interest in subjects learnt. Therefore, engagement and collaboration are considered
 the most critical challenges in an active learning environment (Hue and Li 2008 and Scott
 2020). However, a knowledge gap exists in measuring collaboration that shows competitive
 learning and the cooperation of learners with the academic. The social, psychological, and
 academic characteristics build learners’ perception of collaborative work (Uchiyama and
 Radin, 2008). Out of the above, Hmelo-Silver et al. (2008) established the importance of the
 social entity of collaboration. That suggests associating the benefits of social support by
 establishing a positive atmosphere within collaborative learning. Also, implementing a
 collaborative approach to learning enhances diversity within the BEHE.
 4

 Furthermore, engagement benefits the learners' psychological aspects, reflecting on
 academic performance and mental well-being (Clough and Strycharczyk 2012). It signifies
 student-centric education reflecting on the psychological characteristic of developing students'
 self-esteem, thus increasing interest in the subject. Secondary elements in BE teaching, such
 as site visits, guest lectures and other innovative concepts, could be denoted as examples (Van
 Schaik, 2019). Although Clough and Strycharczyk (2012) consider psychological
 characteristics, the study does not signify the prominence of critical thinking obtained from
 collaboration. Bye, et al. (2007) imply that critical thinking is needed to make content more
 meaningful and collaborative. However, collaborative teaching methods have been limited in
 considering teaching during the COVID pandemic (Blundell et al., 2020); thus, more research
 is needed to identify the means of developing learner engagement in VLEs. In addition, this
 study identifies learner engagement and fostering collaboration demands stimulating learners
 and making the subject interesting. However, a significant knowledge gap exists in addressing
 the findings to make a subject interesting in the current BEHE context.
 1.3 Intellectual stimulation of learners
 Studies identify that learners become stimulated when the subject is interesting and
 motivated to overcome the challenging nature of the course structure (Bolkan and Goodboy,
 2010). Moreover, student motivation and intellectual stimulation increase when subject matter
 reflects learner interests (Baeten et al., 2010). Furthermore, intellectual stimulation improves
 when academics provide authentic, current, industry-related practices relevant to learners’
 academic learning. Bolkan et al. (2011) suggested implementing active learning to enhance
 learners’ intellectual effort. Thus, intellectual stimulation needs to be integrated through
 problem-solving teaching methods, context-based learning, realistic case studies and setting
 clear expectations and motivation for student excellence.
 Chickering and Gamson (1999) suggested that summarising ideas, reviewing problems,
 assessing the level of understanding and concluding on learning outcomes at the end of a
 learning session stimulates learners. Furthermore, Tirrell and Quick (2012) outlined
 opportunities to direct learners by contrasting fundamental theories and applying theory to real
 life. However, the researchers overlook the fact that stimulation could be provided outside the
 learning environment. The current practice within BE academia involves guest lectures and
 arranging site visits to explain the classroom bandwidth and stimulate learners (Chen and Yang,
 2019). Furthermore, Educational Development Association (2013) signifies the influence of
 Professional Standards and Regulatory Bodies (PSRBs) within BE learning. The use of PSRBs
 deems the guarantee in using industry-appropriate knowledge delivered. In addition to making
 the subject interesting, PSRBs would further stimulate the learner to develop academic skills
 and competencies. Thus, the learners tend to foresee the industry-standard reflecting the
 theories, advocating intellectual stimulation. Nonetheless, these strategies are disrupted by the
 COVID pandemic's current measures for virtual module delivery (Allen et al., 2020). Thus, the
 use of the strategies was to be integrated into digitisation platforms and integrated with the
 VLE teaching methods. In contrast, a measure of best practice is eminent in contemplating
 using VLE platforms' strategies in addressing the COVID situation and further development in
 the BEHE curriculum.
 The stimulation provided at the elementary level in BE learning is vital for interaction
 between the learners (Jabar and Albion 2016). The collaboration between learners and
 knowledge providers is vital for intellectual stimulation, and the use of concepts such as VLE
 further promotes stimulation (Block, 2018; Marshalsey and Madeleine, 2018). However,
 identifying fundamental digitisation approaches and innovative teaching methods such as
 5

 blended learning or flipped classroom will signify the commitment toward stimulated learning.
 Stimulation through quizzes and experimental studies will improve the clarity of knowledge
 provided through VLE. In addition, stimulation in a VLE through various digital learning
 strategies for students can promote challenging learners. However, some views on the current
 teaching practices in the COVID era denote that VLE is not the perfect solution for academic
 development (Bao, 2020). Academics need to know to what extent VLE should be integrated
 and how the best practice in BE teaching should be developed.
 1.4 Challenging Learners
 Knowledge providers who promote intellectual stimulation create a challenging
 learning environment that empowers the learners and promotes cognitive and affective learning
 (Bolkan and Goodboy, 2010). Kohn Rådberg et al. (2018) discuss that intellectual stimulation
 depends on the intrinsic motivation to be challenged in critical learning contexts. Thus, the
 learners require encouragement in identifying intellectual stimulus in acknowledging the
 knowledge gained in HE curricula. Altomonte et al. (2016) explain how learners persist in their
 learning process much longer in a challenging environment than in a traditional learning
 environment. A plethora of more contemporary literature (Avargil et al., 2011; Chen and Yang,
 2019) addresses specific learning strategies such as project-based and context-based learning,
 which acts as a stimulus in developing challenging environments in the current BE learning
 context.
 A study carried out by Han and Ellis (2019) has detailed revelation on in-depth learning
 approaches to learning and 'higher learning outcomes'. However, it fails to identify the
 relationship between challenging learners and their impact on academic and cognitive learning
 strategies. Learners often respond more to challenges made via competitive elements such as
 quizzes, polls, and other simpler assessments in module delivery (Chen and Yang, 2019). It is
 vital to understand that a challenging learning environment is not a mere self-testing method
 for assessment in curricula but rather an instrument for continuous academic improvement
 (Darling-Hammond et al., 2019). Further, learners will benefit from self-preparing concerning
 the knowledge content discussed in the classroom. It further influences advanced knowledge
 gained through research rather than knowledge transmission provided in the classroom.
 Challenging learners create more opportunities to collaborate and increase intellectual
 stimulation (Boud et al., 2018; Gomis et al., 2021). However, Boud overlooks counter
 motivation created by learners in challenging, which results in innovation. Furthermore,
 challenging students could be identified to apprehend stimulation and provide informative
 judgment on their academic experience. By challenging the learner, the academic could
 evaluate the aptitude and growth (Hamari et al., 2016). The current practice in academia during
 the COVID pandemic deemed the use of VLE in setting out quizzes and other evaluation
 methods to stimulate and challenge learners (Block, 2018; Bao, 2020). Hence, using digitised
 platforms in an active learning environment is paramount in advancing teaching in BE.
 However, these VLE instruments could be further integrated with the module delivery plan to
 optimise challenging learners and enhance academic development.

 2.0 Methodology
 2.1 Participants & Materials
 ‘Teaching on my course’ of the NSS questionnaire emphasises four questions related
 to ‘explain things, make the subject interesting, create an intellectually stimulating
 environment, and challenge the learners’. Documental analysis and questionnaire surveys with
 6

 separate samples were identified as the potential research tools optimal for the study. Document
 analysis is adopted to analyse a sample of 375 Mid-Module Reviews (MMR) from the students
 from level three to level six in contemplating the finding from literature focusing on the four
 questions in NSS section one. The documental data were categorised into themes where
 students identified how the teaching helped them establish the key elements that were positive
 about the module. This analysis uses 375 samples, assuming the confidence level of 95% and
 the margin of error at 5%.
 The themes identified from the documental analysis were used to identify and develop
 the survey framework and questionnaire conducted for the academics. The closed-ended
 questionnaire survey refined the documental data findings and established the gap between the
 existing and best practices. Departments of Architecture, Construction Management, Civil
 Engineering, Quantity Surveying, and Building surveying were selected to represent the BE
 discipline to obtain validated and reliable data making the survey sample 20 academics. Four
 sets of academics were selected under each discipline based on their title, including a
 Professor/Reader, two Senior Lecturers and a Lecturer from each BE discipline. This approach
 helped to recruit four participants from each discipline in BE. Additionally, three participants,
 a Head of the school, a Principal lecturer, and a Subject lead, were included, bringing the
 sample size to 23 participants. A critical focus of the latter three participants was to eliminate
 unconscious bias in feedback received from students and endorse validity, reliability and
 transferability of the data collected and modelled through ISM analysis. The data obtained from
 the questionnaire assisted in developing the drivers in enhancing the best practice of teaching
 in the BEHE context.
 2.2 Research Procedure
 A systematic approach to data collection incorporating the literature review, document
 analysis, and questionnaire survey has allowed an in-depth understanding of current BEHE
 teaching and learning. The substantial data collected from documental analysis and
 questionnaire survey needed to be correlated with the NSS theme establishing relationships on
 improving BEHE teaching and learning. Thus, the data was modelled using the Interpretive
 Structural Modelling (ISM) tool to find critical drivers and correlation of each driver to the
 theme of NSS section one. The drivers identified through the data analysis were used in the
 ISM analysis. Afterwards, a reachability matrix was developed from modelling the drivers
 through a “Structural Self-interaction Matrix” (SSIM). A “Matrice d’Impacts CroisesMultiplication Appliqúe a Classement” (MICMAC) was further developed to identify what
 factors need to be emphasised in enhancing teaching strategies ascertaining the degree of the
 relationships between the drivers found through SSIM. The MICMAC enabled categorising
 data obtained into independent, dependent and autonomous clusters to establish a best practice
 framework for teaching enhancement in BEHE. The data derived from each analysis was
 factored in when developing the level partitioning of each driver. Moreover, the ISM level
 partitioning illustrated a critical correlation of each driver under NSS themes and emphasised
 implications in the BEHE context. Finally, this study's general conclusions are drawn from the
 level partitioning and presented as the recommended strategies for developing teaching
 enhancement in BEHE.

 3.0 Analysis
 Three Hundred and seventy-five (375) MMRs (Mid Module Reviews) were examined.
 Students were given three questions; how the module is undergoing; what is good/bad, and
 suggestions to improve module delivery. A subjective evaluation by academics was made of
 the reviews provided, and themes were identified in the given student suggestions. This
 7

 evaluation identifies 24 drivers directly influencing the teaching practices highlighted by the
 four NSS questions. The identified drivers were collated and categorised into the specific NSS
 questions/themes, and an ISM analysis was carried out. A pair-wise relationship is mapped to
 the Structural self-interaction matrix (SSIM) using a binary matrix based on the above data
 gathered through the closed-ended questionnaire survey from the teaching staff. The binary
 matrix was used to create the MICMAC graph in recognising the influential drivers that
 enhance HE teaching. Furthermore, a level partitioning was carried out to find the interrelationship of each driver and recognise the sequential order of implications within the BEHE
 context. Based on the characteristics of the independent cluster, these drivers are considered
 fundamental to the system. These drivers are considered incredibly important for enhancing
 teaching in BEHE. The drivers based on the characteristics of the dependent cluster are
 considered a necessity for accommodating the independent drivers. Thus, dependant drivers
 directly influence the planning and module development rather than being fundamental to
 teaching. The drivers based on the characteristics of the autonomous cluster are considered
 fundamental unimportant in the system.
 The study reveals that critical emphasis needs to be given to promote active learning
 and provide in-depth understanding when the academic explains module content. Promoting
 collaboration, student engagement and focussing on student-centric approaches occurred in the
 independent cluster to make the subject interesting. Promoting intellectual stimulation by
 enhancing interaction between the learner and the academic was considered fundamental in
 enhancing active learner stimulation. Challenging the learner by providing motivation,
 promoting self-assessment for continuous improvement, challenging learning culture through
 learner motivation and helping the learner develop an action plan on career progression was
 illustrated in the independent cluster making the drivers deemed fundamental. Thus,
 implementing these drivers would facilitate the best practice in HE teaching.
 Furthermore, dependent drivers identified through the study will be beneficial in
 facilitating the independent drivers mentioned above. An interim assessment opportunity and
 guidance given through a formative feedback session were recognised as dependent drivers in
 explaining the module content. Use of various media in explaining the subject content,
 executing cognitive approaches, arranging site visits (where applicable) or site walk-throughs,
 guest lecturers, augmentation in lecture material, and presenting real-world examples in
 lectures were identified as dependent drivers in making the subject interesting. Intellectual
 stimulation by challenging learners in problem-based learning and assessment guidance
 through assessment rubrics and question-based learning were identified under the dependent
 cluster. Contrary to widespread belief, revisiting previous knowledge and reflecting on module
 content with the pathway provided by PSRB in explaining module content and reflecting more
 on the industry-led practices in intellectually stimulating students were in the autonomous
 cluster. However, this is not because the said drivers have little influence on the system, but
 the drivers are facilitated by other (both dependent in independent) drivers.
 To generalise the critical findings from the MICMAC analysis, the following Table 1
 illustrates the fundamental drivers (independent), facilitating drivers (dependent), and noninfluential/already accommodated drivers (autonomous) in enhancing teaching in HE. The
 drivers are categorised into the four performance indicators depicted by Section 1 of the NSS
 to clarify and ease interpretation. Thus, academics and academic institutions can implement
 these drivers to promote teaching practices within BEHE.

 8

 Table 1: Categorisation of Drivers

 Section 1: The teaching on my course
 NSS Section

 Q1 –
 Staff is good
 at explaining
 things.

 Q2 –
 Staff have
 made the
 subject
 interesting.

 Q3 –
 The course is
 intellectually
 stimulating.

 Q4 –
 My course
 has
 challenged
 me to
 achieve my
 best work.

 Drivers identified through the study
 D1 - Promoting active learning
 D2 - Providing an in-depth
 understanding
 D3 - Revisiting previous knowledge.
 D4 - Interim assessment opportunity.
 D5 - Guidance given through
 formative feedback session.
 D6 - Reflecting module content with
 the pathway provided by PSRB.
 D7 - Promoting collaboration
 D8 - Focussing on student-centric
 approaches.
 D9 - Promoting student engagement.
 D10 - Use of a variety of media in
 explaining the subject content.
 D11 - Executing cognitive approaches.
 D12 - Arranging site visits (where
 applicable) or site walkthroughs.
 D13 - Guest lecturers
 D14 - Augmentation in lecture
 material
 D15 - Presenting real-world examples
 in lectures.
 D16 - Promoting intellectual
 stimulation.
 D17 - Enhance interaction between the
 learner and the academic.
 D18 - Reflecting more on industry-led
 practices.
 D19 - Challenging learners in
 problem-based learning.
 D20 - Promoting self-assessment for
 continuous improvement.
 D21 - Challenging learning culture
 through learner motivation.
 D22 - Assessment guidance through
 assessment rubrics.
 D23 - Question-based learning.
 D24 - Having an action plan on career
 progression.

 SISM Coordinates (I,j)
 10
 17
 2
 11
 13

 24
 6
 6

 13

 10

 10
 7

 10
 13

 6
 10

 21
 14

 15
 15

 5
 11

 18
 17

 4
 4

 18

 2

 19

 4

 9

 19

 8

 15

 11

 9

 13

 11

 9

 13

 7

 19

 12
 16

 6
 11

 5

 20

 MICMAC
 Categorisation
 Independent
 Independent
 Autonomous
 Dependent
 Dependent
 Autonomous
 Independent
 Independent
 Independent
 Dependent
 Dependent
 Dependent
 Dependent
 Dependent
 Dependent
 Independent
 Independent
 Autonomous
 Dependent
 Independent
 Independent
 Dependent
 Dependent
 Independent

 9

 4.0 Discussion and Recommendations
 This study recognises the significant need to enhance quality of teaching in BEHE.
 Both the literature and primary data collection recognised a substantial number of suggestions
 for enhancing teaching practices. The strategies/drivers obtained from primary and secondary
 data are categorised into themes and analysed according to their influence/driver capability
 with questions put forth by NSS section 1. The outcome of the discussion will be the level
 partitioning of the identified drivers, which will illustrate the accurate implementation in
 increasing quality of HE teaching. The below section further finds the identified drivers and
 their correspondence with the NSS themes under section one.
 3.1 Explaining the subject
 The root of explaining the subject depends on how the learner clarifies the knowledge
 criteria. Gollub (2002), Ferguson (2012), and McKnight et al. (2016) prove that active learning
 is highly dependent on the levels of understanding. Providing a higher understanding of the
 subject matter, the context of knowledge transferred, revisiting the experience learnt and
 promoting interactive learning are critical academic performance enhancers (McKnight et al.,
 2016; Guo and Shi, 2014; Eames and Birdsall, 2019). The level partitioning developed from
 the research findings shown in figure 1 below identifies that revisiting knowledge (D3) and
 reflecting on the (D6) PSRB pathway was the least priority at level III. Even though they are
 at level III, they will aid other drivers with in-depth understanding (D2) to better explain
 module content. Both literature (Lozano et al., 2012; Ovbiagbonhia et al., 2020) and data state
 that the module leader needs to identify how to merge academic and professional competency
 gaps in providing an in-depth understanding of BE curricula. However, the research findings
 highlight the importance of the availability of interim assessment guidance. The use of interim
 assessment opportunities (D4) and guidance given through formative feedback (D5) should be
 considered significant in developing the module. Emphasis is on module leaders, and
 academics need to develop and deliver the module content facilitating formative
 assessment/feedback. The study identifies that promoting active learning and in-depth
 understanding is fundamental and at Level I in enhancing knowledge delivery. The current
 studies (Allen et al., 2020) as pedagogic theories and platforms such as VLE in promoting
 active learning by using quizzes and other media to engage students have deemed the best
 strategies in enhancing active learning.

 Figure 1: Level partitioning of Drivers on NSS Q1 - Staff is good at explaining the subject

 10

 3.2 Making the subject interesting
 The literature establishes that the learning culture of the modern-day classroom has
 evolved. Hue and Li (2008) and Hmelo-Silver et al. (2008) identified the core context of
 collaboration and its’ effect on subject engagement. The widespread belief that the current
 pedagogical paradigm on digitised practices promotes collaborations (Siew, 2018; Hamari et
 al., 2016) influences authentic, industry-related content, especially within the BE curricula.
 Moreover, the literature review identifies that BE knowledge providers promote digitised
 learning concepts in HE. Findings from primary data also recognise approaches in
 accommodating augmented concepts and focusing on digitised learning environments
 facilitating such learning. The level partitioning developed from the research findings shown
 in figure 2 below illustrates both facilitating drivers and fundamental drivers. The facilitating
 drivers are: execute cognitive approaches (D11), arrange site visits or site walk-throughs (D12),
 guest lecturers (D13), augmentation/digitisation in lecture material (D14), and present realworld examples in lectures (D15). Since these drivers are positioned at level II, these drivers
 (D10 to D15) are considered to facilitate module delivery's fundamental drivers. However, it
 is identified that D13, D10 and D14 facilitate each other and help facilitate D11 and D15, which
 facilitate D7 and D9, respectively. The study further strengthens the argument that promoting
 student collaborations (D7), engagement (D9) and focussing on student-centric approaches
 (D8) are fundamental in making the subject content interesting. It further revealed that both D7
 and D9 facilitated D8 in making the subject interesting. The ISM level partitioning positioned
 them at Level I due to their fundamental influence in making the subject interesting.
 A critical finding from the study is that using a variety of media (D10) to explain the
 subject brings innovation to the classroom. The research findings signify that digitisation must
 be considered a key facilitator but not a fundamental element in pedagogic development.
 Further to the evidence of earlier studies, blended learning and flipped classroom techniques
 are considered paramount in carrying out collaborative knowledge in group learning (Allen et
 al., 2020). Documental analysis insists on combining traditional and digitised media to deliver
 module content. Findings from documental analysis reveal that students prefer traditional
 module delivery aligning with digitised recordings for revisiting knowledge. Thus, digitisation
 needs to be a facilitator rather than being promoted to a fundamental driver in teaching HE. It
 is further applicable to the current COVID learning context, where online learning has
 dominated pedagogical implementation (Bao, 2020). This study presents critical evidence that
 digitisation is not the case in enhancing teaching practices but rather an opportunity to facilitate
 independent drivers in enhancing HE learning.

 Figure 2: Level partitioning of Drivers on NSS Q2 - Staff have made the subject interesting

 11

 3.3 Intellectual stimulation of learners
 Baeten et al. (2010), Bolkan et al. (2011), and Jabar and Albion (2016) identify that
 intellectual stimulation critical in HE student progression. Both literature by Baeten et al.
 (2010) and Bolkan et al. (2011) and research findings reveal that a straightforward ‘lecturing’
 where the knowledge is being pushed to the learner with less reflection and context is
 considered adverse to academic progress and performance. Data and literature (Van Schaik,
 2019) disagree with adopting industry-led practices (D18) to deliver the module content, thus,
 positioning it at Level III. findings reveal that this is due to drivers such as site visits, guest
 lectures, and focusing on real-world context were already adhered to make the subject
 interesting. However, these drivers are prominent in challenging learners by using problembased (D19) and industry-led contexts in learning. Tirrell and Quick (2012) and Jabar and
 Albion (2016) further emphasised innovative teaching and effective teaching methods, such as
 problem-based learning (D19). However, the research findings emphasise that such practice is
 not fundamental but crucial in increasing intellectual stimulation since it is positioned at Level
 II in ISM level partitioning. However, it recognises the influence of D19 in facilitating both
 D16 and D17. The study emphasises intellectual stimulation (D16) in module development and
 that enhancing learner-academic interaction (D17) is fundamental and is self-facilitating to
 make the course intellectually stimulating. The ISM level partitioning has positioned them in
 Level I, which denotes fundamental influence over intellectual stimulation. The findings
 further show the benefits of utilising digitised tools or in-class activities to promote intellectual
 stimulus, especially within the COVID pandemic (Arora and Srinivasan 2020) and for
 disciplines such as BE, where a vast knowledge content (e.g. architectural, engineering,
 surveying and management) needs to be reflected.

 Figure 3: Level partitioning of Drivers on NSS Q3 - The course is intellectually stimulating

 12

 3.4 Challenging Learners
 The literature review (Darling-Hammond et al., 2019 and Boud et al., 2018) identifies
 those challenging students could increase the probability of academic progression. However,
 Kohn Rådberg et al. (2018) stressed the deficiencies in academic progression regarding the
 lack of motivation and drivers, which does not aid intellectual stimulation. The literature
 provides many strategies for promoting a challenging culture within the learning environment;
 however, the surplus of theories makes the implementation complicated and time-consuming
 (Boud et al., 2018; Bolkan, 2010). Assessment guidance through assessment rubrics (D22) and
 question-based learning (D23) are at Level II at ISM Level portioning. Contradicting the
 literature (Ellis and Hogard, 2018), the research findings illustrate that D22 and D23 were not
 fundamental to challenging students but influential in facilitating D21 in enabling students to
 achieve their best work. Also, this could be due to digitalisation being a prominent aspect in
 enabling these drivers into the HE curriculum. This study identifies that the fundamental
 drivers as promoting self-assessment opportunities (D20), motivating the student through a
 challenging culture of knowledge provision (D21), and developing an action plan on career
 progression/continuous improvement (D24) is positioned at Level I in ISM analysis. It further
 highlights that D21 and D24 facilitate D20, promoting continuous student improvement. Thus,
 the analysis deems that the module leader/lead academic needs to consider the self-assessment
 techniques, challenging learning culture, and action plan for career development in developing
 the module and enhancing teaching in HE.

 Figure 4: Level partitioning of Drivers on NSS Q4 - Course has challenged to achieve the best work

 6.0 Conclusions
 This study establishes drivers to enhance the quality of teaching in BEHE across the
 range of students that reflects on the results of section 1 of NSS. The findings are novel as the
 study discusses drivers and illustrates implementation to improve quality of teaching within
 the four NSS themes. The main findings from the literature review set up a significant room
 for improvement in teaching and pedagogy to enhance student performance in BEHE. The
 practical implications of this study are that the identified drivers could help academics and
 students increase understanding in conjunction with the lectures that deliver in-depth
 knowledge through practical sessions. As illustrated in the figures, the level partitioning will
 enable academics to focus on significant pedagogical themes and enforce strategies. As the
 theme refers to the NSS guidelines, the drivers developed could assist HE institutions in
 13

 obtaining better results for the NSS survey. Finally, the combined set of figures could form a
 framework for enhancing quality of teaching within HE curricula.
 The suggestions for student engagement, developing a stimulating learning
 environment, and challenging students need various collaborative online and face-to-face
 teaching approaches. The literature set up another critical part in providing context on module
 background and content. Drivers further reinforced that promoting active learning and in-depth
 understanding was fundamental in improving teaching in the BEHE context. Moreover, the
 study's primary data proved that teaching and learning, resources, standards, and assessments
 could provide a better understanding to students and could be further facilitated by the abovementioned independent drivers.
 This interpretation contrasts that implementing innovative practices in knowledge
 transfer such as blended learning, flipped classroom and group learning are vital for stimulating
 learners. Promoting collaboration, student engagement, and focusing on student-centric
 approaches were considered independent, but these drivers facilitate other drivers in making
 the subject interesting. Moreover, promoting intellectual stimulation, enhancing interaction
 between the learner and the academic, promoting self-assessment for continuous improvement,
 challenging learning culture through learner motivation, and having an action plan on career
 progression are recognised as independent drivers in advancing teaching in BEHE.
 The study identified several dependent factors, such as aligning the module content
 with the PSRB requirements and emphasising personal and career development benefits.
 However, the current learning practices need to be integrated with the online delivery platforms
 to provide knowledge and challenge learners for better learning practice. Enforcing quizzes
 and real-world examples through a digital platform proves vital in helping independent drivers
 for intellectual stimulation and challenging the learner for an active learning atmosphere.
 Finally, a unique finding is that online delivery in the current situation (COVID 19)
 brings more challenges since the lectures are either blended or delivered online. All the
 independent and dependant drivers for engaging students, increasing understanding, inspiring
 and challenging learners remain unchanged. The current situation also demands training for
 the lecturers on various tools that can help engage, challenge, stimulate, and increase the
 learners' understanding. However, the lecturers may now need to use multimedia tools to
 accommodate the suggestions from this study and facilitate the independent drivers to enhance
 quality of teaching in BEHE. Further research could be carried out by involving a higher
 sample from different HE institutes around the globe to develop a global framework. Also,
 further research is needed to reflect on how quality of teaching influences student learning
 opportunities, assessment and feedback, academic support, and learning resources.

 Acknowledgement
 The data obtained for the below paper was based on a project guided by a steering
 committee within the University of Wolverhampton, chaired by Professor Mohammed Arif.
 Among the committee members, credit needs to be given to Dr David Searle, Dr Alaa Hamood
 and Dr Louise Gyoh for their significant input on the data collection. Furthermore, the student
 and academic participants at the University of Wolverhampton need recognition for their
 insightful comments.

 14

 7.0 Reference List
 Allen, J., Rowan, L. and Singh, P., 2020. Teaching and teacher education in the time of COVID-19.
 Asia-Pacific Journal of Teacher Education, 48(3), pp.233-236.
 Altomonte, S., Logan, B., Feisst, M., Rutherford, P. and Wilson, R. (2016). Interactive and situated
 learning in education for sustainability. International Journal of Sustainability in Higher
 Education, 17(3), pp.417-443.
 Andersson, P., Fejes, A. and Sandberg, F., 2013. Introducing research on recognition of prior learning.
 International Journal of Lifelong Education, 32(4), pp.405-411.
 Arora, A. and Srinivasan, R., 2020. Impact of Pandemic COVID-19 on the Teaching – Learning
 Process: A Study of Higher Education Teachers. Prabandhan: Indian Journal of Management,
 13(4), p.43.
 Avargil, S., Herscovitz, O. and Dori, Y., 2011. Teaching Thinking Skills in Context-Based Learning:
 Teachers’ Challenges and Assessment Knowledge. Journal of Science Education and
 Technology, 21(2), pp.207-225.
 Baeten, M., Kyndt, E., Struyven, K. and Dochy, F., 2010. Using student-centred learning environments
 to stimulate deep approaches to learning: Factors encouraging or discouraging their
 effectiveness. Educational Research Review, 5(3), pp.243-260.
 Bao, W., 2020. COVID ‐19 and online teaching in higher education: A case study of Peking University.
 Human Behavior and Emerging Technologies, 2(2), pp.113-115.
 Block, B., 2018. Digitalization in engineering education research and practice. 2018 IEEE Global
 Engineering Education Conference (EDUCON).
 Blundell, C., Lee, K. and Nykvist, S., 2020. Moving beyond enhancing pedagogies with digital
 technologies: Frames of reference, habits of mind and transformative learning. Journal of
 Research on Technology in Education, 52(2), pp.178-196.
 Bolkan, S. and Goodboy, A. (2010). Transformational Leadership in the Classroom: The Development
 and Validation of the Student Intellectual Stimulation Scale. Communication Reports, 23(2),
 pp.91-105.
 Bolkan, S., Goodboy, A., and Griffin, D. (2011). Teacher Leadership and Intellectual Stimulation:
 Improving Students' Approaches to Studying through Intrinsic Motivation. Communication
 Research Reports, 28(4), 337-346. doi: 10.1080/08824096.2011.615958
 Boud, D., Ajjawi, R., Dawson, P. and Tai, J. (2018). Developing Evaluative Judgement in Higher
 Education. 1st ed. London: Routledge.
 Bowen, T. (2017). Assessing visual literacy: a case study of developing a rubric for identifying and
 applying criteria to undergraduate student learning. Teaching in Higher Education, 22(6),
 pp.705-719.
 Bye, D., Pushkar, D., and Conway, M. (2007). Motivation, Interest, and Positive Affect in Traditional
 and Nontraditional Undergraduate Students. Adult Education Quarterly, 57(2), 141-158. doi:
 10.1177/0741713606294235
 Chen, C. and Yang, Y., 2019. Revisiting the effects of project-based learning on students’ academic
 achievement: A meta-analysis investigating moderators. Educational Research Review, 26,
 pp.71-81.

 15

 Chickering, A. W., & Gamson, Z. F. (1999). Development and adaptations of the seven principles for
 good practice in undergraduate education. New Directions for Teaching and Learning, 80, 75–
 81.
 Clough, P., and Strycharczyk, D. (2012). Developing mental toughness (1st ed.). London: KoganPage.
 Darling-Hammond, L., Flook, L., Cook-Harvey, C., Barron, B. and Osher, D., 2019. Implications for
 educational practice of the science of learning and development. Applied Developmental
 Science, 24(2), pp.97-140.
 Dieh, M., Lindgren, J. and Leffler, E., 2015. The Impact of Classification and Framing in
 Entrepreneurial Education: Field Observations in Two Lower Secondary Schools. Universal
 Journal of Educational Research, 3(8), pp.489-501.
 Eames, C., and Birdsall, S. (2019). Teachers’ perceptions of a co-constructed tool to enhance their
 pedagogical content knowledge in environmental education. Environmental Education
 Research, 1-16. doi: 10.1080/13504622.2019.1645445
 Ellis, R. and Hogard, E., 2018. Handbook of Quality Assurance for University Teaching, Routledge,
 London.
 Ferguson, R. (2012). Learning analytics: drivers, developments and challenges. International Journal of
 Technology Enhanced Learning, 4(5/6), 304. doi: 10.1504/ijtel.2012.051816
 Fram, S., and Margolis, E. (2011). Architectural and built environment discourses in an educational
 context: the Gottscho and Schleisner Collection. Visual Studies, 26(3), 229-243. doi:
 10.1080/1472586x.2011.610946
 Fraser, S., 2019. Understanding innovative teaching practice in higher education: a framework for
 reflection. Higher Education Research & Development, 38(7), pp.1371-1385.
 French, A. and O'Leary, M. (2017). Teaching Excellence in Higher Education:|b Challenges, Changes
 and the Teaching Excellence Framework. Bingley: Emerald Publishing Limited.
 Gollub, J. (2002). Learning and understanding. Washington, DC: National Academy Press.
 Gomis, K., Saini, M., Pathirage, C. and Arif, M., 2021. Enhancing learning opportunities in higher
 education: best practices that reflect on the themes of the national student survey, UK. Quality
 Assurance in Education, 29(2/3), pp.277-292.
 Guo, F. and Shi, J. (2014). The relationship between classroom assessment and undergraduates' learning
 within Chinese higher education system. Studies in Higher Education, 41(4), pp.642-663.
 Hamari, J., Shernoff, D., Rowe, E., Coller, B., Asbell-Clarke, J. and Edwards, T., 2016. Challenging
 games help students learn: An empirical study on engagement, flow and immersion in gamebased learning. Computers in Human Behavior, 54, pp.170-179.
 Han, F. and Ellis, R. (2019). Identifying consistent patterns of quality learning discussions in blended
 learning. The Internet and Higher Education, 40, pp.12-19.
 Hmelo-Silver, C., Chernobilsky, E. and Jordan, R., 2008. Understanding collaborative learning
 processes in new learning environments. Instructional Science, 36(5-6), pp.409-430.
 Hue, M., and Li, W. (2008). Classroom Management: Creating a Positive Learning Environment (Hong
 Kong teacher education). Hong Kong: Hong Kong University Press, HKU.

 16

 Jabar, S. and Albion, P., 2016. Assessing the Reliability of Merging Chickering & Gamson’s Seven
 Principles for Good Practice with Merrill’s Different Levels of Instructional Strategy
 (DLISt7). ERIC Online Learning, 20(2).
 Kohn Rådberg, K., Lundqvist, U., Malmqvist, J. and Hagvall Svensson, O. (2018). From CDIO to
 challenge-based learning experiences – expanding student learning as well as societal impact?.
 European Journal of Engineering Education, 45(1), pp.22-37.
 Lai, K. (2011). Digital technology and the culture of teaching and learning in higher education.
 Australasian Journal of Educational Technology, 27(8). doi: 10.14742/ajet.892
 Lozano, J., Boni, A., Peris, J. and Hueso, A., 2012. Competencies in Higher Education: A Critical
 Analysis from the Capabilities Approach. Journal of Philosophy of Education, 46(1), pp.132147.
 Marshalsey, L, and Madeleine S. (2018). “Critical Perspectives of Technology-Enhanced Learning in
 Relation to Specialist Communication Design Studio Education Within the UK and Australia.”
 Research in Comparative and International Education 13 (1): 92–116. doi:
 10.1177/1745499918761706
 Matthews, A. and Kotzee, B., 2019. The rhetoric of the UK higher education Teaching Excellence
 Framework: a corpus-assisted discourse analysis of TEF2 provider statements. Educational
 Review, pp.1-21.
 McKnight, K., O'Malley, K., Ruzic, R., Horsley, M., Franey, J. and Bassett, K. (2016). Teaching in a
 Digital Age: How Educators Use Technology to Improve Student Learning. Journal of Research
 on Technology in Education, 48(3), pp.194-211.
 Moore, D. and Fisher, T., 2017. Challenges of Motivating Postgraduate Built Environment Online
 Teaching and Learning Practice Workgroups to Adopt Innovation. International Journal of
 Construction Education and Research, 13(3), pp.225-247.
 Office for Students, 2020. National Student Survey Results 2020. London, UK.
 Ovbiagbonhia, A., Kollöffel, B. and Den Brok, P., 2020. Teaching for innovation competence in higher
 education Built Environment engineering classrooms: teachers’ beliefs and perceptions of the
 learning environment. European Journal of Engineering Education, 45(6), pp.917-936.
 Santos, G., Marques, C., Justino, E. and Mendes, L., 2020. Understanding social responsibility’s
 influence on service quality and student satisfaction in higher education. Journal of Cleaner
 Production, 256, p.120597.
 Scott, L. (2020). Engaging Students' Learning in the Built Environment Through Active Learning.
 Claiming Identity Through Redefined Teaching in Construction Programs, pp.1-25.
 Staff and Educational Development Association, (2013). Measuring The Impact Of The UK
 Professional Standards Framework For Teaching And Supporting Learning (UKPSF). Higher
 Education Academy.
 Tirrell, T., and Quick, D. (2012). Chickering's Seven Principles of Good Practice: Student Attrition in
 Community College Online Courses. Community College Journal of Research and Practice,
 36(8), 580-590. doi: 10.1080/10668920903054907
 Tsiligiris, V. and Hill, C., 2019. A prospective model for aligning educational quality and student
 experience in international higher education. Studies in Higher Education, 46(2), pp.228-244.
 Uchiyama, K. and Radin, J., 2008. Curriculum Mapping in Higher Education: A Vehicle for
 Collaboration. Innovative Higher Education, 33(4), pp.271-280.

 17

 Van Schaik, P., Volman, M., Admiraal, W., and Schenke, W. (2019). Approaches to co-construction of
 knowledge in teacher learning groups. Teaching And Teacher Education, 84, 30-43. doi:
 10.1016/j.tate.2019.04.019
 Waheed, H., Hassan, S., Aljohani, N., Hardman, J., Alelyani, S. and Nawaz, R., 2020. Predicting
 academic performance of students from VLE big data using deep learning models. Computers
 in Human Behavior, 104, p.106189.
 Welzant, H., Schindler, L., Puls-Elvidge, S., & Crawford, L. (2015). Definitions of quality in higher
 education: A synthesis of the literature. Higher Learning Research Communications, 5 (3).
 doi:10.18870/hlrc.v5i3.244

 18

 English Language Teaching; Vol. 11, No. 1; 2018
 ISSN 1916-4742
 E-ISSN 1916-4750
 Published by Canadian Center of Science and Education

 The Affection of Student Ratings of Instruction toward EFL
 Instructors
 Yingling Chen1
 1

 Center for General Education, Orieantal Institute of Technology, New Taipei City, Taiwan

 Correspondence: Yingling Chen, Center for General Education, Oriental Instutte of Technology, New Taipei
 City, Taiwan. Tel: 886-909-301-288. E-mail: [email protected]
 Received: October 27, 2017
 doi: 10.5539/elt.v11n1p52

 Accepted: December 3, 2017

 Online Published: December 5, 2017

 URL: http://doi.org/10.5539/elt.v11n1p52

 Abstract
 Student ratings of instruction can be a valuable indicator of teaching because the quality measurement of
 instruction identifies areas where improvement is needed. Student ratings of instruction are expected to evaluate
 and enhance the teaching strategies. Evaluation of teaching effectiveness has been officially implemented in
 Taiwanese higher education since 2005. Therefore, this research investigated Taiwanese EFL university
 instructors’ perceptions toward student ratings of instruction and the impact of student ratings of instruction on
 EFL instructors’ classroom teaching. The data of this quantitative study was collected by 21 questionnaires. 32
 qualified participants were selected from ten universities in the northern part of Taiwan. The results indicate
 those EFL instructors’ perceptions and experiences toward student ratings of instruction affects their approach to
 teaching, but EFL instructors do not prepare lessons based on the results of student ratings of instruction.
 Keywords: student ratings of instruction, EFL, instruction
 1. Introduction
 The Ministry of Education (MOE) authorizes universities and colleges to determine whom to hire in the college
 system according to the Taiwanese College Regulation 21. Moreover, the MOE (2005) concluded that
 developing a system for teacher evaluation is necessary in each college and university. As a result, schools have
 more power in deciding the qualification of educators. Wolfer and Johnson (2003) emphasized that one must be
 clear about the purpose of a course evaluation feedback since it may determine the kind of data required.
 Moreover, teacher evaluation should include the key element for not only promotion, tenure, and reward, but
 also performance review and teaching improvement. In addition, student ratings of instruction become an
 essential element to evaluate teachers’ success for ensuring the quality of teaching. Students’ opinions are
 fundamental sources for forming the quality of instruction in higher education. Murray (2005) stated that more
 than 90% of U.S. colleges and universities pay attention to student evaluation of teachers in order to assess
 teaching. Besides, about 70% of college instructors recognize the need of student input for assessing their
 classroom instruction (Obenchain, Abernathy, & Wiest, 2001). Teacher decision making toward curriculum
 design and teacher expectancy of student achievement have a significant influence on the results of curricular
 and instructional decisions. However, most of the research focus on how to assist and improve students’ learning
 through SRI, how to improve teaching effectiveness through SRI, issues of SRI, or student achievement toward
 SRI; few of them address how do instructors use the feedback from SRI or how do instructors improve teaching
 through the results of SRI (Beran, Violato, Kline, & Frideres, 2005). Accordingly, instructors’ perceptions of
 student ratings become valuable in presenting a better insight for improving teacher performances because
 understanding how instructors are impacted by SRI is influential.
 1.1 Literature
 1.1.1 The Use of Student Ratings of Instruction
 The implementations of SRI at colleges and universities have not only been employed for purposes of improving
 teaching effectiveness, but also have been used for personnel decisions such as tenure. SRI is widely practiced in
 colleges and universities across Canada and the United States (Greenwald, 2002) .In fact, student ratings is not a
 new topic in higher education. Researchers, Remmers and Brandenburg published their first research studies on
 student ratings at Purdue University in 1927. Also, Guthrie (1954) stated that students at the University of
 52

 elt.ccsenet.org

 English Language Teaching

 Vol. 11, No. 1; 2018

 Washington filled out the first student rating forms seventy-five years ago. Nevertheless, SRI is a pertinent topic
 for researchers to study because students still fill out the evaluation forms which produce vital information on
 teaching quality. Administrators take SRI into consideration to determine the effectiveness of instruction and
 personnel promotions as well. There were 68% of American colleges reported using student ratings in Sedin’s
 1983 survey. Meanwhile, there were 86 percent of American colleges reported using student rating surveys in
 colleges in 1993 (Sedin, 1993a). Seldin’s (1993b) surveys reflected the growing number of use of student rating
 as an instrument for teaching evaluation in higher education.
 1.1.2 Student Rating of Instruction in Higher Education in Taiwan
 “During the 1990s, most education systems in the English-speaking world moved towards some notion of
 performance management” (West-Burnham, O’Neill, & Bradbury, 2001, p. 6). The widespread use of the
 performance management concept contributes to the education system, which focuses on specific measurement
 of classroom instruction delivery. The quality of teaching influences students not only academically, but also
 psychologically. With regard to the value of teacher evaluation, the Taiwanese Ministry of Education has
 mandated that colleges and universities monitor the quality of teaching because the quality of teachers and
 instructions impact students’ academic achievement and the reputation of the school. Chang (2002) declared that
 approximately 76 percent of public universities and 85 percent of private universities have implemented SRI in
 Taiwan. As a result, teacher evaluation has become an instrument for examining instructors’ classroom
 presentation. Liu (2011) stated that teachers’ classroom presentation is equivalent to teacher appraisal and
 teacher performance. Furthermore, Liu (2011) found the following:
 Since 28th December 1995, the 21st Regulation of the University Act stated that a college should formulate a
 teacher evaluation system that decides on teacher promotion, and continues or terminates employment based on
 college teachers’ achievement in teaching, research and so forth. (p. 4) SRI has been wildly accepted by
 universities and colleges in Taiwan and has become a practical tool for enhancing teaching performance and
 developing an effective trigger to examine factors that relate to educational improvement.
 SRI stimulates organizational level effects by providing information from evaluation practice such as diagnosing
 organizational problems. SRI raises environmental level effects such as hiring, retention, and dismissal which is
 highly public acts justified through the evaluation process (Cross, Dooris, & Weinstein, 2004).
 1.2 State Hypotheses and Their Correspondence to Research Design
 1.2.1 Null Hypotheses
 The independent variable in this study was SRI. The dependent variables were northern Taiwanese EFL
 university instructors’ perception and the influence of SRI on northern Taiwanese EFL university instructors. The
 null hypotheses was designed for testing the association between EFL instructors’ perceptions and SRI, SRI and
 the classroom instruction, and the impact of SRI and the classroom instruction. A Chi-Square was used to test the
 associations of the null hypotheses. A Chi-square probability of .05 or less was used to reject the null hypotheses.
 The following hypotheses addressed the research question:
 1.2.2 Research Questions
 1). What are Taiwanese EFL university instructors’ perceptions toward SRI?
 H10: No association exists between EFL university instructors’ perceptions and SRI
 (at the .05 level of significance).
 2). What impact does SRI have on EFL university instructors’ classroom instructions?
 H20: No association exists between the impact of SRI and classroom instruction
 (at the .05 level of significance).
 2. Method
 2.1 Participant
 All participating EFL instructors have master or doctoral degrees from the foreign universities or local
 Taiwanese universities. The subjects’ ages were between thirty-five to seventy years old. Each participating
 experienced instructor has received at least three years of results from SRI.
 2.2 Sampling Procedures
 The researcher used random sampling strategy to gain participants from 10 universities in the northern part of
 Taiwan for the quantitative data. The key to random sampling is that each university in the population has an
 53

 elt.ccsenet.org

 English Language Teaching

 Vol. 11, No. 1; 2018

 equal probability of being selected in the sample (Teddlie & Yu, 2007). Using random sampling strategy helped
 the researcher prevent biases from being introduced in the sampling process by drawing names or numbers. 32
 Taiwanese university EFL instructors and were conducted from ten universities in the Northern part of Taiwan.
 2.3 Sample
 The target participants for the quantitative phase were thirty-two Chinese speaking English instructors from 10
 northern universities. All participating EFL instructors have master or doctoral degrees from the foreign
 universities or local Taiwanese universities. Each participating experienced instructor has received at least three
 years of results from SRI.
 2.4 Measurment
 The quantitative data was collected and identified through a demographic survey and EFL instructors’ perception
 of SRI questionnaire. A questionnaire covering instructors’ perceptions toward SRI and a demographic
 questionnaire were used to explain the result of the quantitative data.
 2.5 Research Design
 The researcher randomly selected ten northern universities, which offer the English or applied foreign language
 major by drawing from twenty-eight schools.
 2.6 Data Analisis
 The first step of data analysis was the analyzing of the quantitative data. The researcher assigned codes to all
 questionnaires so that the participants’ information was ensured. Then, the information was transferred into the
 Statistical Package for the Social Sciences (SPSS 21.0). Also, the researcher was correctly enter quantitative data
 into SPSS in order to run a Cronbach's alpha test to create internally consistent, reliable, and valid tests and
 questionnaires for enhancing the accuracy of the survey. Furthermore, a Chi-Square test was implemented for
 testing hypotheses using a non-parametric test. Cooper and Schindler (2006) stated that Non-parametric tests are
 used to test the significance of ordinal and nominal data. A Chi-Square was used to compare SRI to the
 dependent variables. The Chi-Square statistical analysis was used to determine if an association exists between
 SRI and EFL instructors’ perceptions,
 3. Results
 In the Results section, summarize the collected data and the analysis performed on those data relevant to the The
 results were reported in two main parts: (1) background information of quantitative survey participants, (2) a
 Chi-Square test was used to compare SRI to response dependent variables.
 3.1 Gender and Age
 Table 1 showed the distribution of gender and age for participants who taught in the department of English and
 Applied Foreign Language in the universities. Among the 32 EFL university instructor participants, 57% (n= 19)
 of the participants were female and 43% (n=13) percent of the participants were male. In addition, 3% (n=1) of
 participants were between 25-29 years old, 29% of the participants (n=9) were between 30-39 years old, 37% of
 the participants (n=12) were between 40-49 years old, 25% of the participants (n=8) were between 50-59 years
 old, and 6% of the participants (n=2) were between 60-69 years old.
 Table 1. Frequency distribution of gender and age
 Gender

 Frequency
 Overall

 Percentage
 Overall

 Female
 Male
 Total

 19
 13
 32

 57%
 43%
 100%

 Age

 Frequency
 Overall

 Frequency
 Overall

 25-19
 30-39
 40-49

 1
 9
 12

 3%
 29%
 37%

 54

 elt.ccsenet.org

 50-59
 60-69
 70+
 Total

 English Language Teaching

 8
 2
 0
 32

 Vol. 11, No. 1; 2018

 25%
 6%
 0
 100%

 Note. n=32.
 3.2 Years of Teaching
 Table 2 reported the distribution of years of teaching for the participants who taught in the department of English
 and applied foreign language in the universities under EFL settings. The years of teaching varied from
 participants to participants. The distribution of years were the following: 1 for 1-3 years of experience, 2 for 4-6
 years of experience, 7 for 7-10 years of experience, 5 for 11-15 years of experience, 6 for 16-20 years of
 experience, 4 for 21-25 years of experience, 6 for 26-30 years of experience, and 1 for more than 30 years of
 teaching experience.
 Table 2. Frequency distribution of years of teaching
 Years of Teaching

 Frequency
 Overall

 Percentage
 Overall

 Less than 1 year
 1-3 years
 4-6 years
 7-10 years
 11-15 years
 16-20 years
 21-25 years
 26-30 years
 More than 30 years
 Total

 0
 1
 2
 7
 5
 6
 4
 6
 1
 32

 0
 3.1%
 6.2%
 21.9%
 16%
 19%
 12%
 19%
 3.1%
 100%

 Note. n=32.
 3.3 EFL Instructors’ Highest Level of Education
 Table 3 showed the distribution of EFL instructors’ highest level of education among the 32 participants, 24
 participants held doctoral degrees and 8 participants had master’s degrees. Furthermore, 26 participants earned
 the highest level of formal education in a foreign country and 6 participants got the highest level of formal
 education in Taiwan.
 Table 3. Frequency distribution of the educational background
 Highest Degree

 Frequency
 Overall

 Percentage
 Overall

 Master Degree
 Doctoral Degree
 Total
 Foreign Degree
 Domestic Degree
 Total

 8
 24
 32
 26
 6
 32

 25%
 75%
 100%
 81.25%
 18.75
 100%

 Note. n=32.

 55

 elt.ccsenet.org

 English Language Teaching

 Vol. 11, No. 1; 2018

 3.4 Employment Status
 Table 4 showed the employment status among 32 participants. There were 12 (38%) permanent employment
 with on-going contracts without fixed end-points before the age of retirement, 10 (31%) were fixed term
 contracts for a period of more than one school year, and 10 (31%) were fixed term contract for a period of one
 school year or less. In the mean time, 12% of the participants (n=4) were part-time instructors and 88% of the
 participants (n=28) were full-time instructors.
 Table 4. Descriptive statistics for participants’ employment status
 Employment
 status (1)

 Permanent
 employment

 Fixed term contract
 of more than one
 school year

 Fixed term contract
 of more than one
 school year or less

 Total

 Participants/Count
 Percentage %

 12
 37.5%

 10
 31.25%

 10
 31.25%

 32
 100%

 Employment
 status (2)

 Part-time
 employment

 Full-time
 employment

 Total

 Participants/count
 Percentage%

 8
 25%

 24
 75%

 32
 100%

 Note. n=32.
 3.5 Personal Development
 Table 5 showed personal development status among 32 participants. There were 25% (n=7) of participants who
 had master’s degree were pursuing a doctoral degrees that related to their professional field at present in Taiwan.
 There were 75% (n=25) of participants were holding their original degrees without pursuing further degrees.
 Table 5. Descriptive statistics for personal development status
 Personal
 status

 development

 Participants/Count

 Percentage %

 Pursuing
 a
 doctoral
 degree at present

 Holding
 degree

 the

 original

 Total

 7 (in Education, TESL,
 Linguistics, and English
 fields)
 22%

 25

 32

 78%

 100%

 Note. n=32.
 3.6 Internal Reliability
 Six Likert-scale items (items 1-6) in the first section. The researcher assessed the internal reliability with a pilot
 test of item analysis to obtain the Cronbach’s alpha coefficient. Cronbach’s alpha coefficient was utilized to
 determine the reliability of 21 items in discovering Taiwanese EFL university instructors’ perceptions toward
 student rating of instructions. The subscales were (1) EFL instructors’ perceptions toward SRI (six items,
 Cronbach’s Alpha .71); and the influence of SRI on EFL instructors’ classroom instruction (fifteen items,
 Cronbach’s alpha .74) (see Table 6). During data collection, participants were verified as part-time and full-time
 EFL university instructors. The survey packet was distributed at the office. After each participant had completed
 the survey questionnaires, the researcher reviewed the packet for completeness. Fraenkel and Wallen (2003)
 defined validity as the degree to which data supports any inferences that a researcher uses based on the evidence
 he collects using a specific instrument. Content validity is defined as the level in which an instrument can be
 duplicated under the same condition with the same conditions and participants (Sproull, 2002).

 56

 elt.ccsenet.org

 English Language Teaching

 Vol. 11, No. 1; 2018

 Table 6. Reliability statistics of pilot SRI
 Variables

 N of Items

 Cronbach’s Alpha

 EFL instructors’ perceptions toward SRI

 6

 .71

 The influence of SRI on EFL instructors’
 classroom instruction

 15

 .74

 3.7 Rating of Instructions
 A preliminary analysis was executed to determine Taiwanese EFL university instructors’ perceptions toward
 student rating of instruction. Based on primary analysis in Table 7, item 1 reported that 25% of the participants
 strongly disagreed and 69% of the participants disagreed with the positive attitude toward SRI; 6% of the
 participants were neutral. In item 2, 59% of the participants disagreed with holding enthusiastic and confident
 perceptions about the results of SRI. Twenty-two percent of the participants were neutral; 16% of the participants
 agreed and 3% of the participants strongly agreed with having enthusiasm and confidence toward the result of
 SRI. In item 3, 41% of the participants disagreed that they spend more time preparing their classes according to
 SRI results. Fifty-three percent of the participants were neutral and 6% of the participants agreed that they spent
 more time preparing courses based on SRI results. Additionally, in item 4, 31% of the participants disagreed that
 being open to students’ opinions would help receive more positive results of SRI. Forty-four percent of the
 participants were neutral; 22% of the participants agreed and 3% of the participants strongly agreed that being
 open to students’ opinions would help receive more positive result of SRI. In item 5, 6% of the participants
 disagreed that they care about the quality of SRI. There were 41% of the participants were neutral. Fifty-three
 percent of the participants agreed and 16% of the participants strongly agreed that they cared about the quality of
 SRI. In item 6, 6% of the participants strongly disagreed and 47% of the participants disagreed that they were
 always satisfied with the results of SRI. Forty-one percent of the participants were neutral and 6% of the
 participants agreed that they were always satisfied with the result of SRI.
 Table 7. Mean, standard deviation, and percentage of Taiwanese EFL university instructors’ perceptions toward
 SRI
 Item 1-6
 Percentage

 Strongly
 Disagree
 %

 Disagree
 Agree
 %

 Neutral
 %

 Agree
 %

 Strongly
 Agree
 %

 M

 SD

 1. I have positive attitude
 toward SRI.
 2. I am enthusiastic and
 confident about the result of
 SRI.
 3. I
 spend
 more
 time
 preparing my class according to
 SRI results.
 4. I think if I am more open
 to students’ opinions, the result
 will be more positive.
 5. I care about the quality of
 SRI
 6. I am always satisfied with
 the result of SRI

 25

 68.8

 6.3

 0

 0

 1.81

 .535

 0

 59.4

 21.9

 15.6

 3.1

 2.68

 .871

 0

 40.6

 53.1

 6.3

 0

 2.66

 .602

 0

 31.3

 43.8

 21.9

 3.1

 2.97

 .822

 0

 6.3

 40.6

 53.1

 0

 3.47

 .621

 6.3

 46.9

 40.6

 6.3

 0

 2.47

 .718

 Note. M=Mean; SD=Standard Deviation.
 3.8 Descriptive Analyses of the Influence of SRI on Taiwanese University EFL Instructors
 According to the analysis in Table 8, item 7, 6% of the instructors strongly disagreed and 43.8% of the
 instructors disagreed that SRI was an effective instrument for improving English instructional delivery. There
 were 41% of the participants were neutral. There were 9% of the instructors agreed that SRI was an effective
 57

 elt.ccsenet.org

 English Language Teaching

 Vol. 11, No. 1; 2018

 instrument for improving English instructional delivery. In item 8, 16% of the participants strongly disagreed and
 the majority of the participants (56%) disagreed that SRI provides authentic information in developing effective
 English lessons. There were 28% of the instructors were neutral.
 Furthermore, in item 9, 56% of the instructors strongly disagreed and 34% of the participants disagreed that they
 became more supportive in assisting students learning after receiving the result of EFL SRI. There were 9% of
 participants were neutral. In item 10, 13% of the instructors strongly disagreed and 41% of the instructors
 disagreed that the result of SRI provided positive encouragement for their classes. There were 31% of the
 participants were neutral. There were 6% of the participants agreed that the results of SRI provided positive
 encouragement for their classes. Moreover, item 11 was worded in reverse, 9% of the participants strongly
 disagreed and 50% of the participants disagreed that criticism from the SRI did not influence their English
 teaching performance. There were 25% of the instructors were neutral. There were 9% of the participants agreed
 that criticism from the SRI did not influence their English teaching performance.
 In item 12, 6% of the participants strongly disagreed and 59% of the participants disagreed that EFL SRI was an
 efficient communicative bridge between their students and them. There were 25% of the participants were
 neutral. Only 9% of the participants agreed that EFL SRI is an efficient communicative bridge between their
 students and them. In item 13, 6% of the participants disagreed that students’ feedback gave them ideas for
 teaching students with special needs. There were 56% of the participants were neutral and 37% of the
 participants agreed that students’ feedback gave them ideas for teaching students with special needs.
 In item 14, 34% of the participants disagreed that students’ feedback improves their English classroom
 management. 37.5% of the participants were neutral. There were 28% of the participants agreed that students’
 feedback improved their English classroom management. Moreover, item 15 was worded in reverse, 3% of the
 participants strongly disagreed and 34% of the participants disagreed that they would not change their
 knowledge and understanding of English instructional practices after receiving the results of EFL SRI. There
 were 46% of the participants were neutral and 16% of the participants agreed that they would not change the
 knowledge and understanding of English instructional practices after receiving the result of EFL SRI.
 In item 16, 13% of the participants strongly disagreed and the majority of the participants (63%) disagreed that
 students provided trustworthy information when evaluating the effectiveness of English classroom instruction.
 There were 22% of the participants were neutral. Only 3% of the participants agreed that students provided
 trustworthy information when evaluating the effectiveness of English classroom instruction. In item 17, 41% of
 the participants disagreed that students’ academic achievements influenced the result of SRI. There were 31% of
 the participants were neutral. There were 25% of the participants agreed and 3% of the participants strongly
 agreed that students’ academic achievements influenced the result of SRI.
 In item 18, 28% of the instructors disagreed that if they improved the quality of their English teaching, they
 received higher ratings from students. There were 56% of the participants were neutral and 16% of the
 participants agreed that if they improved the quality of their English teaching, they received higher rating from
 students. In item 19, 13% of the instructors disagreed that if they received unpleasant rating scores in the past,
 they changed their English teaching strategies. There were 56% of the instructors were neutral. There were 25%
 of the instructors agreed and 6% of the instructors strongly agreed that they received unpleasant rating scores in
 the past, so they changed their English teaching strategies.
 In item 20, 9% of the participants disagreed that after they changed their English teaching strategies, they
 received better scores of EFL SRI. There were 81% of the participants were neutral and 9% of the participants
 agreed that after changing their English teaching strategies, they received better scores of EFL SRI. In addition,
 item 21 was worded in reverse, 6% of the participants strongly disagreed and 25% of the participants disagreed
 that unpleasant scores of EFL SRI would not decrease their passion toward teaching. Thirteen percent (13%) of
 the participants were neutral. There were 25% of the participants agreed and 6% of the participants strongly
 agreed that unpleasant scores of EFL SRI would not decrease their passion toward teaching.

 58

 elt.ccsenet.org

 English Language Teaching

 Vol. 11, No. 1; 2018

 Table 8. Mean. standard deviation and percentage of the influence of SRI on Taiwanese university EFL
 instructors’ classroom instruction
 Item 7-21
 Percentage
 7. EFL SRI is an effective
 instrument for improving
 English instructional delivery.
 8. Overall,
 EFL
 SRI
 provides
 me
 authentic
 information in developing
 effective English lessons.
 9. I
 become
 more
 supportive in assisting student
 learning after receiving the
 result of EFL SRI.
 10. The result of EFL SRI
 provides
 positive
 encouragement for my class.
 11. Criticism from the SRI
 does not influence my English
 teaching performance.
 12. EFL SRI is an efficient
 communicative
 bridge
 between my students and me.
 13. Students’ feedback gives
 me ideas for teaching students
 with special needs.
 14. Students’
 feedback
 improves
 my
 English
 classroom management.
 15. I will not change the
 knowledge and understanding
 of
 English
 instructional
 practices after receiving the
 result of EFL SRI.
 16. Students
 provide
 trustworthy information when
 evaluating the effectiveness of
 English classroom instruction.
 17. Students’
 academic
 achievements influence the
 result of SRI.
 18. If I improve the quality
 of my English teaching, I will
 receive higher ratings from
 students.
 19. I received an unpleasant
 rating score in the past, so I
 changed my English teaching
 strategies.
 20. After I changed my
 English teaching strategies, I
 received better scores of EFL
 SRI.
 21. Unpleasant scores of
 EFL SRI will not decrease my
 passion toward teaching.

 Strongly
 Disagree
 %
 6.3

 Disagree%

 Neutral
 %

 Agree
 %

 43.8

 40.6

 9.4

 Strongly
 Agree
 %
 0
 2.53

 15.6

 56.3

 28.1

 0

 0

 2.13

 .660

 56.3

 34.4

 9.4

 0

 0

 2.53

 .671

 12.5

 40.6

 31.3

 15.6

 0

 2.50

 .916

 9.4

 50.4

 25.0

 9.4

 0

 2.44

 .840

 6.3

 59.4

 25.0

 9.4

 0

 2.38

 .751

 0

 6.3

 56.3

 37.5

 0

 3.31

 .592

 0

 34.4

 37.5

 28.1

 0

 2.94

 .801

 3.1

 34.4

 45.9

 15.6

 0

 2.75

 .762

 50

 40.6

 9.4

 0

 0

 1.59

 .665

 0

 28.1

 56.3

 15.6

 3.1

 2.91

 .893

 0

 28.1

 46.3

 21.9

 3.1

 2.88

 .660

 0

 12.5

 56.3

 25

 6.3

 3.00

 .803

 0

 63.4

 50.3

 3.4

 9.4

 3.47

 .761

 18.8

 56.3

 18.8

 3.1

 3.1

 2.16

 .884

 Note. M=Mean; SD=Standard Deviation.

 59

 M

 SD

 .761

 elt.ccsenet.org

 English Language Teaching

 Vol. 11, No. 1; 2018

 3.9 The Frequency of Distribution of Years of Teaching Experiences in Four Groups
 Table 9 presented the frequency of distribution of years of teaching experiences in four groups. The researcher
 divided the participants into four different groups based on their years of teaching experiences. Group 1
 represented participants who have been teaching English for 1-6 years (n=3). Group 2 indicated participants who
 have been teaching English for 7-15 years (n=12). Group 3 showed participants who have been teaching English
 for 16-25 years (n=10). Group 4 expressed participants who have been teaching English for more than 26 years
 (n=7).
 Table 9. Frequency of distribution of years of teaching experiences in four groups
 Groups 1-4

 Frequency

 Percentage
 %

 Valid Percentage
 %

 Cumulative
 Percent %

 1 (1-6 years)
 2 (7-15 years)
 3 (16-25 years)
 4 (26 year and more)
 Total

 3
 12
 10
 7
 32

 9.4
 37.5
 31.3
 21.9
 100.0

 9.4
 37.5
 31.3
 21.9
 100.0

 9.4
 46.9
 78.1
 100.0

 Note. n=32.
 3.10 The Means of the Influences of SRI on Taiwanese EFL University Instructors Based on Their Years of
 Teaching Experience
 Four open-ended interview questions (Q1, Q3, Q7, and Q8) reflecting the first part of six quantitative survey
 questionnaires which were designed to investigate EFL instructors’ perceptions toward SRI. The survey
 questionnaires were (1) in general, I have positive attitude toward SRI; (2) I am enthusiastic and confident about
 the result of SRI; (3) I spend more time preparing my class according to SRI result; (4) I think if I am more open
 to students’ opinions, the results will be more positive; (5) I care about the quality of SRI; (6) I am always
 satisfied with the result of SRI. Based on the analysis of participants’ interview transcripts, two themes, four
 subthemes and four issues emerged in order to answer the first research question. The findings to the first
 research question are structured in Table 10.
 Table 10. Structure of the qualitative findings: Research Question 1
 Research Question 1
 Themes
 Theme 1:
 The
 university
 EFL
 Instructors’ Perceptions of
 SRI
 Theme 2:
 The role of SRI

 Subthemes
  Experiences of receiving the results of
 SRI

 Issues
 

 Negative

 

 Implementation of SRI in EFL classroom

 

 Objective

  Opinions after receiving the result of
 EFL SRI

 

 The purpose of SRI

  Suggestions after receiving the result of
 EFL SRI

  The real situation of SRI
 in universities in Taiwan.

 3.11 Quantitative Findings: Null Hypotheses 1
 H10: No association exists between EFL university instructors’ perceptions and SRI.
 Table 11 reported that the researcher failed to reject the first null hypothesis which stated that there was not an
 association between EFL university instructors’ perceptions and student rating of instructions based on a
 significance level of .149 in item 4 (EFL instructors become more open to SRI receive better ratings). The
 significance level of .804 in item 5 (EFL instructors care about the quality of SRI) accepted the first null
 hypothesis. Besides, the first null hypothesis, which stated that there was not an association between EFL
 university instructors’ perceptions and student rating of instructions was rejected based on a significance level
 60

 elt.ccsenet.org

 English Language Teaching

 Vol. 11, No. 1; 2018

 of .000 in item 1 (EFL instructors have positive attitude toward SRI). The significance level of .000 in item 2
 (EFL instructors are confident in the results of SRI) rejected the first null hypothesis. Also, the first null
 hypothesis was rejected based on the significance level of .003 in item 3 (EFL instructors prepare lessons based
 on the results of SRI). The significance level of .000 in item 6 (EFL instructors satisfy with the results of SRI)
 rejected the null hypothesis. As hypothesized, Cillessen and Lafontana (2002) stated that teachers’ perceptions
 affect their behavior and classroom practices. The more teachers learn about their students, the more they are
 able to design effective experiences that elicit real learning. Borg (2006) noted that understanding teacher
 perception is central to the process of understanding teaching. Research also indicated that teachers who are
 willing to develop their teaching skills were open-minded in listening to feedback from their students (Chang,
 Wang, & Yong, 2003).
 Table 11. The summary of chi-square testing for Null Hypothesis 1
 Items 1-6

 Sig

 Null Hypothesis 1
 Accept/Reject

 1. SRI is an effective instrument for EFL instructors to improve
 instructional delivery.
 2. The results of SRI provide EFL instructors authentic
 information in developing lessons.
 3. EFL instructors become more supportive in students’ learning
 after receiving the results of SRI.
 4. The results of SRI provide positive encouragement for EFL
 instructors.
 5. Criticism from SRI does not influence EFL instructors’ teaching
 performance.
 6. SRI is an effective communicative bridge between EFL
 instructors and students.

 .000

 Reject

 .000

 Reject

 .003

 Reject

 .149

 Accept

 .804

 Accept

 .000

 Reject

 Note. A P-value of .05 or less was used to reject the null hypotheses.
 3.12 Quantitative Findings: Null Hypothesis 2
 H20: No association exists between the impact of SRI and classroom instruction.
 Table 12 reported the summary of the Chi-Square test of second null hypothesis, which stated that there was no
 association between the impact of SRI and classroom instruction. The researcher failed to reject the second null
 hypothesis which stated that there was not an association between SRI and classroom instruction based on a
 significance level of .080 in item 10 (The results of SRI provide positive encouragement for EFL instructors) and
 a significance level of .102 in item 14 (SRI improves EFL instructors’ classroom management). The second null
 hypothesis, which stated that there was not an association between SRI and classroom instruction was rejected
 based on a significance level of .002 in item 7 (EFL instructors have positive attitude toward SRI), a significance
 level of .016 in item 8, a significance level of .005 in item 9, a significance level of .004 in item 11, a
 significance level of .000 in item 12, a significance level of .002 in item 13 (SRI gives EFL instructors ideas for
 teaching students with special needs), a significance level of .002 in item 15(EFL instructors will not change the
 knowledge and understanding of instructional practices after receiving the results of SRI), a significance level
 of .001 in item 16 (The results of SRI provide trustworthy information for EFL instructors), a significance level
 of .021 in item 17 (Students’ achievements influence the results of SRI), a significance level of .016 in item 18
 (If I improve the quality of the English instruction, I will receive higher ratings from students), a significance
 level of .006 in item 19 (I received an unpleasant rating score in the past, so I changed my English teaching
 strategies), a significance level of .001 in item 20 (After I changed English teaching strategies, I received better
 results of SRI), and a significance level of .000 in item 21 (Unpleasant scores of SRI will not decrease my
 passion toward English teaching).
 The current findings concurred with the hypothesis that an association existed between the influence of SRI and
 classroom instruction. Teacher evaluation provided information to faculty about teaching effectiveness (Biggs,
 2003; Ramsdem, 2003; Yorke, 2003) and to students about how they can improve their learning and how well
 they are doing in the course (Carless et al., 2007; Gibbs 2006). Liu (2011) stated that teachers’ classroom
 61

 elt.ccsenet.org

 English Language Teaching

 Vol. 11, No. 1; 2018

 presentation is equivalent to teacher appraisal and teacher performance. Furthermore, “since 28th December
 1995, the 21st Regulation of the University Act states that a college should formulate a teacher evaluation system
 that decides on teacher promotion, and continues or terminates employment based on college teachers’
 achievement in teaching, research and so forth” (Liu, 2011, p. 4). “Universities started to formulate school
 regulations based on the University Act and began executing teacher education. According to the official
 documentation, 60% of the colleges stipulate that teachers have to pass the evaluation before receiving a
 promotion” (Liu, 2011, p. 4). EFL instructors’ perceptions and experiences toward SRI will affect their approach
 to teaching. In other words, assessment attitudes and experiences by EFL students will also influence their way
 of learning.
 Table 12. The results of chi-square testing for Null Hypothesis 2
 Items 7-21

 Sig

 Null Hypothesis 2
 Accept/Reject

 7. SRI is an effective instrument for EFL instructors to improve
 instructional delivery.
 8. The results of SRI provide EFL instructors authentic
 information in developing lessons.
 9. EFL instructors become more supportive in students’ learning
 after receiving the results of SRI.
 10. The results of SRI provide positive encouragement for EFL
 instructors.
 11. Criticism from SRI does not influence EFL instructors’ teaching
 performance.
 12. SRI is an effective communicative bridge between EFL
 instructors and students.
 13. SRI gives instructors ideas for teaching students with special
 needs.
 14. SRI improves EFL instructors’ classroom management.
 15. EFL instructors will not change the knowledge and
 understanding of instructional practices after receiving the results of
 SRI
 16. SRI provides trustworthy information for EFL instructors.
 17. Students’ achievements influence the results of SRI.
 18. If I improve the quality of the English instruction, I will receive
 higher ratings from students.
 19. I received an unpleasant rating score in the past, so I changed
 my English teaching strategies.
 20. After I changed English teaching strategies, I received better
 results of SRI.
 21. Unpleasant scores of SRI will not decrease my passion toward
 English teaching.

 .002

 Reject

 .016

 Reject

 .005

 Reject

 .080

 Accept

 .004

 Reject

 .000

 Reject

 .002

 Reject

 .102
 .002

 Accept
 Reject

 .001
 .021
 .016

 Reject
 Reject
 Reject

 .006

 Reject

 .001

 Reject

 .000

 Reject

 Note. A P-value of .05 or less was used to reject the null hypotheses.
 4. Discussions
 The results uncovered that EFL instructors’ teaching attitudes and motivation were being diminished simply
 because teachers overwhelmingly expressed that SRI did not provide them useful feedback on their performance
 in the classroom. EFL instructors were not willing to take risk in assigning works, carrying out tests, or
 addressing needs in supporting student in learning. The results of SRI were hardly for EFL instructors used to
 make important decisions for improving the quality of instruction/education. In fact, SRI was considered an
 indicator of instructors’ performance when it came time to dismiss them. The findings highlighted the northern
 Taiwanese EFL instructors’ perceptions toward SRI and the influence of SRI on EFL instructors’ classroom
 62

 elt.ccsenet.org

 English Language Teaching

 Vol. 11, No. 1; 2018

 instruction. Faculties were more likely to disagree on the effectiveness of SRI and pointed out the increasing
 issues of SRI. Broadly negative feedback accompanied by small numbers objective feedback may provide us
 with indicators about the different value perceptions and influences adopted by northern Taiwanese EFL
 university instructors. As the results of quantitative data showed 87% of the items from the second part of the
 influence of SRI on EFL university instructors had associations between SRI and classroom instruction. It was
 interesting to note that EFL instructors seemed to distrust the results of SRI. The possible explanations of the
 negative perceptions could indicate that EFL instructors were sensitive to the factors that the results of SRI were
 considered for tenure, promotion, and employment status which reflects Cross, Dooris and Weinstein’s theory in
 2004. SRI raises environmental level effects such as hiring, retention, and dismissal. They were highly public
 acts justified through the evaluation process. Students’ perceptions of SRI may differ from the faculty members
 because students may not realize how the results of teacher evaluation may be used by administrators. As a result,
 students may not know the consequences of teaching. Administrators and educators need to understand factors
 that influence EFL instructors’ classroom instruction so that they will be able to develop a reasonable
 environment in merit raises, promotion, and tenure decisions.
 References
 American Psychological Association. (1972). Ethical standards of psychologists. Washington, DC: American
 Psychological Association.
 Anderson, C. A., Gentile, D. A., & Buckley, K. E. (2007). Violent video game effects on children and adolescents:
 Theory, research and public policy.
 Beran, T., Violato, C., Kline, D., & Frideres, J. (2005). The utility of student ratings of instruction for students,
 faculty, and administrators: a consequential validity study. The Canadian Journal of Higher Education, 2,
 49-70.
 Biggs, J. (2003). Teaching for quality learning at university (2nd ed.). Buckingham: Society for Research into
 Higher Education/Open University Press.
 Borg, S. (2006). Teacher cognition and language education: Research and practice. London: Continuum.
 Carless, D., Joughin, G., & Mok. M. M. C. (2007). Learning-oriented assessment: Principles and practice.
 Assessment & Evaluation in Higher Education, 31, 395-398.
 Chang, J. L, Wang, W. Z., & Yong, H. (2003). Measurement of Fracture Toughness of Plasma-Sprayed Al2O3
 Coatings Using a Tapered Double Cantilever Beam Method. Journal of the American Ceramic Society,
 86(8), 1437-1439. https://doi.org/10.1111/j.1151-2916.2003.tb03491.x
 Chang, T-S. (2002). Student ratings of instruction. Taipei, Taiwan: Yung Zhi.
 Cillessen, A. H. N., & Lafontana, K.M. (2002). Children’s perceptions of popular and unpopular peers: A
 multimethod
 assessment.
 Developmental
 Psychology,
 38(5),
 635-647.
 https://doi.org/10.1037/0012-1649.38.5.635
 Fraenkel, J. R., & Wallen, N. E. (2003). How to design and evaluate research in education (5th ed.). Boston:
 McGraw-Hill.
 Gibbs, G. (2006). How assessment frames student learning. In C. Bryan, & K. Clegg (Eds.), Innovative
 Assessment in Higher Education (pp. 23-36). London: Routledge.
 Cooper, D., & Schindler, P. S. (2006). Business research methods (9th ed.). New York: McGraw-Hill Companies,
 Inc.
 Greenwald, A. G. (2002). Constructs in student ratings of instructors. In H. I. Braun, D. N. Jackson, & D. E.
 Wiley (Eds.), The role of constructs in psychological and educational measurement, 24(3), 193-202, New
 York: Erlbaum.
 Guthrie, E. R. (1954). The evaluation of teaching: A progress report. Seattle: University of Washington.
 Liu, C-W. (2011). The implementation of teacher evaluation for professional development in primary education
 in Taiwan. (Doctoral dissertation). Retrieved from Dissertation.com, Boca Raton, Florida.
 Ministry of Education (Taiwan) (MOE). (2005). Ministry of Education News: college law. Retrieved from
 October 31, 2016, from http://tece.heeact.edu.tw/main.php.
 Murray, H. G. (2005). Student evaluation of teaching: has it made a difference? In the Annual meeting of the
 society for teaching and learning in higher education, June 2005 (pp.1-15). Charlottetown, Prince Edward
 63

 elt.ccsenet.org

 English Language Teaching

 Vol. 11, No. 1; 2018

 Island, Canada.
 Obenchain, K. M., Abernathy, T. V., & Wiest, L. R. (2001). The reliability of students’ ratings of faculty teaching
 effectiveness. College Teaching, 49(3), 100-104. https://doi.org/10.1080/87567550109595859
 Ramsden, P. (2003). Learning to teach in higher education (2nd ed.). London: Routledge.
 Seldin, P. (1993a). How colleges evaluate professors: 1983 versus 1993. AAHE Bulletin, 12, 6-8
 Seldin, P. (1993b). The use and abuse of student ratings of professors. Bolton, MA:Anker.
 Sproull, J. (2002). Personal communication with authors, University of Edinburgh.
 Teddlie, C., & Yu, F. (2007). Mixed methods sampling: a typology with examples. Journal of Mixed Methods
 Research, 1(1), 77-100. https://doi.org/10.1177/2345678906292430
 West-Burnham, J., O’Neill, J., & Bradbury, I. (Eds.) (2001). Performance management in schools: How to lead
 and manage staff for school improvement. London, UK: Person Education.
 Wolfer, T., & Johnson, M. (2003). Re-evaluating student evaluation of teaching: The Teaching Evaluation Form.
 Journal of Social Work Education, 39, 111-121.
 Yorke, M. (2003). Formative assessment in higher education: Moves towards theory and enhancement of
 pedagogic practice. Higher Education, 45, 477-501. https://doi.org/10.1023/A:1023967026413

 Copyrights
 Copyright for this article is retained by the author(s), with first publication rights granted to the journal.
 This is an open-access article distributed under the terms and conditions of the Creative Commons Attribution
 license (http://creativecommons.org/licenses/by/4.0/).

 64

 Advances in Engineering Education
 FALL 2020 VOLUME 8 NUMBER 4

 Supportive Classroom Assessment for Remote Instruction
 RENEE M. CLARK
 MARY BESTERFIELD-SACRE
 AND
 APRIL DUKES
 University of Pittsburgh
 Pittsburgh, PA

 ABSTRACT

 During the summer 2020, when remote instruction became the norm for universities due to
 COVID-19, expectations were set at our school of engineering for interactivity and activity within
 synchronous sessions and for using technology for engaging asynchronous learning opportunities.
 Instructors were asked to participate in voluntary assessment of their instructional techniques, and
 this “supportive” assessment was intended to enable growth in remote teaching as well as demonstrate excellence in the School’s instruction. Preliminary results demonstrated what is possible
 with voluntary assessment with a “support” focus – namely instructor willingness to participate and
 encouragement in the use of desirable teaching practices.
 Key words: Assessment, COVID-19, remote learning

 INTRODUCTION AND BACKGROUND

 For many faculty, the last five weeks of the spring 2020 semester represented a time of “persisting
 through” to the end of the semester after a heavily-unforeseen, rapid change from ordinary campus life and learning to remote education. At the University of Pittsburgh’s Swanson School of
 Engineering, there were different expectations, however, for the summer 2020 semester, as the
 Associate Dean for Academic Affairs established a “new norm” for remote instruction by setting
 expectations regarding interactivity and activity in synchronous classroom sessions as well as the
 use of technology for creating engaging, high-quality asynchronous learning resources. These expectations were supported by multiple synchronous training sessions for faculty prior to the start

 FALL 2020 VOLUME 8 NUMBER 4

 1

 ADVANCES IN ENGINEERING EDUCATION

 Supportive Classroom Assessment for Remote Instruction

 of the summer semester. In addition, instructors were asked to participate in voluntary assessment
 of their summer instruction via interviews with and classroom observation by the School’s Assessment Director. This voluntary activity had a two-fold purpose, namely 1) to perform “supportive,”
 as opposed to summative, assessment, to enable growth and development in remote online teaching, and 2) to demonstrate to others excellence in the School’s instruction. The authors believe this
 voluntary program was particularly noteworthy because it was considered an assessment program;
 however, a very supportive aspect was also involved, namely upfront planning assistance (via an
 instructional checklist developed via faculty discussions), in-class coaching and observation, and
 follow-up formative verbal and written feedback. Thus, this voluntary “assessment” program had
 concomitant supportive aspects.
 This supportive assessment program consisted of both 1) one-on-one instructional planning and
 coaching intended to encourage participation, and 2) formative assessment and feedback. This
 program was rooted in previous work by the Assessment Director (AD), in which she had used an
 individualized, social-based approach involving instructional coaching to propagate active learning
 within the engineering school [1]. Her previous work was based on the writings of Charles Henderson,
 Dancy, and colleagues, which advanced the idea that educational change may best occur through
 socially-driven and personalized practices, such as informal communication, interpersonal networks,
 collegial conversations, faculty communities, and support provided during change and implementation [2–4]. The AD’s previous work was also grounded in the professional development literature
 indicating that adult professional learning must be personalized, including support with upfront
 planning, during classroom implementation, and via evaluation [5–7]. Classroom observation is one
 such form of support during classroom implementation [6–11].

 METHODS

 In the two weeks prior to the start of the summer semester, synchronous training and information sessions via Zoom video conferencing were held for instructors to promote desired teaching
 techniques and approaches in the remote online environment. The training and information sessions, which were one hour in length and conducted during the lunch hour, covered the following
 topics: 1) Online Classroom Organization and Communication, 2) Using Zoom for Active Learning,
 3) Active Learning with Classroom Assessment Techniques (CATs), 4) Inclusive Online Teaching,
 and 5) Voluntary Supportive Assessment.
 During the information session on voluntary assessment, the Assessment Director described the
 plan shown in Table 1, which was based on the framework discussed in Introduction & Background.

 2

 FALL 2020 VOLUME 8 NUMBER 4

 ADVANCES IN ENGINEERING EDUCATION

 Supportive Classroom Assessment for Remote Instruction

 Table 1. Voluntary Assessment Program.
 1. Individual interview with instructor (e.g., Zoom, phone, email)
 a. Review Planning and Observational Checklist
 b. Discuss plans for classroom observation (if applicable and desired)
 c. Discuss plans for other support or review (e.g., review of course materials) if desired
 2. Observe class session if applicable
 a. Provide written feedback to instructor
 3. Provide other review or support as desired
 a. Provide written feedback to instructor
 4. Provide acknowledgment of instructor participation to Associate Dean
 5. Future discussion, interview, or email communications with instructor (as follow-up)
 6. Create concise written summary (e.g., table/template) whereby excellence in teaching can be demonstrated

 Thus, the assessment program was socially-based and involved one-on-one discussions with each
 instructor about his/her instructional plans, classroom observation using the COPUS observational
 protocol [12], determination of additional types of review or support desired, provision of written
 feedback to the instructor, and future follow-up communications with the instructor. The initial
 interview/discussion with the instructor was guided by a customized checklist created by a faculty
 team to assist the instructor with his/her planning as well as enable the Assessment Director to
 document actual practices observed or otherwise determined. The various sections of the checklist
 are as follows: 1) Synchronous instruction and methods for interactivity, activity, and “changing up”
 of lecture, 2) Asynchronous instruction, including flipped instruction, and methods such as videos,
 readings, accountability quizzes, and in-class exercises, 3) Learning Management System (LMS) use
 and organization, 4) Communication methods with students, 5) Assessment of learning approaches,
 submission methods, and student feedback plans, and 6) Academic integrity promotion.
 Given that the program was voluntary, each instructor’s participation was acknowledged to the
 Associate Dean in a weekly bulk email. This email described desirable practices witnessed during
 assessment activity with the instructor that week (e.g., via classroom observation). Each instructor discussed in the email was cc’d to drive community among the participants, with the hope of
 potentially creating small learning communities.

 PRELIMINARY RESULTS

 Of the 31 summer instructors, 16 (52%) volunteered to participate in the assessment following the
 information session. We believe this participation metric was noteworthy given the program was

 FALL 2020 VOLUME 8 NUMBER 4

 3

 ADVANCES IN ENGINEERING EDUCATION

 Supportive Classroom Assessment for Remote Instruction

 one of voluntary-based assessment. This “supportive” assessment proactively began immediately at
 the start of the summer semester. At approximately five weeks into the summer semester, an initial
 interview, classroom observation, and/or “other review” had occurred with 15 instructors and so
 the assessment was formative and supportive, versus summative. A plan was made to observe the
 remaining instructor later in the summer given the schedule of the course. The following examples of
 desirable instructional practices, which were communicated to the Associate Dean, were observed
 by the Assessment Director:
 • Not only did Instructor 1 create a classroom in which the expectation was activity and
 engagement, but his flipped classroom was notable for the positive environment in which he
 thanked students for their responses, randomly asked students if they would mind answering
 questions, and always provided positive feedback on the responses. The classroom execution
 was flawless, including circulation among 11 breakout rooms for group work.
 • Instructor 2 made use of the Top Hat software and simple classroom assessment techniques
 (CATs), such as the Minute Paper, to drive interactivity and engagement. He also desired to
 use Zoom for this purpose (i.e., Polling or Chat window).
 • Instructor 3 created an asynchronous class design using Panopto videos with embedded
 accountability quizzes and reflective questions, all exceptionally laid out for students in
 Canvas. She held a live Zoom Q&A session to highlight the week’s material, pose questions, and answer questions. The students responded to questions and asked their own
 questions.
 • Instructor 4 ran a blended classroom, in which he conducted both synchronous Zoom lecture
 sessions and provided content videos via Panopto. Students took a quiz in Canvas to drive
 accountability with the videos during class. There was interactive lecture, in which students
 were highly responsive by asking and answering questions via chat and verbally.
 These sample results demonstrate what is possible with a voluntary assessment program with
 a “support” focus given strong leadership that provides learning and training opportunities for
 instructors – namely instructor willingness to participate as well as support for desirable teaching
 practices. An anonymous survey distributed to the instructors near the end of the semester indicated an average rating of 3.88 on a 5-point scale regarding the helpfulness and usefulness of the
 classroom observation and other formative feedback offered (57% response rate). In the words of
 one participant, “I got a professional review of my strategy for remote teaching, and a check on my
 early implementation. Assessment provided me with a positive reinforcement that gave me assurance
 and encouraged me to move forward. I was offered a broad range of helpful support that reassured
 me that I could rely on opportune help when needed. I do appreciate it very much!” In the words of
 another, “…Also, just the act of being evaluated makes me reflect more on my teaching methods.”

 4

 FALL 2020 VOLUME 8 NUMBER 4

 ADVANCES IN ENGINEERING EDUCATION

 Supportive Classroom Assessment for Remote Instruction

 NEXT STEPS AND FUTURE PLANS

 Given the relatively larger number of courses in the fall semester, this assessment program will
 be continued on an “as requested” basis for instructors. It is worth noting that there was a time
 commitment by the Assessment Director and that (in general), individualized coaching is time-wise
 expensive [13]. However, evidence suggests that the effectiveness of professional development
 for instructors, including coaching, is positively associated with the intensity of the support [14].
 Thus, seeing what was possible with this supportive voluntary assessment program in the summer
 suggests that committing the right resources (i.e., both in number and supportiveness) may be an
 avenue to propelling remote instruction to higher levels.

 REFERENCES

 1. Clark, R., Dickerson, S., Bedewy, M., Chen, K., Dallal, A., Gomez, A., Hu, J., Kerestes, R., & Luangkesorn, L. (2020). SocialDriven Propagation of Active Learning and Associated Scholarship Activity in Engineering: A Case Study. International
 Journal of Engineering Education, 36(5), 1–14.
 2. Dancy, M., Henderson, C., & Turpen, C. (2016). How faculty learn about and implement research-based instructional
 strategies: The case of peer instruction. Physical Review Physics Education Research, 12(1), 010110-010110–17.
 3. Dancy, M., & Henderson, C. (2010). Pedagogical practices and instructional change of physics faculty. American
 Journal of Physics, 78(10), 1056–1063.
 4. Foote, K., Neumeyer, X., Henderson, C., Dancy, M., & Beichner, R. (2014). Diffusion of research-based instructional
 strategies: the case of SCALE-UP. International Journal of STEM Education, 1(1), 1–18.
 5. Rodman, A. (2019). Personalized Professional Learning: A Job-Embedded Pathway for Elevating Teacher Voice.
 Alexandria, VA: ASCD, pp. 1–9.
 6. Desimone, L. M., & Pak, K. (2017). Instructional coaching as high-quality professional development. Theory Into
 Practice, 56(1), 3–12.
 7. Rhodes, C., Stokes, M., & Hampton, G. (2004). A practical guide to mentoring, coaching and peer-networking: Teacher
 professional development in schools and colleges. London: Routledge, pp. 25, 29–30.
 8. Braskamp, L., & Ory, J. (1994). Assessing Faculty Work. San Francisco: Jossey-Bass Inc., 202.
 9. Keig, L., & Waggoner, M. (1994). Collaborative peer review: The role of faculty in improving college teaching.
 ASHE-ERIC Higher Education Report No. 2. Washington, DC: The George Washington University, School of Education
 and Human Development, 41–42.
 10. Reddy, L. A., Dudek, C. M., & Lekwa, A. (2017). Classroom strategies coaching model: Integration of formative
 assessment and instructional coaching. Theory Into Practice, 56(1), 46–55.
 11. Gallucci, C., Van Lare, M., Yoon, I., & Boatright, B. (2010). Instructional coaching: Building theory about the role
 and organizational support for professional learning. American Educational Research Journal, 47(4), 919–963.
 12. Smith, M., Jones, F., Gilbert, S., & Wieman, C. (2013). The classroom observation protocol for undergraduate
 STEM (COPUS): A new instrument to characterize university STEM classroom practices. CBE-Life Sci. Educ., 12(4), 618–627.

 FALL 2020 VOLUME 8 NUMBER 4

 5

 ADVANCES IN ENGINEERING EDUCATION

 Supportive Classroom Assessment for Remote Instruction

 13. Connor, C. (2017). Commentary on the special issue on instructional coaching models: Common elements of
 effective coaching models. Theory into Practice, 56(1), 78–83.
 14. Devine, M., Houssemand, C., & Meyers, R. (2013). Instructional coaching for teachers: A strategy to implement new
 practices in the classrooms. Procedia-Social and Behavioral Sciences, 93, 1126–1130.

 AUTHORS

 Renee M. Clark is Research Assistant Professor of Industrial Engineering and
 Director of Assessment for the Swanson School of Engineering at the University
 of Pittsburgh. Dr. Clark’s research focuses on assessment of active learning and
 engineering professional development initiatives. Her research has been funded
 by the NSF and the University of Pittsburgh’s Office of the Provost.

 Mary Besterfield-Sacre is Nickolas A. DeCecco Professor, Associate Dean for
 Academic Affairs, and Director of the Engineering Education Research Center in
 the Swanson School of Engineering at the University of Pittsburgh. Dr. Sacre’s
 principal research is in engineering education assessment, which has been
 funded by the NSF, Department of Education, Sloan Foundation, Engineering
 Information Foundation, and VentureWell.

 April Dukes is the Faculty and Future Faculty Program Director for the
 Engineering Education Research Center in the Swanson School of Engineering
 at the University of Pittsburgh. Dr. Dukes facilitates professional development
 on instructional best practices for current and future STEM faculty for both
 synchronous online and in-person environments.

 6

 FALL 2020 VOLUME 8 NUMBER 4

 http://wje.sciedupress.com

 World Journal of Education

 Vol. 11, No. 3; 2021

 Timeless Principles for Effective Teaching and Learning: A Modern
 Application of Historical Principles and Guidelines
 R. Mark Kelley1,*, Kim Humerickhouse2, Deborah J. Gibson3 & Lori A. Gray1
 1

 School of Interdisciplinary Health Programs, Western Michigan University, Kalamazoo, MI, USA

 2

 Department of Teacher Education, MidAmerica Nazarene University, Olathe, KS, USA

 3

 Department of Health and Human Performance, University of Tennessee at Martin, Martin, TN, USA

 *Correspondence: School of Interdisciplinary Health Programs, Western Michigan University, 1903 W. Michigan
 Ave., Kalamazoo, MI, 49008, USA. Tel: 1-269-387-1097. E-mail: [email protected]
 Received: February 13, 2021
 doi:10.5430/wje.v11n3p1

 Accepted: May 23, 2021

 Online Published: June 2, 2021

 URL: https://doi.org/10.5430/wje.v11n3p1

 Abstract
 The purpose of this study is twofold: (a) to assess the perceived relevance of the Seven Timeless Principles and
 guidelines posited by Gregory (1886) for current educators and educators-in-training and (b) to develop and pilot test
 the instrument needed to accomplish the former. The “Rules for Teachers” Gregory attributes to each of these laws
 were used as guidelines to develop an assessment instrument. Eighty-four educators and future educators across three
 universities participated in an online survey using a 4-point Likert scale to evaluate the consistency of Gregory’s
 guidelines with modern best-teaching practices. Responses were framed within the Timeless Principles, providing a
 measure of pedagogical universality. Total mean scores for all principles and guidelines were greater than 3.0,
 suggesting that Gregory had indeed identified foundational principles of teaching and learning that maintain
 relevance across academic disciplines and in a variety of settings in which learning occurs.
 Keywords: teaching and learning, principles of teaching, historical pedagogy, educational principles
 1. Introduction
 In 1886, John Milton Gregory published a book entitled The Seven Laws of Teaching that offered a set of principles
 to support and strengthen teachers’ capabilities systemically and comprehensively. The primary purpose of this study
 was to explore whether Gregory’s principles are consistent with faculty and student perceptions of 21st century best
 teaching practices. To accomplish the primary purpose, a secondary goal of the study was to pilot and provide
 evidence of reliability and validity of an instrument based on Gregory’s principles and guidelines. The study
 evaluated the value and relevance of these 19th century principles to modern teachers via a researcher-developed
 instrument using the guidelines established within each of Gregory’s principles and then presented results to validate
 concept transferability. After examining the basic structure of The Seven Laws of Teaching in the context of modern
 approaches, we suggest that these seven laws represent Timeless Principles of the science and art of teaching.
 1.1 Background
 Discussions of foundational principles that frame effective teaching are not unique to Gregory, and the educational
 literature contains an abundance of suggested principles, strategies, and guidelines. Thorndike (1906) identified three
 essential principles that included readiness, exercise, and effect. The Law of Readiness suggested that a child must be
 ready to learn in order to learn most efficiently. It is the responsibility of the teacher to develop the readiness to learn
 in the student. The Law of Exercise is further divided into the Law of Use and Law of Disuse. Repetition strengthens
 understanding, and practice makes perfect. Conversely, if one does not “use it,” they tend to “lose it.” It is the
 responsibility of the teacher to ensure practice is interesting and meaningful in order to enhance learning.
 Thorndike’s Law of Effect suggests that: (a) actions that elicit feelings of pleasure and satisfaction enhance effective
 learning, (b) any action met with frustration and annoyance will likely be avoided, and (c) success breeds success
 and failure leads to further failure.
 Published by Sciedu Press

 1

 ISSN 1925-0746

 E-ISSN 1925-0754

 http://wje.sciedupress.com

 World Journal of Education

 Vol. 11, No. 3; 2021

 Rosenshine and Furst (1971) conducted what is considered the first literature review of the research addressing
 principles for effective teaching. They outlined five “most important” teacher-effectiveness variables, which include:
 clarity, variability, enthusiasm, task-oriented behavior, and student opportunity to learn criterion material. Almost 30
 years later, Walls (1999) posited four similar criteria, including outcomes, clarity, engagement, and enthusiasm. Walls
 stressed that it is important for students to understand the direction in which the teacher is guiding their
 learning—and the teacher’s intentions for going there—by providing clear goals and related learning outcomes. It is
 vital to build upon what students already know while making material as clear as possible.
 In 1987, Chickering and Gamson posited seven principles that they argued are representative of good practice in
 undergraduate education: (a) encourage contacts between students and faculty, (b) develop reciprocity and
 cooperation among students, (c) use active learning techniques, (d) give prompt feedback, (e) emphasize time on task,
 (f) communicate high expectations, and (g) respect diverse talents and ways of learning. These seven principles are
 “intended as guidelines for faculty members, students, and administrators to improve teaching” (Chickering &
 Gamson, 1987, p. 3).
 Walls (1999) agreed with Thorndike (1906) that students must be engaged to learn, stressing the importance of active
 learning, which encompasses aspects of Thorndike’s laws. Students must be engaged to learn, as people learn what
 they practice (Law of Exercise). Both the student and the teacher should be enthusiastic about the learning (Law of
 Effect); if the teacher does not enjoy the teaching, how can students be expected to enjoy the learning?
 More recently, distinct approaches have offered an element of novelty but ultimately integrated pre-existing
 principles. Perkins (2008) used baseball as a metaphor to depict his principles of teaching. The principles set the
 stage for what Perkins further referred to as conditions and principles of transfer. The principles include: (a) play the
 whole game (develop capability by utilizing holistic work); (b) make the game worth playing (engage students
 through meaningful content); (c) work on the hard parts (develop durable skills through practice, feedback, and
 reflection); (d) play out of town (increase transfer of knowledge with diverse application of experiences); (e) play the
 hidden game (sustain active inquiry); (f) learn from the team (encourage collaborative learning); and (g) learn the
 game of learning (students taking an active role in their learning).
 Tomlinson’s (2017) differentiation emphasized the need for teachers to respond dynamically within a given
 classroom by varying (“differentiating”) instruction to meet student needs. Conceptually, Tomlinson identified
 respectful tasks, ongoing assessment and adjustment, and flexible grouping as general principles driving
 differentiation while identifying the primary domains of the teacher (content, process, and product) and the student
 (readiness, interests, and learning profile).
 Beyond the contributions of individual approaches, the past 20 years have also seen an increase in collaborative,
 research-based recommendations for educational principles that draw upon the experiences of educators, researchers,
 and policymakers. Workforce entry and academic preparation for college have been the primary aspects of these
 recommendations. The InTASC Model Core Teaching Standards delineated competencies based on key principles
 that are intended to be mastered by the teacher (Council of Chief State School Officers, 2011). It is anticipated that
 proficiency in these standards supports sufficient preparation for K-12 students to succeed in college and to obtain
 the skill sets needed for a future workplace. Preparing 21st Century Students for a Global Society set forth four skills
 found to be most important, including critical thinking, communication, collaboration, and creativity, and stated,
 “What was considered a good education 50 years ago, however, is no longer enough for success in college, career,
 and citizenship in the 21st century” (National Education Association, 2012, p. 3).
 In specific academic disciplines, similar discussions and statements have been made. For example, in the field of
 health education and promotion, Auld and Bishop (2015) stated that “given today’s rapid pace of change and health
 challenges, we are called to identify, adapt and improve key elements that make teaching and learning about health
 and health promotion successful” (p. 5). Pruitt and Epping-Jordan (2005) discussed the need to develop a new
 approach to training for the 21st century global healthcare workforce. Regardless of approach or discipline, there is a
 clear desire among educators to identify a universal set of principles to guide effective teaching.
 1.2 Overview of the Seven Laws of Teaching
 Gregory (1886) drew upon the metaphor of examining natural laws or phenomena to define the foundational
 principles that govern effective teaching. In step with what is now recognized as a positivist paradigm, Gregory
 believed that in order to understand such laws, one must subject the phenomenon to scientific analysis and identify
 its individual components. Gregory (1886) posited that the essential elements of “any complete act of teaching” are
 composed of:
 Published by Sciedu Press

 2

 ISSN 1925-0746

 E-ISSN 1925-0754

 http://wje.sciedupress.com

 World Journal of Education

 Vol. 11, No. 3; 2021

 Seven distinct elements or factors: (1) two personal factors—a teacher and a learner; (2) two mental factors—a
 common language or medium of communication, and a lesson or truth or art to be communicated; and (3) three
 functional acts or processes—that of the teacher, that of the learner, and a final or finishing process to test and
 fix the result. (p. 3)
 Further, he argued that regardless of whether that which to be learned is a single fact requiring a few minutes or a
 complex concept requiring a lesson of many hours, all seven of these factors must be present if learning is to occur;
 none can be missing. For the purposes of this article, the concept of a “law” of teaching as expressed by Gregory
 (1886) has been re-termed to be a “principle.” We also embraced Gregory’s general grouping of these elements as
 key dimensions of the Seven Principles (i.e., actors, mental factors, functional processes, and finishing acts).
 1.2.1 The Seven Principles Stated
 There are a variety of ways that these seven principles can be expressed. Gregory (1886) first stated the overarching
 principles, then expressed them as direct statements for teachers to follow in their pursuits. Below are the principles
 exactly as Gregory wrote them (emphasis his own):
 1) The Principle of the Teacher: A teacher must be one who KNOWS the lesson or truth or art to be taught... [As
 expressed to teachers:] Know thoroughly and familiarly the lesson you wish to teach,—teach from a full mind and a
 clear understanding.
 2) The Principle of the Learner: A learner is one who ATTENDS with interest to the lesson given.… [As expressed
 to teachers:] Gain and keep the attention and interest of the pupils upon the lesson. Do not try to teach without
 attention.
 3) The Principle of the Language: The language used as a MEDIUM between teacher and learner must be
 COMMON to both... [As expressed to teachers:] Use words understood in the same way by the pupils and
 yourself—language clear and vivid to both.
 4) The Principle of the Lesson: The lesson to be mastered must be explicable in terms of truth already known by the
 learner—the UNKNOWN must be explained by means of the KNOWN… [As expressed to teachers:] Begin with
 what is already well known to the pupil upon the subject and with what [they themselves] experienced,—and
 proceed to the new material by single, easy, and natural steps, letting the known explain the unknown.
 5) The Principle of the Teaching Process: Teaching is AROUSING and USING the pupil’s mind to grasp the
 desired thought... [As expressed to teachers:] Stimulate the pupil’s own mind to action. Keep [their] thoughts as
 much as possible ahead of your expression, placing [their] in the attitude of a discoverer, an anticipator.
 6) The Principle of the Learning Process: Learning is THINKING into one’s own UNDERSTANDING a new idea
 or truth… [As expressed to teachers:] Require the pupil to reproduce in thought the lesson [they are]
 learning—thinking it out in its parts, proofs, connections and applications till [they] can express it in [their] own
 language.
 7) The Principle of Review: The test and proof of teaching done—the finishing and fastening process—must be a
 REVIEWING, RETHINKING, RE-KNOWING, REPRODUCING, and APPLYING of the material that has been
 taught… [As expressed to teachers:] Review, review, REVIEW, reproducing correctly the old, deepening its
 impression with new thought, linking it with added meanings, finding new applications, correcting any false views,
 and completing the true. (Gregory, 1886, pp. 5-7)
 1.2.2 Essentials of Successful Teaching Using the Seven Principles
 There are a variety of understandings that are essential for applying these Seven Principles to effective teaching. The
 first understanding is that the Seven Principles are both necessary and sufficient for effective teaching. Gregory
 (1886) stated that “these rules, and the laws which they outline and presuppose, underlie and govern all successful
 teaching. If taken in their broadest meaning, nothing need be added to them; nothing can be safely taken away” (p. 7).
 He posited that when these principles are used in conjunction with “good order,” no teacher need be concerned about
 failing as a teacher, provided each principle is paired with effective behavior management. Thus, Gregory indicated
 that profound understanding and consistent application of these principles forms the foundation for all successful
 teaching and learning experiences.
 Another understanding essential for successful teaching with the principles is the deceptiveness of their simplicity. At
 first review, it is easy for the reader to conclude that these principles “seem at first simple facts, so obvious as
 scarcely to require such formal statement, and so plain that no explanation can make clearer their meaning” (Gregory,
 1886, p. 8). As one begins to examine the applications and effects of these principles, it becomes apparent that while
 Published by Sciedu Press

 3

 ISSN 1925-0746

 E-ISSN 1925-0754

 http://wje.sciedupress.com

 World Journal of Education

 Vol. 11, No. 3; 2021

 there is constancy, there is also opportunity for variation as each teacher finds their personal expression of each
 principle.
 The functionality of the principles is not temporally constrained; the principles are as applicable for the 21st century
 teacher as they were for teachers of the 19th century. For example, while the language of the learners of the 1800s
 was likely to have been substantially different from the language of the learners of the 2000s, teachers must prepare
 their lesson with the language of their learners in mind regardless of the century in which they taught or are teaching.
 Gregory’s (1886) principles offer a basis for modern strategies and theories of teaching and learning that is consistent
 with broader philosophies of education. For this reason, we will refer to them as the Seven Timeless Principles.
 The ubiquitous nature of these Seven Timeless Principles needs to be understood in order for the principles to be
 applied in effective teaching. Gregory (1886) stated that the laws “cover all teaching of all subjects and in all grades,
 since they are the fundamental conditions on which ideas may be made to pass from one mind to another, or on
 which the unknown can become known” (p. 8). In this way, he suggested that the principles are just as applicable to
 the elementary school teacher as they are to the college professor, equally important to the music teacher as to the
 health teacher.
 Associated with each principle were what Gregory (1886) described as “Rules for Teachers” (p. 31). These rules
 herein subsequently will be referred to as guidelines. These guidelines detail the core components that shape each
 principle. For example, a guideline under the Teacher Principle would be: “Prepare each lesson by fresh study. Last
 year’s knowledge has necessarily faded somewhat” (Gregory, 1886, p. 20). A guideline posited for the Learner
 Principle: “Adapt the length of the class exercise to the ages of the pupils: the younger the pupils the briefer the
 lesson” (Gregory, 1886, p. 30).
 1.3 Significance and Study Objective
 Gregory’s (1886) original work has been recognized as making valuable contributions to the teaching and learning
 process in some circles (Stephenson, 2014; Wilson, 2014). In a recent reprint of Gregory’s first edition text,
 Stephenson (2014) provided supplemental materials that included study questions, self-assessment, and a sample
 teacher observation form. In the same book, Wilson (2014) argued that one of the essential elements of effective
 teaching is that teachers understand the distinction between the methods of teaching and the principles of teaching.
 Wilson (2014) stated, “Methods change. They come and go. In the ancient world, students would use wax tablets to
 take notes, and now they use another kind of tablet, one with microchips inside” (p. 4). Wilson suggested that a
 teacher using the methods of wax or stone needed to know what was going to be said and why just as much as a
 teacher using the methods of a smart board or computer in today’s classroom. The purpose of this study is twofold: (a)
 to assess the perceived relevance of the Seven Timeless Principles and guidelines posited by Gregory (1886) for
 current educators and educators-in-training and (b) to develop and pilot test the instrument needed to accomplish the
 former. The research hypothesis of this study is that the principles and guidelines posited by Gregory are affirmed as
 relevant by current and future educators. The approach is to translate Gregory’s guidelines into a survey instrument
 capable of providing evidence of the value of the overarching principles.
 2. Method
 2.1 Research Design
 This research was an exploratory study with a cross-sectional design that used a convenience sample. Research sites
 were chosen because of their accessibility to the researchers. The research protocol was approved by the institutional
 review boards (IRBs) of all of the institutions with which the authors are affiliated.
 2.2 Sample and Participant Selection
 The participants for this study consisted of current educators and educators-in-training. The current educators were
 higher education professors from three universities ranging in size from small- to mid-sized: one in the South, one in
 the Midwest, and one in the North. The educators in training participants were students enrolled in the undergraduate
 teacher education programs at two of the universities. Recruitment for all participants was conducted via an email or
 in-class invitation to participate in the research project by completing the survey.
 Student participants were recruited from two classes: an introduction to teacher education course and a senior-level
 course. Surveys were taken by students prior to participating in their student teaching experience, and bonus points
 were offered for participation. Faculty participants were recruited through the faculty development process, though
 participation in the process was not required to participate in the survey. All participants voluntarily completed the
 Published by Sciedu Press

 4

 ISSN 1925-0746

 E-ISSN 1925-0754

 http://wje.sciedupress.com

 World Journal of Education

 Vol. 11, No. 3; 2021

 survey after reading and acknowledging the informed consent form.
 2.3 Data Collection and Analysis
 Participant invitations and all surveys were administered in the 2018 spring and fall academic semesters using
 Google Forms, from which aggregated data were downloaded. Statistics for descriptive and reliability analyses were
 generated using SPSS Version 26 software. Means and standard deviations were calculated for all 43 guidelines,
 including all aggregate groupings for principles and dimensions. To affirm reliability of the instrument and the
 subscales, Cronbach’s alphas were computed on the total scale and on each of the principle subscales.
 2.4 Institutional Approvals and Ethical Considerations
 The protocol of this project was approved by the IRBs of Western Michigan University, Mid-America Nazarene
 University, and University of Tennessee at Martin. Prior to completing the electronic survey, each potential
 participant reviewed an IRB-approved informed consent form online. Potential participants who agreed to participate
 clicked on the “proceed to survey” button, which led them to the initial questions of the survey. The informed
 consent notified participants that they could discontinue participation at any time.
 Participant confidentiality and anonymity were protected through the security of the Google survey management
 system and the encrypted, password-protected security of the investigators’ university computers. There is limited
 psychometric risk to participation in an online survey. No prior psychometric data were available for the instrument,
 as one purpose of this study was to pilot its use.
 2.5 Instrument Development: Assessment and Measures
 The instrument used in this research was developed by the authors and is based upon Gregory’s (1886) Seven
 Timeless Principles. The instrument contains two basic components. The first component of the survey was basic
 demographic information, including: age, binary gender, race, level of involvement in teaching, and primary
 academic discipline. No identifying information beyond the above-mentioned variables was collected.
 The second component of the instrument was developed directly from the guidelines for teachers described by
 Gregory (1886) to measure teacher perception of the guidelines’ modern relevance. Each guideline was used as an
 item on the instrument. Evidence of face validity was obtained by a panel of education professionals who reviewed
 each of the guidelines for its relevance to the principle with which it was associated. In some instances, minor
 changes were made to the language of Gregory’s guidelines in order present the content in more modern language.
 Care was taken to ensure that each statement accurately reflected its original meaning.
 The final instrument consisted of five demographic items and 43 items related to the guidelines for effective teaching,
 creating a Timeless Principles Scale. The items (guidelines) associated with each of the seven principles were
 combined into subscales comprised of n items (i.e., Principle of the Teacher [n = 6], Principle of the Learner [n = 6],
 Principle of the Language [n = 6], Principle of the Lesson [n = 6], Principle of the Teaching Process [n = 9], Principle
 of the Learning Process [n = 4], and Principle of Review and Application [n = 6]).
 Using a 4-point Likert scale, participants affirmed or rejected the perceived relevance of each item (guideline) as it
 relates to teacher best practices in 21st century educational settings (1 = strongly disagree to 4 = strongly agree).
 Means and standard deviations were computed for the total scale, for each of the subscales, and for each of the 43
 items. Responses and mean scores of 3.0 or greater were considered affirming of the relevance of the principle and/or
 guideline for current teaching and learning.
 3. Results
 3.1 Demographics
 Of the 84 educators and education students who participated in the study, 86.9% identified as White and 9.6%
 identified as African American/Black, Hispanic, Asian, or Native American; 3.6% did not identify race. The majority
 of participants were female (57.1%), with 39.3% participants identifying as male and 3.6% that did not identify
 gender. With regard to primary discipline, health sciences was most common (25.0%), followed by physical sciences
 (15.5%), behavioral sciences (13.1%), and social sciences (11.9%). Humanities, language arts, music or fine arts, and
 physical education represented 8.3%, 9.5%, 4.8%, and 2.4% of disciplines, respectively.
 The majority of respondents were educators in higher education settings (70.2%). Education students represented
 28.6% of participant responses, and other workforce professionals represented 1.2% of the sample. Most participants
 in higher education reported employment at a full-time level (44% of total), with 4.8% reporting a part-time teaching
 Published by Sciedu Press

 5

 ISSN 1925-0746

 E-ISSN 1925-0754

 http://wje.sciedupress.com

 World Journal of Education

 Vol. 11, No. 3; 2021

 position. Of the 59 total teachers/professors, 50% stated that their education included training and course work in
 effective teaching practices.
 3.2 Total Scale
 Mean and standard deviation scores for the total scale, the subscales, and for each item are presented in Table 1. The
 mean total score for the Timeless Principles Scale (consisting of all 43 items on the instrument) was 3.37 with a
 standard deviation of 0.348 (see Table 1). This result indicates that participants agreed overall, and were inclined to
 strongly agree, with the guidelines and principles identified by Gregory’s (1886) laws. Cronbach’s alpha calculated
 for the total scale was 0.954, indicating a high level of internal consistency and that the total scale is reliable.
 Table 1. Mean, Standard Deviation, and Cronbach’s Alpha Scores for the Timeless Principles Scale
 Item
 Timeless Principles: Total Scale
 Principle of the Teacher - An effective teacher should:
 1) Prepare each lesson by fresh study. Last year’s knowledge has necessarily faded somewhat.
 2) Find the connection of the lesson to the lives and duties of the learners. Its practical value lies in
 these connections.
 3) Keep in mind that complete mastery of a few things is better than an ineffective smattering of many.
 4) Have a plan of study, but do not hesitate, when necessary, to study beyond the plan.
 5) Make use of all good books and resources available to you on the subject of the lesson.
 6) Get the help of the best scholars and thinkers on the topic at hand to solidify your own thoughts.
 Principle of the Learner - To enhance student engagement, an effective teacher should:
 7) Never exhaust wholly the learner's power of attention. Stop or change activities when signs of
 attention fatigue appear.
 8) Adapt the length of the class exercise to the ages of the pupils: The younger the pupils the briefer the
 lesson.
 9) Appeal whenever possible to the interests of your learners.
 10) Prepare beforehand thought-provoking questions. Be sure that these are not beyond the ages and
 attainments of your learners.
 11) Make your lesson as attractive as possible, using illustrations and all legitimate devices and
 technologies. Do not, however, let these devices or technologies be so prominent as to become sources
 of distraction.
 12) Maintain in yourself enthusiastic attention to and the most genuine interest in the lesson at hand.
 True enthusiasm is contagious.
 Principle of the Language (n = 6) - In order to ensure a common language, an effective teacher should:
 13) Secure from the learners as full a statement as possible of their knowledge of the subject, to learn
 both their ideas and their mode of expressing them, and to help them correct their knowledge.
 14) Rephrase the thought in more simple language if the learner fails to understand the meaning.
 15) Help the students understand the meanings of the words by using illustrations.
 16) Give the idea before the word, when it is necessary to teach a new word.
 17) Test frequently the learner's sense of the words she/he uses to make sure they attach no incorrect
 meaning and that they understand the true meaning.
 18) Should not be content to have the learners listen in silence very long at a time since the acquisition
 of language is one of the most important objects of education. Encourage them to talk freely
 Principle of the Lesson - In order to create an effective lesson, an effective teacher should:
 19) Find out what your students know of the subject you wish to teach to them; this is your starting
 point. This refers not only to textbook knowledge but to all information they may possess, however
 acquired.
 20) Relate each lesson as much as possible with prior lessons, and with the learner's knowledge and
 experience.
 21) Arrange your lesson so that each step will lead naturally and easily to the next; the known leading
 to the unknown.
 22) Find illustrations in the most common and familiar objects suitable for the purpose.
 23) Lead the students to find fresh illustrations from their own experience.
 Published by Sciedu Press

 6

 ISSN 1925-0746

 M
 3.37
 3.30
 3.02

 SD
 .348
 .385
 .640

 α
 .954
 .667
 -

 3.48

 .611

 -

 3.23
 3.39
 3.33
 3.30
 3.52

 .704
 .560
 .627
 .561
 .376

 .754

 3.39

 .602

 -

 3.43

 .556

 -

 3.62

 .513

 -

 3.64

 .530

 -

 3.60

 .518

 -

 3.45

 .629

 -

 3.31

 .414

 .800

 3.27

 .588

 -

 3.54
 3.37
 3.11

 .525
 .533
 .581

 -

 3.30

 .636

 -

 3.25

 .641

 -

 3.51

 .415

 .828

 3.37

 .655

 -

 3.65

 .503

 -

 3.61

 .515

 -

 3.46
 3.48

 .525
 .571

 -

 E-ISSN 1925-0754

 http://wje.sciedupress.com

 World Journal of Education

 Vol. 11, No. 3; 2021

 Item
 24) Urge the learners to use their own knowledge to find or explain other knowledge. Teach them that
 knowledge is power by showing how knowledge really helps solve problems.
 Principle of the Teaching Process - To create and effective teaching process, the effective teacher should:
 25) Select and/or develop lessons and problems that relate to environment and needs of the learner
 26) Excite the learner's interest in the lesson when starting the lesson, by some question or statement
 that will awaken inquiry. Develop a hook to awaken their interest.
 27) Place yourself frequently in the position of a learner among learners, and join in the search for
 some fact or principle.
 28) Repress the impatience which cannot wait for the student to explain themselves, and which takes
 the words out of their mouth. They will resent it, and feel that they could have answered had you given
 them sufficient time
 29) Count it your chief duty to awaken the minds of the learners and do not rest until each learner
 shows their mental activity by asking questions.
 30) Repress the desire to tell all you know or think upon the lesson or subject; and if you tell
 something to illustrate or explain, let it start a fresh question.
 31) Give the learner time to think, after you are sure their mind is actively at work, and encourage them
 to ask questions when puzzled.
 32) Do not answer the questions asked too promptly, but restate them, to give them greater force and
 breadth, and often answer with new questions to secure deeper thought.
 33) Teach learners to ask What? Why? and How? in order to better learn the nature, cause, and method
 of every fact, idea, or principle observed or taught them: also, Where? When? By whom? and What of
 it? - the place, time, actors, and consequences.
 The Principle of the Learning Process - In order to facilitate an effective learning process, the effective
 teacher should:
 34) Ask the learner to express, in their own words, the meaning as they understand it, and to persist
 until they have the whole thought.
 35) Let the reason why be perpetually asked until the learner is brought to feel that they are expected to
 give a reason for their opinion.
 36) Aim to make the learner an independent investigator - a student of nature, a seeker of truth.
 Cultivate in them a fixed and constant habit of seeking accurate information.
 37) Seek constantly to develop a profound regard for truth as something noble and enduring.
 The Principle of Review and Application: To affirm the learning that has occurred and apply it, the
 effective teacher should:
 38) Have a set time for reviews. At the beginning of each lesson take a brief review of the preceding
 lesson
 39) Glance backward, at the close of each lesson, to review the material that has been covered. Almost
 every good lesson closes with a summary. It is good to have the learners know that any one of them
 may be called upon to summarize the lesson at the end of the class.
 40) Create all new lessons to bring into review and application, the material of former lessons.
 41) The final review, which should never be omitted, should be searching, comprehensive, and
 masterful, grouping all parts of the subject learned as on a map, and giving the learner the feeling of a
 familiar mastery of it all.
 42) Seek as many applications as possible for the subject studied. Every thoughtful application
 involves a useful and effective review.
 43) An interesting form of review is to allow members of the class to ask questions on previous
 lessons.

 M

 SD

 α

 3.51

 .570

 -

 3.37
 3.49

 .407
 .549

 .858
 -

 3.57

 .521

 -

 3.42

 .587

 -

 3.29

 .654

 -

 3.06

 .766

 -

 3.30

 .555

 -

 3.48

 .548

 -

 3.34

 .590

 -

 3.37

 .599

 -

 3.40

 .462

 .684

 3.35

 .674

 -

 3.29

 .721

 -

 3.58

 .542

 -

 3.37

 .638

 -

 3.37

 .348

 .852

 3.25

 .618

 -

 3.30

 .619

 -

 3.12

 .722

 -

 3.28

 .668

 -

 3.33

 .627

 -

 3.23

 .533

 -

 Note. N = 84. Survey questions (“guidelines”) were aggregated by subscales representing Gregory’s (1886) Seven
 Laws (“Principles”) of Teaching. Values were calculated from 4-point Likert scale responses (1 = strongly disagree,
 2 = disagree, 3 = agree, 4 = strongly agree).
 3.3 Principles and Guidelines
 The mean and standard deviation for each of the principle subscales were computed as follows: Principle of the
 Teacher (M = 3.30, SD = 0.385), Principle of the Learner (M = 3.52, SD = 0.376), Principle of the Language (M =
 Published by Sciedu Press

 7

 ISSN 1925-0746

 E-ISSN 1925-0754

 http://wje.sciedupress.com

 World Journal of Education

 Vol. 11, No. 3; 2021

 3.31, SD = 0.414), Principle of the Lesson (M = 3.51, SD = 0.415), Principle of the Teaching Process (M = 3.37, SD
 = 0.407), Principle of the Learning Process (M = 3.40, SD = 0.462), and the Principle of Review and Application (M
 = 3.37, SD = 0.348). This represents affirmation of each of the seven principles as relevant to current educational
 settings.
 Cronbach’s alpha for each of the subscales were as follows: Principle of the Teacher (α = 0.667), Principle of the
 Learner (α = 0.754), Principle of the Language (α = 0.800), Principle of the Lesson (α = 0.828), Principle of the
 Teaching Process (α = 0.858), Principle of the Learning Process (α = 0.684), and Principle of Review and
 Application (α = 0.852). These values affirm the internal consistency of each of the subscales.
 The mean scores of each of the 43 items (guidelines) were above 3.0. The item mean scores ranged from 3.02 to 3.65
 with standard deviations ranging from 0.503 to 0.766. These results reflect that each individual guideline was
 affirmed as being relevant to current educational settings.
 4. Discussion
 4.1 Implications
 In this paper, we examined the relevance of the principles (laws) presented in 1886 by John Milton Gregory in The
 Seven Laws of Teaching. We presented evidence that these principles may indeed represent enduring Timeless
 Principles of effective teaching that, while their application in the 21st century may look different than it did in the
 19th century, encapsulate the necessary elements to facilitate effective learning. The results of this exploratory study
 confirm affirmation from educators and educators-in-training of the current relevance of these principles.
 The results of the study also affirm the perception of applicability of the guidelines—or as Gregory (1886) described
 them, rules for teachers—for faculty members of institutions of higher education as well as prospective K-12
 teachers. However, neither we nor Gregory posit that the guidelines presented in the study represent a comprehensive,
 exhaustive list of appropriate guidelines. For example, one could envision a guideline such as “Learn students’ names
 to help them feel connected to the learning community” as an element of effective teaching. However, this statement
 could easily be considered as a fit for the Principle of the Learner, as the feeling of being connected to the learning
 community certainly contributes to learner engagement. It is reasonable and should be expected that other guidelines
 for teachers would be consistent with one of the seven principles.
 The mean score of respondents to each guideline statement was above 3.0 on a 4-point Likert scale in which a 3
 represented agree and a 4 represented strongly agree (lowest M = 3.02, highest M = 3.65). In addition, the mean
 scores for the subscales representing each principle had mean scores ranging from 3.31 to 3.52, reflecting strong
 affirmation of the current relevance of each of the Seven Timeless Principles.
 The enduring nature of these Seven Principles may be a result of their consistency with research-based practices
 whose impact has been shown since Gregory (1886) described his Laws for Teachers. For example, the concept of
 cognitive load theory (Atkinson & Shiffin, 1968) is consistent with both the Principle of the Learner and the
 Principle of the Lesson. In addition, elements of self-determination theory (Ryan & Deci, 2000) are clearly consistent
 with the guidelines in the Principle of the Teaching Process, and spaced-retrieval practice (Karpicke & Roediger,
 2007) easily fits within the Principle of Review, the reviewing, rethinking, re-knowing, and reproducing of the
 learning. Eyler’s (2018) description of curiosity as one the fundamental elements of how humans learn contains
 many elements that overlap with and are similar to the language used by Gregory to describe the Principle of the
 Learner. In order for learning to occur, the learner must actively engage in the learning process and must demonstrate
 curiosity toward that which is to be learned.
 As Wilson (2014) indicated, “highly effective teachers will understand the profound differences between methods of
 teaching and principles of teaching” (p. 3). For example, lesson plan development is a common method used in
 teacher preparation programs to emphasize the importance of comprehensive understanding of the lesson to be taught.
 The lesson plan includes objectives, a review of previous lessons, a summary of the content, and identification of
 activities that will be used to facilitate the learning. These activities represent methods that are consistent with the
 Principle of the Lesson. The teacher must have a clear understanding of what is to be learned in this class and how
 the content to be learned builds upon previous lessons or classes.
 Additionally, in the higher education arena, institutions and accreditation bodies have a variety of methods designed
 to be consistent with the Principle of the Teacher. A teacher must be one who knows the lesson or truth to be taught.
 Potential faculty are evaluated on the relevance of their degrees, research, and experiences to the classes to be taught,
 Published by Sciedu Press

 8

 ISSN 1925-0746

 E-ISSN 1925-0754

 http://wje.sciedupress.com

 World Journal of Education

 Vol. 11, No. 3; 2021

 all of which is done in an attempt to demonstrate that the instructor knows the lesson or truth to teach.
 There is danger in too great of a focus on methods rather than the principles. For example, the actions of some
 accrediting bodies in higher education imply that the only way an instructor can learn about a particular content area
 is to take courses at a university or college. However, it is easy to elicit examples of respected experts who developed
 their expertise outside the traditional classroom. Another example can be easily observed in the developing role of
 the digital classroom. While the methods of developing and maintaining the engagement of students are likely to be
 quite distinct from a face-to-face classroom versus an online or hybrid classroom, the Principle of the Learner is
 equally relevant in both settings.
 4.2 Limitations and Future Work
 While embracing convenience sampling and incentivizing student participation increases reliability and power
 associated with sample size, it also influences who accepts the invitation to participate. This increases the potential
 non-response bias of the study. Similarly, while adhering to Gregory’s (1886) language closely was a primary
 component of identifying transferability, the structure of the instrument may increase desirability and acquiescence
 biases. Such response biases are possible when evaluating a series of statements without embedded item controls.
 While a highly controlled instrument was outside the scope of this work, future studies can leverage an in-depth
 analysis of specific principles and guidelines using survey techniques designed to mitigate bias.
 The sample size, while sufficient for the statistical purposes of the study, is not necessarily sufficient to make an
 argument that it is representative of a national population of educators or future educators. However, we believe the
 sample is strengthened by the diversity of academic disciplines that are represented in it. Additional replications of
 the study with larger, more representative samples will be necessary to extrapolate the results to a larger population;
 this will be a focus of continued research.
 Additional efforts are needed to examine each of the Seven Timeless Principles in-depth and to provide insights into
 the application in 21st century education. This includes more detailed research involving a larger and more diverse
 sample, as well as the addition of mixed methods for a more comprehensive portrayal of data. Further, future efforts
 will attempt to demonstrate that current-day teaching theories and methods, as well as modern policies and
 regulations that are considered innovative, are founded in these Timeless Principles. In addition, there is potential to
 create a framework for the teaching and learning process that assists teachers at all levels of education to clearly
 associate their strategies and methods of teaching with the Timeless Principles.
 References
 Atkinson, R. C., & Shiffin, R. M. (1968). Human memory: A proposed system and its control processes. Psychology
 of Learning and Motivation, 2, 89-195. https://doi.org/10.1016/S0079-7421(08)60422-3
 Auld, M. E., & Bishop, K. (2015). Striving for excellence in health promotion pedagogy. Pedagogy in Health
 Promotion, 1(1), 5-7. https://doi.org/10.1177/2373379915568976
 Chickering, A. W., & Gamson, Z. F. (1987). Seven principles for good practice in undergraduate education. AAHE
 Bulletin, 39(7), 3-6. Retrieved from https://aahea.org/articles/sevenprinciples1987.htm
 Council of Chief State School Officers. (2011). InTASC model core teaching standards: A resource for state dialogue.
 Washington,
 DC:
 Author.
 Retrieved
 from
 https://ccsso.org/resource-library/intasc-model-core-teaching-standards
 Eyler, J. R. (2018). How human learn: The science and stories behind effective college teaching. Morgantown, WV:
 West Virginia Press.
 Gregory, J. M. (1886). The seven laws of teaching. Boston, MA: Congregational Sunday-School and Publishing
 Society.
 Karpicke, J. D., & Roediger, H. L. III. (2007). Expanding retrieval practice promotes short-term retention, but
 equally spaced retrieval enhances long-term retention. Journal of Experimental Psychology: Learning, Memory,
 and Cognition, 33(4), 704-719. https://doi.org/10.1037.0278-7393.33.4.704
 National Education Association. (2012). Preparing 21st century students for a global society: An educator’s guide to
 the “four Cs.” Alexandria, VA: Author.
 Perkins, D. (2008). Making learning whole: How seven principles of teaching can transform education. San
 Francisco, CA: Jossey-Bass.
 Published by Sciedu Press

 9

 ISSN 1925-0746

 E-ISSN 1925-0754

 http://wje.sciedupress.com

 World Journal of Education

 Vol. 11, No. 3; 2021

 Pruitt, S. D., & Epping-Jordan, J. E. (2005). Preparing the 21st century global healthcare workforce. BMJ, 330, 637.
 https://doi.org/10.1136/bmj.330.7492.637
 Rosenshine, B., & Furst, N. (1971). Research on teacher performance criteria. In B. O. Smith (Ed.), Research in
 teacher education (pp. 37-72). Englewood Cliffs, NJ: Prentice Hill.
 Ryan, R. M., & Deci, E. L. (2000). Self-determination theory and the facilitation of intrinsic motivation, social
 development, and well-being. American Psychologist, 55(1), 68-78. https://doi.org/10.1037/0003-066X.55.1.68
 Stephenson, L. (2014). Appendices. In J. M. Gregory, The seven laws of teaching (1st ed. reprint; pp. 129-144).
 Moscow, ID: Canon Press.
 Thorndike, E. L. (1906). The principles of teaching: Based on psychology. London, England: Routledge.
 Tomlinson, C. A. (2017). How to differentiate instruction in academically diverse classrooms (3rd ed.). Alexandria,
 VA: ASCD.
 Walls, R. T. (1999). Psychological foundations of learning. Morgantown, WV: West Virginia University International
 Center for Disability Information.
 Wilson, D. (2014). Foreword: The seven disciplines of highly effective teachers. In J. M. Gregory, The seven laws of
 teaching (1st ed. reprint; pp. 1-9). Moscow, ID: Canon Press.

 Copyrights
 Copyright for this article is retained by the author(s), with first publication rights granted to the journal.
 This is an open-access article distributed under the terms and conditions of the Creative Commons Attribution
 license (http://creativecommons.org/licenses/by/4.0/).

 Published by Sciedu Press

 10

 ISSN 1925-0746

 E-ISSN 1925-0754

 Developing Peer Review of Instruction in an Online Master Course Model

 Developing Peer Review of Instruction
 in an Online Master Course Model
 John Haubrick
 Deena Levy
 Laura Cruz
 The Pennsylvania State University

 Abstract
 In this study we looked at how participation in a peer-review process for online Statistics courses
 utilizing a master course model at a major research university affects instructor innovation and
 instructor presence. We used online, anonymous surveys to collect data from instructors who
 participated in the peer-review process, and we used descriptive statistics and qualitative analysis
 to analyze the data. Our findings indicate that space for personal pedagogical agency and
 innovation is perceived as limited because of the master course model. However, responses
 indicate that participating in the process was overall appreciated for the sense of community it
 helped to build. Results of the study highlight the blurred line between formative and summative
 assessment when using peer review of instruction, and they also suggest that innovation and
 presence are difficult to assess through short term observation and through a modified version of
 a tool (i.e., the Quality Matters rubric) intended for the evaluation of an online course rather than
 the instruction of that course. The findings also suggest that we may be on the cusp of a second
 stage for peer review in an online master course model, whether in-person or online. Our findings
 also affirm the need for creating a sense of community online for the online teaching faculty. The
 experiences of our faculty suggest that peer review can serve as an integral part of fostering a
 departmental culture that leads to a host of intangible benefits including trust, reciprocity,
 belonging, and, indeed, respect.
 Keywords: Peer review, online teaching, teaching evaluation, master course model, statistics
 education, instructor presence
 Haubrick, J., Levy, D., & Cruz, L., (2021). Developing peer review of instruction in an online
 master course model. Online Learning, 25(3), 313-328. doi:10.24059/olj.v25i3.2428

 Online Learning Journal – Volume 25 Issue 3 – September 2021

 313

 Developing Peer Review of Instruction in an Online Master Course Model

 Peer review has a long history in academia, originating in the professional societies of the
 early Enlightenment. The practice first arose to address the need for an evaluation/evaluative
 metric of the quality of research in an era replete with amateur scientists. In this same context,
 peer review also functioned as a foundation for establishing collective expertise that was not
 dependent on the approval of an external body, whether political fiat or divine consecration. The
 present study examines one way in which this long-standing practice of peer review has evolved
 to embrace new professional modes (i.e., teaching), new modalities of instruction (i.e., online),
 and new roles for instructors within the current context of higher education.

 Literature Review

 Peer review had long been the gold standard for academic research, but it was not until
 the learning-centered revolution, begun in the 1970s, that the practice found application in
 education. At first, peer review was confined largely to volunteers who were experimenting with
 pedagogical changes stemming from recent developments in learning science research. As one
 leading scholar writes, there was “a general sense…that teaching would benefit from the kinds of
 collegial exchange and collaboration that faculty seek out as researchers” (Hutchings, 1996).
 Further, contrary to the conservative bias often attributed to the peer review of research (Roy &
 Ashburn, 2001), peer review of teaching (PRT) has increasingly proven to foster both personal
 empowerment and teaching transformation (Chism, 2005; Hutchings, 1996; Lomas & Nicholls,
 2005; Smith, 2014; Trautman, 2009). As one set of scholars state, “the value of formative peer
 assessment is promoted in the exhortative literature…justified in the theoretical literature…and
 supported by reports of experimental and qualitative research” (Kell & Annetts. 2009; Hyland et
 al., 2018; Thomas et al., 2014).
 Those early experiments led to dramatic breakthroughs in evidence-based practice in
 teaching and learning and, by extension, changes in how these activities are evaluated. Since the
 early 2000s, universities have responded to a growing imperative to assess teaching
 effectiveness, both as a means of evaluating work performance and as a way of demonstrating
 collective accountability for the student learning experience. An increasing number of studies
 have linked effective instruction to desired institutional outcomes, including recruitment,
 persistence, and graduation rates, upon the latter of which many funding models rest. Because
 the drive towards accountability is fueled by student interests, it is perhaps not surprising that the
 most common strategy for evaluating teaching are student evaluations of instruction (SETs). At
 a typical U.S. university today, students are asked to complete an electronic survey at the end of
 each semester comprised of a series of scaled survey items along with a handful of open-ended
 questions.
 Over the years, the use of SETs as a measure of teaching effectiveness has been both
 affirmed and disputed (Seldin, 1993). The reliability of the practice has been strengthened
 through increasing sophistication of both the design of the questions and the analysis of the
 results. At the same time, however, it has also been questioned as the basis of personnel
 decisions (Nilson, 2012; Nilson, 2013).
 Although not definitively proven, there is a persistent perception that SETs are biased,
 particularly in the case of faculty members from under-represented populations, including those
 for whom English is a second language and, in some disciplines, women (Calsamiglia &
 Loviglio, 2019; Zipser & Mincieli, 2019). Other scholars have called the validity of the results
 into question, suggesting that students are not always capable of assessing their own learning
 accurately or appropriately, leading to claims that SETs are more likely to measure popularity
 Online Learning Journal – Volume 25 Issue 3 – September 2021

 314

 Developing Peer Review of Instruction in an Online Master Course Model

 rather than effectiveness (Schneider, 2013; Uttl et al., 2017). Perhaps the only safe and definitive
 conclusion to draw is that the implications of the practice are complex and contested.
 Higher education institutions have navigated these stormy waters in multiple ways, most
 by encouraging the use of multiple forms of measurement for teaching effectiveness, often in the
 form of a portfolio, or similar collection tool (Chism, 1999; Seldin et al., 2010). This practice is
 supported by the research literature, which aligns the practice with the multi-faceted nature of
 teaching as well as the importance of direct (e.g., not self-reported) measures of student learning.
 To potentially counterbalance the limitations of SETs, practitioners have suggested the use of
 PRT, which places disciplinary experts, rather than amateur students, in the driver’s seat. In this
 evaluative mode, PRT typically takes the form of either peer review of instructional materials
 and/or peer observation of teaching.
 While PRT may appear to be a neat solution to a pervasive issue, the practice had
 previously been used largely for formative purposes on a voluntary basis. The transition to
 compulsory (or strongly encouraged) evaluative practice has proven to be fraught with dangers,
 both philosophical and practical (Blackmore, 2005; Edström 2019; Keig, 2006; McManus,
 2002). Practically speaking, the PRT process requires a considerable investment of time, energy,
 and attention, not only to conducting the reviews but also to developing shared standards and
 practices. Philosophically, several scholars have predicted that several of the primary benefits of
 PRT as a developmental tool might suffer when transposed into a summative context (Cavanagh,
 1996; Gosling, 2002; Kell & Annetts, 2002; Morley, 2003; Peel, 2005). It has proven to be
 difficult to substantiate these fears, however, as one of the downsides of utilizing summative
 assessment is the challenges it presents to research.
 The PRT problem is confounded by the rise of new modes of instruction, especially
 online and hybrid modalities (Bennett & Barp, 2008; Jones & Gallen, 2016). Since its inception,
 online education has carried with it a burden of accountability that traditional in-person
 instruction has not, and the onus rests with online instructors to prove that the virtual learning
 experience is of comparable quality to other modalities (Esfijani, 2018: Shelton, 2011). This has,
 in turn, led to the development and refinement of shared quality standards for online courses
 (notably, the Quality Matters (QM) rubric), the application and evaluation of which often rely on
 the collective expertise of other online instructors, i.e., pedagogical (rather than disciplinary)
 peers (Shattuck et al., 2014). The QM peer-review process, for example, designates two reviewer
 roles, a subject matter expert and online pedagogy practitioner, the latter of whom undergoes a
 QM-administered certification process.
 The proliferation of online courses, however, has been accompanied by design and
 implementation changes. Because it takes time and sustained engagement to master the
 techniques and approaches needed to meet the quality standards for online courses, the role of
 the instructional designer (ID) as expert in these areas has become increasingly commonplace. A
 typical role for an ID might be to collaborate closely with faculty members to design and develop
 online courses that effectively deliver content in a manner that meets (or exceeds) quality
 standards. Once created, it is certainly possible for the same course to be taught by multiple
 faculty members.
 In a typical ID-faculty scenario, the faculty member often has considerable input on the
 design as it evolves and provides primary instruction, but peer review of instruction is
 complicated both by the medium and the role of the third party (the ID) (Drysdale, 2019). For
 example, the observation protocols developed for the classroom may not apply to a virtual space,
 at least not to the same degree, and a review of instructional strategies, as reflected in artifacts
 Online Learning Journal – Volume 25 Issue 3 – September 2021

 315

 Developing Peer Review of Instruction in an Online Master Course Model

 such as the syllabus, may be the product of both the ID and/or the faculty member. It is perhaps
 for these reasons that peer review of online instruction has tended to focus on the course rather
 than the instructor. The Quality Matters rubric, for example, emphasizes attributes of course
 design rather than teaching effectiveness. Yet, the need for evaluative measures of instruction
 and instructor persists, perhaps even more so as trends point to a growing number of adjunct
 faculty teaching online courses for whom such measures can provide both accountability and
 professional development. (Barnett, 2019; Taylor, 2017).
 The challenge is further compounded by the emergence of instructional standards and/or
 competencies for online (or hybrid) courses that are distinctive to the virtual environment, both
 in form and context (Baran et al., 2011). The popular community of inquiry model, for example,
 differentiates between cognitive presence (content and layout), social presence (engagement),
 and teaching presence in online courses; all are facets of instruction that are less emphasized in
 in-person instruction. These insights have led to the development of several exemplary protocols
 specifically intended for reviewing online instruction (McGahan et al., 2015; Tobin et al., 2015).
 Each of these tools are firmly grounded in an extensive body of evidence-based practice for
 online teaching, but still, the handful of studies that have been conducted on the PRT process
 itself have tended to be limited to case studies and/or action research (Barnard et al., 2015;
 Swinglehurst et al., 2014; Sharma & Ling, 2018; Wood & Friedel, 2009). As one researcher put
 it, it is simply “difficult to find quantitative evidence due to its nature and context” (Bell, 2002;
 Peel, 2002).
 The challenge of peer review of teaching is even further complicated by the increasing
 use of the master course model (Hanbing & Mingzhuo, 2012; Knowles & Kalata, 2007). For
 courses in which stakes are higher and student populations larger, such as gateway or barrier
 courses, an institution may choose to adopt a master course model in which an already designed
 course is provided to all instructors, thereby ensuring a consistent experience for all students
 (Parscal & Riemer, 2010). In this scenario, instructors have little to no control over the content,
 design, and, in many cases, delivery of the course, all of which serve as major components of
 most peer review of instruction models, whether for online or in-person courses. However, even
 within a master course model, instruction varies and opportunities remain to provide both
 formative (for individual improvement) and summative (for performance evaluation) feedback.
 Yet, the question of how to evaluate teaching within these boundaries is a subject that has
 received less attention in both research and practice. Our study explores the implementation of a
 peer review of teaching process for an online statistics program that uses master courses at a
 large, public, research-intensive university.

 Methods
 Context
 The Pennsylvania State University is a public research university located in the
 northeastern part of the United States. The statistics program offers 24 online courses, with
 approximately 1500 enrollments per semester, including those for its online graduate program
 and two undergraduate service courses. Statistics courses have been identified as barrier courses
 at many institutions, including this one. Therefore, the program at The Pennsylvania State
 University bears the responsibility for high standards of instructions that contribute to student
 success, especially persistence.

 Online Learning Journal – Volume 25 Issue 3 – September 2021

 316

 Developing Peer Review of Instruction in an Online Master Course Model

 Each of the program’s 24 courses is based on a master template of objectives, content,
 and assessments. The courses are delivered through two primary systems, the learning
 management system (LMS) and the content management system (CMS). Each section has its
 own unique LMS space for each iteration of the course. Students and instructors use the LMS for
 announcements, communication/email, assessments, grading, discussion and any other
 assignments or interactions. The lesson content for each course is delivered through a CMS,
 which in this case has a public website whose content is classified as open educational resources
 under a creative commons license. The CMS is unique to the course and is not personalized or
 changed from semester to semester. Similarly, the lesson content, developed and written by
 program faculty members, does not change from semester to semester, aside from minor fixes
 and/or planned revisions.
 Instructor agency in the LMS context varies depending on the course taught, how long
 the instructor has taught it, and how many sections are offered in that semester. Instructors who
 are teaching a course that has only one section have more agency to change appearance and
 interactions within the LMS than instructors who are teaching a course with multiple sections. In
 this statistics department, only one section of most of the online graduate courses is offered per
 semester, while more than one section of undergraduate courses is typically offered. The largest
 of these undergraduate courses is a high enrollment, general education requirement course that
 runs 10-12 sections per semester. Courses with multiple sections use the same CMS as well as
 the same master template in the LMS to maintain consistency in the student experience.
 Therefore, in a single section course the instructor could modify the design of their course space
 within the LMS by choosing their home page, setting the navigation, and organizing the modules
 while still delivering the content and objectives as defined by the department for that course.
 Such modifications are less likely to occur in multi-section courses. The following table
 highlights the level of agency possessed by the instructor in both the CMS and LMS according to
 the varied teaching contexts in this department.
 Table 1
 Levels of Instructor Agency in Various Course Types Offered
 If the instructor teaches...
 Undergraduate, single section
 Graduate, single section
 Undergraduate, multiple sections
 Graduate, multiple sections

 Content Management System
 (CMS)

 Learning Management System
 (LMS)

 Low
 Low
 Low
 Low

 High
 High
 Low
 Low

 During the fall 2019 semester, the faculty members in the department who teach online courses
 were comprised of full-time teaching professors (n=13), tenure-track professors (n=6), and
 adjuncts (n=10). Peer review of instruction has been practiced since the onset of the program. In
 its current iteration, the process takes place annually over an approximately three-week period in
 the fall semester. The primary purpose of the peer-review process is to offer formative feedback
 to the instructors, but the results are shared with the assistant program director and faculty
 members are permitted (though not required) to submit the results as part of their reappointment,
 promotion, and tenure dossiers. For the fall 2019 semester, 27 of the 29 (93%) faculty members
 participated in the peer-review process.

 Online Learning Journal – Volume 25 Issue 3 – September 2021

 317

 Developing Peer Review of Instruction in an Online Master Course Model

 Peer Review of Instruction Model
 In the fall of 2018, the instructional designer for these statistics courses piloted a new
 peer-review rubric, which is a modification of the well-known Quality Matters Higher Ed rubric.
 In this modification, 21 out of 42 review standards were determined to be applicable to the
 instructors in the master course context. The rubric serves as the centerpiece of a two-part
 process, in keeping with identified best practices (Edkey, & Roehrich, 2013). First, the faculty
 member completes a pre-observation survey and the reviewer, who is added to the course as an
 instructor, evaluates the course according to each of the twenty-one standards in the rubric. The
 observation is followed by a virtual, synchronous meeting with the peer-review partner. Faculty
 members are paired across various teaching ranks and course levels, and the pairings are rotated
 from year to year. Both the observation and the peer meeting are guided by materials created by
 the instructional designer, who provides both the instructor intake form and two guiding
 questions for discussion.
 In keeping with evidence-based practice for online instruction, the first discussion prompt
 addresses how the faculty establish social, cognitive, and teaching presence within their course.
 Along with the prompt, definitions and examples of each type of presence are provided to the
 instructor.
 Discussion prompt 1 in the online statistics program peer-review guide:
 Prompt #1: Share with your peer how you establish these three types of presence in your
 course.
 Notes: How does your peer establish these three types of presence in their course?
 The second prompt provides an opportunity for the instructors to share changes or innovations
 they have implemented within the past year.
 Discussion prompt 2 in the online statistics program peer-review guide:
 \Prompt #2: Share with your peer if you are trying anything new this semester (or year)?
 If yes, share your innovation or change you’ve made this semester (or year).
 • Has the innovation or change been successful?
 • What challenges have you had to work through?
 • How could others benefit from what you’ve learned?
 • What advice would you share with a colleague who is interested in trying
 this or something similar?
 Notes: What has your peer done this semester (or year) that is innovative or new for
 them?
 The process seeks to evaluate and promote not only quality standards through the rubric, but also
 collegial discussion around innovation, risk-taking, and instructor presence.

 Online Learning Journal – Volume 25 Issue 3 – September 2021

 318

 Developing Peer Review of Instruction in an Online Master Course Model

 Study Design
 The IRB-approved study was originally intended to be a mixed methods study, in which
 input from participating instructors, collected in the form of a survey, would be supplemented
 with an analysis of the peer-review artifacts, especially the instructor intake form and the peerreview rubric (which includes the 2 discussion prompts). The instructors provided mixed
 responses to the requests for use of their identifiable artifacts, which limits their inclusion in the
 study, but the majority did choose to participate in the anonymous survey (14 out of 27, 54%)
 which was administered in the Fall semester of 2019. The online survey, sent to instructors by a
 member of the research team not associated with the statistics department, consisted of 11
 questions, comprised of 1 check all that apply, 8 five-point Likert scale, 1 yes/no, and 3 openended questions.

 Results
 Quantitative Results
 With the small sample size (n=13) we are limited to basic descriptive statistics to analyze
 the results of the Likert questions. The most infrequently chosen category on the Likert scale of
 this survey was “neither agree nor disagree” (n=10), while “somewhat agree” (n=37) was the
 most frequently chosen. In looking at the responses to specific prompts, we note that the
 statement with the highest score was The steps of the peer-review process were clear. For this
 statement, 13/13 responded with somewhat agree or strongly agree (mode = “strongly agree”).
 Consistent with our qualitative findings, the next highest scoring statement was The peer-review
 process was collegial, where 12/13 responded with somewhat agree or strongly agree and one
 responded as neither agree nor disagree (mode = “strongly agree”). The statement The peerreview process was beneficial to my teaching received the third highest rating with 10/13
 respondents saying that they somewhat agree (n=7) or strongly agree (n=3) (mode = “somewhat
 agree”).
 We do want to note that consistent with best survey design practice, one of the statements
 was purposely designed as a negative statement: The peer-review process was not worth the time
 spent on doing it. For this prompt, 8/13 responded with strongly disagree or somewhat disagree,
 while 3/13 somewhat agreed with that statement and 2 chose neither agree nor disagree (mode =
 “strongly disagree”).
 Qualitative Results
 The findings suggest that the participants operated under several constraints. When asked
 how they assess student learning in the intake form, for example, the majority indicated that the
 assessments are part of the master class and largely outside of their control, e.g. All… sections
 have weekly graded discussion forums (might not be the same question), same HWs and same
 exams. All instructors contribute for exams and HWs. Assessment of learning outcomes mainly
 occur through these. This was evident both in the content and tone of their responses, with
 passive voice predominating, e.g., quiz and exam questions are linked to lesson learning
 objectives. The presence of constraint also came to the fore in the survey questions about
 changes; for those who did make changes (6/11), these largely took the form of microinnovations (e.g., so far just little things, small modifications), tweaks primarily focused either
 on course policies (e.g., new late policy); enhancing instructor presence (e.g., try new
 introductions; I am using announcements more proactively) or fostering community (e.g.,
 increasing discussion board posts, add netiquette statement).
 Online Learning Journal – Volume 25 Issue 3 – September 2021

 319

 Developing Peer Review of Instruction in an Online Master Course Model

 Space for personal pedagogical agency and innovation is perceived as limited because of
 the master course model employed in this context. This sentiment is evidenced by the tone of the
 survey responses related to assessments, and as just discussed. On the other hand, the instructor
 intake form shows that instructors can innovate and experiment with those course components
 that can be characterized broadly as relating to instructor presence, particularly regarding
 communication in the course. There is a marked shift in the tone of response when asked, for
 example, Please describe the nature and purpose of the communications between students and
 instructors in this course. Responses to this question show agency and active involvement on the
 part of the instructor in this aspect of the course:
 I post announcements regularly and am in constant communication with the class. The
 discussion forums have a fair bit of chatter and I have replied with video and images as
 well there with positive feedback.
 I respond very quickly to student correspondence. I use the course announcements
 feature very often and check Canvas multiple times a day.
 I would like to promote the use of the Discussion Boards more, but students still do not
 use those as much as I would like them to.
 In this last example, we see that the instructor is forward looking and discusses changes that he
 or she would like to make even in the future. The data suggest that instructors are trying to make
 space for their own unique contribution to the course and for more personalized choices in their
 interactions with students. They are also eager to get feedback from their peers on practices that
 fall into this space of agency:
 I would appreciate any feedback on my use of course announcements. Do you feel that
 they are appropriate in both content, frequency, and timing?
 Our findings indicate that many of these instructors are operating within the constraints of a
 master course model, as discussed earlier, and they are most enthusiastic in their responses and
 innovative in their teaching when they can identify areas over which they can exert some degree
 of control in the course design and delivery process.
 As evidenced in the quantitative findings previously discussed, these qualitative findings
 also tell us that instructors who participated in the survey appreciate the collegiality of the
 process. Their open-ended responses indicate an appreciation of the collegiality and connection,
 the informal learning, that the peer-review process afforded them. For example, one instructor
 comments, “I have enjoyed the opportunity to discuss teaching ideas and strategies with other
 online faculty. As a remote faculty member, I particularly value that interaction.” Responses
 primarily indicate that participating in the process was overall appreciated for the sense of
 community it helped to build. What we see emerge is another space—a space where instructors
 can negotiate together the limitations for innovation that exist in this sequence of Statistics
 courses, and where they can also share experiences. As one participant comments, The direct
 communication with the peer is great for sharing positive and negative experiences with different
 courses. As we see in our findings, faculty members clearly find value in the process, regardless
 of the product. This insight suggests the presence of a lesser known third model, distinct from
 Online Learning Journal – Volume 25 Issue 3 – September 2021

 320

 Developing Peer Review of Instruction in an Online Master Course Model

 either formative or evaluative formats, called collaborative PRT (Gosling, 2002; Keig &
 Waggoner, 1995). In collaborative PRT, the end goal is to capture the benefits of turning
 teaching from a public to a more collaborative activity (Hutchings, 1996).

 Discussion

 Our findings should not be overstated. This study was conducted for a single program at a
 single university over the course of one semester; as such, the results may or may not be
 replicable elsewhere. Replication may also be hindered by the challenges inherent in studying
 peer review as a process. Because the results of peer review in this case may be used for
 summative or evaluative purposes, any evidence generated is considered part of a personnel file
 and, as such, subject to higher degrees of oversight in the ethical review process. The ethical
 review board at The Pennsylvania State University, for example, did not classify this study as
 exempt research, but rather put the proposal through full (rather than expedited or exempt) board
 review, and has required additional accountability measures. And the evaluative nature of those
 documents also contributed to low faculty participation (n=3) in the first stage of our study,
 where we asked to include copies of their peer-review documents (an intake form, review rubric,
 and meeting notes). There is a reason why there are comparatively few studies on peer review as
 a process.
 In the case of the statistics program, the primary rationale for establishing a peer review
 of teaching process was intended to be formative assessment, i.e., providing feedback to
 instructors so that they might improve the teaching and learning in online statistics courses. In
 practice, however, the boundaries between formative and summative assessment blurred. While
 instructors were not required or compelled to disclose the results of their peer review, many did
 choose to include comments and/or ratings in their formal appointment portfolios, especially
 when the only other evidence of teaching effectiveness (a primary criterion) available are student
 evaluations of instruction (SETs). At The Pennsylvania State University, SETs are structured so
 that students provide feedback on both the instructor and the course, at times separately and, at
 other times, together. In a master course model, however, instructors have limited control over
 many components of the course, making the results of student evaluations challenging to parse
 out and potentially misleading if treated nominally or comparatively.
 The distinction between formative and evaluative assessment is not the only blurred line
 that arose from this study. In this case, peer review of instruction was accomplished with a
 modified version of a tool (the QM rubric) intended to be used for the evaluation of an online
 course. The modification of the QM rubric took the form of removing questions or sections
 pertaining to course components deemed to be outside the control of the master course
 instructors. In addition to the modified QM rubric, two supplemental items—open-ended
 questions—were added to the review process. These items focused on presence and innovation,
 which are difficult to assess through short-term observation. Our results suggest that this strategy
 has led to partial success, i.e., the majority (10/13) of faculty members who responded to our
 survey strongly or somewhat agreed that the process was beneficial, but its impact on teaching
 practice has been limited. This may be partially a result of the limited scope of the study (one
 academic year) which may or may not be an appropriate time frame for capturing changes to
 teaching practice, but it may also stem from limitations in the current iteration of the peer-review
 process itself.

 Online Learning Journal – Volume 25 Issue 3 – September 2021

 321

 Developing Peer Review of Instruction in an Online Master Course Model

 If we look back over the history of peer review of instruction for online courses, a pattern
 emerges in which first, an existing tool, developed for a different purpose or context, is
 importuned and adapted into a new environment. This occurred, for example, when peer
 evaluation tools designed for in-person courses were adapted to suit online courses. In the next
 stage, the adaption process reveals limitations of the existing tool which, in turn, spur the
 development of new instruments or processes that are specifically designed for the context in
 which they are being used. The creation of the QM Rubric is a clear example of this latter step.
 The findings of our study suggest that we may be on the cusp of this second stage for
 peer review of teaching in online master courses, which constitutes a quite different teaching
 environment than other types of courses, whether in-person or online. In the case of master
 courses, there is a distinctive division of labor where, primarily, instructional designers work
 with authors to develop courses, course leads manage content, and instructors serve as the
 primary point of contact with students. It may be time to develop a new rubric (or similar tool)
 that takes this increasingly popular configuration more into consideration.
 Adoption of the master course model is fueled by the need for both efficiency and
 consistency in the student learning experience, and both experience and research suggest that it
 has been effective in serving these goals. That being said, like all models, it also has its
 limitations. Our study suggests that one of those tradeoffs may be that the model constricts both
 the space for and the drivers of change. Without being able to make changes to the master course
 itself, the faculty in our study tried to find ways to make small changes, i.e., micro-improvements
 in those areas over which they held agency. Larger or more long-term changes, on the other
 hand, would need to come from instructional designers and program managers, who may be one
 or even two steps removed from the direct student experience. Although instructors frequently
 make suggestions for course improvements, large changes to courses are not frequently
 implemented. In other words, the division of labor needed to support the master course model
 also divides agency, and the challenge remains to find systematic ways to re-integrate that
 agency in the service of continuous improvement.
 The limitations on faculty agency inherent in the master course model have led some
 institutions to further devalue the role, substituting faculty-led courses for lower-paid, lesser
 recognized, and more easily inter-changeable instructor roles (Barnett, 2019). Such a path would
 be at odds with the culture of The Pennsylvania State University, but it does suggest the need for
 faculty development, i.e., for finding ways to support and treat even part-time instructors as
 valued and recognized members of the community of teaching and learning, even in conditions
 where they may not be able to meet in person. It could be said that our findings affirm both the
 need for creating a sense of community online both inside and outside of the courses, for faculty
 members who teach them. The experiences of our faculty members suggest that peer review can
 be an integral part of departmental culture that supports faculty peer to peer engagement, leading
 to a host of intangible benefits including trust, reciprocity, belonging, and, indeed, respect.

 Online Learning Journal – Volume 25 Issue 3 – September 2021

 322

 Developing Peer Review of Instruction in an Online Master Course Model

 References

 Baran, E., Correia, A. P., & Thompson, A. (2011). Transforming online teaching practice:
 Critical analysis of the literature on the roles and competencies of online teachers. Distance
 Education, 32(3), 421-439.
 Barnard, A., Nash, R., McEvoy, K., Shannon, S., Waters, C., Rochester, S., & Bolt, S. (2015).
 LeaD-in: a cultural change model for peer review of teaching in higher education. Higher
 Education Research & Development, 34(1), 30-44.
 Barnett, D. E. (2019). Full-range leadership as a predictor of extra effort in online higher
 education: The mediating effect of job satisfaction. Journal of Leadership Education, 18(1).
 Bennett, S., & Barp, D. (2008). Peer observation–a case for doing it online. Teaching in Higher
 Education, 13(5), 559-570.
 Blackmore, J. A. (2005). A critical evaluation of peer review via teaching observation within
 higher education. International Journal of Educational Management, 19(3), 218-232.
 Calsamiglia, C., & Loviglio, A. (2019). Grading on a curve: When having good peers is not
 good. Economics of Education Review, 73(C).
 Cavanagh, R. R. (1996). Formative and summative evaluation in the faculty peer review of
 teaching. Innovative higher education, 20(4), 235-240.
 Chism, N. V. N. (1999). Peer review of teaching. A sourcebook. Bolton, MA: Anker.
 Drysdale, J. (2019). The collaborative mapping model: Relationship-centered instructional
 design for higher education. Online Learning, 23(3), 56-71.
 Edkey, M. T. & Roehrich, H. (2013). A faculty observation model for online instructors:
 Observing faculty members in the online classroom. Online Journal of Distance Learning
 Administration, 16 (2).
 http://www.westga.edu/~distance/ojdla/summer162/eskey_roehrich162.html
 Edström, K., Levander, S., Engström, J., & Geschwind, L. (2019). Peer review of teaching merits
 in academic career systems: A comparative study. In Research in Engineering Education
 Symposium.
 Esfijani, A. (2018). Measuring quality in online education: A meta-synthesis. American Journal
 of Distance Education, 32(1), 57-73.
 Gosling, D. (2002). Models of peer observation of teaching. Report. LTSN Generic Center.
 https://www.researchgate.net/profile/David_Gosling/publication/267687499_Models_of_Peer_O
 bservation_of_Teaching/links/545b64810cf249070a7955d3.pdf

 Online Learning Journal – Volume 25 Issue 3 – September 2021

 323

 Developing Peer Review of Instruction in an Online Master Course Model

 Graham, C., Cagiltay, K., Lim, B. R., Craner, J., & Duffy, T. M. (2001). Seven principles of
 effective teaching: A practical lens for evaluating online courses. The Technology Source, 30(5),
 50.
 Hanbing, Y., & Mingzhuo, L. (2012). Research on master-teachers’ management model in online
 course by integrating learning support. Journal of Distance Education, 5(10), 63-67.
 Hutchings, P. (1996). Making teaching community property: A menu for peer collaboration and
 peer review. AAHE Teaching Initiative.
 Hutchings, P. (1996). The peer review of teaching: Progress, issues and prospects. Innovative
 Higher Education, 20(4), 221-234.
 Hyland, K. M., Dhaliwal, G., Goldberg, A. N., Chen, L. M., Land, K., & Wamsley, M. (2018).
 Peer review of teaching: Insights from a 10-year experience. Medical Science Educator, 28(4),
 675-681.
 Johnson, G., Rosenberger, J., & Chow, M. (October 2014) The importance of setting the stage:
 Maximizing the benefits of peer review of teaching. eLearn, 2014 (10).
 https://doi.org/10.1145/2675056.2673801
 Jones, M. H., & Gallen, A. M. (2016). Peer observation, feedback and reflection for development
 of practice in synchronous online teaching. Innovations in Education and Teaching
 International, 53(6), 616-626.
 Keig, L. (2000). Formative peer review of teaching: Attitudes of faculty at liberal arts colleges
 toward colleague assessment. Journal of Personnel Evaluation in Education, 14(1), 67-87.
 Keig, L. W., & Waggoner, M. D. (1995). Peer review of teaching: Improving college instruction
 through formative assessment. Journal on Excellence in College Teaching, 6(3), 51-83.
 Kell, C., & Annetts, S. (2009). Peer review of teaching embedded practice or policy‐holding
 complacency?\ Innovations in Education and Teaching International, 46(1), 61-70.
 Knowles, E., & Kalata, K. (2007). A model for enhancing online course development. Innovate:
 Journal of Online Education, 4(2).
 Lomas, L., & Nicholls, G. (2005). Enhancing teaching quality through peer review of teaching.
 Quality in Higher Education, 11(2), 137-149.
 Mayes, R. (2011, March). Themes and strategies for transformative online instruction: A review
 of literature. In Global Learn (pp. 2121-2130). Association for the Advancement of Computing
 in Education (AACE).
 McGahan, S. J., Jackson, C. M., & Premer, K. (2015). Online course quality assurance:
 Development of a quality checklist. InSight: A Journal of Scholarly Teaching, 10, 126-140.
 Online Learning Journal – Volume 25 Issue 3 – September 2021

 324

 Developing Peer Review of Instruction in an Online Master Course Model

 McManus, D. A. (2001). The two paradigms of education and the peer review of teaching.
 Journal of Geoscience Education, 49(5), 423-434.
 Nilson, L. B. (2012). 14: Time to raise questions about student ratings. To improve the academy,
 31(1), 213-227.
 Nilson, L. B. (2013). 17: Measuring student learning to document faculty teaching effectiveness.
 To Improve the Academy, 32(1), 287-300.
 Nogueira, I. C., Gonçalves, D., & Silva, C. V. (2016). Inducing supervision practices among
 peers in a community of practice. Journal for Educators, Teachers and Trainers, 7, 108-119.
 Parscal, T., & Riemer, D. (2010). Assuring quality in large-scale online course development.
 Online Journal of Distance Learning Administration, 13(2).
 Peel, D. (2005). Peer observation as a transformatory tool? Teaching in Higher Education, 10(4),
 489 - 504.
 Roy, R., & Ashburn, J. R. (2001). The perils of peer review. Nature, 414(6862), 393-394.
 Schneider, G. (2013, March). Student evaluations, grade inflation and pluralistic teaching:
 Moving from customer satisfaction to student learning and critical thinking. In Forum for Social
 Economics 42(1),122-135.
 Seldin, P. (1993). The use and abuse of student ratings of professors. Chronicle of Higher
 Education, 39(46), A40-A40.
 Seldin, P., Miller, J. E., & Seldin, C. A. (2010). The teaching portfolio: A practical guide to
 improved performance and promotion/tenure decisions. John Wiley & Sons.
 Sharma, M., & Ling, A. (2018). Peer review of teaching: What features matter? A case study
 within STEM faculties. Innovations in Education and Teaching International, 55(2), 190200.ms: a comparative study. In Research in Engineering Education Symposium.
 Shattuck, K., Zimmerman, W. A., & Adair, D. (2014). Continuous improvement of the QM
 Rubric and review processes: Scholarship of integration and application. Internet Learning
 Journal, 3(1).
 Shelton, K. (2011). A review of paradigms for evaluating the quality of online education
 programs. Online Journal of Distance Learning Administration,4(1), 1-11.
 Smith, S. L. (2014). Peer collaboration: Improving teaching through comprehensive peer review.
 To Improve the Academy, 33(1), 94-112.
 Swinglehurst, D., Russell, J., & Greenhalgh, T. (2008). Peer observation of teaching in the online
 environment: an action research approach. Journal of Computer Assisted Learning, 24, 383-393.
 Online Learning Journal – Volume 25 Issue 3 – September 2021

 325

 Developing Peer Review of Instruction in an Online Master Course Model

 Taylor, A. H. (2017). Intrinsic and extrinsic motivators that attract and retain part-time online
 teaching faculty at Penn State (Doctoral dissertation, The Pennsylvania State University).
 Thomas, S., Chie, Q. T., Abraham, M., Jalarajan Raj, S., & Beh, L. S. (2014). A qualitative
 review of literature on peer review of teaching in higher education: An application of the SWOT
 framework. Review of Educational Research, 84(1), 112-159.
 Tobin, T. J., Mandernach, B. J., & Taylor, A. H. (2015). Evaluating online teaching:
 Implementing best practices. San Francisco, CA: John Wiley & Sons.
 Trautmann, N. M. (2009). Designing peer review for pedagogical success. Journal of College
 Science Teaching, 38(4).
 Uttl, B., White, C. A., & Gonzalez, D. W. (2017). Meta-analysis of faculty’s teaching
 effectiveness: Student evaluation of teaching ratings and student learning are not related. Studies
 in Educational Evaluation, 54, 22-42.
 Wood, D., & Friedel, M. (2009). Peer review of online learning and teaching: Harnessing
 collective intelligence to address emerging challenges. Australasian Journal of Educational
 Technology, 25(1).
 Zipser, N., & Mincieli, L. (2018). Administrative and structural changes in student evaluations of
 teaching and their effects on overall instructor scores. Assessment & Evaluation in Higher
 Education, 43(6), 995-1008.

 Online Learning Journal – Volume 25 Issue 3 – September 2021

 326

 Developing Peer Review of Instruction in an Online Master Course Model

 Appendix A
 Anonymous Survey Questions
 Likert Questions [1-8]
 Answer Options
 Strongly disagree

 Somewhat disagree
 Neither agree nor disagree
 Somewhat agree
 Strongly agree
 1.

 The peer-review process was beneficial to my teaching.

 2.

 The peer-review process was beneficial to my career development.

 3.

 The peer-review process was not worth the time spent on doing it.

 4.

 The peer-review process was collegial.

 5.

 The peer-review process provided me with new insight into my teaching practice.

 6.

 The peer-review process inspired me to try new things related to my teaching.

 7.

 The steps of the peer-review process were clear.

 8.

 I have little to no prior experience with peer review of online teaching.

 Open-ended Questions [9-11]
 9.

 Did you make (or do you plan to make) changes to your instruction based on your participation in this peerreview process (e.g. feedback you received, conversations with your peers, rubrics, etc...)?
 Y/N
 a.

 If Y, please describe the change(s) you plan to make to your instruction based on the feedback you
 received through the peer-review process.

 10. Please describe at least two insights gained from your participation in the peer-review process.
 11. What changes, if any, would you suggest should be made to enhance the benefits of the peer-review
 process?

 Online Learning Journal – Volume 25 Issue 3 – September 2021

 327

 Developing Peer Review of Instruction in an Online Master Course Model

 Appendix B

 Instructor Intake Form Questions
 Your information
 1. What is your name?
 2. What is your e-mail address?
 3. Who is your assigned peer reviewer?

 Your Online Course
 4. What is your course name, number & section (e.g., STAT 500 001)?
 5. What is the title of your course (e.g., Applied Statistics)?
 6. What is the Canvas link to your course?
 7. What is the link to the online notes in your course?

 Context
 8. How many semesters have you taught this course? Choose: (0-3) (4-6) (6 or more)
 9. Does your course have multiple sections?
 10. If yes, are all sections based on a single master (or another instructor’s) course?
 11. If yes, roughly what percentage of the course do you change or personalize from the master?
 12. How do you know if students are meeting the learning outcomes of your course?
 13. Is there a specific part of the course content or design for which you would like the reviewer to
 provide feedback?
 14. Please describe the nature and purpose of the communications between students and instructors in
 this course.
 15. Are you trying anything new this semester based on prior student or peer feedback, professional
 development, or your own experiences?
 16. If yes, please explain.

 Canvas Communication
 17. Please identify other communications among students and instructors about which the
 Reviewer should be aware, but which are not available for review at the sites listed above.
 18. Does the course require any synchronous activities (same time, same place)?
 ___Yes

 ___No

 19. If yes, please describe:
 20. Is there any other information you would like to share with your peer before they review your
 course?
 Online Learning Journal – Volume 25 Issue 3 – September 2021

 328

 feedback

 OPEN ACCESS

 This is the English version.
 The German version starts at p. 8.

 article

 Does peer feedback for teaching GPs improve student
 evaluation of general practice attachments? A pre-post
 analysis
 Abstract
 Objectives: The extent of university teaching in general practice is increasing and is in part realised with attachments in resident general
 practices. The selection and quality management of these teaching
 practices pose challenges for general practice institutes; appropriate
 instruments are required. The question of the present study is whether
 the student evaluation of an attachment in previously poorly evaluated
 practices improves after teaching physicians have received feedback
 from a colleague.
 Methods: Students in study years 1, 2, 3 and 5 evaluated their experiences in general practice attachments with two 4-point items (professional competence and recommendation for other students). Particularly
 poorly evaluated teaching practices were identified. A practising physician with experience in teaching and research conducted a personal
 feedback of the evaluation results with these (peer feedback), mainly
 in the form of individual discussions in the practice (peer visit). After
 this intervention, further attachments took place in these practices. The
 influence of the intervention (pre/post) on student evaluations was
 calculated in generalised estimating equations (cluster variable practice).
 Results: Of 264 teaching practices, 83 had a suboptimal rating. Of
 these, 27 practices with particularly negative ratings were selected for
 the intervention, of which 24 got the intervention so far. There were no
 post-evaluations for 5 of these practices, so that data from 19 practices
 (n=9 male teaching physicians, n=10 female teaching physicians) were
 included in the present evaluation. The evaluations of these practices
 were significantly more positive after the intervention (by n=78 students)
 than before (by n=82 students): odds ratio 1.20 (95% confidence interval 1.10-1.31; p<.001).
 Conclusion: The results suggest that university institutes of general
 practice can improve student evaluation of their teaching practices via
 individual collegial feedback.

 Michael Pentzek1
 Stefan Wilm1
 Elisabeth
 Gummersbach1
 1 Heinrich Heine University
 Düsseldorf, Medical Faculty,
 Centre for Health and Society
 (chs), Institute of General
 Practice (ifam) , Düsseldorf,
 Germany

 Keywords: general practice, teacher training, feedback, medical
 students, undergraduate medical education, evaluation

 Introduction
 The German “Master Plan Medical Studies 2020”
 provides for a strengthening of the role of general practice
 in the curriculum [1]. One form of implementation desired
 by students and teachers is attachments in practices
 early and continuously in the course of studies [2]. Beyond
 pure learning effects, experiences that students make in
 these attachments can help shape a professional orientation. Good experiences in attachments can increase interest in general practice as a discipline and profession
 [3], [4].
 In accordance with the Medical Licensing Regulations
 [https://www.gesetze-im-internet.de/_appro_2002/
 BJNR240500002.html], students in the Düsseldorf

 medical curriculum complete an attachment in general
 practices lasting a total of six weeks in the academic
 years 1, 2, 3 and 5 [https://www.medizinstudium.hhu.de].
 The requirements of the attachments build on each other
 in terms of content; initially the focus is on anamnesis
 and physical examination, later more complex medical
 contexts and considerations for further diagnostics and
 therapy are added. Under the supervision of the resident
 teaching general practitioners (GP), the students can gain
 experience in doctor-patient interaction. An important
 and therefore repeatedly emphasised factor for a positive
 student perception of the attachments is the fact that
 the students are given the opportunity to work independently with patients during the attachment in order to be
 able to directly experience themselves in the provider

 GMS Journal for Medical Education 2021, Vol. 38(7), ISSN 2366-5017

 1/14

 Pentzek et al.: Does peer feedback for teaching GPs improve student ...

 role [2], [5]. The attitude and qualifications of the teaching
 physicians continue to play an important role in the didactic success of the attachments [3]. About 2/3 of the
 teaching practices are positively evaluated by the students, but about 1/3 are not. Due to the increasing demand for attachments in general practices since the installation of the new curriculum, many teaching practices
 have been newly recruited; a feedback culture is now
 being established. A first step was the possibility for
 teaching practices to actively request their written evaluation results, but this was almost never taken up. The
 next step of establishing a feedback strategy is reported
 here: One way to improve teaching performance is to receive feedback from an experienced colleague (peer
 feedback) [6]. This can generate insights that student
 evaluations alone cannot achieve and is increasingly recognised as a complement to student feedback. In personal peer feedback, ideas can be exchanged, problems
 discussed, strategies identified and concrete approaches
 to improvement found [7]. Potential effects include increased awareness and focus of the teaching physician
 on the teaching situation in practice, more information
 about what constitutes good teaching, motivation to be
 more interactive and student-centred, and inspiration to
 use new teaching methods [8]. Pedram et al. found positive effects on teacher behaviour after peer feedback,
 especially in terms of shaping the learning atmosphere
 and interest in student understanding [9]. The application
 of peer feedback to the setting described here has not
 yet been investigated. The research question of the
 present study is whether the student attachment evaluation of previously poorly rated GPs improves after peer
 feedback has been conducted.

 Methods
 Teaching practices
 The data were collected during the 4 attachments in
 GP practices [https://www.uniklinik-duesseldorf.de/
 patienten-besucher/klinikeninstitutezentren/institut-fuerallgemeinmedizin/lehre], all of which take place in
 teaching practices coordinated by the Institute of General
 Practice. Before starting their teaching practice, all
 teaching GPs are informed verbally and in writing about
 the collection of student evaluations and a personal interview with an institute staff member in case of poor evaluation results.
 Interested doctors take part in a 2-3 hour information
 session led by the institute director (SW) before taking
 up a teaching GP position, in which they are first informed
 about the prerequisites for teaching students in their
 practices. These include, among other things, the planning
 of time resources for supervising students in the attachments, enthusiasm for working as a GP, acceptance of
 the university’s teaching objectives in general practice
 (in particular that interns are allowed to work independently with patients) and participation in at least two of

 the eight didactic trainings offered annually by the institute (with the commencement of the teaching activity,
 the institute assumes the acceptance of these prerequisites on the part of the teaching physician, but does not
 formally check that they are met). This is followed by detailed information on the structure of the curriculum, the
 position of the attachments, the contents and requirements of the individual attachments and basic didactic
 aspects of 1:1 teaching. Information about the student
 evaluation of the attachment is provided verbally and in
 writing, combined with the offer to actively request both
 an overall evaluation and the individual evaluation by email. There is no unsolicited feedback of the evaluation
 results to the practices. After the information event, a
 folder with corresponding written information is handed
 out.
 Before each attachment, the teaching physicians are sent
 detailed material so that they can orient themselves once
 again. This contains information on the exact course of
 the attachment, on the current learning status of the
 students incl. enclosure of or reference to the underlying
 didactic materials, on the tasks to be worked on during
 the attachment and the associated learning objectives,
 on the relevance of practising on patients as well as a
 note on the attitude of wanting to convey a positive image
 of the GP profession to the students.
 In addition, each student receives a cover letter to the
 teaching physician in which the most important points
 mentioned above are summarised once again.

 Evaluation
 Student evaluation as a regular element of teaching
 evaluation [https://www.medizin.hhu.de/studium-undlehre/lehre] was carried out by independent student
 groups before and after the intervention. It consisted,
 among other things, of the opportunity for free-text comments, an indication of the number of patients personally
 examined and the items “How satisfied were you with the
 professional supervision by your teaching physician?”
 and “Would you recommend this teaching practice to
 other fellow students?” (both with a positively ascending
 4-point scale).

 Selection of practices for the intervention
 Since most practices received a very good evaluation
 (skewed distribution), three groups were identified as
 follows. From all the institute’s teaching practices involved
 in the attachments, those were first selected that had a
 lower than very good evaluation (=“suboptimal”): rated
 <2 at least once on at least one of the two above-mentioned items or repeatedly received negative free text
 comments. From this group of suboptimal (=less than
 very good) practices, those with more than two available
 student evaluations, continued teaching and particularly
 negative evaluations were selected: at least twice with
 <2 on at least one of the two items or repeated negative
 free text comments. Of the 27 practices, 24 practices

 GMS Journal for Medical Education 2021, Vol. 38(7), ISSN 2366-5017

 2/14

 Pentzek et al.: Does peer feedback for teaching GPs improve student ...

 (88.9%) have so far received an intervention to improve
 their teaching from a peer (n=3 not yet due to the pandemic), and 19 practices (70.4%) provided evaluation
 results from post-intervention attachments (n=5 had no
 attachments after the intervention). To characterise the
 three groups of very well, suboptimal and poorly evaluated
 (=selected) practices, an analysis of variance including
 post-hoc Scheffé tests was calculated with the factor
 group and the dependent variable evaluation result.

 Intervention

 The free texts in the student evaluations as well as the
 teacher comments in the peer visits and group discussions were processed qualitatively using content analysis
 in order to outline the underlying problems and the
 teacher reactions to the feedback in addition to the pure
 numbers. For this purpose, inductive category development was carried out on the material [11]. The numbers
 of negative student comments before and after the intervention were also compared quantitatively.

 Results

 Peer feedback was implemented as part of the
 didactic concept in particularly negatively evaluated
 teaching practices [https://www.uniklinik-duesseldorf.de/
 patienten-besucher/klinikeninstitutezentren/institut-fuerallgemeinmedizin/didaktik-fortbildungen]: A GP staff
 member of the Institute of General Practice (EG) known
 to the teaching physicians and experienced in practice
 and teaching reported back to the teaching physicians
 their student evaluations. The primary mode was a personal visit to the practice (peer visit) [10]. For organisational reasons, group discussions with several teaching
 physicians and written feedback occasionally had to be
 offered as alternative solutions. Peer visits and group
 discussions were both aimed at reflecting on one's own
 teaching motivation and problems. This was followed by
 a discussion of the personal evaluation in order to enter
 into a constructive exchange between teaching GP and
 university with regard to teaching and dealing with students in the practice. Peer visits and group discussions
 were recorded. The opening question was “Why are you
 a teaching doctor?”, followed by questions about personal
 experiences: “Can you tell me about your experiences?
 What motivates you to be a teaching physician? Are there
 any problems from your point of view?” Then the (bad)
 feedback was addressed and discussed, followed by the
 question “What can we do to support you?”. The written
 feedback consisted of an uncommented feedback of the
 student evaluation results (scores and free texts).

 Analyses
 Due to a strong correlation of the two evaluation items
 (Spearman's rho=0.79), these were averaged into an
 overall evaluation for the present analyses. In order to
 determine multivariable influences on this student evaluation, a generalised estimating equation (GEE) was
 calculated with the cluster variable “practice”, due to the
 lack of a normal distribution (Kolmogorov-Smirnov test
 p<.001) with gamma distribution and log linkage. The
 following were included as potential influence variables:
 Intervention effect (pre/post), intervention mode (peer
 visit vs. group/written), time of attachment (study year),
 number of patients seen in person per week. In parallel
 to this analysis, the intervention effect on the number of
 personally supervised patients was examined in a second
 GEE.

 Teaching practices and pre-evaluations
 264 teaching practices with a total of 1648 attachments
 were involved. Of these, 181 practices (68.6%) with 1036
 attachments were rated very good (student evaluation
 mean 3.8±standard deviation 0.2), 56 practices (21.2%)
 with 453 attachments were rated suboptimal (3.3±0.4)
 and 27 practices (10.2%) with 159 attachments were
 rated very poor (2.8±0.4). The overall comparison of the
 three
 groups
 shows
 significant
 differences
 (F(df=2)=205.1; p<.001), with significant differences in
 all post-hoc comparisons (all p<.001): very good vs. suboptimal (mean difference 0.51; standard error 0.04); very
 good vs. poor (1.09; 0.06); suboptimal vs. poor (0.58;
 0.07).
 Table 1 describes the analysis sample of n=19 out of the
 27 poorly rated practices in more detail.
 Reasons for a poor evaluation according to free texts of
 the student evaluation can be presented in five categories. For example, the lack of opportunity to practise on
 patients was criticised.
 “Unfortunately, I did not have the opportunity to examine many patients myself during my last patient attachment, although I requested this on several occasions.”
 (about practice ID 1)
 There were also comments about lack of appreciation
 and difficult communication:
 “The teaching doctor has little patience especially
 with foreign patients who cannot understand anatomical or medical terms. She makes insulting and ironic
 statements. With some patients I was left alone for
 30 minutes while with others only 2 and afterwards
 she got annoyed when I was not done with the examination/anamnesis.” (about practice ID 14)
 Some teaching physicians were commented on with regard to their didactic competence:
 “[…] as a teaching doctor, I experienced him as little
 to not at all competent and also very disinterested.
 He had no idea of what PA1 [Patient Attachment 1]
 was supposed to teach us and even after several approaches to him on my part, he understood little of
 what I was about or what I was supposed to learn
 there.” (about practice ID 22)
 Practice procedures and structures were mentioned
 which, according to the students, made it difficult to carry
 out the attachment efficiently:

 GMS Journal for Medical Education 2021, Vol. 38(7), ISSN 2366-5017

 3/14

 Pentzek et al.: Does peer feedback for teaching GPs improve student ...

 Table 1: Characteristics of the analysis sample

 “From 8-11 am only patients come for blood collection, fixed appointments are not scheduled during
 that time. As I was not allowed to take blood or vaccinate, there was nothing for me to do during that
 time.” (about practice ID 10)
 In some practices with primarily non-German-speaking
 patients and also staff (incl. teaching physician), the
 language barrier turned out to be a problem in the evaluations.
 “As the teaching doctor is [nationality XY], about 70%
 of the consultations were in [language XY].” (about
 practice ID 2)

 Intervention
 In the protocols of the peer visits and group discussions
 with the teaching physicians, four categories of problems
 emerge, which partly mirror the student comments mentioned above: For example, the teaching physicians reported concerns about letting students work alone with patients (the following are quotes from the protocols of the
 intervening peer doctor.)
 “He finds it difficult to leave students alone. [...] He
 thinks the patients don’t like it that way, although his
 experience is actually different. Also has many patients from management. “Students are also too short
 in practice.”” (reg. ID 17)
 A sceptical attitude towards lower semester students in
 particular was also expressed.
 “Can’t do anything with the 2nd semesters, “they can’t
 do anything, there's no point in letting them listen to
 the heart if they don't know the clinical pictures”. [...]
 “The problem is also that they are always very young
 girls now.”” (reg. ID 24)
 Some teaching physicians were not familiar with the didactic concepts and materials of the practical courses.

 “He has no knowledge of teaching, doesn't read
 through anything. Doesn’t know he is being evaluated
 either.” (reg. ID 6)
 In some cases, a self-image as a teaching general practitioner leads to the definition of one’s own attachment
 content, neglecting or devaluing the learning objectives
 set by the university.
 ““I’ve made a commitment to general practice and I
 want to pass that on”. Explains a lot to students, but
 doesn’t let them do much. “I show young people the
 right way. Nobody else does it (the university certainly
 doesn’t), so I do it.”” (reg. ID 4)
 “However, clearly wants to show the students
 everything, repeatedly mentions ultrasound, blood
 sampling, does not know teaching content, makes
 his own teaching content: “I show them everything of
 interest””. (reg. ID 22)
 At several points, the teaching physicians expressed intentions to change their behaviour, e.g. according to the
 minutes, “wants to guide students more to examination”
 or “says he wants to read through the handouts in future”.
 The majority of the teaching physicians showed a basic
 interest and commitment in supervising the students.
 Most were able to reflect on the points of criticism.

 Pre-post analysis
 The intervention effect on the student evaluation is significant and independent of the (also significant) influence
 of the number of patients (see table 2).
 The intervention effect on the number of patients personally cared for by students also persisted in a GEE (odds
 ratio 1.41; 95% confidence interval 1.21-1.64; p<.001),
 regardless of the type of intervention and study year
 (analysis not shown).

 GMS Journal for Medical Education 2021, Vol. 38(7), ISSN 2366-5017

 4/14

 Pentzek et al.: Does peer feedback for teaching GPs improve student ...

 Table 2: Multivariable influences on the dependent variable “student evaluation of GP attachment” (generalised estimating
 equation (GEE) with cluster variable practice)

 Table 3: Number of students’ comments on attachments in 19 poorly evaluated GP teaching practices

 The proportion of critical comments in the student freetext comments decreases overall and in four of the five
 categories mentioned (see table 3).

 Discussion
 In a pre-post comparison of poorly evaluated teaching
 physicians who supervised students in the context of GP
 attachments, peer feedback by a general practitioner had
 a positive effect on student evaluation and on the number
 of patients personally examined by students during the
 attachment. This is reflected in the evaluation scores and
 also in the fact that corresponding negative free-text
 comments by the students were less frequent after the
 intervention.
 In line with the literature, it was crucial for student evaluation that students were given the opportunity to work
 independently with patients in order to experience
 themselves directly in the provider role [2], [5]. Also independent of the number of patients, student evaluation
 improved after the intervention: The qualitative results
 provide evidence that the teaching physicians may have
 been more closely engaged with the meaning of the at-

 tachments, the learning objectives and didactic materials
 after the intervention. This in turn also seemed to have
 had positive effects on the exchange and relationship
 between the teaching physician and the student (possibly
 in the sense of an alignment of mutual expectations) also important elements of a positive attachment experience [3], [12]. The qualitative results on didactic competence and attitude indicate that, at least for the small
 group of previously poorly evaluated teaching physicians
 studied here, a more intensive consideration of their
 teaching assignment and repeated interaction between
 the university and the teaching practice is required in
 order to internalise contents and concepts and to implement them in the attachments for students in a recognisable and consistent manner. The fact that it is precisely
 the poorly evaluated teaching physicians who tend to
 rarely attend the meetings at the university (offered eight
 times a year in Düsseldorf) is an experience also reported
 by many other locations. The formal review of the prerequisites and criteria for an appropriate teaching GP
 position would involve an enormous amount of effort
 given the high number of teaching practices required –
 especially in a curriculum constructed along the lines of
 longitudinal general practice. However, it must be weighed

 GMS Journal for Medical Education 2021, Vol. 38(7), ISSN 2366-5017

 5/14

 Pentzek et al.: Does peer feedback for teaching GPs improve student ...

 up whether more resources should be invested in the
 selection and qualification of practices interested in
 teaching or in quality control and training of practices
 already teaching.
 A strength of this study is the evaluations by independent
 student groups pre-post, so that biases due to repeated
 exposure of students to a practice (e.g. response shift
 bias, habituation, observer drift) are excluded. The
 weakness associated with the pre-post design without a
 control group and the focus on poorly evaluated practices
 is, among other things, the phenomenon of regression
 to the mean, which presumably accounts for part of the
 positive intervention effect. The primary research question
 of this study is formulated and answered quantitatively;
 we report only limited qualitative results. These allow only
 partial hypothesis-generating insights into the exact
 mechanisms of peer feedback [13]. In the present study,
 several modes of mediation of a peer feedback were
 realised. Since the analyses do not indicate different effects of the personnel and time-intensive peer visit on
 the one hand and the more efficient methods of group
 discussion and written feedback on the other, further
 studies are necessary to differentiate before a broader
 implementation. For example, Rüsseler et al. [14] found
 that written peer feedback – albeit in relation to lecturers
 – had positive effects on the design of the course.

 3.

 Grunewald D, Pilic L, Bödecker AW, Robertz J, Althaus A. Die
 praktische Ausbildung des medizinischen Nachwuchses Identifizierung von Lehrpraxen-Charakteristika in der
 Allgemeinmedizin. Gesundheitswesen. 2020;82(07):601-606.
 DOI: 10.1055/a-0894-4556

 4.

 Böhme K, Sachs P, Niebling W, Kotterer A, Maun A. Macht das
 Blockpraktikum Allgemeinmedizin Lust auf den Hausarztberuf?
 Z Allg Med. 2016;92(5):220-225. DOI:
 10.3238/zfa.2016.0220–0225

 5.

 Gündling PW. Lernziele im Blockpraktikum Allgemeinmedizin Vergleich der Präferenzen von Studierenden und Lehrärzten. Z
 Allg Med. 2008;84:218-222. DOI: 10.1055/s-2008-1073148

 6.

 Steinert Y, Mann K, Centeno A, Dolmans D, Spencer J, Gelula M,
 Prideaux D. A systematic review of faculty development initiatives
 designed to improve teaching effectiveness in medical education:
 BEME Guide No. 8. Med Teach. 2006;28(6):497-526. DOI:
 10.1080/01421590600902976

 7.

 Garcia I, James RW, Bischof P, Baroffio A. Self-Observation and
 Peer Feedback as a Faculty Development Approach for ProblemBased Learning Tutors: A Program Evaluation. Teach Learn Med.
 2017;29(3):313-325. DOI: 10.1080/10401334.2017.1279056

 8.

 Gusic M, Hageman H, Zenni E. Peer review: a tool to enhance
 clinical teaching. Clin Teach. 2013;10(5):287-290. DOI:
 10.1111/tct.12039

 9.

 Pedram K, Brooks MN, Marcelo C, Kurbanova N, Paletta-Hobbs
 L, Garber AM, Wong A, Qayyum R. Peer Observations: Enhancing
 Bedside Clinical Teaching Behaviors. Cureus. 2020;12(2):e7076.
 DOI: 10.7759/cureus.7076

 10.

 O'Brien MA, Rogers S, Jamtvedt G, Oxman AD, Odgaard-Jensen
 J, Kristoffersen DT, Forsetlund L, Bainbridge D, Freemantle N,
 Davis DA, Haynes RB, Harvey EL. Educational outreach visits:
 effects on professional practice and health care outcomes.
 Cochrane Database Syst Rev. 2007;2007(4):CD000409. DOI:
 10.1002/14651858.CD000409.pub2

 11.

 Kruse J. Qualitative Interviewforschung. 2. Aufl. Weinheim: Beltz
 Juventa; 2015.

 12.

 Koné I, Paulitsch MA, Ravens-Taeuber G. Blockpraktikum
 Allgemeinmedizin: Welche Erfahrungen sind für Studierende
 relevant? Z Allg Med. 2016;92(9):357-362. DOI:
 10.3238/zfa.2016.0357-0362

 13.

 Raski B, Böhm M, Schneider M, Rotthoff T. Influence of the
 personality factors rigidity and uncertainty tolerance on peerfeedback. In: 5th International Conference for Research in
 Medical Education (RIME 2017), 15.-17. March 2017,
 Düsseldorf, Germany. Düsseldorf: German Medical Science GMS
 Publishing House; 2017. P15. DOI: 10.3205/17rime46

 14.

 Ruesseler M, Kalozoumi-Paizi F, Schill A, Knobe M, Byhahn C,
 Müller MP, Marzi I, Walcher F. Impact of peer feedback on the
 performance of lecturers in emergency medicine: a prospective
 observational study. Scand J Trauma Resusc Emerg Med.
 2014;22:71. DOI: 10.1186/s13049-014-0071-1

 15.

 Huenges B, Gulich M, Böhme K, Fehr F, Streitlein-Böhme I,
 Rüttermann V, Baum E, Niebling WB, Rusche H.
 Recommendations for Undergraduate Training in the Primary
 Care Sector - Position Paper of the GMA-Primary Care Committee.
 GMS Z Med Ausbild. 2014;31(4):Doc35. DOI:
 10.3205/zma000927

 16.

 Böhme K, Streitlein-Böhme I, Baum E, Vollmar HC, Gulich M,
 Ehrhardt M, Fehr F, Huenges B, Woestmann B, Jendyk R. Didactic
 qualification of teaching staff in primary care medicine - a position
 paper of the Primary Care Committee of the Society for Medical
 Education. GMS J Med Educ. 2020;37(5):Doc53. DOI:
 10.3205/zma001346

 Conclusions
 It makes sense to further consider the effects of teaching
 physician feedback in both research and teaching. The
 comprehensive GMA recommendations provide a robust
 framework for teaching [15] and the didactic qualification
 of teaching physicians [16]. Embedded in this, collegial
 peer feedback for poorly rated teaching physicians represents a possible tool for quality management of general
 practice teaching.

 Competing interests
 The authors declare that they have no competing interests.

 References
 1.

 Bundesministerium für Bildung und Forschung. Masterplan
 Medizinstudium 2020. Berlin: Bundesministerium für Bildung
 und Forschung; 2017. Zugänglich unter/available from: https:/
 /www.bmbf.de/files/2017-03-31_Masterplan%
 20Beschlusstext.pdf

 2.

 Wiesemann A, Engeser P, Barlet J, Müller-Bühl U, Szecsenyi J.
 Was denken Heidelberger Studierende und Lehrärzte über
 frühzeitige Patientenkontakte und Aufgaben in der
 Hausarztpraxis? Gesundheitswesen. 2003;65(10):572-578. DOI:
 10.1055/s-2003-42999

 GMS Journal for Medical Education 2021, Vol. 38(7), ISSN 2366-5017

 6/14

 Pentzek et al.: Does peer feedback for teaching GPs improve student ...

 Corresponding author:
 PD Dr. rer. nat Michael Pentzek
 Heinrich Heine University Düsseldorf, Medical Faculty,
 Centre for Health and Society (chs), Institute of General
 Practice (ifam) , Moorenstr. 5, Building 17.11, D-40225
 Düsseldorf, Germany, Phone: +49 (0)211/81-16818
 [email protected]
 Please cite as
 Pentzek M, Wilm S, Gummersbach E. Does peer feedback for teaching
 GPs improve student evaluation of general practice attachments? A
 pre-post analysis. GMS J Med Educ. 2021;38(7):Doc122.
 DOI: 10.3205/zma001518, URN: urn:nbn:de:0183-zma0015182

 This article is freely available from
 https://www.egms.de/en/journals/zma/2021-38/zma001518.shtml
 Received: 2021-03-03
 Revised: 2021-08-12
 Accepted: 2021-08-17
 Published: 2021-11-15
 Copyright
 ©2021 Pentzek et al. This is an Open Access article distributed under
 the terms of the Creative Commons Attribution 4.0 License. See license
 information at http://creativecommons.org/licenses/by/4.0/.

 GMS Journal for Medical Education 2021, Vol. 38(7), ISSN 2366-5017

 7/14

 Feedback

 OPEN ACCESS

 This is the German version.
 The English version starts at p. 1.

 Artikel

 Verbessert Peer-Feedback für Lehrärzte die studentische
 Bewertung von Hausarztpraktika? Ein Prä-Post-Vergleich
 Zusammenfassung
 Zielsetzung: Die allgemeinmedizinische Lehre an den Universitäten
 nimmt zu und wird u.a. mit Praktika bei niedergelassenen Hausärzten
 realisiert. Auswahl und Qualitätsmanagement dieser Lehrpraxen stellen
 die allgemeinmedizinischen Institute vor Herausforderungen; entsprechende Instrumente sind gefragt. Die Fragestellung der vorliegenden
 Studie lautet, ob sich die studentische Bewertung eines Praktikums in
 bislang schlecht evaluierten Hausarztpraxen verbessert, nachdem die
 hausärztlichen Lehrärzte eine Rückmeldung durch eine Kollegin erhalten
 haben.
 Methodik: Studierende der Studienjahre 1, 2, 3 und 5 bewerteten ihre
 Erfahrungen in hausärztlichen Praktika mit zwei 4-stufigen Items
 (fachliche Betreuung und Empfehlung für andere Kommilitonen). Besonders schlecht evaluierte Lehrpraxen wurden identifiziert. Eine
 praktisch tätige und lehr-erfahrene Hausärztin und wissenschaftliche
 Mitarbeiterin führte mit diesen eine persönliche Rückmeldung der
 Evaluationsergebnisse durch (Peer-Feedback), überwiegend in Form
 von Einzelgesprächen in der Praxis (peer visit). Nach dieser Intervention
 wurden in diesen Praxen weiter Praktika durchgeführt. Der Einfluss der
 Intervention (prä/post) auf die studentischen Evaluationen wurde in
 verallgemeinerten Schätzungsgleichungen (Clustervariable Praxis) berechnet.
 Ergebnisse: Von insgesamt 264 Lehrpraxen hatten 83 eine suboptimale
 Bewertung. Davon wurden 27 besonders negativ bewertete Praxen für
 die Intervention ausgewählt, von denen in bislang 24 die Intervention
 umgesetzt werden konnte. Für 5 dieser Praxen gab es keine post-Evaluationen, so dass in die vorliegende Auswertung die Daten von 19
 Praxen (n=9 männliche Lehrärzte, n=10 weibliche Lehrärztinnen) eingingen. Die Evaluationen dieser Praxen waren nach der Intervention
 (durch n=78 Studierende) signifikant positiver als vorher (durch n=82
 Studierende): Odds Ratio 1.20 (95% Konfidenzintervall 1.10-1.31;
 p<.001).
 Schlussfolgerung: Die Ergebnisse deuten darauf hin, dass allgemeinmedizinische Universitätsinstitute die studentische Bewertung ihrer
 Lehrpraxen über individuelle kollegiale Rückmeldungen verbessern
 können.

 Michael Pentzek1
 Stefan Wilm1
 Elisabeth
 Gummersbach1
 1 Heinrich-Heine-Universität
 Düsseldorf, Medizinische
 Fakultät, Centre for Health
 and Society (chs), Institut für
 Allgemeinmedizin (ifam),
 Düsseldorf, Deutschland

 Schlüsselwörter: Allgemeinmedizin, Ausbildung von Lehrkräften,
 Feedback, Medizinstudenten, medizinische Ausbildung im
 Grundstudium, Evaluation

 Einleitung
 Der „Masterplan Medizinstudium 2020“ sieht eine Stärkung der Rolle der Allgemeinmedizin im Curriculum vor
 [1]. Eine von Studierenden und Lehrenden gewünschte
 Form der Umsetzung besteht in Praktika in Hausarztpraxen bereits früh und kontinuierlich im Studienverlauf [2].
 Die Erfahrungen, die Studierende in diesen Praktika machen, können –über reine Lerneffekte hinaus– eine be-

 rufliche Orientierung mitformen; gute Erfahrungen in
 Praktika können das Interesse an Allgemeinmedizin und
 am Hausarztberuf steigern [3], [4].
 Im Einklang mit der ärztlichen Approbationsordnung
 [https://www.gesetze-im-internet.de/_appro_2002/
 BJNR240500002.html] absolvieren die Studierenden im
 Düsseldorfer Modellstudiengang in den Studienjahren 1,
 2, 3 und 5 jeweils ein Praktikum in Hausarztpraxen mit
 insgesamt
 sechs
 Wochen
 Dauer
 [https://
 www.medizinstudium.hhu.de]. Die Anforderungen der
 Praktika bauen inhaltlich aufeinander auf; zunächst liegt

 GMS Journal for Medical Education 2021, Vol. 38(7), ISSN 2366-5017

 8/14

 Pentzek et al.: Verbessert Peer-Feedback für Lehrärzte die studentische ...

 der Schwerpunkt auf Anamnese und körperlicher Untersuchung, später kommen komplexere medizinische Zusammenhänge und Überlegungen zu weiterführender
 Diagnostik und Therapie hinzu. Unter Supervision der
 niedergelassenen Lehrärzte können die Studierenden
 hier Erfahrungen in der Arzt-Patienten-Interaktion sammeln. Ein wichtiger und deshalb immer wieder betonter
 Faktor für eine positive studentische Wahrnehmung der
 Praktika ist die Tatsache, dass den Studierenden im
 Praktikum die Möglichkeit gegeben wird, selbstständig
 mit Patienten zu arbeiten, um sich unmittelbar selbst in
 der ärztlichen Rolle erleben zu können [2], [5]. Für den
 didaktischen Erfolg der Praktika spielen weiterhin die
 Haltung und Qualifikation der Lehrärzte eine wichtige
 Rolle [3]. Ungefähr 2/3 der Lehrpraxen werden von den
 Studierenden sehr gut bewertet, ca. 1/3 jedoch nicht.
 Aufgrund des seit Installation des Modellstudiengangs
 steigenden Bedarfs an Praktikumsplätzen in Hausarztpraxen wurden viele Lehrpraxen neu gewonnen; eine Feedback-Kultur wird nun aufgebaut. Ein erster Schritt bestand
 in der Möglichkeit für Lehrpraxen, aktiv ihre schriftlichen
 Evaluationsergebnisse einzufordern, was aber fast nie in
 Anspruch genommen wurde. Über den nächsten Schritt
 der Etablierung einer Feedback-Strategie wird hier berichtet: Eine Möglichkeit zur Verbesserung der Lehrperformanz ist die Rückmeldung durch einen erfahrenen Kollegen „auf Augenhöhe“ (Peer-Feedback) [6]. Dies kann
 Einsichten generieren, die studentische Evaluationen allein nicht erreichen und wird zunehmend als Ergänzung
 zur Studierendenrückmeldung anerkannt. Insbesondere
 in persönlichen Peer-Feedbacks können Ideen ausgetauscht, Probleme diskutiert, Strategien aufgezeigt und
 konkrete Verbesserungsansätze gefunden werden [7].
 Zu den möglichen Effekten gehören ein größeres Bewusstsein und eine stärkere Fokussierung des Lehrarztes auf
 die Lehrsituation in der Praxis, mehr Information über
 das, was gutes Lehren ausmacht, die Motivation zu verstärkter Interaktivität und Studierendenzentriertheit sowie
 eine Inspiration zur Anwendung neuer Lehrmethoden [8].
 Pedram et al. fanden nach einem Peer-Feedback positive
 Effekte auf das Verhalten der Lehrenden, insbesondere
 hinsichtlich der Gestaltung der Lernatmosphäre und des
 Interesses am Studierendenverständnis [9]. Die Anwendung von Peer-Feedback auf das hier beschriebene Setting wurde bislang nicht untersucht. Die Fragestellung
 der vorliegenden Studie lautet, ob sich die studentische
 Praktikumsevaluation bislang schlecht bewerteter Hausarztpraxen nach Durchführung eines Peer-Feedback
 verbessert.

 Methoden
 Lehrpraxen
 Die Daten wurden im Rahmen der 4 Praktika in
 Hausarztpraxen [https://www.uniklinik-duesseldorf.de/
 patienten-besucher/klinikeninstitutezentren/institut-fuerallgemeinmedizin/lehre] erhoben, die alle in vom Institut

 für Allgemeinmedizin koordinierten hausärztlichen Lehrpraxen stattfinden. Vor Aufnahme der Lehrarzttätigkeit
 werden alle Lehrärzte mündlich und schriftlich über die
 Erhebung studentischer Evaluationen und ein persönliches Gespräch mit einem oder einer Institutsmitarbeiter/in im Falle schlechter Evaluationsergebnisse informiert.
 Interessierte Ärzte nehmen vor Aufnahme einer Lehrarzttätigkeit an einer 2-3-stündigen Informationsveranstaltung
 unter Leitung des Institutsdirektors (SW) teil, in der sie
 zunächst über die Voraussetzungen für die Lehrarzttätigkeit informiert werden; dazu gehören u.a. die Planung
 zeitlicher Ressourcen für die Betreuung der Studierenden
 in den Praktika, Begeisterung für die Arbeit als Hausarzt,
 die Akzeptanz des universitären allgemeinmedizinischen
 Lehrzielkataloges (insbesondere dass Praktikanten
 selbstständig mit Patienten arbeiten dürfen) und die
 Teilnahme an mindestens zwei der acht jährlich angebotenen allgemeinmedizinisch-didaktischen Fortbildungen
 des Instituts. (Mit Aufnahme der Lehrtätigkeit geht das
 Institut von der Akzeptanz dieser Voraussetzungen seitens
 des Lehrarztes aus, überprüft das Vorliegen jedoch nicht
 formal.) Es folgen ausführliche Informationen über den
 Aufbau des Curriculums, die Verortung der Praktika, die
 Inhalte und Anforderungen der einzelnen Praktika und
 grundlegende didaktische Aspekte des 1:1-Unterrichts.
 Über die Studierendenevaluation des Praktikums wird
 mündlich und schriftlich aufgeklärt, verbunden mit dem
 Angebot, sowohl eine Gesamtauswertung als auch die
 individuelle Evaluation per E-Mail aktiv anfordern zu
 können. Eine unaufgeforderte Rückmeldung der Evaluationsergebnisse an die Praxen gibt es nicht. Nach der Informationsveranstaltung wird eine Mappe mit entsprechenden schriftlichen Informationen ausgehändigt.
 Vor jedem Praktikum wird den Lehrärzten ausführliches
 Material zugeschickt, damit sie sich noch einmal orientieren können. Dieses enthält Hinweise zum genauen Ablauf
 des Praktikums, zum aktuellen Lernstand der Studierenden inkl. Beilage der bzw. Verweis auf die zugrundeliegenden didaktischen Materialien, zu den im Praktikum zu
 bearbeitenden Aufgaben und den damit verbundenen
 Lernzielen, zur Relevanz des Übens am Patienten sowie
 einen Hinweis zur Haltung, den Studierenden ein positives
 Bild des Hausarztberufs vermitteln zu wollen.
 Außerdem erhält jeder Studierende ein Anschreiben an
 den Lehrarzt, in dem die wichtigsten o.g. Punkte noch
 einmal zusammengefasst sind.

 Evaluation
 Die studentische Praktikumsevaluation als reguläres Element
 der Lehrevaluation [https://www.medizin.hhu.de/studiumund-lehre/lehre.html] wurde in den untersuchten Praxen
 vor und nach der Intervention durch unabhängige Studierendengruppen durchgeführt und bestand u.a. aus der
 Möglichkeit für Freitext-Kommentare, einer Angabe der
 Anzahl persönlich betreuter Patienten und den Items „Wie
 zufrieden waren Sie mit der fachlichen Betreuung durch
 Ihre Lehrärztin/Ihren Lehrarzt?“ und „Würden Sie anderen

 GMS Journal for Medical Education 2021, Vol. 38(7), ISSN 2366-5017

 9/14

 Pentzek et al.: Verbessert Peer-Feedback für Lehrärzte die studentische ...

 KommilitonInnen diese Lehrpraxis empfehlen?“, beide
 aufsteigend positiv 4-stufig skaliert.

 Auswahl der Praxen für die Intervention
 Da die meisten Praxen eine sehr gute Bewertung erhielten
 (schiefe Verteilung), wurden wie folgt drei Gruppen identifiziert: Aus allen an den Praktika beteiligten Lehrpraxen
 des Instituts wurden zunächst diejenigen ausgewählt, die
 eine geringere als sehr gute Evaluation aufwiesen
 (=„suboptimal“): mindestens einmal mit <2 auf mind.
 einem der beiden o.g. Items bewertet oder wiederholt
 negative Freitextkommentare. Aus dieser Gruppe der
 suboptimal (=geringer als sehr gut) bewerteten Praxen
 wurden nun die mit mehr als zwei vorliegenden Studierendenbewertungen, weiterhin bestehender Lehrarzttätigkeit und besonders negativen Bewertungen ausgewählt: mindestens zweimal mit <2 auf mind. einem der
 beiden Items bewertet oder wiederholt negative Freitextkommentare. Von den 27 Praxen erhielten bislang 24
 Praxen (88.9%) eine Intervention zur Verbesserung ihrer
 Lehre von Seiten einer hausärztlich tätigen Allgemeinmedizinerin (n=3 pandemiebedingt noch nicht), und 19
 Praxen (70.4%) lieferten Evaluationsergebnisse aus
 Praktika nach der Intervention (n=5 hatten nach der Intervention keine Praktikanten mehr). Zur Charakterisierung der drei Gruppen der sehr gut, suboptimal und
 schlecht evaluierten (=ausgewählten) Praxen wurde eine
 Varianzanalyse inkl. post-hoc Scheffé-Tests mit dem
 Faktor Gruppe und der abhängigen Variable Evaluationsergebnis gerechnet.

 thematisiert und besprochen, gefolgt von der Frage „Was
 können wir tun, um Sie zu unterstützen?“. Das schriftliche
 Feedback bestand aus einer unkommentierten Rückmeldung der studentischen Evaluationsergebnisse (Scores
 und Freitexte).

 Analysen
 Aufgrund einer starken Korrelation der beiden Evaluationsitems (Spearman’s rho=0.79) wurden diese für die
 vorliegenden Analysen zu einer Gesamtbewertung gemittelt. Um multivariable Einflüsse auf diese studentische
 Bewertung zu ermitteln, wurde eine verallgemeinerte
 Schätzungsgleichung (GEE) mit der Clustervariable „Praxis“ gerechnet, aufgrund fehlender Normalverteilung
 (Kolmogorow-Smirnow-Test p<.001) mit Gamma-Verteilung und Log-Verknüpfung. Als potenzielle Einflussvariablen flossen ein: Interventionseffekt (prä/post), Interventionsmodus (peer visit vs. Gruppe/schriftlich), Praktikumszeitpunkt (Studienjahr), Anzahl der persönlich betreuten
 Patienten pro Woche. Parallel zu dieser Analyse wurde
 in einer zweiten GEE der Interventionseffekt auf die Anzahl der persönlich betreuten Patienten untersucht.
 Die Freitexte in den Studierendenevaluationen sowie die
 Lehrarztkommentare in den peer visits und Gruppendiskussionen wurden qualitativ inhaltsanalytisch aufgearbeitet, um neben den reinen Zahlen auch die dahinterliegenden Probleme und die Lehrarztreaktionen auf das Feedback zu skizzieren. Dazu wurde eine induktive Kategorienbildung am Material vorgenommen [11]. Die Anzahlen
 negativer Studierendenkommentare vor und nach der
 Intervention wurden zudem quantitativ gegenübergestellt.

 Intervention
 Das Peer-Feedback wurde als Teil des didaktischen
 Konzepts bei besonders negativ evaluierten Lehrpraxen
 realisiert
 [https://www.uniklinik-duesseldorf.de/
 patienten-besucher/klinikeninstitutezentren/institut-fuerallgemeinmedizin/didaktik-fortbildungen]: Eine den
 Lehrärzten bekannte und in Praxis und Lehre erfahrene
 hausärztliche Mitarbeiterin des Instituts für Allgemeinmedizin (EG) meldete den Lehrärzten deren studentischen
 Evaluationen zurück. Der vorrangige Modus war ein persönlicher Besuch in der Praxis (peer visit) [10]. Aus organisatorischen Gründen mussten gelegentlich Gruppendiskussionen mit mehreren Lehrärzten sowie ein schriftliches
 Feedback als Ausweichlösungen angeboten werden. Peer
 visit und Gruppendiskussion hatten beide eine Reflexion
 der eigenen Lehrarztmotivation, der Probleme sowie eine
 Diskussion der persönlichen Evaluation zum Ziel, um
 darüber in einen konstruktiven Austausch zwischen
 Lehrarzt und Universität in Bezug auf die Lehre und den
 Umgang mit Studierenden in der Praxis zu gelangen. Peer
 visits und Gruppendiskussionen wurden protokolliert. Die
 Eingangsfrage lautete „Warum sind Sie Lehrarzt/Lehrärztin?“, gefolgt von Fragen zu persönlichen Erfahrungen:
 „Können Sie mir über Ihre Erfahrungen berichten? Was
 motiviert Sie zu der Lehrarzttätigkeit? Gibt es aus Ihrer
 Sicht Probleme?“. Dann wurde das (schlechte) Feedback

 Ergebnisse
 Lehrpraxen und Präevaluationen
 264 Lehrpraxen mit insgesamt 1648 Praktika waren beteiligt. Davon wurden 181 Praxen (68.6%) mit 1036
 Praktika sehr gut bewertet (Mittelwert der Studierendenevaluation 3.8 ± Standardabweichung 0.2), 56 Praxen
 (21.2%) mit 453 Praktika suboptimal (3.3±0.4) und 27
 Praxen (10.2%) mit 159 Praktika sehr schlecht (2.8±0.4).
 Der übergeordnete Vergleich der drei Gruppen ergibt signifikante Unterschiede (F(df=2)=205.1; p<.001), mit
 jeweils signifikanten Unterschieden in allen post-hocVergleichen (alle p<.001): sehr gut vs. suboptimal (mittlere Differenz 0.51; Standardfehler 0.04); sehr gut vs.
 schlecht (1,09; 0.06); suboptimal vs. schlecht (0.58;
 0.07).
 In Tabelle 1 ist die Analysestichprobe der n=19 aus den
 27 schlecht bewerteten Praxen näher beschrieben.
 Gründe für eine schlechte Bewertung laut Freitexten der
 Studierendenevaluation lassen sich in fünf Kategorien
 darstellen. So wurde die mangelnde Gelegenheit zum
 Einüben praktischer Fertigkeiten am Patienten kritisiert.
 „Leider hatte ich während meines letzten Patientenpraktikums nicht die Möglichkeit, viele Patienten ei-

 GMS Journal for Medical Education 2021, Vol. 38(7), ISSN 2366-5017

 10/14

 Pentzek et al.: Verbessert Peer-Feedback für Lehrärzte die studentische ...

 Tabelle 1: Merkmale der Analysestichprobe

 genständig zu untersuchen, obwohl ich dies zu mehreren Gelegenheiten eingefordert habe.“ (über Praxis
 ID 1)
 Weiterhin gab es Kommentare über mangelnde Wertschätzung und schwierige Kommunikation:
 „Die Lehrärztin hat wenig Geduld insbesondere mit
 ausländischen Patienten, die anatomische oder medizinische Begriffe nicht verstehen können. Sie macht
 beleidigende und ironische Aussagen. Mit einigen
 Patienten wurde ich 30 Minuten lang alleine gelassen,
 während mit anderen nur 2 und danach hat sie sich
 darüber geärgert, wenn ich mit der Untersuchung/Anamnese noch nicht fertig war.“ (über Praxis ID 14)
 Einige Lehrärzte wurden hinsichtlich ihrer didaktischen
 Kompetenz kommentiert:
 „[…] als Lehrarzt hab ich ihn als wenig bis gar nicht
 kompetent erlebt und auch sehr desinteressiert. Er
 hatte keine Ahnung von dem, das PP1 [Patientenpraktikum 1] uns lehren soll und hat auch nach mehrmaligem Herantreten an ihn meinerseits wenig verstanden, worum es mir ging bzw. was ich dort lernen sollte.“ (über Praxis ID 22)
 Genannt wurden Praxisabläufe und –strukturen, die laut
 Studierenden eine effiziente Praktikumsdurchführung
 erschwerten:
 „Von 8-11 Uhr kommen nur Patienten zur Blutentnahme, feste Termine sind in der Zeit nicht geplant. Da
 ich weder Blut abnehmen noch impfen durfte, war in
 der Zeit nichts für mich zu tun.“ (über Praxis ID 10)
 In einigen Praxen mit primär nicht-deutschsprachigem
 Patientenklientel und auch Personal (inkl. Lehrarzt)
 stellte sich in den Evaluationen die Sprachbarriere als
 Problem heraus.
 „Da die Lehrärztin [Nationalität XY] ist, verliefen ca.
 70% der Konsultationen auf [Sprache XY].“ (über
 Praxis ID 2)

 Intervention
 In den Protokollen der peer visits und Gruppendiskussionen mit den Lehrärzten zeigen sich vier Kategorien von
 Problemen, die teilweise die genannten Studierendenkommentare spiegeln: So berichteten die Lehrärzte von
 Bedenken, Studierende allein mit Patienten arbeiten zu
 lassen. (Im folgenden Zitate aus den Protokollen der intervenierenden Peer-Ärztin.)
 „Es fällt ihm schwer, Studierende allein zu lassen. […]
 Er meint, die Patienten mögen das nicht so, obwohl
 seine Erfahrungen eigentlich anders sind. Hat auch
 viele Patienten aus dem Management. „Die Studierenden sind auch zu kurz in der Praxis.““ (zu ID 17)
 Auch eine skeptische Haltung vor allem Studierenden
 niedriger Semester gegenüber wurde geäußert.
 „Kann mit den 2. Semestern nichts anfangen, „die
 können nichts, es hat keinen Sinn, sie das Herz abhören zu lassen, wenn sie die Krankheitsbilder nicht
 kennen.“ […] „Das Problem ist auch, dass es jetzt
 immer ganz junge Mädchen sind.““ (zu ID 24)
 Einige Lehrärzte waren nicht vertraut mit den didaktischen
 Konzepten und Materialien der Praktika.
 „Er hat keine Kenntnis von der Lehre, liest sich nichts
 durch. Weiß auch nicht, dass er evaluiert wird.“ (zu
 ID 6)
 Teils führt ein Selbstverständnis als allgemeinmedizinischer Lehrarzt zur Definition eigener Praktikumsinhalte
 unter Vernachlässigung oder Abwertung der universitär
 vorgegebenen Lernziele.
 „„Ich habe mich zur Allgemeinmedizin bekannt und
 will das weiterreichen.“ Erklärt den Studierenden viel,
 lässt aber nicht viel machen. „Ich zeige jungen Menschen den rechten Weg. Sonst macht es ja keiner (die
 Uni schon gar nicht), also mach ich es.““ (zu ID 4)

 GMS Journal for Medical Education 2021, Vol. 38(7), ISSN 2366-5017

 11/14

 Pentzek et al.: Verbessert Peer-Feedback für Lehrärzte die studentische ...

 Tabelle 2: Multivariable Einflüsse auf die abhängige Variable ‚studentische Bewertung des Praktikums‘ (verallgemeinerte
 Schätzungsgleichung (GEE) mit Clustervariable Praxis)

 Tabelle 3: Anzahl der Kommentare von Studierenden zu Praktika in 19 schlecht evaluierten hausärztlichen Lehrpraxen

 „Möchte allerdings eindeutig den Studierenden alles
 zeigen, erwähnt wiederholt Ultraschall, Blutabnahmen, kennt Lehrinhalte nicht, macht sich eigene
 Lehrinhalte: „Ich zeig denen alles Interessante““. (zu
 ID 22)
 An mehreren Stellen äußerten die Lehrärzte Intentionen
 zur Verhaltensänderung, laut Protokollen z.B. „will Studierende mehr zum Selbst-Untersuchen anleiten“ oder „sagt,
 er wolle sich zukünftig die Handouts durchlesen“. Die
 Mehrzahl der besuchten Lehrärzte zeigte sich im Gespräch grundsätzlich interessiert und engagiert in der
 Betreuung der Studierenden. Die meisten waren in der
 Lage, die Kritikpunkte zu reflektieren.
 Prä-post-Analyse: Der Interventionseffekt auf die studentische Bewertung ist deutlich und unabhängig vom
 (ebenfalls signifikanten) Einfluss der Patientenanzahl
 (siehe Tabelle 2).
 Auch der Interventionseffekt auf die Anzahl persönlich
 durch die Studierenden betreuter Patienten bleibt in einer
 GEE bestehen (Odds Ratio 1.41; 95% Konfidenzintervall
 1.21-1.64; p<.001), unabhängig von der Art der Intervention und dem Studienjahr (Analyse nicht gezeigt).

 Der Anteil kritischer Anmerkungen in den studentischen
 Freitextkommentaren nimmt insgesamt und in vier der
 fünf genannten Kategorien deutlich ab (siehe Tabelle 3).

 Diskussion
 Ein Peer-Feedback durch eine hausärztlich tätige Allgemeinmedizinerin wirkte sich in einer Stichprobe schlecht
 evaluierter Lehrärzte, die im Rahmen der hausärztlichen
 Praktika Studierende betreuten, im prä-post-Vergleich
 positiv auf die studentische Evaluation und auf die Anzahl
 der im Praktikum von Studierenden persönlich betreuten
 Patienten aus. Dies zeigt sich in den Evaluationsscores
 und auch darin, dass entsprechend negative Freitextkommentare der Studierenden nach der Intervention seltener
 waren.
 Im Einklang mit der Literatur war es entscheidend für die
 studentische Bewertung, dass den Studierenden die
 Möglichkeit gegeben wurde, selbstständig mit Patienten
 zu arbeiten, um sich unmittelbar selbst in der ärztlichen
 Rolle erleben zu können [2], [5]. Aber auch unabhängig
 von der Patientenanzahl verbesserte sich die studentische
 Evaluation nach der Intervention: Die qualitativen Ergeb-

 GMS Journal for Medical Education 2021, Vol. 38(7), ISSN 2366-5017

 12/14

 Pentzek et al.: Verbessert Peer-Feedback für Lehrärzte die studentische ...

 nisse liefern Hinweise darauf, dass sich die Lehrärzte
 nach der Intervention näher mit dem Sinn der Praktika,
 den Lernzielen und didaktischen Materialien beschäftigt
 haben könnten. Dies wiederum schien auch positive Effekte auf den Austausch und die Beziehung zwischen
 Lehrarzt und Studierendem (u.U. im Sinne eines Abgleichs
 gegenseitiger Erwartungen) gehabt zu haben – ebenfalls
 wichtige Elemente einer positiven Praktikumserfahrung
 [3], [12]. Die qualitativen Ergebnisse zur didaktischen
 Kompetenz und Haltung weisen darauf hin, dass es zumindest für die hier untersuchte kleine Gruppe zuvor
 schlecht evaluierter Lehrärzte einer intensiveren Auseinandersetzung mit ihrem Lehrauftrag und einer wiederholten Interaktion zwischen der universitären Einrichtung
 für Allgemeinmedizin und der Lehrpraxis bedarf, um Inhalte und Konzepte zu verinnerlichen und diese in den
 Praktika für Studierende wiedererkennbar und konsistent
 umzusetzen. Dass gerade die schlecht evaluierten
 Lehrärzte eher selten an den (in Düsseldorf achtmal pro
 Jahr angebotenen) Treffen in der Universität teilnehmen,
 ist eine auch von vielen anderen Standorten berichtete
 Erfahrung. Die formale Überprüfung der Voraussetzungen
 und Kriterien für eine angemessene Lehrarzttätigkeit
 wäre bei der – insbesondere in einem longitudinal-allgemeinmedizinisch und praxisnah konstruierten Curriculum
 erforderlichen – hohen Anzahl an Lehrpraxen mit enormem Aufwand verbunden. Es ist jedoch abzuwägen, ob
 mehr Ressourcen in die Auswahl und Qualifikation lehrinteressierter Praxen oder aber in die Qualitätskontrolle
 und das Training bereits lehrender Praxen zu investieren
 ist.
 Eine Stärke dieser Studie sind die Bewertungen durch
 unabhängige Studierendengruppen prä-post, so dass
 Verzerrungen durch wiederholte Exposition der Studierenden mit einer Praxis (z.B. response shift bias, Gewöhnung,
 observer drift) ausgeschlossen sind. Die mit dem präpost-Design ohne Kontrollgruppe und dem Fokus auf
 schlecht evaluierte Praxen einhergehende Schwäche
 besteht u.a. im Phänomen der Regression zur Mitte,
 welches vermutlich einen Teil des positiven Interventionseffekts begründet. Die primäre Fragestellung dieser Studie ist quantitativ formuliert und beantwortet; wir berichten nur begrenzt qualitative Ergebnisse. Diese erlauben
 hier nur in Teilen hypothesengenerierende Einsichten in
 die genauen Wirkmechanismen eines Peer-Feedback
 [13]. In der vorliegenden Studie wurden mehrere Modi
 der Vermittlung eines Peer-Feedback realisiert. Da die
 Analysen nicht auf unterschiedliche Effekte des personell
 und zeitlich aufwändigen peer visit einerseits und der
 effizienteren Methoden Gruppendiskussion und schriftliche Rückmeldung andererseits hindeuten, sind vor einer
 breiteren Umsetzung weitere Studien zur Differenzierung
 notwendig. So fanden Rüsseler et al. [14], dass ein
 schriftliches Peer-Feedback – dort allerdings bezogen
 auf Vorlesungsdozenten – positive Effekte auf die Gestaltung der Lehrveranstaltung hatte.

 Schlussfolgerungen
 Es macht Sinn, die Effekte eines Lehrarzt-Feedbacks sowohl in der Forschung als auch in der Lehre weiter zu
 berücksichtigen. Die umfangreichen GMA-Empfehlungen
 bieten einen robusten Rahmen für die Lehre [15] und die
 didaktische Qualifizierung von Lehrärzten [16]. Darin
 eingebettet stellt ein kollegiales Peer-Feedback für
 schlecht bewertete Lehrärzte ein mögliches Werkzeug
 für das Qualitätsmanagement der allgemeinmedizinischen
 Lehre dar.

 Interessenkonflikt
 Die Autor*innen erklären, dass sie keinen Interessenkonflikt im Zusammenhang mit diesem Artikel haben.

 Literatur
 1.

 Bundesministerium für Bildung und Forschung. Masterplan
 Medizinstudium 2020. Berlin: Bundesministerium für Bildung
 und Forschung; 2017. Zugänglich unter/available from: https:/
 /www.bmbf.de/files/2017-03-31_Masterplan%
 20Beschlusstext.pdf

 2.

 Wiesemann A, Engeser P, Barlet J, Müller-Bühl U, Szecsenyi J.
 Was denken Heidelberger Studierende und Lehrärzte über
 frühzeitige Patientenkontakte und Aufgaben in der
 Hausarztpraxis? Gesundheitswesen. 2003;65(10):572-578. DOI:
 10.1055/s-2003-42999

 3.

 Grunewald D, Pilic L, Bödecker AW, Robertz J, Althaus A. Die
 praktische Ausbildung des medizinischen Nachwuchses Identifizierung von Lehrpraxen-Charakteristika in der
 Allgemeinmedizin. Gesundheitswesen. 2020;82(07):601-606.
 DOI: 10.1055/a-0894-4556

 4.

 Böhme K, Sachs P, Niebling W, Kotterer A, Maun A. Macht das
 Blockpraktikum Allgemeinmedizin Lust auf den Hausarztberuf?
 Z Allg Med. 2016;92(5):220-225. DOI:
 10.3238/zfa.2016.0220–0225

 5.

 Gündling PW. Lernziele im Blockpraktikum Allgemeinmedizin Vergleich der Präferenzen von Studierenden und Lehrärzten. Z
 Allg Med. 2008;84:218-222. DOI: 10.1055/s-2008-1073148

 6.

 Steinert Y, Mann K, Centeno A, Dolmans D, Spencer J, Gelula M,
 Prideaux D. A systematic review of faculty development initiatives
 designed to improve teaching effectiveness in medical education:
 BEME Guide No. 8. Med Teach. 2006;28(6):497-526. DOI:
 10.1080/01421590600902976

 7.

 Garcia I, James RW, Bischof P, Baroffio A. Self-Observation and
 Peer Feedback as a Faculty Development Approach for ProblemBased Learning Tutors: A Program Evaluation. Teach Learn Med.
 2017;29(3):313-325. DOI: 10.1080/10401334.2017.1279056

 8.

 Gusic M, Hageman H, Zenni E. Peer review: a tool to enhance
 clinical teaching. Clin Teach. 2013;10(5):287-290. DOI:
 10.1111/tct.12039

 9.

 Pedram K, Brooks MN, Marcelo C, Kurbanova N, Paletta-Hobbs
 L, Garber AM, Wong A, Qayyum R. Peer Observations: Enhancing
 Bedside Clinical Teaching Behaviors. Cureus. 2020;12(2):e7076.
 DOI: 10.7759/cureus.7076

 GMS Journal for Medical Education 2021, Vol. 38(7), ISSN 2366-5017

 13/14

 Pentzek et al.: Verbessert Peer-Feedback für Lehrärzte die studentische ...

 10.

 O'Brien MA, Rogers S, Jamtvedt G, Oxman AD, Odgaard-Jensen
 J, Kristoffersen DT, Forsetlund L, Bainbridge D, Freemantle N,
 Davis DA, Haynes RB, Harvey EL. Educational outreach visits:
 effects on professional practice and health care outcomes.
 Cochrane Database Syst Rev. 2007;2007(4):CD000409. DOI:
 10.1002/14651858.CD000409.pub2

 11.

 Kruse J. Qualitative Interviewforschung. 2. Aufl. Weinheim: Beltz
 Juventa; 2015.

 12.

 Koné I, Paulitsch MA, Ravens-Taeuber G. Blockpraktikum
 Allgemeinmedizin: Welche Erfahrungen sind für Studierende
 relevant? Z Allg Med. 2016;92(9):357-362. DOI:
 10.3238/zfa.2016.0357-0362

 13.

 14.

 15.

 16.

 Raski B, Böhm M, Schneider M, Rotthoff T. Influence of the
 personality factors rigidity and uncertainty tolerance on peerfeedback. In: 5th International Conference for Research in
 Medical Education (RIME 2017), 15.-17. March 2017,
 Düsseldorf, Germany. Düsseldorf: German Medical Science GMS
 Publishing House; 2017. P15. DOI: 10.3205/17rime46

 Korrespondenzadresse:
 PD Dr. rer. nat Michael Pentzek
 Heinrich-Heine-Universität Düsseldorf, Medizinische
 Fakultät, Centre for Health and Society (chs), Institut für
 Allgemeinmedizin (ifam), Moorenstr. 5, Gebäude 17.11,
 40225 Düsseldorf, Deutschland, Tel.: +49
 (0)211/81-16818
 [email protected]
 Bitte zitieren als
 Pentzek M, Wilm S, Gummersbach E. Does peer feedback for teaching
 GPs improve student evaluation of general practice attachments? A
 pre-post analysis. GMS J Med Educ. 2021;38(7):Doc122.
 DOI: 10.3205/zma001518, URN: urn:nbn:de:0183-zma0015182
 Artikel online frei zugänglich unter
 https://www.egms.de/en/journals/zma/2021-38/zma001518.shtml

 Ruesseler M, Kalozoumi-Paizi F, Schill A, Knobe M, Byhahn C,
 Müller MP, Marzi I, Walcher F. Impact of peer feedback on the
 performance of lecturers in emergency medicine: a prospective
 observational study. Scand J Trauma Resusc Emerg Med.
 2014;22:71. DOI: 10.1186/s13049-014-0071-1

 Eingereicht: 03.03.2021
 Überarbeitet: 12.08.2021
 Angenommen: 17.08.2021
 Veröffentlicht: 15.11.2021

 Huenges B, Gulich M, Böhme K, Fehr F, Streitlein-Böhme I,
 Rüttermann V, Baum E, Niebling WB, Rusche H.
 Recommendations for Undergraduate Training in the Primary
 Care Sector - Position Paper of the GMA-Primary Care Committee.
 GMS Z Med Ausbild. 2014;31(4):Doc35. DOI:
 10.3205/zma000927

 Copyright
 ©2021 Pentzek et al. Dieser Artikel ist ein Open-Access-Artikel und
 steht unter den Lizenzbedingungen der Creative Commons Attribution
 4.0 License (Namensnennung). Lizenz-Angaben siehe
 http://creativecommons.org/licenses/by/4.0/.

 Böhme K, Streitlein-Böhme I, Baum E, Vollmar HC, Gulich M,
 Ehrhardt M, Fehr F, Huenges B, Woestmann B, Jendyk R. Didactic
 qualification of teaching staff in primary care medicine - a position
 paper of the Primary Care Committee of the Society for Medical
 Education. GMS J Med Educ. 2020;37(5):Doc53. DOI:
 10.3205/zma001346

 GMS Journal for Medical Education 2021, Vol. 38(7), ISSN 2366-5017

 14/14

 Higher Learning Research Communications
 2021, Volume 11, Issue 2, Pages 22–39. DOI: 10.18870/hlrc.v11i2.1244

 Original Research

 © The Author(s)

 Students’ and Teachers’ Perceptions and Experiences
 of Classroom Assessment: A Case Study of a Public
 University in Afghanistan
 Sayed Ahmad Javid Mussawy, PhD Candidate
 University of Massachusetts Amherst, Amherst, Massachusetts, United States
 https://orcid.org/0000-0001-9991-6681
 Gretchen Rossman, PhD
 University of Massachusetts Amherst, Amherst, Massachusetts, United States
 https://orcid.org/0000-0003-1224-4494
 Sayed Abdul Qahar Haqiqat, MEd
 Baghlan University, Pule-khumri, Baghlan, Afghanistan
 Contact: [email protected]

 Abstract
 Objective: The primary goal of the study was to examine students’ perceptions of classroom assessment at a
 public university in Afghanistan. Exploring current assessment practices focused on student and faculty
 members lived experiences was a secondary goal. The study also sought to collect evidence on whether or not
 the new assessment policy was effective in student achievement.
 Method: Authors used an explanatory sequential mixed-methods design to conduct the study. Initially, we
 applied the Students Perceptions of Assessment Questionnaire (SPAQ), translated into Dari/Farsi and
 validated, to collect data from a random sample of 400 students from three colleges: Agriculture, Education,
 and Humanities. Response rate was 88.25% (N = 353). Semi-structured interviews were used to collect data
 from a purposeful sample of 18 students and 7 faculty members. Descriptive statistics, one-way ANOVA, and
 t-tests were used to analyze quantitative data, and NVivo 12 was used to conduct thematic analysis on
 qualitative data.
 Results: The quantitative results suggest that students have positive perceptions of the current assessment
 practices. However, both students and faculty members were dissatisfied with the grading policy, reinforcing
 summative over formative assessment. Results support that the policy change regarding assessment has
 resulted in more students passing the courses compared to in the past. The findings also suggest
 improvements in faculty professional skills such as assessment and teaching and ways that they engage
 students in assessment processes.
 Implication for Policy and Practice: Recommendations include revisiting the grading policy at the
 national level to allow faculty members to balance the formative and summative assessment and utilizing

 We would like to thank the students and teachers who participated and assisted with this study.

 Mussawy et al., 2021

 Open Access

 assessment benchmarks and rubrics to guide formative and summative assessment implementation in
 practice.
 Keywords: assessment, classroom assessment, higher education, Afghanistan
 Submitted: March 14, 2021 | Accepted: July 23, 2021 | Published: October 13, 2021
 Recommended Citation
 Mussawy, S. A. J., Rossman, G., & Haqiqat, S. A. Q. (2021). Students’ and teachers’ perceptions and experiences of
 classroom assessment: A case study of a public university in Afghanistan. Higher Learning Research
 Communications, 11(2) 22–39. DOI: 10.18870/hlrc.v11i2.1244

 Introduction
 Classroom assessment, an instrumental aspect of teaching and learning, refers to a systematic process of
 obtaining information about learner progress, understanding, skills, and abilities towards the learning goals
 (Dhindsa et al., 2007; Goodrum et al., 2001; Klenowski & Wyatt-Smith, 2012; Linn & Miller, 2005). According
 to Scriven (1967) and Poskitt (2014), educational assessment surfaced in the 20th century to serve two purposes.
 The first was to improve learning (formative assessment), and the second was to make judgments about student
 learning (summative assessment). The current literature on assessment emphasizes establishing alignment
 between educational expectations versus student learning needs (Black et al., 2003; Gulikers et al., 2006;
 Mussawy, 2009). Therefore, teachers use various forms of assessment to determine where students are and
 create diverse activities to help them achieve the expected outcomes (Mansell et al., 2020).
 As most countries have expanded their higher education systems by embracing broader access to higher
 education, the student population has also become diverse (Altbach, 2007; Salmi, 2015). The more diverse
 student population suggests that conventional assessment approaches may no longer work. Therefore,
 alternative assessment approaches need to put students in the center to avoid wasting “learning for drilling
 students in the things that they [teachers] will be held accountable [for]” (Dhindsa et al., 2007, p. 1262).
 The concept of classroom assessment has been loosely defined in the higher education sector of Afghanistan.
 While students and teachers are aware of different assessment approaches, current assessment practices rely
 heavily on conventional summative assessment (Mussawy, 2009; Noori et al., 2017). Previously, final exams
 were the only mechanisms to assess student learning (UNESCO-IIEP, 2004). However, higher education
 reform in Afghanistan in the early 2000s paved the way for introducing mid-term exams and the credit
 system that replaced the conventional course structure based on the number of subjects (Babury & Hayward,
 2013; Hayward, 2017). More specifically, in the traditional system, the value of final grades for each subject
 was the same, irrespective of the number of hours the subject was taught per week. However, in the credit
 system, the value of grades varies depending on credit hours per week. Further, due to the absence of specific
 regulations on assessment approaches, faculty members enjoyed immense autonomy in assessing student
 learning. Since most of the faculty members had not received any training on pedagogy and assessment, they
 primarily relied on conventional open-ended summative assessment (Darmal, 2009).
 In 2014, the Ministry of Higher Education (MoHE) in Afghanistan introduced a new assessment policy that
 centers on (a) transparency through the establishment of assessment committees at the institution and faculty
 levels and (b) the type and the number of question items in an exam (Ministry of Higher Education (MoHE),
 2018). The second component, which is the focus of this study, indicates that assessment includes “evaluation
 of quizzes, mid-term exams, assignments, laboratory projects, class seminars and projects, final exams, and
 thesis and dissertations” (MoHE, 2018, p. 5). While mid-term and final exams constitute 20% and 60% of
 students’ grades, respectively, the policy emphasizes “30—40 question items on final exams” and “a minimum
 of 10 items on mid-terms” (MoHE, 2018, p. 7). The policy also recommends a combination of closed-ended

 Higher Learning Research Communications

 23

 Mussawy et al., 2021

 Open Access

 and open-ended questions with a value of “3–5 points” for descriptive and analytic items and “1 point for
 multiple-choice questions” (MoHE, 2018, p. 5).
 Although the assessment policy recognizes various approaches, such as quizzes, assignments, student
 projects, seminars, and mid-term and final exams, formative assessment and class attendance account for
 only 20% of a student’s grade. Mid-term and final exams, on the contrary, constitute 80% of students’ grades;
 this indirectly projects more value for summative over formative assessment. Therefore, perceptions of
 students and faculty members can shed light on current practices and participants’ experiences of classroom
 assessment.

 Review of Literature
 The focus of classroom assessment has gradually shifted from assessment of learning—“testing learning”
 (Birenbaum & Feldman, 1998, p. 92) to assessment for learning—creating diverse opportunities for learners to
 prosper (Brown, 2005; Wiliam, 2011). This is because research shows that classroom assessment significantly
 affects the approach students take to learning (Pellegrino & Goldman, 2008). More specifically, new
 assessment approaches encourage an increase in correspondence between student learning needs and
 expectations to prosper in a changing environment (Gulikers et al., 2006). Goodrum et al. (2001) argued that,
 ideally, assessment “enhances learning, provides feedback about student progress, builds self-confidence and
 self-esteem, and develops skills in evaluation” (p. 2). Nonetheless, Dhindsa et al. (2007) stated that [primary
 and secondary school] teachers “sacrifice learning for drilling students in the things that they will be held
 accountable for” (p. 1262). This suggests that teachers use “a very narrow range of assessment strategies” to
 help students prepare for high-stakes tests, while limited evidence exists to support that “teachers actually use
 formative assessment to inform planning and teaching” (Goodrum et al., 2001, p. 2). Most importantly, recent
 research on classroom assessment emphasizes the quality and relevance of assessment activities to help
 students learn (Ibarra-Saiz et al., 2020).
 Inquiring into students’ perception of assessment has been an important aspect of the literature on classroom
 assessment (Koul et al., 2006; Segers et al., 2006; Struyven et al., 2005; Waldrip et al., 2009). Examining
 their perceptions confirms the assumption that assessment “rewards genuine effort and in-depth learning
 rather than measuring luck” (Dhindsa et al., 2007, p. 1262). For this reason, recent studies on classroom
 assessment advocate for student involvement in developing assessment tools (Falchikove, 2004; Waldrip et
 al., 2014) to make the learning process more valuable to students. With this in mind, Fisher et al. (2005)
 developed Students Perceptions of Assessment Questionnaire (SPAQ) and confirmed its validity by applying it
 to a sample consisting of 1,000 participants from 40 science classes in grades 8–10. Following that, Cavanagh
 et al. (2005) modified and adapted the SPAQ as an analytic tool to study student perceptions of classroom
 assessment in five specific areas: Congruence with planned learning (CPL), assessment of applied learning
 (AAL), students’ consultation (SC) types, transparency in assessment (TA), and accommodation of students’
 diversity (ASD) in assessment procedures. Cavanagh et al. (2005) used SPAQ to study 8th through 10th grade
 student perceptions of assessment in Australian science classrooms. Their study showed that student
 perceptions of assessment in science subjects varied depending on their abilities.
 Other studies examining students’ perceptions of assessment reveal diverse responses. For instance, Koul et
 al. (2006) modified, validated, and applied SPAQ on a 4-point Likert scale to study secondary student
 perceptions of assessment in Australia. Their study shows that the difference between males’ and females’
 perceptions of assessment was not statistically significant. However, they reported statistically significant
 differences in student perceptions of assessment by grade level. Similarly, Dhindsa et al. (2007) used SPAQ to
 examine high school student perceptions of assessment in Brunei Darussalam and learned that the Student
 Consultation was rated the lowest of the scales. Their findings suggest that students perceived assessment as

 Higher Learning Research Communications

 24

 Mussawy et al., 2021

 Open Access

 transparent and as aligned with learning goals. However, they did find that teachers hardly consulted with
 students regarding assessment forms.
 Kwok (2008) also studied student assumptions of peer assessment and reported that, while students
 perceived peer assessment as substantially important in enhancing self-efficacy, they considered themselves
 unprepared relative to their teachers who brought years of experience. In another study, Segers et al. (2006)
 examined college students’ understanding of assignment-based learning versus problem-based learning. Their
 study showed that students in the assignment-based learning course embraced “more deep learning strategies
 and less surface-learning strategies than the students in the PBL [problem-based learning] course” (Segers et
 al., 2006, p. 234). They reported that students in the PBL course showed surface-level learning strategies
 (Segers et al., 2006, p. 236). Although the context varied, their findings are partly consistent with those of
 Birenbaum and Feldman (1998), who examined 8th through 10th-grade student attitudes towards openended versus closed-ended response assessment. They reported that gender and learning strategies were
 significantly correlated in that female students leaned towards essay questions while male students favored
 closed responses. In other words, students who demonstrated the “surface study approach” preferred closeended question items as opposed to those with “deep study approach,” favoring open-ended questions
 (Birenbaum & Feldman, 1998).
 However, Beller and Gafni’s (2000) study shows that although boys favored multiple-choice questions items
 in mathematics assessment, the difference between performance based on gender was not profound. Their
 study focused on the relationship between question format, examining whether multiple choice versus openended questions accounted for gender differences. Their study “results challenge the simplistic assertion that
 girls perform relatively better on OE [open-ended] test items” (Beller & Gafni, 2000, p. 16). On a similar note,
 Van de Watering et al. (2008) found no “relationship between students’ perceptions of assessment and their
 assessment results” (p. 657). They reported that students prefer close-ended question formatting when
 attending to a “New Learning Environment” (Van de Watering et al., 2008, p. 245).
 Meanwhile, Struyven et al. (2005) studied the relationship between student perceptions of assessment and
 their learning approaches. In general, students preferred close-ended questions; however, students with
 advanced learning abilities and with low test anxiety favored essay exams. Lastly, Ounis (2017) investigated
 perceptions of classroom assessment among secondary school teachers in Tunisia. The author reported that
 the teachers “have highly favorable perceptions of assessment and they hold highly the motivational function
 of assessment” (p. 123). According to Ounis (2017), the teachers emphasized oral assessment as a useful
 approach to increase learning even though they reported some challenges to implementing the oral
 assessment.
 Although assessment in higher education is loosely defined relative to assessment at the primary and
 secondary education levels, recent literature sheds light on introducing alternative/formative assessment
 tasks such as portfolios, applied research projects, and others (Bess, 1977; Ibarra-Sáiz et al., 2020; Nicol &
 Macfarlane-Dick, 2006; Struyven et al., 2005). Further, to date, research on perceptions and experiences of
 undergraduate students and faculty members in Afghanistan is scarce. For instance, Noori et al. (2017) and
 Darmal (2009) studied assessment practices of university lecturers in Afghanistan. However, the scope of
 these research studies is limited. For instance, Darmal’s (2009) study focuses on the experiences of six faculty
 members involved in the Department of Geography, and Noori et al.’s (2017) research included three lecturers
 who taught English as a Foreign Language. Since the government has introduced new regulations on
 assessment with a focus on types and number of questions in mid-term and final exams, exploring the
 experiences of students and faculty members can shed light on the meaningfulness of classroom assessment
 and create insight into the policy.
 Although the existing literature provides mixed findings regarding the student perceptions of assessment
 based on gender, gender equity has been underscored as a key challenge in the higher education sector of

 Higher Learning Research Communications

 25

 Mussawy et al., 2021

 Open Access

 Afghanistan (Babury & Hayward, 2014; Mussawy & Rossman, 2018). According to Babury and Hayward
 (2014), female students constitute less than 20% of the student population in universities. Since females are
 underrepresented in the higher education sector, examining students’ perceptions of assessment based on
 gender will inform whether assessment practices serve male and female students evenly.

 Study Purpose
 The primary purpose of the study was to examine student perceptions of classroom assessment at a university
 in Afghanistan. Exploring current assessment practices focused on student and faculty lived experiences was a
 secondary purpose. The study also sought to collect evidence on whether the new Afghanistan assessment
 policy was effective in improving student learning. Cavanagh et al. (2005) suggested two strategies to
 understand the advantages and disadvantages of classroom assessment on student learning: (a) examining
 the research on assessment forms that teachers use; and (b) inquiring into students’ perceptions of classroom
 assessment. This study used both strategies. More specifically, the research questions and hypotheses guiding
 the study are below.
 1.

 What are the perceptions of students about classroom assessment? As part of this research question,
 gender and academic discipline differences were explored.
 •

 Hypothesis 1: There is no significant difference in student perceptions of classroom assessment
 based on gender.

 •

 Hypothesis 2: There is no significant difference in student perceptions of classroom assessment
 based on academic discipline.

 2. What are the experiences of students and faculty members concerning classroom assessment?

 Significance of the Study
 This study contributes to the literature on classroom assessment. First, the study’s findings provide new
 insights into how students perceive classroom assessment and whether the assessment outcomes affect their
 learning. Second, the research explored student and faculty lived experiences with classroom assessment.
 Specific attention was given to faculty pedagogical skills and assessment literacy. Third, teachers’ challenges
 concerning the national assessment policy with a focus on grading practices are highlighted. The study also
 informs the conversation regarding student involvement in assessment processes and the challenges
 associated with the lack of student preparedness to pursue undergraduate degree programs.

 Theoretical Framework
 The study uses formative and summative assessment as an analytic lens to explore perceptions and experiences
 of classroom assessment among undergraduate students and faculty. Formative and summative assessment
 approaches are well explored in the literature (Scriven, 1967; Wiliam & Black, 1996; Wiliam & Thompson,
 2008). Formative assessment in the United States refers to “assessments that are used to provide information on
 the likely performance of students on state-mandated test—a usage that might better be described as ‘earlywarning summative’” (Wiliam & Thompson, 2008, p. 60). Other places use formative assessment to provide
 feedback to students—informing them “which items they got correct and incorrect” (Wiliam & Thompson, 2008,
 p. 60). Providing feedback to improve learning is a key component of formative assessment that benefits
 students in a higher education setting to achieve desirable outcomes (Black & Wiliam, 1998; Nicole &
 Macfarlane-Dick, 2006; Sadler, 1998). In other words, formative assessment allows instructors to help students
 engage in their own learning by exhibiting what they know and identifying their needs to move forward (Black &

 Higher Learning Research Communications

 26

 Mussawy et al., 2021

 Open Access

 Wiliam, 1998; Mansell et al., 2020; Wiliam, 2011). Formative assessment occurs in formal and informal forms
 such as quizzes, oral questioning, self-reflection, peer feedback, and think-aloud (Mansell et al., 2020; Wiggins &
 McTighe, 2007). Formative assessment also influences the quality of teaching and learning while engaging
 students in self-directed learning (Stiggins & Chappuis, 2005).
 On the other hand, summative assessment is bound to administrative decisions (Wiliam, 2008). It occurs at the
 “end of a qualification, unit, module or learning target to evaluate the learning which has taken place towards the
 required outcomes” (Mansell et al., 2020, p. xxi). Summative assessment, known as assessment of learning, is
 primarily used “in deciding, collecting and making judgments about evidence relating to the goals of the learning
 being assessed” (Harlen, 2006, p. 103). Herrera et al. (2007, p. 13) argued that “assessment of achievement has
 become increasingly standardized, norm-referenced and institutionalized,” which thus negatively affects the
 quality of teaching (Firestone & Mayrowetz, 2000). For scholars like Stiggins and Chappuis (2005), student
 roles vary depending on assessment forms, suggesting that summative assessment enforces a passive role while
 formative assessment engages students in the process as active members.
 While some studies promote formative assessment over summative assessment (Firestone & Mayrowetz,
 2000; Harlen, 2006), other studies emphasize the purpose and outcome of assessment activities with a focus
 on ways to utilize the information to improve the teaching and learning experience (Taras, 2008; Ussher &
 Earl, 2010). Bloom (1969) also asserted that when assessment is aligned with the process of teaching and
 learning, it will have “a positive effect on student learning and motivation” (cited in Wiliam, 2008, p. 58).
 Assessment in general accounts for “supporting learning (formative), certifying the achievement or potential
 of individuals (summative), and evaluating the quality of educational institutions or programs (evaluative)”
 (Wiliam, 2008, p. 59). Black and Wiliam (2004) emphasized ways to use the outcomes of formative and
 summative assessment approaches to improve student learning. Taras (2008) argued that “all assessment
 begins with summative assessment (which is a judgment) and that formative assessment is, in fact,
 summative assessment plus feedback which the learner uses” (p. 466). According to Taras (2008), both
 formative and summative assessments require “making judgments,” which might be implicit or explicit
 depending on the context (p. 468). In other words, Taras (2008) argued that assessment could not “be
 uniquely formative without the summative judgment having preceded it” (p. 468). Similarly, Wiggins and
 McTighe (2007) explained that formative assessment occurs during instruction rather than as a separate
 activity at the end of a class or unit. The literature on assessment underscores the importance of formative
 and summative assessment and ways that “assessment… feed into actions in the classroom in order to affect
 learning” (Wiliam & Thompson, 2008, p. 63).

 Methods
 Research Site
 The study was conducted at a public university in Northern Afghanistan. The university, established in 1993
 and re-established in 2003, has seven colleges and 27 departments. The university has approximately 155 fulltime faculty members who serve approximately 5,000 students, 20% of whom are female. The faculty–
 student ratio at the university is 1/35, and the staff–student ratio is 1:70. The university offers only
 undergraduate degrees.

 Procedure and Participants
 The authors used an explanatory sequential mixed-methods design to collect data from senior, junior, and
 some sophomore students. We administered the 24-item SPAQ to a random sample of 400 students from the
 Agriculture, Education, and Humanities colleges and received a response rate from 355 students (88.25 %).
 Following the administration of the SPAQ, the authors conducted document analysis (mainly policy
 documents on assessment) as well as semi-structured interviews with a purposeful sample of 25 individuals,

 Higher Learning Research Communications

 27

 Mussawy et al., 2021

 Open Access

 seven faculty members, and 18 undergraduate students to explore their lived experiences concerning current
 assessment practices. The in-person interviews ranged from 30 to 70 minutes. The notation for this study can
 be written as QUAN → QUAL (Creswell & Clark, 2017). The authors obtained approval of the Institutional
 Review Board prior to conducting the study.

 Instrument
 We adapted the SPAQ (Cavanagh et al., 2005) to examine students’ perceptions of assessment. As a
 conceptual model, SPAQ assesses students’ perceptions of assessment in the following five dimensions:
 1.

 Congruence with planned learning (CPL)—Students affirm that assessment tasks align with the
 goals, objectives, and activities of the learning program;
 2. [Assessment] Authenticity (AA)—Students affirm that assessment tasks feature real-life
 situations that are relevant to themselves as learners;
 3. Student consultation (SC)—Students affirm that they are consulted and informed about the forms
 of assessment tasks being employed;
 4. [Assessment] Transparency (AT) –The purposes and forms of assessment tasks are affirmed by
 the students as well-defined and are made clear; and
 5. Accommodation to student diversity (ASD)—Students affirm they all have an equal chance of
 completing assessment tasks (Cavanagh et al., 2005, p. 3).
 Since the original instrument was only used to measure science assessment, we adapted and translated it to
 correspond to other disciplines such as social science, agriculture, and humanities. The Dari/Farsi translation
 of SPAQ is located in Appendix A. Students’ responses to the SPAQ were recorded on a 4-point Likert scale (4
 = Strongly Agree to 1 = Strongly Disagree).
 For the qualitative section of the study, we used a phenomenological approach to explore student and faculty
 experiences of classroom assessment (Rossman & Rallis, 2016). Using a phenomenological approach in a
 qualitative study is important in “understanding meaning, for participants in the study, of the events,
 situations, and actions they are involved with, and of the accounts that they give of their lives and
 experiences” (Maxwell, 2012, p. 8). The authors used two semi-structured interview protocols (one for
 students and one for faculty) containing 19 open-ended questions to corroborate the results of the quantitative
 data. Appendices B and C contain the interview protocols for faculty and students, respectively. These
 protocols centered on four important themes of classroom assessment—methods, authenticity, transparency,
 and the use of assessment outcomes to improve learning—that emerged from the literature on perceptions of
 assessment.
 Since the SPAQ and interview protocols were developed in English, one of the authors, fluent in English and
 Dari, used a forward translation approach to translate the instruments into Dari/Farsi. The English and Dari
 versions were shared with three experts who were fluent in both languages, and the translated versions were
 revised based on their comments and suggestions. Then, the instruments were pilot tested among senior and
 junior students and faculty members. The investigators conducted the survey and interviews once the
 research participants confirmed that the questionnaire and interview protocols were understandable in the
 local language.

 Higher Learning Research Communications

 28

 Mussawy et al., 2021

 Open Access

 Validity and Reliability
 Previous research confirmed the validity and reliability of SPAQ. For instance, Fisher et al. (2005) developed
 SPAQ and confirmed its validity by applying it to a sample consisting of 1,000 participants from 40 science
 classes in grades 8–10. Cavanagh et al. (2006) replicated the study and revised the instrument from 30 to 24
 items. Dhindsa et al. (2007) administered the revised SPAQ with 1,028 Bruneian upper-secondary students.
 They reported Cronbach’s alpha reliability (Cronbach, 1951) as “0.86” for 24 items, while it ranged from “0.64
 to 0.77” for subscales (p. 1269). Similarly, Koul et al. (2006) applied the original 30-item instrument and
 reported that Cronbach’s alpha reliability coefficient for SPAQ subscales ranged from 0.63 to 0.83. Lastly,
 Mussawy (2009) administered the revised SPAQ at Baghlan Higher Education Institution in Afghanistan and
 confirmed that the SPAQ was suitable for understanding student perceptions of assessment. Cronbach’s alpha
 reliability coefficient in that study was 0.89 for all items (24), and it ranged from 0.61 to 0.76 for subscales.
 Thus, validity and reliability of SPAQ have been confirmed in secondary and tertiary education settings. The
 investigators used the triangulation technique to increase the study’s validity by collecting data from different
 sources including the SPAQ, semi-structured interviews, and document analysis. Research methodologists,
 including Maxwell (2012) and Rossman and Rallis (2016), support that by using triangulation, researchers
 can reduce the risk of any chance combined with the data or covering only one aspect of the phenomenon that
 can result when using one particular method. Further, the Cronbach’s alpha reliability coefficient was
 calculated to determine the extent to which items in each subscale measure the same dimension of students’
 perceptions of assessment.

 Analysis
 Descriptive analyses address the first research question about students’ overall perceptions about assessment
 at the university. Two separate statistical analyses were performed to answer the research hypotheses testing
 whether there are statistical differences in student perceptions of assessment by academic discipline and
 gender. The investigators performed one-way, between-groups ANOVA to examine whether the difference
 between students’ perceptions of assessment was statistically significant based on colleges/disciplines. Next,
 we conducted a t-test to analyze the difference in students’ perceptions of assessment based on gender.
 To analyze the qualitative data, initially, the interviews were transcribed and translated into English. Next, the
 authors organized the data, reviewed it for accuracy, and cross-checked the original translation to ensure the
 meanings were consistent (Marshall & Rossman, 2016). Then, the authors applied accepted analysis practices
 such as “immersion in the data, generating case summaries and possible categories and themes, coding the
 data, offering interpretations through analytic memos, search for alternative understanding, and writing the
 report” to analyze the data inductively (Marshall & Rossman, 2016, p. 217). We used NVivo 12 to code the
 data, run queries, and observe overlaps/connections among themes. The process, overall, was very interactive
 as the authors exchanged perspectives by writing analytic memos and reflections to draw connections between
 the qualitative themes and to corroborate the quantitative results (Marshall & Rossman, 2016). In short, the
 qualitative analysis focused on the meaningfulness of classroom assessment based on lived experiences
 (Rossman & Rallis, 2016).

 Results
 Quantitative
 The Cronbach alpha reliability coefficient for all items in SPAQ was α = 0.89, suggesting strong internal
 consistency. Among the subcales within SPAQ, Transparency had the highest alpha reliability score of α =
 0.75, and Congruence with Planned Learning had the lowest α = 0.64. The instrument reliability for subscales is consistent with previous research (see Dhindsa et al., 2007; Koul et al., 2006; Mussawy, 2009). Given

 Higher Learning Research Communications

 29

 Mussawy et al., 2021

 Open Access

 that the alpha reliability results for the subscales of SPAQ were consistently above 0.63, according to Cortina
 (1993), the use of SPAQ was considered reliable (See Table 1).
 The descriptive statistics show mean scores ranging from M = 2.99 for the sub-scales Accommodation to
 Student Diversity to M = 3.30 for Congruence with Planned Learning on a 4-point Likert scale (4 = strongly
 agree—1 = strongly disagree). The high mean scores suggest that students have a very positive perception of
 classroom assessment. Table 1 provides an illustration of sub-scales mean scores, standard deviations, and
 Cronbach alpha reliability.
 Table 1. Sub-Scale Mean, Standard Deviation, and Cronbach Alpha Reliability Coefficient for the SPAQ and
 its Subscales
 SPAQ Scales

 Mean

 St. Dev

 Alpha Reliability

 Congruence with planned learning

 3.30

 .506

 .644

 Assessment authenticity

 3.19

 .540

 .694

 Student consultation

 3.09

 .690

 .732

 Assessment transparency

 3.18

 .652

 .749

 Accommodation to student diversity

 2.99

 .710

 .698

 Overall

 3.16

 .484

 .898

 The descriptive statistics associated with students’ perceptions of classroom assessment across three colleges
 are reported in Table 2. The results show that participants from the College of Humanities were associated
 with the smallest mean value (M = 3.05, SD =.467); participants from the College of Education were
 associated with the highest mean value (M = 3.28, SD = .499); and participants from the College of
 Agriculture were in between (M = 3.19, SD = .397). A one-way, between-groups ANOVA was performed to test
 the hypothesis that college was associated with perceptions of classroom assessment. The assumption of
 homogeneity of variance was tested and satisfied based on Levene’s test, F(2, 350) = .59, p = .55.
 Table 2. Average Scale-Item Mean, Average Item Standard Deviation, and Standard Error Results for
 College Level Differences in SPAQ Overall Scores
 95% Confidence Interval for
 Mean
 N

 M

 SD

 Std.
 Error

 Lower
 Bound

 Upper
 Bound

 Education

 142

 3.28

 .499

 .041

 3.20

 3.37

 Humanities

 171

 3.05

 .467

 .035

 2.98

 3.12

 Agriculture

 40

 3.19

 .397

 .062

 3.06

 3.31

 Total

 353

 3.16

 .484

 .025

 3.11

 3.21

 Colleges

 The independent between-groups ANOVA was statistically significant, F(2, 350) = 9.45, p = .000, η2 = .058.
 Thus, the null hypothesis of no difference between the mean scores was rejected, and 5.8% of variance was
 accounted for in the college group. To analyze the differences between the mean scores of the three colleges,
 we used Fisher’s LSD post-hoc tests. The difference between students’ perceptions from the College of
 Education and the College of Humanities was statistically significant across Congruence with Planned
 Learning, Assessment Authenticity, Student Consultation, and Accommodation to Student Diversity
 subscales. The difference between student perceptions from the Colleges of Education and Agriculture was

 Higher Learning Research Communications

 30

 Mussawy et al., 2021

 Open Access

 only statistically significant for the Accommodations to Student Diversity subscale. Finally, the difference
 between students’ perceptions of assessment from the Colleges of Agriculture and Humanities was not
 statistically significant across all scales. See Table 3 for further information on means and probability values.
 Table 3. Average Scale-Item Mean, Average Item Standard Deviation, and ANOVA Results for
 College Differences in SPAQ Scale Scores
 Education

 Humanities

 Agriculture

 p values

 M

 SD

 M

 SD

 M

 SD

 Education
 versus
 Humanities

 CLP

 3.38

 .461

 3.15

 .559

 3.41

 .516

 .003

 .800

 .030

 AA

 3.26

 .602

 3.04

 .540

 3.20

 .571

 .005

 .494

 .260

 SC

 3.32

 .604

 2.97

 .707

 3.22

 .457

 .000

 .220

 .108

 AT

 3.25

 .686

 3.11

 .639

 3.19

 .558

 .059

 .816

 .325

 ASD

 3.21

 .621

 2.85

 .706

 2.94

 .572

 .000

 .009

 .467

 Scale

 Education
 versus
 Agriculture

 Agriculture
 versus
 Humanities

 Lastly, an independent sample t-test was performed to determine if the mean scores between male (N = 258)
 and female (N = 95) students were statistically different. The assumption of homogeneity of variances was
 tested and satisfied via Levene’s test, F(351) = .551, p = .458. The independent samples t-test was not
 associated with a statistically significant effect, t(351) = -1.34, p = .17. This suggests that the difference
 between students’ perceptions of assessment based on gender was not statistically significant, and the null
 hypothesis was retained.

 Qualitative Section
 The qualitative results generated insights about important aspects of classroom assessment. Both students
 and faculty commented that the existing classroom assessment policies and practices favor exams, which
 center on summative assessment approaches. However, most faculty members reported that they implement
 both formative and summative assessment. Three themes emerged from the interviews with faculty and
 students: Improvement in pedagogy and assessment; student involvement in assessment processes; and
 assessment forms versus the grading policy. Findings suggest that awareness about different forms of
 assessment is high among the faculty. In addition, both students and faculty reported student involvement in
 assessment processes at some level. Further, all participants highlighted the restriction of the grading policy
 as an important challenge for faculty members to institutionalize alternative assessment approaches in
 addition to the existing high stakes assessment and for students to buy into assessment activities that are not
 tied to their grades.

 Higher Learning Research Communications

 31

 Mussawy et al., 2021

 Open Access

 Improvements in Pedagogy and Assessment Skills
 Most faculty indicated substantial growth in teaching and assessment competencies due to exposure to
 modern pedagogies provided at the national and institutional levels. A faculty member explained that
 universities in Afghanistan follow a cascade model of professional development for faculty. She added that the
 university has a team of experts facilitating training sessions on “outcome-based learning” and “studentcentered instruction.” Another faculty member supported that the training sessions covered different
 assessment approaches. He explained, “I feel confident facilitating student-driven lessons and developing
 different assessment forms to assess my students.” Similarly, a junior faculty reported that she had learned
 ways to create “individualized and collaborative assessment tasks.” For these participants, professional
 development programs facilitated by the quality assurance office have increased their assessment literacy.
 While the faculty participants noted improvements in their assessment skills, many students criticized them for
 failing to design assessment tasks that matched individual student capabilities. “My classmates come from
 different geographies where access to schools is limited. They have different learning abilities, but assignments
 and exams are the same for everyone,” said a senior student from the College of Education. He added that not
 everyone has the same learning style, suggesting that faculty members should pay attention to the individualized
 needs of students. Nonetheless, students acknowledged assessment transparency and the recurrence of daily
 assessment during instruction. A senior student described that their “exams consist of simple, medium, and
 difficult questions.” Nevertheless, a few students were skeptical about merit-based assessment, noting that final
 exams are sometimes politicized to promote one student group over another. While participants avoided
 providing specific details, the recent example flags assessment ethics centered on “fairness and equity” as
 teachers make judgments about student learning (Klenowski & Wyatt-Smith, 2014, p. 7).
 Student Involvement in Assessment Processes
 Most of the faculty members who participated in the interviews expressed reluctance to involve the student in
 assessment tasks, particularly when grading a student’s work. Nevertheless, they were open to the idea of
 having students review their peers’ work and provide constructive feedback. One faculty member stated that
 he often encouraged students to make oral comments when their peers presented their projects, but he never
 asked them to provide written feedback. Other faculty members also recalled instances when they worked
 with students to solve a problem or discuss applying concepts and theories in practice. For these faculty,
 assessment and teaching are “inseparable.” For instance, according to one faculty who was teaching writing
 courses, providing opportunities for students to ask questions and reflect on the lesson was central to her
 teaching philosophy. She went on to explain, “I usually provide lengthy feedback on students’ papers by
 explaining the strengths, weaknesses, and ways to improve them.” The faculty member, nonetheless,
 acknowledged that she had never shared her assessment rubric with students.
 Student engagement in assessment tasks only occurred in informal settings. Students explained that the
 faculty usually involve students in assessment when the subject requires them to conduct fieldwork and share
 their findings with the class. More precisely, a junior student said, “When we present the findings of our
 fieldwork, our classmates can ask questions or make comments about the presentation.” He went on to say
 that a few faculty members had specific policies, for example, choosing referees among students to make
 judgments about student presentations. Another junior stated, “I felt much empowered when it was my turn
 to evaluate other students’ presentations one day.” He added, “I was a little nervous but so excited to serve as
 a referee.” However, a few students complained about the purpose of peer assessment when there is no
 guideline from the instructor. According to a sophomore, “The faculty members should establish the
 grounding rules when they let students ask questions and assess the presentations. Some students ask difficult
 questions to challenge their classmates.” While many of the students highlighted the importance of student
 involvement in assessment processes, the last example informs the role of faculty members in managing
 assessment.

 Higher Learning Research Communications

 32

 Mussawy et al., 2021

 Open Access

 Given that classroom assessment occurs at different intervals, several faculty members complained about the
 lack of student preparedness for post-secondary education. They criticized secondary schools for failing to
 prepare students with adequate knowledge and skills to pursue undergraduate programs. For instance, a
 faculty member who facilitated a freshman course on academic writing described her experience: “Students
 barely know how to write. I had to revisit my course syllabus to meet their needs.” For this participant and
 s
	# -- coding: utf-8 --
	import matplotlib.pyplot as plt
	from wordcloud import WordCloud, STOPWORDS


	#%%
	f = open('output_file.txt', 'r', encoding="utf8")
	text = f.read()
	f.close()

	# Stopwords
	stop = []

	# f = open('stop-words-spanish-snowball-mod.txt', 'r', encoding="utf8")
	# stop += f.read().split()
	# f.close()

	# f = open('spanish_stopwords.txt', 'r', encoding="utf8")
	# stop += f.read().split()
	# f.close()


	wordcloud = WordCloud(stopwords=STOPWORDS.union(set(stop)),
	background_color='#3c3c3c',
	width=1800,
	height=1400,
	max_words=100,
	colormap="magma",
	#font_path='./CabinSketch-Bold.ttf'
	)
	wordcloud.generate(text)

	plt.figure(figsize=(9, 7))
	plt.imshow(wordcloud)
	plt.axis('off')
	plt.tight_layout()
	plt.savefig('word_cloud-SLR.png', dpi=300, transparent=True)
	plt.show()