key: cord-0596604-sxma6fw9
authors: Tang, Ziqi; Wang, Yutong; Luo, Jiebo
title: Are Top School Students More Critical of Their Professors? Mining Comments on RateMyProfessor.com
date: 2021-01-23
journal: nan
DOI: nan
sha: e67093d07ce9bb1186f18f78dc2fdd2477b165e5
doc_id: 596604
cord_uid: sxma6fw9

Student reviews and comments on RateMyProfessor.com reflect realistic learning experiences of students. Such information provides a large-scale data source to examine the teaching quality of the lecturers. In this paper, we propose an in-depth analysis of these comments. First, we partition our data into different comparison groups. Next, we perform exploratory data analysis to delve into the data. Furthermore, we employ Latent Dirichlet Allocation and sentiment analysis to extract topics and understand the sentiments associated with the comments. We uncover interesting insights about the characteristics of both college students and professors. Our study proves that student reviews and comments contain crucial information and can serve as essential references for enrollment in courses and universities.

Since 1983, the U.S. News & World Report has been publishing rankings for the colleges and universities in the United States each fall. These rankings have remarkable impacts on applications, admissions, enrollment decisions, as well as tuition pricing policies (Monks and Ehrenberg 1999) . It is an important reference for not only students and parents, but also institutions and professors. The ranking methodology measures and calculates a variety of factors, and has been continuously refined over time based on user feedback, discussions with institutions and education experts, literature reviews and their own data (Morse and Brooks 2020) . The current ranking methodology considers the following factors, along with indicator weights: Graduation and Retention Rates (22%), Undergraduate Academic (20%), Faculty resources (20%), Financial Resources (10%), student selectivity for entering class (7%), Graduation Rate performance (8%), Social Mobility (5%), Graduate Indebtedness (5%), and Alumni Giving Rate (3%). This measurement takes a good number of objective factors into consideration. However, the learning experiences of students are subjective and personal, which cannot readily be represented by the ranking scores. In this regard, a professor rating website such as RateMyProfessors.com is a great resource to uncover the hidden knowledge about the learning experience that the U.S. News Rankings can not account for.

Rate My Professor is a website that allows students to anonymously rate their professors and write comments. The website claims that users have added more than 19 million ratings, 1.7 million professors, and over 7,500 schools to the website, and there are more than 4 million college students using this website each month (Rat 2020). Such massive text data is a great resource to study the following topics: features of different universities, learning experiences of students, and course and lecture qualities. Past literature has primarily examined the usefulness and validity of these ratings (Otto, Jr, and Ross 2008) , and the correlation levels between easiness, clarity and helpfulness of lecturers (Otto, Sanford, and Wagner 2011) . Yet the rich data on Rate My Professor contain more hidden information to discover. A unique feature of Rate My Professor is that it has professor reviews from different tiers of universities, such as Ivy League schools, Big Ten schools, and community colleges. These reviews discuss the same topic, which is the experiences of taking a course from a college professor. This provides an opportunity to conduct a plausible control variable experiment to learn about the characteristics of students and professors in different universities or colleges.

In summary, this study makes several contributions:

1. We conduct a large-scale study of the course learning experiences across the broad spectrum of universities and colleges in the United States.

2. We employ exploratory data analysis, topic modeling, and sentiment analysis to mine the behaviors and characteristics of different segments of colleges, students and professors.

3. we uncover interesting and useful insights that can be used to understand and improve the learning experience.

Rate My Professors data was scraped from the website. We selected about 75 universities based on the U.S. News college rankings of 2020. The rationale of our selection was the following: The eight Ivy League schools represent the top ranked private universities, ten Big Ten Academic Alliance Member universities represent the top ranked public universities, and the top 15 ranked community colleges in the United States represent the community colleges. In addition, we selected the top 25 ranked universities and those ranked in [100 -125] in the United States.

For each university in our selections, we picked the 60 most-rated professors (not highest rated), and for each professor page, we scraped the most recent 20 comments. In total, we collected 87,436 data records, containing the following attributes: "Professor ID", "Professor Name", "University", "Department", "Course ID", "Quality score", "Difficulty score", "Comments". Each data record represents a review by a student on a course.

We partitioned the collected data into several datasets. The rationale was the following: 1. Based on the school type, we partition the data into three categories: private (Ivy League), public (Big Ten), and community colleges. 2. Based on the average rating scores of the professors, we calculate the average quality score of each professor, and selected those professors with an average score above 4.0 and below 2.0 (the full score range is from 1.0 to 5.0), as the high-rating professor and low-rating professor groups, respectively. 3. Based on the quality score of each comment, we also create datasets for three categories: comments with a score above 4.0, comments with a scores below 2.0, and comments with a score in between. In the end, we have 11 datasets for three types of comparison. Note that these datasets may overlap with each other.

We perform exploratory analysis with the following initial findings: • The word counts shown in Figure 1 indicate that most of the comments contain around 60 words. All groups have similar distributions. The only difference is that Ivy League students use short phrases more often than other groups. • The Punctuation usage shown in Figure 2 demonstrates the most commonly used punctuation is period. The distributions for all groups are similar as well. The only difference is that community college students use commas, dashes and exclamation marks more frequently than other groups. • Figure 3 is the distribution of average quality ratings of all professors, which has a left skewed distribution. • Figure 4 is the distribution of quality ratings from all schools (75 schools). • From Figure 5 , the proportions of quality ratings of the three different groups of schools are different. Community college students give more high (5/5) ratings, while Ivy League students give fewer low (1/5) ratings. This answers our initial question in the title -top school students are not critical when rating their professors and course quality. • Figure 6 shows the proportions of difficulty ratings of the three different groups of schools, which are very similar. • The correlations between quality ratings and difficulty ratings for Ivy League, Big Ten and community colleges are [-0.178, -0.424, -0.515], respectively. All groups have negative correlation values that imply the quality rating decreases when difficulty rating increases, and vice versa. Ivy League's correlation is closer to zero which means there is little relationship between quality ratings and difficulty ratings. Moreover, students from Big Ten schools and community colleges are more likely to give a higher quality rating when the course is easy.

In order to find out what factors influence the quality ratings, we perform Latent Dirichlet Allocation (LDA) to extract topics of the comments. We implement a few types of topic modeling methods: LDA, BiGram LDA and Tri-Gram LDA using the Gensim library. Also, we apply traditional LDA and Multi-Grain LDA using the Tomotopy library. Gensim is a well-known python library for topic modeling, and Tomotopy is a new library that provides functions for topic modeling. The advantages of Tomotopy are the capability of dealing with large scale datasets, significantly faster running time than Gensim (5 to 10 times faster than Gensim), and its availability for implementing Multi-Grain LDA. Multi-Grain LDA takes both local topics and global topics into consideration when performing topic modeling. Therefore, we decide to examine the Tomotopy Multi-Grain LDA model for our study. BiGram, TriGram and Multi-Grain LDA models are similar algorithms to the traditional LDA. However, they have an additional step that adds N-gram phrases to increase the model's complexity, which could be useful in boosting the model's performance. Figure 6 : Difficulty rating distributions of Ivy League, Big Ten, and community colleges.

In our case, the BiGram model has phrases like: "easy-A", "office-hour", "online-course", etc. For the TriGram model, there are phrases like: "extra-credit-opportunity", "attendance isn mandatory", etc.

In order to evaluate the performance of all of these mod-els, we use coherence score, pyLDAvis visualization, loglikelihood, and manual checking as our evaluation metrics. For LDA, BiGram LDA and TriGram LDA models using Gensim, their coherence score comparison is shown in Figure 7 . Furthermore, we use the pyLDAvis topic modeling visualization tool to analyze the performance of models. For the Multi-Grain LDA model using Tomotopy, the library does not generate coherence scores, which is a downside of this library. Therefore, we decide to manually check all the topics these models generate and choose the one that makes more sense to us. Figure 8 shows the resulting topics we create using BiGram LDA, TriGram LDA and Multi-Grain LDA methods. They are all generated from the same dataset (community college lower quality rating comments) and have the same number of top topics selected (nine topics). A major portion of the topics are similar. However, the TriGram LDA model covers the most of the topics. For instance, we see key word "online" from the result of TriGram LDA. Since this is a community college dataset, we can infer that community colleges tend to offer more online classes than other groups, which could be a factor that students consider when they rate the course quality. Moreover, we also see "accent" for the first time from the result of TriGram LDA. This is a very interesting factor to include because many students actually have a different time understanding their professors' accents. The communication experience is an important aspect of course quality rating. 

Higher Ratings (4-5) The key words of the topics of higher quality ratings for the three groups are listed in Figure 9 . There are many factors that students mentioned in the comments when giving higher ratings. For example, school works (homework, test, exam), extra help (office hour), professor's personality (friendly, humor, entertaining), and so on. Meanwhile, some unexpected words stand out in the table: "tough", "boring", "strict", implying that these are not negatively affecting Ivy League and community college's quality ratings. In addition, both Big Ten and community college students mention "extra credit", "grade" more often.

The word "friend" appears in Big Ten's topics, perhaps implying students in Big Ten schools are more likely to get along with their professors like friends.

The key words of the topics of lower quality ratings for the three groups are listed in Figure 10 . A number of factors are mentioned by students in the comments with lower ratings. For example, school works (homework, test, exam), organization of the content (unclear, disorganized, useless), professor's attitude (manner, rude, arrogant), and so on. One thing to point out is that "cost" is a common factor through all schools as the cost of textbooks, supplies and software has significantly negative effects on quality ratings.

Middle Ratings (2-4) The topic key words of middle quality ratings for the three groups are listed in Figure 11 . The middle rating comments are usually not too extreme. We note that "accent" appears under Big Ten's topics, and in community college's topics for lower ratings. This suggests that Big Ten school students may have a higher tolerance for professors' accents than community college students.

The key words in the comments for professors with an average quality rating higher than 4 and lower than 2 are listed in Figure 12 . One thing to notice is that the coherence score of higher rating professors is lower, which means the topics of these comments are more dissimilar. Factors that affect the average ratings of professors are: grade, difficulty, organization of the contents, personality, extra help, passion, knowledge, fairness, and so on. In contrast, "voice" and "recitation" appear in the lower rating professors category, and this is the only time they appear. This implies communication is critical to students' experience in classes, and professors teaching science classes (Physics, Chemistry, Biology) that have recitation sections tend to get lower average ratings.

The LIWC2015 toolkit includes the main text analysis module along with a group of predefined internal lexicons. The text analysis module compares each word in the text against the dictionary and then identifies which words are associated with which psychologically-relevant categories (Pennebaker et al. 2015) . It has been used on previous studies for sentiment analysis on text data from social media (e.g., (Chen et al. 2020) . LIWC2015 provides about a hundred psychologically-relevant categories, from which we select around 20 categories for our analysis.

After we obtain the LIWC scores for each data record, we calculate the average scores and standard deviations. Figure 13 shows the LIWC results for our first comparison group (Group A). Some interesting categories stand out: positive emotion, anxiety, achievement, sexual, and gender. We run the t-test on these categories and LIWC grand average scores. The two-tailed P values for Positive Emotion, Achievement and Male Reference were all below 0.001 (P < 0.001). By conventional criteria, the differences are considered to be statistically significant. We make the following observations: 1. The positive emotion scores for college students are overall higher than the average. The Ivy League students score is not only higher than the grand average, but also higher than the other groups. It indicates that students from top ranked private schools do not tend to criticize professors more, instead they praise the professors more often than other groups. 2. The Achievement score for community college students is higher than other groups. Our interpretation is that the community college students may have had jobs previously and they decide to attend community college because they want to receive more education and learn more skills. They possibly have clearer motivation and goals than other groups. Therefore they tend to talk about achievement-related topics more often in their comments.

3. The Female Reference score for community colleges and Male Reference score for Ivy League schools stand out. The gender reference scores are measured when the students mention gender related phrases, such as he or she. Due to the fact that Rate My Professor website does not record the gender of the professor, we collect a fixed number of comments from each professor. The score generated from gender reference words is the most reliable way to infer male and female professors. Our analysis indicates that there are more male professors in Ivy league schools and more female lecturers in the community colleges. Our interpretation is that for research-oriented institutions, like Ivy League schools and Big Ten schools, the professors are required to perform research and teaching full-time. Community colleges, on the other hand, have more part-time lecturer positions. For female professors Figure 9 : Topic key words of higher ratings (4-5) of Ivy League vs. Big Ten vs. community colleges. Figure 10 : Topic key words of lower ratings (1-2) of Ivy League vs. Big Ten vs. community colleges.

who might have to take care of family and children at the same time, teaching part-time at community college seems to be a good option.

4. The anxiety scores are considered to be statistically insignificant. Based on the literature, our expectation was that students attending top ranked private colleges have a higher likelihood to feel depression and pressure (Deresiewicz 2014). However, the LIWC results show that the students did not express pressure and anxiety in their reviews. Our interpretation is that these comments were mostly written after the final exams or projects. The students no longer feel anxious at the time they post the comments.

5. The sexual scores are considered to be statistically insignificant. The sexual category contains phrases that describe the appearance of the professors. This could indicate whether the appearance could affect the student ratings and comments. Our study showed there is no evidence to prove the existence of connection between appearance and student ratings.

Similarly, after we obtain the LIWC scores for each data record, we calculated the average scores and standard deviations. Figure 13 also shows the LIWC results for our second comparison group (Group B). Some interesting categories stand out: Achievement and Gender. We run the t-test on these categories and LIWC grand average scores. The twotailed P value for Achievement and Gender Reference were both less than 0.001 (P < 0.01). By conventional criteria, the differences were considered to be statistically significant. The specific findings are: 1. The Achievement score for high-rating professors is higher than the low-rating professors. This may indicate that apart from the general impressions people have for a good professor, students think a good professor also needs to know how to motivate the students. 2. The Female Reference score for low-rating professors is higher, while the Male Reference score for high-rating professors is higher. This shows that there are more lowrating female professors and more high-rating male professors. It may imply that students are more critical on female professors than male professors.

In this paper, we have presented a framework of evaluating the learning experiences of college students from a more subjective perspective. We first partition the scraped data from RateMyProfessor.com into different groups and apply several LDA models to understand the topics of the comments. Furthermore, we perform sentiment analysis using LIWC2015. We discover a number of interesting findings that may be helpful for improving the college learning experience for all partied involved, including students, professors and administrators. There are three possible directions for future work. First, we can investigate a fine-grained partition strategy to divide the data by departments, subjects or courses. Second, we Figure 11 : Topic key words of middle ratings (2-4) of Ivy League vs. Big Ten vs. community colleges. Figure 12 : Topic key words of professors with high average ratings vs. professors with low average ratings.

can track the comments over time. Our current dataset contain comments from 2018 to 2020, while most of the comments are posted in May and December, which are the ends of spring and fall semesters. With more data over time, we may study individual professor's teaching style changes and look at this problem from a brand new point of temporal view. Lastly, many in-person lectures are switched to online lectures due to COVID-19 and quarantine. A valuable study is to first determine the courses that are transformed from inperson to online and then understand the changes from the student's experiences.

Rate My Professors About Page

Feature and Opinion Mining for Customer Review Summarization

How Instructor Immediacy Behaviors Affect Student Satisfaction and Learning in Web-Based Courses

Latent Dirichlet Allocation

Dynamic Topic Models

the Eyes of the Beholder: Analyzing Social Media Use of Neutral and Controversial Terms for COVID-19

Don't Send Your Kid to the Ivy League

Unsupervised Named-Entity Extraction from the Web: An Experimental Study

Mining and Summarizing Customer Reviews

Introduction to WordNet: An On-line Lexical Database*

The Impact of US News and World Report College Rankings on Admission Outcomes and Pricing Decisions at Selective Private Institutions

How U.S. News Calculated the 2021 Best Colleges Rankings

Distributed Algorithms for Topic Models

Does ratemyprofessor.com really rate my professor? Assessment & Evaluation in Higher Education

Analysis Of Online Student Ratings Of University Faculty

The Development and Psychometric Properties of LIWC2015. University of Texas Libraries URL

Rate My Professor: Online Evaluations of Psychology Instructors

Modeling Online Reviews with Multi-grain Topic Models