I will be creating a Machine Learning Model that will predict if a comment is either a negative comment or a positive comment and what grade that comment earned. There will be multiple ML algorithms and libraries to achieve this. I will go over how I retrieved/cleaned/structure/and fit the data to the model to get a good accuracy rating. And I will attempt to deploy the model online where users can input a comment and get the rating as an output. This is a the first time messing around with ML so I am learning as I go. Text analysis/classification. I will be using a larger database of just comments and rating. We first need to get our data, clean our data, analyze our data, classify our data, and label our data, then figure out what ML(Machine learning) algorithm we need to use.
WORKFLOW: 1) Research Question and Getting Data 2) Clean Data and Analyze 3) NLP and Insights 4) Machine Learning Model
1) Research Question and Getting Data
The best way to answer these questions is to directly ask the students, but that is rather time-consuming.
So why not utilize other sources to get our secondary data? Fortunately, we live in the age of the internet and there is one popular website that students nation-wide use to rate professors, schools, and make decisions on what professor to choose from when registering based on these ratings.
Ratemyprofessor.com is a good source to answer our research question, where pools of students share their experiences with each professor.
This is the website we will be scraping for the comment prediction model (sentiment analysis) and for analysis.
This is a typical comment format for a professor. The student gives them a quality rating of 1-5 (5 being higher quality) and a difficulty rating of 1-5 (5 being the most difficult). They also can leave other information such as the grade received, if they would take them again, if the professor was “Awesome”, “Average”, or” Awful”, and some attributes they can select (Respected, Inspirational, Amazing Lectures, etc.). And the comment itself.
This is good data to answer our questions, so we would need to extract ALL the comments from each GMU professor. We will be using a python script to scrape the data and export it to an excel file because otherwise, it would take forever to copy and paste… and besides we live in the 21st century.
We need to write a bot to scrape the website and extract the data to an excel (or JSON) file. Before we write the bot, we need to inspect the website and its HTML code to get an understanding of how the content of the website is laid out, so we can give proper direction to our bot to navigate the website.
This is the HTML portion of the website, I will not cover how to navigate the HTML.
We will be using python and its libraries to write up the bot script; requests, BeautifulSoup, regular expressions, Regular Expression, and pandas.
BeautifulSoup – “Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating.”
Regular Expression (RE) – “is a sequence of characters that define a search pattern. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings.”
Pandas – “software library written for the Python programming language for data manipulation and analysis. It offers data structures and operations for manipulating numerical tables and time series. Basically, like an excel table.”
We will be grabbing all the highlighted fields and ONLY top 5 comments.
This is the python scrape code broken down into 10 segments where I give a rough explaining of what each segment of code does.
- We import the required libraries needed to write the code
- The headers variable is what data the website (Ratemyprofessor) receives when our bot visits it.
- The page variable is what we will use to generate the links the bot will use to extract data.
- This block of code will add page numbers to the end of the link and put each link in a list for our bot to use. So it will generate 885 links that our bot will use to extract each GMU professor individual link.
3) The bot then loops through our list of the 885 links and visits each one
4) And in each link it visits, it extracts the professors own individual link and stores it in another list.
5) Now that we have each individual professor link. We will create a Data frame with column names such as ID, Value, Quality, etc. This Data frame will store all our data that we extract from the professor’s link.
6) This block will tell the bot to loop through each individual link for each professor, visit the link and extract the values we are looking for. After each profile visit and data extraction, the bot is then instructed to save the Data frame as an excel file on our desktop. The bot is also instructed to wait 2 seconds between each visit to not overwhelm the website we are visiting (its good practice to not be a jerk).
7) This section grades all the grades for each comment
8) And this section grades the quality and difficulty values
Here is the full script in HTML format: SCRAPE_SCRIPT
9) This section grabs all the rest and stores the data to its respective column in the Data frame created earlier.
10) This piece of code is a fail-safe to let the bot move on to the next link if the link it visited is broken or unavailable.
Now we have a structured dataset in the form of an excel file that contain the extracted fields. We now must transpose this dataset so we can get each comment and its attributes.
So, the only columns that we will keep is [ID, Value, Quality, Difficulty, Grade, Comment, Link]
NOTE: We got the values for the ID column by splitting the LINK column.
Here is the DATASET_RATINGS_RAW
Now we have the data in an excel file. The ID column is the ID of each profile, the value column is the rating (awesome, average, awful) and so on. We even got the link for each profile to double check in case we need to go back and double-check some information. This excel data file has 137,222 entries, note this is raw data. So now we need to clean it using python, but before we do that, we will analyze the raw and uncleaned data to get an understanding of what the data contains so we know what to clean and how to clean it.
Here is the dataset: DATASET_COMMENT_RAW
As you can see by just scanning through the raw data that some values are missing and some comments range from a simple word, a paragraph, and even a smiley face. So will analyze it using python.
Chart to the left is the count of values (Awesome, Average, Awful, and Empty values). N = 114,629 not including EMPTY value of 27,500. Since we are making an ML model that predicts if the text is Awesome, Average, or Awful, we wont include empty values in our Data. As you can see the value proportion is imbalanced, 64.36% of the data has the value Awesome, and only 11.65% make up the value Awful. This can create a huge bias in our model. Chart on the right is the count of Grades(A+, A, A- , B+, B , B-, C+, C, C-, D+, D, D-, F, Pass, Fail). Note that if we are to create another model that can predict the grade for a given comment then we must address the biases in our grading distribution. And get drop values “Pass, Fail, and Empty values”, since Pass and Fail values are redundant values.
Now let’s measure the length of each comment by using the number of characters in each comment. As you can see there is a big difference between comments in terms of the size of characters. This is important when we are trying to classify the text. We need to be consistent in some way in terms of character length when feeding our data to our model to train on.
Now looking at the distribution of comment length, we can see most comments will be between 150-350. Looking at the left chart (treemap) the biggest block is 300-350 making it the max and min the is 600+. This is important to know because we need to know what comments to keep based on length. But first, let’s look at the comment length distribution based on their label (Awesome, Average, and Awful).
Counting the characters for each comment in Awesome, Average, and Awful separately we can see the distribution is pretty equal in terms of characters between (251 – 350). But in terms of the frequency of each group, Awesome has the majorly of the count. As expected, since the majority of the data is Awesome value 64.36%. We must keep this in mind when we are cleaning our data.
There are 4,074,858 total words in these comments, and 31,801 unique words (removed duplicates). The WordCloud image above is the words that appeared more often in the comments. (Download File Here)
2) Clean & Analyze Data
Now the next step is to clean the data into a standard format for further and accurate analysis.
We lowercase all the comments and then remove all the symbols, digits, duplicates and empty values from the comment column. Then we remove values in the grade column that would are too vague to give us a grade by removing “Pass” and “Fail” values. This will get rid of redundant and useless data. Also removing extra white spaces in the comment sections will clean the text and reduce character count noise.
We will then check the length of the comments and its frequency (the same way we did for the dirty data). We still see an uneven distribution of the count values (Awesome, Average, and Awful).
There is also an unbalance portion for the grades as well.
We go through the data frame and get the length of each comment and its frequency and store it in a dictionary (a type of list) then we plot the dictionary using pandas and have it display the MODE (a value that appears most often). This will help us balance our data when creating our prediction model.
Y-axis = Length of comment
X-axis = Frequency
Mode = 350
TOP = raw data frequency distribution
BOTTOM = clean data charts
Even after cleaning the data the length of comments is about the same as the raw data distribution. We will remove any comment UNDER 84 characters and ABOVE 345.
Since the “Average” column has the least amount of entries, we will remove a total of 35,064 from the “Awesome” column, and 8,523 from the “Awful” value, for a total of 43,587 rows from our dataset which will even out the proportion of each value to avoide a bias prediction model. We will use Python for this.
Python script to slice dataframe
Dataframe is now proportionate
The chart below is an analysis of how students rated the professor based on their grades. Green columns is the Quality rate (1= low, 5= high), and Red columns is the Difficulty (1 = easy, 5 = very difficult). Students who got a favorable grade, rate the professor a higher Quality rate and lower difficulty, and students with lower grades rate the professor a lower Quality and higher Difficulty. (Sample Data)
The Chart Above is a frequency area count chart of students who rated either “Awesome, Average, Awful” based on the grade received. Students who got favorable grades rated the professor more positively as oppose to a student who got a lower grade. Note the dramatic drop of the Green area as grades decrease. (Sample Data)
Quality goes down as Difficulty goes up. Professors with amazing lectures are perceived as higher quality and lower difficulty. And tough grading professors are considered low quality and higher difficulty.
3) NLP and Insights
We will be using SPACY in Python. Spacy is an open-source software library for advanced natural language processing
We concatenate all the 19,489 comments and run it through spacy and generate a word cloud on Dates, Locations, Persons, languages, and organizations. Spacy is smart enough to recognize these.
You can see that Spacy recognizes entities. Shakespeare is recognized as a Person. So now we have our Dataset of 9,195 entries. We will now use a word cloud (basically, frequency of each entry) to get a better visualization on the content of the comments.
Word Cloud for people mentioned in the comments
George Mason University is a very diverse school with majority of students from China and India. So, it’s not surprising that these countries have a high frequency count in the comments.
Not surprising that Chegg is mentioned a lot in the comments, since everyone uses that website to get homework help.
Using Spacy on python, we looped through all the comments and found words that are either misspelled or none recognized words (mostly nouns). Interestingly, COVID had the most weight. Download Image
3) Machine (supervised) Learning
Now we will create a text classification model using Scikit-Learn in Python. This model will predict if a students review for a professor is either negative or positive (Awesome, Average, Awful)
We used Scikit-Learn in python to train our text classification model. Click Here to view the html format of the python script to view how we trained it.
The model accuracy is 0.698~ almost 70%. Good enough for now.