I will be creating a Machine Learning Model that will predict if a comment is either a negative comment or a positive comment and what grade that comment earned. There will be multiple ML algorithms and libraries to achieve this. I will go over how I retrieved/cleaned/structure/and fit the data to the model to get a good accuracy rating. And I will attempt to deploy the model online where users can input a comment and get the rating as an output. This is a the first time messing around with ML so I am learning as I go. Text analysis/classification. I will be using a larger database of just comments and rating. We first need to get our data, clean our data, analyze our data, classify our data, and label our data, then figure out what ML(Machine learning) algorithm we need to use.

WORKFLOW: 1) Research Question and Getting Data 2) Clean Data and Analyze 3) NLP and Insights 4) Machine Learning Model

1) Research Question and Getting Data

 the best way to answer these questions is to directly ask the students, but that is rather time-consuming. So why not utilize other sources to get our secondary data? Fortunately, we live in the age of the internet and there is one popular website that students nation-wide use to rate professors, schools, and make decisions on what professor to choose from when registering based on these ratings. Ratemyprofessor.com is a good source to answer our research question, where pools of students share their experiences with each professor.

This is a typical comment format for a professor. The student gives them a quality rating of 1-5 (5 being higher quality) and a difficulty rating of 1-5 (5 being the most difficult). They also can leave other information such as the grade received, if they would take them again, if the professor was “Awesome”, “Average”, or” Awful”, and some attributes they can select (Respected, Inspirational, Amazing Lectures, etc.). And the comment itself.

This is good data to answer our questions, so we would need to extract ALL the comments from each GMU professor. We will be using a python script to scrape the data and export it to an excel file because otherwise, it would take forever to copy and paste… and besides we live in the 21st century.

We need to write a bot to scrape the website and extract the data to an excel (or JSON) file. Before we write the bot, we need to inspect the website and its HTML code to get an understanding of how the content of the website is laid out, so we can give proper direction to our bot to navigate the website.

This is the HTML portion of the website, I will not cover how to navigate the HTML.
We will be using python and its libraries to write up the bot script; requests, BeautifulSoup, regular expressions, Regular Expression, and pandas.

BeautifulSoup – “Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating.”

Regular Expression (RE) – “is a sequence of characters that define a search pattern. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings.”

Pandas – “software library written for the Python programming language for data manipulation and analysis. It offers data structures and operations for manipulating numerical tables and time series. Basically, like an excel table.”

This is the python scrape code broken down into 10 segments where I give a rough explaining of what each segment of code does.
  1. We import the required libraries needed to write the code
    1. The headers variable is what data the website (Ratemyprofessor) receives when our bot visits it.
    2. The page variable is what we will use to generate the links the bot will use to extract data.
  2. This block of code will add page numbers to the end of the link and put each link in a list for our bot to use. So it will generate 885 links that our bot will use to extract each GMU professor individual link.

3) The bot then loops through our list of the 885 links and visits each one

4) And in each link it visits, it extracts the professors own individual link and stores it in another list.

5) Now that we have each individual professor link. We will create a Data frame with column names such as ID, Value, Quality, etc. This Data frame will store all our data that we extract from the professor’s link.

6) This block will tell the bot to loop through each individual link for each professor, visit the link and extract the values we are looking for. After each profile visit and data extraction, the bot is then instructed to save the Data frame as an excel file on our desktop. The bot is also instructed to wait 2 seconds between each visit to not overwhelm the website we are visiting (its good practice to not be a jerk).

7) This section grades all the grades for each comment

8) And this section grades the quality and difficulty values

9) This section grabs all the rest and stores the data to its respective column in the Data frame created earlier.

10) This piece of code is a fail-safe to let the bot move on to the next link if the link it visited is broken or unavailable.

Now we have the data in an excel file (Data file). The ID column is the ID of each profile, the value column is the rating (awesome, average, awful) and so on. We even got the link for each profile to double check in case we need to go back and double-check some information. This excel data file has 137,222 entries, note this is raw data. So now we need to clean it using python, but before we do that we will analyze the raw and uncleaned data to get an understanding of what the data contains so we know what to clean and how to clean it.

As you can see by just scanning through the raw data that some values are missing and some comments range from a simple word, a paragraph, and even a smiley face. So will analyze it using python.

Chart to the left is the count of values (Awesome, Average, Awful, and Empty values). N = 114,629 not including EMPTY value of 27,500. Since we are making an ML model that predicts if the text is Awesome, Average, or Awful, we wont include empty values in our Data. As you can see the value proportion is imbalanced, 64.36% of the data has the value Awesome, and only 11.65% make up the value Awful. This can create a huge bias in our model. We will attend to this later on.

Chart on the right is the count of Grades(A+, A,  A- , B+, B ,  B-, C+,  C,  C-, D+,  D,  D-, F,  Pass,  Fail). Note that if we are to create another model that can predict the grade for a given comment then we must address the biases in our grading distribution. And get drop values “Pass, Fail, and Empty values”, since Pass and Fail values are redundant values.
Now let’s measure the length of each comment by using the number of characters in each comment. As you can see there is a big difference between comments in terms of the size of characters. This is important when we are trying to classify the text. We need to be consistent in some way in terms of character length when feeding our data to our model to train on.
Now looking at the distribution of comment length, we can see most comments will be between 150-350. Looking at the left chart (treemap) the biggest block is 300-350 making it the max and min the is 600+. This is important to know because we need to know what comments to keep based on length. But first, let’s look at the comment length distribution based on their label (Awesome, Average, and Awful).
Counting the characters for each comment in Awesome, Average, and Awful separately we can see the distribution is pretty equal in terms of characters between (251 – 350). But in terms of the frequency of each group, Awesome has the majorly of the count. As expected, since the majority of the data is Awesome value 64.36%. We must keep this in mind when we are cleaning our data.
There are 4,074,858 total words in these comments, and 31,801 unique words (removed duplicates). The WordCloud image above is the words that appeared more often in the comments. (Download File Here)

2) Clean & Analyze Data

Now the next step is to clean the data into a standard format for further and accurate analysis.
We lowercase all the comments and then remove all the symbols, digits, duplicates and empty values from the comment column. Then we remove values in the grade column that would are too vague to give us a grade by removing “Pass” and “Fail” values. This will get rid of redundant and useless data. Also removing extra white spaces in the comment sections will clean the text and reduce character count noise.

We will then check the length of the comments and its frequency (the same way we did for the dirty data). We still see an uneven distribution of the count values (Awesome, Average, and Awful).

There is also an unbalance portion for the grades as well.

We go through the data frame and get the length of each comment and its frequency and store it in a dictionary (a type of list) then we plot the dictionary using pandas and have it display the MODE (a value that appears most often). This will help us balance our data when creating our prediction model.

Y-axis = Length of comment
X-axis = Frequency
Mode = 350

The two top charts are the raw data frequency distribution and the bottom two are the clean data charts. Before and after cleaning you can see the distribution for each value remain about the same.

Even after cleaning the data the length of comments is about the same as the raw data distribution. We will any comment UNDER 84 characters and ABOVE 345.

Now our goal is to make sure the Weight of each value is proportionate to one another to avoid a bias model when creating our prediction model. Since the “Average” column has the least amount of entries, we will remove a total of 35,064 from the “Awesome” column, and 8,523 from the “Awful” value, for a total of 43,587 rows from our dataset which will even out the proportion of each value.

(NOTE: We have to remove comments BASED on the length of the comment – we will use python to achieve this)

Now our data is equally proportionate! All we did was slice the dataframe.

You can find some insights HERE

3) NLP

Using Spacy on python, we looped through all the comments and found words that are either misspelled or none recognized words (mostly nouns). Interestingly, COVID had the most weight. Download Image

4) Insight & Research Question


4) Machine Learning Model

Click the Slide to play around the model!