I will be creating a Machine Learning Model that will predict if a comment is either a negative comment or a positive comment and what grade that comment earned. There will be multiple ML algorithms and libraries to achieve this. I will go over how I retrieved/cleaned/structure/and fit the data to the model to get a good accuracy rating. And I will attempt to deploy the model online where users can input a comment and get the rating as an output. This is a the first time messing around with ML so I am learning as I go. Text analysis/classification. I will be using a larger database of just comments and rating. We first need to get our data, clean our data, analyze our data, classify our data, and label our data, then figure out what ML(Machine learning) algorithm we need to use.
WORKFLOW: 1) Research Question and Getting Data 2) Clean Data and Analyze 3) NLP and Insights 4) Machine Learning Model
1) Research Question and Getting Data
the best way to answer these questions is to directly ask the students, but that is rather time-consuming. So why not utilize other sources to get our secondary data? Fortunately, we live in the age of the internet and there is one popular website that students nation-wide use to rate professors, schools, and make decisions on what professor to choose from when registering based on these ratings. Ratemyprofessor.com is a good source to answer our research question, where pools of students share their experiences with each professor.
This is a typical comment format for a professor. The student gives them a quality rating of 1-5 (5 being higher quality) and a difficulty rating of 1-5 (5 being the most difficult). They also can leave other information such as the grade received, if they would take them again, if the professor was “Awesome”, “Average”, or” Awful”, and some attributes they can select (Respected, Inspirational, Amazing Lectures, etc.). And the comment itself.
This is good data to answer our questions, so we would need to extract ALL the comments from each GMU professor. We will be using a python script to scrape the data and export it to an excel file because otherwise, it would take forever to copy and paste… and besides we live in the 21st century.
This is the HTML portion of the website, I will not cover how to navigate the HTML.
BeautifulSoup – “Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating.”
Regular Expression (RE) – “is a sequence of characters that define a search pattern. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings.”
Pandas – “software library written for the Python programming language for data manipulation and analysis. It offers data structures and operations for manipulating numerical tables and time series. Basically, like an excel table.”
- We import the required libraries needed to write the code
- The headers variable is what data the website (Ratemyprofessor) receives when our bot visits it.
- The page variable is what we will use to generate the links the bot will use to extract data.
- This block of code will add page numbers to the end of the link and put each link in a list for our bot to use. So it will generate 885 links that our bot will use to extract each GMU professor individual link.
3) The bot then loops through our list of the 885 links and visits each one
4) And in each link it visits, it extracts the professors own individual link and stores it in another list.
5) Now that we have each individual professor link. We will create a Data frame with column names such as ID, Value, Quality, etc. This Data frame will store all our data that we extract from the professor’s link.
6) This block will tell the bot to loop through each individual link for each professor, visit the link and extract the values we are looking for. After each profile visit and data extraction, the bot is then instructed to save the Data frame as an excel file on our desktop. The bot is also instructed to wait 2 seconds between each visit to not overwhelm the website we are visiting (its good practice to not be a jerk).
7) This section grades all the grades for each comment
8) And this section grades the quality and difficulty values
9) This section grabs all the rest and stores the data to its respective column in the Data frame created earlier.
10) This piece of code is a fail-safe to let the bot move on to the next link if the link it visited is broken or unavailable.
As you can see by just scanning through the raw data that some values are missing and some comments range from a simple word, a paragraph, and even a smiley face. So will analyze it using python.
Chart on the right is the count of Grades(A+, A, A- , B+, B , B-, C+, C, C-, D+, D, D-, F, Pass, Fail). Note that if we are to create another model that can predict the grade for a given comment then we must address the biases in our grading distribution. And get drop values “Pass, Fail, and Empty values”, since Pass and Fail values are redundant values.
2) Clean & Analyze Data
There is also an unbalance portion for the grades as well.
Y-axis = Length of comment
X-axis = Frequency
Mode = 350
The two top charts are the raw data frequency distribution and the bottom two are the clean data charts. Before and after cleaning you can see the distribution for each value remain about the same.
Even after cleaning the data the length of comments is about the same as the raw data distribution. We will any comment UNDER 84 characters and ABOVE 345.
(NOTE: We have to remove comments BASED on the length of the comment – we will use python to achieve this)
Now our data is equally proportionate! All we did was slice the dataframe.
Clean & Analyze Data