These are the python libraries that we will be importing

We will open the comment dataset excel file into a pandas dataframe

We will clean our comments for Stopwords, such as “and, I, a, etc.”. These words don’t add anything of value to our model. And we will stem each word down to its roots. So Beginning will be shortened to begin.

Now we will loop through our dataset and clean the comments for stopwords and stem each word into a new column called “Cleaned”

Now we will create our training and test data by splitting the cleaned column in the dataset. We will use 30% of our data for our test data, and the remaining 70% will be training data. The features will be the comments in the cleaned column and the target vector will be the value column (Awesome, Average, Awful).

Now we will build our model. The first line of code is where the features (comments) get broken down by words into a bag of word. These words are then given a weight based on their frequency. The more time the word appears the less weight it will get. Words that appear less Are given a higher weight because they are more relevant. ngram_range = tell the model how many words to consider as a feature. So, in our case we have set the ngram to 1,2. K=10,000 – we are telling it to select 10,000 best features.

Now we are training the model with our training data.

Now our model is complete. Let’s see its accuracy score. The accuracy is not that good. There are ways we can improve the accuracy but 70% is sufficient for now.