These are the python libraries that we will be importing
import json as j
import pandas as pd
import re
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
We will open the comment dataset excel file into a pandas dataframe
df = pd.read_excel('DATASET_COMMENT_CLEAN.xlsx')
df.head()
ID | value | quality | difficulty | grade | comment | link | |
---|---|---|---|---|---|---|---|
0 | 2318671 | awesome | 5.0 | 1.0 | B | quizzes paper and presentation and final ... | /ShowRatings.jsp?tid=2318671 |
1 | 1792829 | awesome | 4.0 | 4.0 | EMPTY | a fair amount of reading lots of blog post o... | https://www.ratemyprofessors.com/ShowRatings.j... |
2 | 561105 | awesome | 5.0 | 4.0 | B | a little rough around the edges but once you g... | /ShowRatings.jsp?tid=561105 |
3 | 1193391 | awesome | 5.0 | 3.0 | EMPTY | a lot of work but very easy he is clear and v... | /ShowRatings.jsp?tid=1193391 |
4 | 802611 | awesome | 4.5 | 3.0 | EMPTY | a lot of work but it was the best class i hav... | /ShowRatings.jsp?tid=802611 |
We will clean our comments for Stopwords, such as “and, I, a, etc.”. These words don’t add anything of value to our model. And we will stem each word down to its roots. So Beginning will be shortened to begin.
stemmer = SnowballStemmer('english')
words = stopwords.words("english")
Now we will loop through our dataset and clean the comments for stopwords and stem each word into a new column called “Cleaned”
df['cleaned'] = df['comment'].apply(lambda x: " ".join([stemmer.stem(i) for i in re.sub("[^a-zA-Z]", " ", x).split() if i not in words]).lower())
Now we will create our training and test data by splitting the cleaned column in the dataset. We will use 30% of our data for our test data, and the remaining 70% will be training data. The features will be the comments in the cleaned column and the target vector will be the value column (Awesome, Average, Awful).
X_train, X_test, y_train, y_test = train_test_split(df['cleaned'], df.value, test_size=0.3)
Now we will build our model. The first line of code is where the features (comments) get broken down by words into a bag of word. These words are then given a weight based on their frequency. The more time the word appears the less weight it will get. Words that appear less Are given a higher weight because they are more relevant. ngram_range = tell the model how many words to consider as a feature. So, in our case we have set the ngram to 1,2. K=10,000 – we are telling it to select 10,000 best features.
pipeline = Pipeline([('vect', TfidfVectorizer(ngram_range=(1, 2), stop_words="english", sublinear_tf=True)),
('chi', SelectKBest(chi2, k=15000)),
('clf', LinearSVC(C=1.0, penalty='l1', max_iter=5000, dual=False))])
Now we are training the model with our training data.
model = pipeline.fit(X_train, y_train)
Now our model is complete. Let’s see its accuracy score. The accuracy is not that good. There are ways we can improve the accuracy but 70% is sufficient for now.
print("accuracy score: " + str(model.score(X_test, y_test)))
accuracy score: 0.69813579613477