AI/ML

NLP Classification: Part 2


In this section, the data collected from the subreddits will be cleaned and prepared for modeling. Various models were applied to see which one performed best.

These are the imports used for this section:

import pandas as pd
import regex as re
import nltk
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import collections
from collections import Counter
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

Read in data that was saved at the end of the previous section. Dropping the created_utc column since this is no longer needed. Use inplace=True to make this change permanent.

cocktails = pd.read_csv('../data/cocktails_50comments.csv')
cocktails.drop(columns=['created_utc'], inplace=True)
cocktails.head()
Resulting DataFrame

I wanted to see how many words were in the body before I cleaned the data. I used the following function resulting in 1,461,519 words.

def word_count(series):
list_tokens = [w.lower() for w in series]
string_tokens = str(list_tokens)
tokens = BeautifulSoup(string_tokens).get_text()
return tokens
len(word_count(cocktails['body']).split())

Now it’s time to clean the text. The function was adopted from a General Assembly lecture by Matt Bremms. I kept the stop_words outside of the function so I could easily use add more words to the list.

Trending Bot Articles:

1. Crawl Twitter Data using 30 Lines of Python Code

2. Blender Vs Rasa open source chatbots

3. Sentiment Analysis Voice Bot

4. Chat bots — A Conversational AI

This function removes HTML, removing characters that are not letters, lower cases all the words, splits them up into individual tokens without officially tokenizing, removes the stopwords, and I tried both lemmatization and stemming the words . I commented out the stemmer because I ended up using the lemmatizer.

stop_words = stopwords.words('english')
stop_words.append('like')
stop_words.append('one')
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
def status_words(status):
'''takes a series and cleans the text data '''

review_text = BeautifulSoup(status).get_text()
# Removed HTLM

letters_only = re.sub('[^a-zA-Z]', ' ', review_text)
# Removed Non Letter

words = letters_only.lower().split()
# Tokenize without official tokenizer

stops = set(stop_words)
# Remove Stopwords

meaningful_words = [lemmatizer.lemmatize(w) for w in words if w not in stops]
# meaningful_words = [stemmer.stem(w) for w in words if w not in stops]
# tried both the stemmer and the lemmatizer
return(' '.join(meaningful_words))

I used this function to add a column to my DataFrame of the clean data.

cocktails['body_clean'] = cocktails['body'].map(status_words)

Now I can compare a row before and after.

cocktails['body'][0]
before cleaning
cocktails['body_clean'][0]
after cleaning

I wanted to see the top 10 most common words before and after. I utilized a counter to set up this function.

def top_10(series):
clean_tokens = word_count(series)
count = Counter(clean_tokens.split())
return count.most_common(10)
Top 10 words before cleaning
Top 10 words after cleaning

I wanted to see how many word there were after the cleaning. Now there are 819,173 and the data started at 1,461,519 words.

len(word_count(cocktails['body_clean']).split())

Save this DataFrame. Always save the DataFrame even if you are going to continue on at this point. If something goes wrong, this data will still be available. It’s not a huge deal in this scenario but if something takes hours or days to run, you want to be able to access the data again right away if something goes sideways. Now all the text is clean and ready to model.

cocktails.to_csv('../data/clean_50cocktail.csv', index=None)

After all the cleaning, there are now some rows with null values. Some comments have just a simple emoji or stopwords. This is another reason I wanted to add the clean text onto the DataFrame rather than just replacing the column all together. It makes comparison easy to do. After examining the null values, I felt good about dropping them. The cleaning process removed 309 rows of the initial 50,000.

df_cocktail.loc[df_cocktail['body_clean'].isnull()]
df_cocktail.dropna(inplace=True)
Looking at null values in the clean data

This process was also done to df_wine and now they can be combined.

df = pd.concat([df_cocktail, df_wine], axis=0)

To run models on this data, all words must be numerical values. Subreddit wine = 1 and cocktails = 0

df['subreddit'] = df['subreddit'].map({'wine': 1, 'cocktails': 0})

The next step is assigning X and y then splitting them in a set of training data for the model to learn from and testing data to evaluate performance.

X = df['body_clean']
y = df['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, 
stratify=y,
test_size=0.25,
random_state=42)

I tried several combinations of models: Logistic Regression, MultinomialNB, LinearSVC models and turned words into numbers using both CountVectorizer and TfidfVectorizer. The one that performed the best was the following combination. This resulted in Train 0.863 and Test 0.857 meaning that on the training set, the model classified comments accurately 86.3% of the time and 85.7% on unseen data.

pipe = make_pipeline(CountVectorizer(), MultinomialNB())
pipe_use = pipe.fit(X_train, y_train)
print('Baseline Train', pipe.score(X_train, y_train))
print('Baseline Test', pipe.score(X_test, y_test))

Using GridSearchCV, I was able get a Train: 0.954 and Test: 0.875. This did result in a better test score so this model has higher accuracy but the training score is so much higher than the test score, showing that this model has high variance and is extremely overfit.

pipe = Pipeline([
('cvec', CountVectorizer()),
('nb', MultinomialNB())
])
params = {
'cvec__min_df': [1, 2],
'cvec__max_df': [.1, .25, .5, 1],
'cvec__ngram_range': [(1,1), (1,2)],
'nb__alpha': [.25, .5, 1]
}
gs = GridSearchCV(pipe, param_grid=params, n_jobs= -1, cv=5)
gs.fit(X_train, y_train)
print('Best Params', gs.best_score_)
print('Best Score:', gs.best_params_)
print('Train:', gs.score(X_train, y_train))
print('Test:', gs.score(X_test, y_test))

Don’t forget to give us your 👏 !

https://medium.com/media/7078d8ad19192c4c53d3bf199468e4ab/href


NLP Classification: Part 2 was originally published in Chatbots Life on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source link

Related posts

Predictive Analytics Tools Forecast COVID-19 Surges Globally

Newsemia

Designing conversational experiences with sentiment analysis in Amazon Lex

Newsemia

Ask an MIT Computer Scientist: Describe programming in six words

Newsemia

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy

COVID-19

COVID-19 (Coronavirus) is a new illness that is having a major effect on all businesses globally LIVE COVID-19 STATISTICS FOR World