Your cart is currently empty!
Spam Classifier: A Natural Language Processing Project
What is Natural Language Processing?
NLP is a method or a way in which computer interprets the Human language are perform the task. Alexa, Siri, etc. are some of its example.
Let’s start with the Spam Classifier:
The spam classifier predicts whether received message is a ham or a spam.
Let’s start with the dataset: The dataset consists of 5572 messages and their labels which is either “ham” or “spam”.
import pandas as pd
messages = pd.read_csv(“SMSSpamClassifier”,sep=”t”,names=[‘label’,’message’])
Now the labels needs to be converted in 0 and 1 labels which can be done using get_dummies() method of pandas library.
y = pd.getdummies(messages[‘labels’])
y = y.iloc[:1].values

Here, y wil contain 0 for “ham” labels and 1 for “spam” labels.
Now let’s look at independent data i.e. for x. For that 1st we have to clean the message data i.e. remove stopwords, lower string, group the same type words, etc. For all these we will use WordNetLemmatizer, the main reason of using the lemmatizer instead of stemming, it will provide meaning full words.
Now the code for it is:
import re
import nltk
import nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
corpus = []
for i in range(len(messages)):
review = re.sub(‘[^a-zA-Z]’,’ ‘,messages[‘message’][i])
review = review.lower()
review = review.split()
review = [lemmatizer.lemmaatizer(word) for word in review if not word in stopwords.words(‘english’)]
review = ‘ ‘.join(review)
corpus.append(review)
Here, corpus have all the sentences with clear data. The code above removes the stopwords, lowercase them and get all the important words that are required for prediction. Now we use Term Frequency and Inverse Term Frequency i.e. TfidfVectorizer to for the vector of words. The Tf-idf vector provide us with a vector of words and their importance.
Trending Bot Articles:
2. Automated vs Live Chats: What will the Future of Customer Service Look Like?
4. Chatbot Vs. Intelligent Virtual Assistant — What’s the difference & Why Care?
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(max_features=5000)
x = cv.fit_transform(corpus).toarray()
The data is prepared in ‘x’ and now we can use it for training our model. Since Naïve Bayes algorithm works better for NLP we will use it for training our model.
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state=0)
spam_detect_model = MultinomialNB().fit(X_train, y_train)
y_pred = spam_detect_model.predict(X_test)
print(accuracy_score(y_test,y_pred))
The model will give of accuracy of around 98%. To predict the new input we can use model.predict(cv.tranform(user_input).toarray()) and get the output for it.
All resources and code is present at:
Darkshadow9799/Sms-Spam-Classifier
To have a look for NLP description click here.
Don’t forget to give us your 👏 !



Spam Classifier: A Natural Language Processing Project was originally published in Chatbots Life on Medium, where people are continuing the conversation by highlighting and responding to this story.