Spam Classifier: A Natural Language Processing Project

What is Natural Language Processing?

NLP is a method or a way in which computer interprets the Human language are perform the task. Alexa, Siri, etc. are some of its example.

Let’s start with the Spam Classifier:

The spam classifier predicts whether received message is a ham or a spam.

Let’s start with the dataset: The dataset consists of 5572 messages and their labels which is either “ham” or “spam”.

import pandas as pd

messages = pd.read_csv(“SMSSpamClassifier”,sep=”t”,names=[‘label’,’message’])

Now the labels needs to be converted in 0 and 1 labels which can be done using get_dummies() method of pandas library.

y = pd.getdummies(messages[‘labels’])

y = y.iloc[:1].values

Here, y wil contain 0 for “ham” labels and 1 for “spam” labels.

Now let’s look at independent data i.e. for x. For that 1st we have to clean the message data i.e. remove stopwords, lower string, group the same type words, etc. For all these we will use WordNetLemmatizer, the main reason of using the lemmatizer instead of stemming, it will provide meaning full words.

Now the code for it is:

import re

import nltk

import nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

corpus = []

for i in range(len(messages)):

review = re.sub(‘[^a-zA-Z]’,’ ‘,messages[‘message’][i])

review = review.lower()

review = review.split()

review = [lemmatizer.lemmaatizer(word) for word in review if not word in stopwords.words(‘english’)]

review = ‘ ‘.join(review)

corpus.append(review)

Here, corpus have all the sentences with clear data. The code above removes the stopwords, lowercase them and get all the important words that are required for prediction. Now we use Term Frequency and Inverse Term Frequency i.e. TfidfVectorizer to for the vector of words. The Tf-idf vector provide us with a vector of words and their importance.

Trending Bot Articles:

1. How Conversational AI can Automate Customer Service

2. Automated vs Live Chats: What will the Future of Customer Service Look Like?

3. Chatbots As Medical Assistants In COVID-19 Pandemic

4. Chatbot Vs. Intelligent Virtual Assistant — What’s the difference & Why Care?

from sklearn.feature_extraction.text import TfidfVectorizer

cv = TfidfVectorizer(max_features=5000)

x = cv.fit_transform(corpus).toarray()

The data is prepared in ‘x’ and now we can use it for training our model. Since Naïve Bayes algorithm works better for NLP we will use it for training our model.

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state=0)

spam_detect_model = MultinomialNB().fit(X_train, y_train)

y_pred = spam_detect_model.predict(X_test)

print(accuracy_score(y_test,y_pred))

The model will give of accuracy of around 98%. To predict the new input we can use model.predict(cv.tranform(user_input).toarray()) and get the output for it.

All resources and code is present at:

Darkshadow9799/Sms-Spam-Classifier

To have a look for NLP description click here.

Don’t forget to give us your 👏 !

<a href="https://medium.com/media/7078d8ad19192c4c53d3bf199468e4ab/href">https://medium.com/media/7078d8ad19192c4c53d3bf199468e4ab/href</a>

Spam Classifier: A Natural Language Processing Project was originally published in Chatbots Life on Medium, where people are continuing the conversation by highlighting and responding to this story.

Spam Classifier: A Natural Language Processing Project

Trending Bot Articles:

Don’t forget to give us your 👏 !

More posts

TTS Latenz Benchmark 2025: Google vs. Microsoft Voices für Phonebots

What’s the Funniest Thing an AI Chatbot Has Ever Said to You?

Recommend your best AI chatbot for 2025

Chatbots : Conventional and unconventional uses