GitHub → https://github.com/varshav0119/spam-ham-classifier

Web App → https://spam-ham-classifier.vercel.app/


The given problem statement is to develop an application that:

The process followed to solve the problem is discussed below in detail.

Research

Understanding Spam Classification

The first step was understanding the difference between spam and ham. Further, below are some of the articles that were explored to understand the area of spam classification.

Summary URL
Explains SpamAssassin (email spam classifier) and the attributes based on which it assigns numerical spam scores. Positive numbers indicate spam, and negative numbers indicate it is unlikely the email is spam. https://www.emailonacid.com/blog/article/email-development/emailology_avoiding_the_assassin/#:~:text=The SpamAssassin score explained,likelihood an email is junk
A simple method to classify SMS messages as spam or ham, using TF-IDF and SVM classifier. https://towardsdatascience.com/spam-or-ham-introduction-to-natural-language-processing-part-2-a0093185aebd
Understanding the impact of false positives vs. false negatives in spam detection: we control for false positives (ham messages should not be misclassified as spam). https://towardsdatascience.com/the-case-of-precision-v-recall-1d02fe0ac40f
The core idea of SVM is to find a maximum marginal hyperplane(MMH) that best divides the dataset into classes. https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python
Working of Logistic Regression and a discussion on overfitting https://realpython.com/logistic-regression-python/#when-do-you-need-classification

Exploratory Data Analysis

The given dataset contains SMS messages labelled as “spam” or “ham”.

Screen Shot 2022-02-14 at 6.49.20 PM.png

<aside> 💬 The total number of SMS messages (data points) is 5572.

</aside>

Screen Shot 2022-02-14 at 6.50.17 PM.png

<aside> 💬 The average number of words per SMS message is 15.71, and the average number of stopwords in each message (as per the NLTK corpus) is 4.97.

</aside>

Word clouds made for spam and ham messages show a clear difference in the topics and tone of voice between the two.