GitHub → https://github.com/varshav0119/spam-ham-classifier
Web App → https://spam-ham-classifier.vercel.app/
Technologies Used
Deployment
The given problem statement is to develop an application that:
The process followed to solve the problem is discussed below in detail.
The first step was understanding the difference between spam and ham. Further, below are some of the articles that were explored to understand the area of spam classification.
| Summary | URL |
|---|---|
| Explains SpamAssassin (email spam classifier) and the attributes based on which it assigns numerical spam scores. Positive numbers indicate spam, and negative numbers indicate it is unlikely the email is spam. | https://www.emailonacid.com/blog/article/email-development/emailology_avoiding_the_assassin/#:~:text=The SpamAssassin score explained,likelihood an email is junk |
| A simple method to classify SMS messages as spam or ham, using TF-IDF and SVM classifier. | https://towardsdatascience.com/spam-or-ham-introduction-to-natural-language-processing-part-2-a0093185aebd |
| Understanding the impact of false positives vs. false negatives in spam detection: we control for false positives (ham messages should not be misclassified as spam). | https://towardsdatascience.com/the-case-of-precision-v-recall-1d02fe0ac40f |
| The core idea of SVM is to find a maximum marginal hyperplane(MMH) that best divides the dataset into classes. | https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python |
| Working of Logistic Regression and a discussion on overfitting | https://realpython.com/logistic-regression-python/#when-do-you-need-classification |
The given dataset contains SMS messages labelled as “spam” or “ham”.

<aside> 💬 The total number of SMS messages (data points) is 5572.
</aside>

<aside> 💬 The average number of words per SMS message is 15.71, and the average number of stopwords in each message (as per the NLTK corpus) is 4.97.
</aside>
Word clouds made for spam and ham messages show a clear difference in the topics and tone of voice between the two.