This is a text corpus of over 5,500 English SMS messages with ~13% labeled as spam. The text file contains one message per line with two columns: the label ("ham" or "spam") and the raw text of the message. Messages labeled as "ham" are non-spam messages considered legitimate.
Background: You work for a telecom company launching a new messaging app. Unfortunately, the previous spam filters that they used were out of date and no longer effective. They have asked you whether you can use new data they supplied to accurately distinguish between spam and regular messages. They have also told you that it is essential that regular messages are rarely if ever, categorized as spam.
Objective: Build a streamlit web app to detect spam accurately
Techniques Used: Exploratory Data Analysis, Data Visualization,, Predictive Modeling, Web Frameworks, RESTful APIs, Containerization .
Type of Problem: Binary Classification
Language, Libraries, technologies used: Python, Pandas, Matplotlib, Seaborn, Numpy, word cloud, String, Nltk, Scikit-learn, pickle, Docker, flask, Streamlit .
Source of the dataset. This corpus was created by Tiago A. Almeida and José María Gómez Hidalgo.
Citations:
-
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.
-
Gómez Hidalgo, J.M., Almeida, T.A., Yamakami, A. On the Validity of a New SMS Spam Collection. Proceedings of the 11th IEEE International Conference on Machine Learning and Applications (ICMLA'12), Boca Raton, FL, USA, 2012.
-
Almeida, T.A., Gómez Hidalgo, J.M., Silva, T.P. Towards SMS Spam Filtering: Results under a New Dataset. International Journal of Information Security Science (IJISS), 2(1), 1-18, 2013.