This repository contains code for a spam email classifier developed as part of my Machine Learning internship at Digital Empowerment Network. The goal of this project is to classify emails as spam or ham (not spam) using machine learning techniques.
The dataset used for this project is the mail_data.csv file, which contains the following columns:
- Category: The label of the email, either 'spam' or 'ham'.
- Message: The content of the email.
Cleaning the data and preparing it for model training.
Transforming the text data into numerical features using TF-IDF vectorization.
Training a Logistic Regression model to classify emails.
Evaluating the model's performance using accuracy, classification report, and confusion matrix.
The Logistic Regression model achieved the following results:
Accuracy on training data: 0.967
Accuracy on test data: 0.965
To run the code, follow these steps:
- Clone this repository to your local machine.
- Navigate to the directory containing the code.
- Ensure that the mail_data.csv file is in the same directory as the code.
- Run the script: python spam_email_classifier.ipynb
This project demonstrates the process of building a spam email classifier using Logistic Regression. The model can accurately classify emails as spam or ham based on their content. Future improvements could include experimenting with different models and techniques to further enhance accuracy.