SpamSlayer - Hệ thống phát hiện Email Spam

SpamSlayer là một công cụ machine learning được xây dựng bằng Python để phát hiện email spam sử dụng thuật toán Naive Bayes và Logistic Regression.

Tính năng

Phát hiện email spam với độ chính xác cao
So sánh nhiều thuật toán (Naive Bayes vs Logistic Regression)
Tiền xử lý văn bản tự động (loại bỏ URL, email, số, ký tự đặc biệt)
Lưu và tải model đã huấn luyện
Test với email mẫu

Yêu cầu hệ thống- Hệ thống phát hiện Email Spam

SpamSlayer là một công cụ machine learning được xây dựng bằng Python để phát hiện email spam sử dụng thuật toán Naive Bayes và Logistic Regression.

✨ Tính năng

Phát hiện email spam với độ chính xác cao
So sánh nhiều thuật toán (Naive Bayes vs Logistic Regression)
Tiền xử lý văn bản tự động (loại bỏ URL, email, số, ký tự đặc biệt)
Lưu và tải model đã huấn luyện
Test với email mẫu

📋 Yêu cầu hệ thống

Python 3.7+
pip (Python package manager)

Cài đặt

1. Clone hoặc tải project

git clone <repository-url>
cd SpamSlayer

2. Cài đặt dependencies

pip install -r requirements.txt

Hoặc cài đặt thủ công:

pip install pandas scikit-learn

3. Tạo thư mục dataset

mkdir dataset

Chuẩn bị dữ liệu

Tải dataset Enron Spam

# Tải trực tiếp
wget https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip -O dataset/enron_spam_data.zip

# Giải nén
cd dataset
unzip enron_spam_data.zip

Định dạng dữ liệu

Dataset cần có định dạng CSV với các cột:

Subject: Tiêu đề email
Message: Nội dung email
Spam/Ham: Nhãn (spam hoặc ham)

Sử dụng

1. Huấn luyện model cơ bản

from spam_detector import SpamDetector

# Khởi tạo detector
detector = SpamDetector()

# Load và huấn luyện
X, y = detector.load_data('dataset/enron_spam_data.csv')
results = detector.train_models(X, y)

# Lưu model
detector.save_model('my_spam_model.pkl')

2. Chạy pipeline hoàn chỉnh

python main.py

3. Sử dụng model đã lưu

from spam_detector import SpamDetector

# Load model đã lưu
detector = SpamDetector()
detector.load_model('best_spam_detector.pkl')

# Dự đoán
email = "Congratulations! You've won $1000! Click here now!"
prediction = detector.predict(email)
probability = detector.predict_proba(email)

print(f"Prediction: {prediction[0]}")
print(f"Spam probability: {probability[0][1]:.2%}")

Cấu trúc project

SpamSlayer/
├── main.py                 # File chính để chạy pipeline
├── spam_detector.py        # Class SpamDetector
├── utils.py               # Các hàm tiện ích (test_detector)
├── data_loader.py         # Tải và xử lý dữ liệu
├── requirements.txt       # Dependencies
├── README.md             # Hướng dẫn này
├── dataset/              # Thư mục chứa dữ liệu
│   └── enron_spam_data.csv
└── models/               # Thư mục lưu model
    └── best_spam_detector.pkl

Tùy chỉnh

Thay đổi thuật toán

Trong spam_detector.py, chỉnh sửa models_config:

models_config = {
    'Naive Bayes': MultinomialNB(),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'SVM': SVC(probability=True)  # Thêm SVM
}

Tùy chỉnh TF-IDF

tfidf = TfidfVectorizer(
    stop_words='english',
    ngram_range=(1, 3),      # Thay đổi n-gram
    max_features=100000,     # Tăng số features
    min_df=2,               # Tần suất tối thiểu
    max_df=0.95             # Tần suất tối đa
)

Kết quả mẫu

So sánh kết quả các mô hình:

         Model  Accuracy  Precision  Recall  F1-Score
   Naive Bayes    0.9756     0.9654  0.9724    0.9689
Logistic Regression 0.9834  0.9789  0.9801    0.9795

Best model: Logistic Regression

Test với Google Colab

Xem file Bản sao của SpamSlayer.ipynb để chạy trên Google Colab với dữ liệu từ Google Drive.

Xử lý lỗi thường gặp

Lỗi "File not found"

Kiểm tra đường dẫn tới file CSV
Đảm bảo file dataset đã được tải về

Lỗi import

pip install --upgrade scikit-learn pandas

Lỗi encoding

df = pd.read_csv('dataset/enron_spam_data.csv', encoding='utf-8')

Đóng góp

Fork project
Tạo feature branch (git checkout -b feature/AmazingFeature)
Commit changes (git commit -m 'Add some AmazingFeature')
Push branch (git push origin feature/AmazingFeature)
Tạo Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Liên hệ

Author: Thanh Toàn
Email: lathanhtoan01@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dataset		dataset
.gitignore		.gitignore
QUICKSTART.md		QUICKSTART.md
README.md		README.md
config.py		config.py
data_loader.py		data_loader.py
demo.py		demo.py
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py
spam_detector.py		spam_detector.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SpamSlayer - Hệ thống phát hiện Email Spam

Tính năng

Yêu cầu hệ thống- Hệ thống phát hiện Email Spam

✨ Tính năng

📋 Yêu cầu hệ thống

Cài đặt

1. Clone hoặc tải project

2. Cài đặt dependencies

3. Tạo thư mục dataset

Chuẩn bị dữ liệu

Tải dataset Enron Spam

Định dạng dữ liệu

Sử dụng

1. Huấn luyện model cơ bản

2. Chạy pipeline hoàn chỉnh

3. Sử dụng model đã lưu

Cấu trúc project

Tùy chỉnh

Thay đổi thuật toán

Tùy chỉnh TF-IDF

Kết quả mẫu

Test với Google Colab

Xử lý lỗi thường gặp

Lỗi "File not found"

Lỗi import

Lỗi encoding

Đóng góp

License

Liên hệ

About

Uh oh!

Languages

LockMan04/SpamSlayer

Folders and files

Latest commit

History

Repository files navigation

SpamSlayer - Hệ thống phát hiện Email Spam

Tính năng

Yêu cầu hệ thống- Hệ thống phát hiện Email Spam

✨ Tính năng

📋 Yêu cầu hệ thống

Cài đặt

1. Clone hoặc tải project

2. Cài đặt dependencies

3. Tạo thư mục dataset

Chuẩn bị dữ liệu

Tải dataset Enron Spam

Định dạng dữ liệu

Sử dụng

1. Huấn luyện model cơ bản

2. Chạy pipeline hoàn chỉnh

3. Sử dụng model đã lưu

Cấu trúc project

Tùy chỉnh

Thay đổi thuật toán

Tùy chỉnh TF-IDF

Kết quả mẫu

Test với Google Colab

Xử lý lỗi thường gặp

Lỗi "File not found"

Lỗi import

Lỗi encoding

Đóng góp

License

Liên hệ

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages