Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phishing model and data updates #462

Merged
10 commits merged into from
Nov 18, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions examples/data/email.jsonlines
Git LFS file not shown
4 changes: 2 additions & 2 deletions examples/data/email_with_addresses.jsonlines
Git LFS file not shown
10 changes: 5 additions & 5 deletions models/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,19 +61,19 @@ Well-Read Students Learn Better: On the Importance of Pre-training Compact Model

## Phishing Email Detection
### Model Overview
Phishing email detection is a binary classifier differentiating between phishing and non-phishing emails.
Phishing email detection is a binary classifier differentiating between phishing/spam and non-phishing/spam emails and SMS messages.
### Model Architecture
BERT-base uncased transformer model
### Training
Training consisted of fine-tuning the original pretrained [model from google](https://huggingface.co/bert-base-uncased). The labeled training dataset is around 20000 emails from three public datasets ([CLAIR](https://www.kaggle.com/datasets/rtatman/fraudulent-email-corpus), [SPAM_ASSASIN](https://spamassassin.apache.org/old/publiccorpus/readme.html), [Enron](https://www.cs.cmu.edu/~./enron/))
Training consisted of fine-tuning the original pretrained [model from google](https://huggingface.co/bert-base-uncased). The labeled training dataset is around 5000 SMS messages from a public dataset- [SMS Spam Collection](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)
### How To Use This Model
This model is an example of customized transformer-based phishing email detection. It can be further fine-tuned for specific detection needs and customized the emails of your enterprise using the fine-tuning scripts in the repo.
This model is an example of customized transformer-based phishing email detection. It can be retrained for specific detection needs and customized the emails of your enterprise using the training scripts in the repo.
#### Input
Entire email as a string
#### Output
Binary sequence classification as phishing or non-phishing
Binary sequence classification as phishing/spam or non-phishing/spam
### References
- Radev, D. (2008), CLAIR collection of fraud email, ACL Data and Code Repository, ADCR2008T001, http://aclweb.org/aclwiki
- https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
- Devlin J. et al. (2018), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
https://arxiv.org/abs/1810.04805

Expand Down
Git LFS file not shown
3 changes: 0 additions & 3 deletions models/phishing-models/phishing-bert-20211006.onnx

This file was deleted.

3 changes: 0 additions & 3 deletions models/phishing-models/phishing-bert-20211006.pt

This file was deleted.

3 changes: 3 additions & 0 deletions models/phishing-models/phishing-bert-20221115.onnx
Git LFS file not shown
3 changes: 3 additions & 0 deletions models/phishing-models/phishing-bert-20221115.pt
Git LFS file not shown
Original file line number Diff line number Diff line change
Expand Up @@ -38,17 +38,13 @@ class BinarySequenceClassifier(SequenceClassifier):
def init_model(self, model_or_path):
"""
Load model from huggingface or locally saved model.

:param model_or_path: huggingface pretrained model name or directory path to model
:type model_or_path: str

Examples
--------
>>> from clx.analytics.binary_sequence_classifier import BinarySequenceClassifier
>>> sc = BinarySequenceClassifier()

>>> sc.init_model("bert-base-uncased") # huggingface pre-trained model

>>> sc.init_model(model_path) # locally saved model
"""
self._model = AutoModelForSequenceClassification.from_pretrained(model_or_path)
Expand All @@ -65,7 +61,6 @@ def init_model(self, model_or_path):
def predict(self, input_data, max_seq_len=128, batch_size=32, threshold=0.5):
"""
Predict the class with the trained model

:param input_data: input text data for prediction
:type input_data: cudf.Series
:param max_seq_len: Limits the length of the sequence returned by tokenizer. If tokenized sentence is shorter
Expand All @@ -78,7 +73,6 @@ def predict(self, input_data, max_seq_len=128, batch_size=32, threshold=0.5):
:type threshold: float
:return: predictions, probabilities: predictions are labels (0 or 1) based on minimum threshold
:rtype: cudf.Series, cudf.Series

Examples
--------
>>> from cuml.preprocessing.model_selection import train_test_split
Expand All @@ -95,20 +89,23 @@ def predict(self, input_data, max_seq_len=128, batch_size=32, threshold=0.5):
predict_dataset = Dataset(predict_gdf)
predict_dataloader = DataLoader(predict_dataset, batchsize=batch_size)

preds = cudf.Series()
probs = cudf.Series()
preds_l = []
probs_l = []

self._model.eval()
for df in predict_dataloader.get_chunks():
b_input_ids, b_input_mask = self._bert_uncased_tokenize(df["text"], max_seq_len)
with torch.no_grad():
logits = self._model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)[0]
b_probs = torch.sigmoid(logits[:, 1])
b_preds = b_probs.ge(threshold)
b_preds = b_probs.ge(threshold).type(torch.int8)

b_probs = cudf.io.from_dlpack(to_dlpack(b_probs))
b_preds = cudf.io.from_dlpack(to_dlpack(b_preds))
preds = preds.append(b_preds)
probs = probs.append(b_probs)
b_preds = cudf.io.from_dlpack(to_dlpack(b_preds)).astype("boolean")
preds_l.append(b_preds)
probs_l.append(b_probs)

preds = cudf.concat(preds_l)
probs = cudf.concat(probs_l)

return preds, probs

This file was deleted.

Loading