nv-morpheus · Nov 18, 2022 · Nov 15, 2022 · Nov 15, 2022 · Nov 15, 2022 · Nov 16, 2022
@@ -61,19 +61,19 @@ Well-Read Students Learn Better: On the Importance of Pre-training Compact Model
 
 ## Phishing Email Detection
 ### Model Overview
-Phishing email detection is a binary classifier differentiating between phishing and non-phishing emails.
+Phishing email detection is a binary classifier differentiating between phishing/spam and non-phishing/spam emails and SMS messages.
 ### Model Architecture
 BERT-base uncased transformer model
 ### Training
-Training consisted of fine-tuning the original pretrained [model from google](https://huggingface.co/bert-base-uncased). The labeled training dataset is around 20000 emails from three public datasets ([CLAIR](https://www.kaggle.com/datasets/rtatman/fraudulent-email-corpus), [SPAM_ASSASIN](https://spamassassin.apache.org/old/publiccorpus/readme.html), [Enron](https://www.cs.cmu.edu/~./enron/))
+Training consisted of fine-tuning the original pretrained [model from google](https://huggingface.co/bert-base-uncased). The labeled training dataset is around 5000 SMS messages from a public dataset- [SMS Spam Collection](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)
 ### How To Use This Model
-This model is an example of customized transformer-based phishing email detection. It can be further fine-tuned for specific detection needs and customized the emails of your enterprise using the fine-tuning scripts in the repo.
+This model is an example of customized transformer-based phishing email detection. It can be retrained for specific detection needs and customized the emails of your enterprise using the training scripts in the repo.
 #### Input
 Entire email as a string
 #### Output
-Binary sequence classification as phishing or non-phishing
+Binary sequence classification as phishing/spam or non-phishing/spam
 ### References
-- Radev, D. (2008), CLAIR collection of fraud email, ACL Data and Code Repository, ADCR2008T001, http://aclweb.org/aclwiki
+- https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
 - Devlin J. et al. (2018), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 https://arxiv.org/abs/1810.04805
 

@@ -38,17 +38,13 @@ class BinarySequenceClassifier(SequenceClassifier):
     def init_model(self, model_or_path):
         """
         Load model from huggingface or locally saved model.
-
         :param model_or_path: huggingface pretrained model name or directory path to model
         :type model_or_path: str
-
         Examples
         --------
         >>> from clx.analytics.binary_sequence_classifier import BinarySequenceClassifier
         >>> sc = BinarySequenceClassifier()
-
         >>> sc.init_model("bert-base-uncased")  # huggingface pre-trained model
-
         >>> sc.init_model(model_path) # locally saved model
         """
         self._model = AutoModelForSequenceClassification.from_pretrained(model_or_path)
@@ -65,7 +61,6 @@ def init_model(self, model_or_path):
     def predict(self, input_data, max_seq_len=128, batch_size=32, threshold=0.5):
         """
         Predict the class with the trained model
-
         :param input_data: input text data for prediction
         :type input_data: cudf.Series
         :param max_seq_len: Limits the length of the sequence returned by tokenizer. If tokenized sentence is shorter
@@ -78,7 +73,6 @@ def predict(self, input_data, max_seq_len=128, batch_size=32, threshold=0.5):
         :type threshold: float
         :return: predictions, probabilities: predictions are labels (0 or 1) based on minimum threshold
         :rtype: cudf.Series, cudf.Series
-
         Examples
         --------
         >>> from cuml.preprocessing.model_selection import train_test_split
@@ -95,20 +89,23 @@ def predict(self, input_data, max_seq_len=128, batch_size=32, threshold=0.5):
         predict_dataset = Dataset(predict_gdf)
         predict_dataloader = DataLoader(predict_dataset, batchsize=batch_size)
 
-        preds = cudf.Series()
-        probs = cudf.Series()
+        preds_l = []
+        probs_l = []
 
         self._model.eval()
         for df in predict_dataloader.get_chunks():
             b_input_ids, b_input_mask = self._bert_uncased_tokenize(df["text"], max_seq_len)
             with torch.no_grad():
                 logits = self._model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)[0]
                 b_probs = torch.sigmoid(logits[:, 1])
-                b_preds = b_probs.ge(threshold)
+                b_preds = b_probs.ge(threshold).type(torch.int8)
 
             b_probs = cudf.io.from_dlpack(to_dlpack(b_probs))
-            b_preds = cudf.io.from_dlpack(to_dlpack(b_preds))
-            preds = preds.append(b_preds)
-            probs = probs.append(b_probs)
+            b_preds = cudf.io.from_dlpack(to_dlpack(b_preds)).astype("boolean")
+            preds_l.append(b_preds)
+            probs_l.append(b_probs)
+
+        preds = cudf.concat(preds_l)
+        probs = cudf.concat(probs_l)
 
         return preds, probs