Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create generative model #146

Open
3 of 8 tasks
david-thrower opened this issue Jan 11, 2024 · 1 comment
Open
3 of 8 tasks

Create generative model #146

david-thrower opened this issue Jan 11, 2024 · 1 comment

Comments

@david-thrower
Copy link
Owner

david-thrower commented Jan 11, 2024

Based on the provided information, here's a step-by-step guide to address the issues and enhance the model:

  1. Create a new branch from the specified commit:
  2. Duplicate the GPT-2 based phishing classification notebook for this project.
  3. Modify the final layer of the model:
    1. Replace the Dense(1) layer with a Dense(VOCAB_SIZE) layer and change the loss function to sparse_categorical_crossentropy (since you're using a softmax activation function).
  4. Handle de-tokenization (without clashing with the crossentropy loss):
  5. Prepare the training dataset: In progress
  6. Create a function that takes a prompt and an expected response and generates a list of text sequences where each has one additional word cumulatively added from the expected response.:
    1. E.g. the data set {"prompt":"Write a haiku about integrity: ", "response":["Integrity is great\nIntegrity is nice.\nIntegrity will get you far" } will become these data and labels: data = ["Write a haiku about integrity:", "Write a haiku about integrity: Integrity", "Write a haiku about integrity: Integrity will", ... ]; labels = ["Integrity", "will", "get", ... ]
  7. Create a benchmark training set: like the above.
  8. Prototype a benchmark training loop.
@david-thrower
Copy link
Owner Author

david-thrower commented Jan 12, 2024

A prototype of packaging the text samples:

import nltk
import numpy as np
import re

# Text Sample: Class for a text sample


class TextSample:
    def __init__(self, prompt: str, response: str):
        self.prompt = prompt
        self.response = response

# Example data
sample1 = TextSample(prompt="Tell me all about the capitol of France",
                     response="Paris is known as the city of love")
sample2 = TextSample(prompt="Write a haiku about life",
                     response="Life blows.\nYou go to school.\nYou go to work")
sample3 = TextSample(prompt="Write an ode to Silence:",
                     response="Silence is awesome. Silence is rare. Silence is beauty. Silence is nowhere.")
samples = [sample1, sample2, sample3]

# Empty list (may want to change to a dict for scalability) 
data = []
labels = []


def split_string(text: str) -> list:
    try:
        words = nltk.word_tokenize(text)
    except LookupError as err:
        print(f"Looks like punkt is missing: \n {err} \n "
              "Downloading punkt to try resolving this:")
        nltk.download('punkt')
        words = nltk.word_tokenize(text)

    return words


def create_data_and_labels(samples):
    for i in np.arange(len(samples)):
        sample_0 = samples[i]
        prompt_0_str = sample_0.prompt
        response_0_str = sample_0.response

        response_0_list = split_string(response_0_str)

        data_0 = []
        label_0 = []
        data_0 = prompt_0_str
        for j in np.arange(len(response_0_list)):
            if i == 0:
                label_0 = response_0_list[j]
            else:
                data_0 += f" {response_0_list[j - 1]}"
                label_0 = response_0_list[j]
            data.append(data_0)
            labels.append(label_0)

# Test case:
create_data_and_labels(samples)

for i in np.arange(len(data)):

    data_0 = data[i]
    label_0 = labels[i]

    print(f"Sample: {i}")
    print(data_0)
    print(f"Label {i}:")
    print(label_0)
    # print(type(label_0))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant