Skip to content

FalsettoAI/Hydra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Read This Before Editing Data

This file contains specifics for organizing and editing data. Please read it carefully as this is very important for creating good models.

Data Guidelines

  • YOU MUST LOOK THROUGH EVERY SENTENCE PUT INTO THE MODEL. Don't just generate 150 sentences and call it good without checking them for the below properties.
  • If you put data into one intent that could also go into another one, you must instead create a new, specific label that quantifies the intent. For example, prompted-name sentences can go into all order and reservation intents, therefore we creates the add-info intent.
  • Do not input wacky sentences generated by a LLM that nobody will ever say, the odd language and extra words can fuck up the model.
    • "yo dont trip fam lemme get a table for this wily crew" - Claude (yes it actually wrote that)
  • However, broken English sentences are very good for training. These tend to not fuck with the model and help it to understand the general notion of the intent.
    • "i need table for tonight"

Organization

I am going to go through each folder and its purpose. The first 3 are unique, after that they are sentence data points grouped by intent.

  1. processing
  • This folder stores all Python files used for processing the data.
  • processing.py contains functions to write out data into the necessary format for both intent and NER models
  • data_helper_functions.py contains every other function we use to process data.
    • For example, deleting duplicates, or removing any line with a specific phrase.
    • If you ever need something like this done, please check for the necessary function within the file first. If it is not there, create your own function and provide an explanation of it for others to use.
  1. Filler-Data
  • Hold files with filler data for dynamic sentences, our processing files will automatically insert random lines from these files into empty labels in dynamic sentences.
  1. Final-Datasets
  • Stores the final .json output for each new model
  1. Out_of_Scope
  • Out of Scope intent sentences
  • data_full.json is a file of sentences I found online, the out_of_scope_processing.py file removes any intents in that could interfere with other files and prints the necessary sentences into a .txt
  • If anything is classified as out of scope incorrectly, check this file for conflicting sentences.
  1. Add-Info
  • Prompted inputs such as name, date, time, etc...
  • Allows these inputs to contribute to multiple intent pipelines
  1. Change-Info
  • If a user wants to change the name on their order or reservation, it needs to apply to both pipelines. Since we don't have multi-intent, a new label is necessary.
  1. Confirm-Deny and Greeting-Farewell
  • Pretty self-explanatory
  1. Inquiry, Order, and Reservation
  • Groupings of intent data

Compiling Data For a New Model

  • Creating and organizing data is important, but if you mess up compiling it, none of that matters. Pay very close attention to what you are doing, be sure to remember the data guidelines outlined in the initial section of this file.
  • Find all data from separated files throughout the specific folders and the model type folder, bring all the data into Final.txt, and send it through processing.py to convert it into a usable format. You can store both the Final.txt and outputted useable data in the Final-Datasets folder of the model type.

That's the end for now. If you have any questions please bring it up to me(Travis) as soon as possible and I can amend the document.

Again, this is very important. Unorganized data will lead to more sloppy models, wasted time, and wasted money.

About

store data files and processing scripts to be used later.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •