-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I finally got perfect labels (classification task) via prompting : r/LocalLLaMA #643
Comments
Related issues#456: Baseline benchmark for 17 coding models : r/LocalLLaMA### DetailsSimilarity score: 0.88 - [ ] [Baseline benchmark for 17 coding models : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/19fc4uf/baseline_benchmark_for_17_coding_models/)Baseline Benchmark for 17 Coding ModelsDiscussionI am currently working on implementing some ideas for coding models inference strategies (prompting, control, context exploration, CoT, ToT, etc) and I needed a baseline benchmark on a bunch of models. Since I work on a 3060 12GB, I was limited in what I can test so I went for every model that is 7/13B and has an AWQ quant available, since that is what the inference library that I use supports. I thought I'd share some numbers. Notes:
f"""<s>You are a helpful and respectful assistant. Answer the following question: {question}""" ResultsI've plotted the results (with horrendous contrasting colors, but alas) to look for any interesting patterns in problem solving. You can find the plots here.
Suggested labels{ "label-name": "coding-models", "description": "Discussion and benchmark of coding models implementation strategies.", "confidence": 96.82 }#309: openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code"### DetailsSimilarity score: 0.86 - [ ] [openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code"](https://github.com/openai/human-eval)HumanEval: Hand-Written Evaluation Set This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code". Installation Make sure to use python 3.7 or later: $ conda create -n codex python=3.7 $ git clone https://github.com/openai/human-eval This program exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. The execution call in execution.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner. See the comment in execution.py for more information and instructions. After following the above instructions to enable execution, generate samples and save them in the following JSON Lines (jsonl) format, where each sample is formatted into a single line like so: {"task_id": "Corresponding HumanEval task ID", "completion": "Completion only without the prompt"} Here is nearly functional example code (you just have to provide generate_one_completion to make it work) that saves generated completions to samples.jsonl. from human_eval.data import write_jsonl, read_problems problems = read_problems() num_samples_per_task = 200 $ evaluate_functional_correctness samples.jsonl As a quick sanity-check, the example samples should yield 0.5 pass@1. $ evaluate_functional_correctness data/example_samples.jsonl --problem_file=data/example_problem.jsonl $ evaluate_functional_correctness --help Known Issues While evaluation uses very little memory, you might see the following error message when the system is running out of RAM. Since this may cause some correct programs to fail, we recommend that you free some memory and try again. malloc: can't allocate region Please cite using the following bibtex entry: @Article{chen2021codex, Suggested labels{ "key": "llm-evaluation", "value": "Evaluating Large Language Models performance and behavior through human-written evaluation sets" }#134: marker: Convert PDF to markdown quickly with high accuracy### DetailsSimilarity score: 0.86 - [ ] [https://github.com/VikParuchuri/marker#readme](https://github.com/VikParuchuri/marker#readme)MarkerMarker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.
More DetailsHow it worksMarker is a pipeline of deep learning models:
Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition. From the nougat paper: Nougat is an amazing model, but I wanted a faster and more general purpose solution. Marker is 10x faster and has low hallucination risk because it only passes equation blocks through an LLM forward pass. Examples
PerformanceThe above results are with marker and nougat setup so they each take ~3GB of VRAM on an A6000. See below for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks. LimitationsPDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
InstallationThis has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and poetry. First, clone the repo:
Linux
Mac
UsageFirst, some configuration:
Convert a single fileRun
Make sure the Convert multiple filesRun
Convert multiple files on multiple GPUsRun
BenchmarksBenchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data). We show naive text extraction (pulling text out of the pdf with no processing) for comparison. Speed
Accuracy First 3 are non-arXiv books, last 3 are arXiv papers.
Peak GPU memory usage during the benchmark is Throughput Marker takes about 2GB of VRAM on average per task, so you can convert 24 documents in parallel on an A6000. Running your own benchmarksYou can benchmark the performance of marker on your machine. First, download the benchmark data here and unzip. Then run
This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each. Omit Commercial usageDue to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage. I'm building a version that can be used commercially, by stripping out the dependencies below. If you would like to get early access, email me at marker@vikas.sh. Here are the non-commercial/restrictive dependencies: Other dependencies/datasets are openly licensed (doclaynet, byt5), or used in a way that is compatible with commercial usage (ghostscript). ThanksThis work would not have been possible without amazing open source models and datasets, including (but not limited to):
Thank you to the authors of these models and datasets for making them available to the community! #369: "You are a helpful AI assistant" : r/LocalLLaMA### DetailsSimilarity score: 0.86 - [ ] ["You are a helpful AI assistant" : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/18j59g1/you_are_a_helpful_ai_assistant/?share_id=g_M0-7C_zvS88BCd6M_sI&utm_content=1&utm_medium=android_app&utm_name=androidcss&utm_source=share&utm_term=1)"You are a helpful AI assistant" Discussion Don't say "don't": this confuses them, which makes sense when you understand how they "think". They do their best to string concepts together, but they simply generate the next word in the sequence from the context available. Saying "don't" will put everything following that word into the equation for the following words. This can cause it to use the words and concepts you're telling it not to. (system prompts) Here is some context for the conversation: (Paste in relevant info such as web pages, documentation, etc, as well as bits of the convo you want to keep in context. When you hit the context limit, you can restart the chat and continue with the same context). "You are a helpful AI assistant" : this is the demo system prompt to just get agreeable answers from any model. The issue with this is, once again, how they "think". The models can't conceptualize what is helpful beyond agreeing with and encouraging you. This kind of statement can lead to them making up data and concepts in order to agree with you. This is extra fun because you may not realize the problem until you discover for yourself the falacy of your own logic. Then pass the list to your assistant you intend to chat with with something like "you can confidently answer in these subjects that you are an expert in: (the list). The point of this ^ is to limit its responses to what it actually knows, but make it confidentially answer with the information it's sure about. This has been incredibly useful in my cases, but absolutely check their work. Suggested labels{ "key": "sparse-computation", "value": "Optimizing large language models using sparse computation techniques" }#499: marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.### DetailsSimilarity score: 0.85 - [ ] [marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.](https://github.com/marella/ctransformers?tab=readme-ov-file#gptq)CTransformers
Python bindings for the Transformer models implemented in C/C++ using GGML library. Also see ChatDocs Supported Models
InstallationTo install via
UsageIt provides a unified interface for all models: from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")
print(llm("AI is going to")) Run in Google Colab To stream the output: for text in llm("AI is going to", stream=True):
print(text, end="", flush=True) You can load models from Hugging Face Hub directly: llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml") If a model repo has multiple model files ( llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin") 🤗 TransformersNote: This is an experimental feature and may change in the future. To use with 🤗 Transformers, create the model and tokenizer using: from ctransformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model) Run in Google Colab You can use 🤗 Transformers text generation pipeline: from transformers import pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("AI is going to", max_new_tokens=256)) You can use 🤗 Transformers generation parameters: pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1) You can use 🤗 Transformers tokenizers: from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True) # Load model from GGML model repo.
tokenizer = AutoTokenizer.from_pretrained("gpt2") # Load tokenizer from original model repo. LangChainIt is integrated into LangChain. See LangChain docs. GPUTo run some of the model layers on GPU, set the llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50) Run in Google Colab CUDAInstall CUDA libraries using: pip install ctransformers[cuda] ROCmTo enable ROCm support, install the CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers MetalTo enable Metal support, install the CT_METAL=1 pip install ctransformers --no-binary ctransformers GPTQNote: This is an experimental feature and only LLaMA models are supported using [ExLlama](https Install additional dependencies using: pip install ctransformers[gptq] Load a GPTQ model using: llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ") Run in Google Colab If the model name or path doesn't contain the word It can also be used with LangChain. Low-level APIs are not fully supported. DocumentationFind the documentation on Read the Docs. Config
Find the URL for the model card for GPTQ here. Made with ❤️ by marella Suggested labelsnull#626: classifiers/README.md at main · blockentropy/classifiers### DetailsSimilarity score: 0.85 - [ ] [classifiers/README.md at main · blockentropy/classifiers](https://github.com/blockentropy/classifiers/blob/main/README.md?plain=1)classifiers/README.mdFast Classifiers for Prompt RoutingRouting and controlling the information flow is a core component in optimizing machine learning tasks. While some architectures focus on internal routing of data within a model, we focus on the external routing of data between models. This enables the combination of open source, proprietary, API based, and software based approaches to work together behind a smart router. We investigate three different ways of externally routing the prompt - cosine similarity via embeddings, zero-shot classification, and small classifiers. Implementation of Fast ClassifiersThe We quantize the model using Optimum, enabling the model to run extremely fast on a CPU router. Each classifier takes 5-8ms to run. An ensemble of 8 prompt classifiers takes about 50ms in total. Thus, each endpoint can route about 20 requests per second. In the example Train test split of 80/20 yields an accuracy of 95.49% and f1 score of 0.9227. Comparison vs other Routing methodsThe most popular alternative to routing is via embedding similarity. For example, if one were to try to route a programming question, one might set up the set of target classes as ["coding", "not coding"]. Each one of these strings is then transformed into an embedding and compared against a prompt query like, "write a bubble sort in python". Given the computed pair-wise cosine similarity between the query and class, we can then label the prompt as a coding question and route the prompt to a coding-specific model. These do not scale well with larger numbers of embeddings. Nor are they able to capture non-semantic type classes (like is the response likely to be more or less than 200 tokens). However, they are adaptable and comparably fast and thus provide a good alternative to the trained fast classifiers. Quantifying different methods of routing in terms of execution time. As the prompt size increases, the query time also increases as shown in (a). There is also a close to linear increase in the time as the number of classes increase as shown in (b). However, the small classifiers do not increase in time as the class examples increase in the number of tokens (c). This is due to the upfront cost of training the binary classifier, reducing cost at inference. ReproducibilityThe Suggested labels{'label-name': 'Prompt-Routing', 'label-description': 'Focuses on external routing of data between models to optimize machine learning tasks.', 'confidence': 50.24} |
TITLE
I finally got perfect labels (classification task) via prompting : r/LocalLLaMA
DESCRIPTION
"I finally got perfect labels (classification task) via prompting
Tutorial | Guide
It took me weeks of trial and error, but here are my biggest lessons:
For some reason, fine-tuned/DPO models tend to HELLA over-fit for false positives and improvements were marginal compared to Mixtral
Split your prompt into 3 sections:
Below is the plug-n-play template I finalized/am using
Below is an instruction that describes a task, paired with an input that provides further context. Write response that appropriately completes the request.
Instruction:
Label the text based on this question: "{task}" Below are example labeled comments, w/ the reason behind their labels as context. Learn from the examples and think step by step before responding. Start your response by printing a "Yes/No" statement first as the label.
(Hint: {common mistakes you see after trial and error})
Text: {few-shot example}
Reason for Label: {explanation}
Label: {correct label}
Input:
Text: {Text for it to label}
Label (Print Yes/No Only):
Response:
For experimentation, I found that discrepancies are your best friend. My setup was:
Comparison between M8x7b-t0-s1000.csv and M8x7b-t1-s1000.csv:
Same: 900, Different: 100
Number of times M8x7b-t0 said "Yes" and M8x7b-t1 said "No": 100
Number of times M8x7b-t0 said "No" and M8x7b-t1 said "Yes": 0
That was actually the result of my first test, where I increased the number of few-shot examples from 5 to 19. Looking at this, I could tell that the update made lead to more negative labels. After checking, there were some correct labels but mostly just false negatives. This was super helpful because its more feasible to examine 100 outputs than 1000... or 1 million...
Eventually I got it down to this:
Comparison between M8x7b-t1-s1000.csv and M8x7b-t2-s1000.csv:
Same: 972, Different: 28
Number of times M8x7b-t1 said "Yes" and M8x7b-t2 said "No": 2
Number of times M8x7b-t1 said "No" and M8x7b-t2 said "Yes": 26
When I reviewed the output, filtering for these cases, it turns out that the second round of testing corrected all of the mislabels.
Now is this perfect? After sampling instances where they agreed, it seems to be in order. I think there is something really special about this approach - by forcing overfitting, we can turn that into a feature instead of bug. Working with the flaws of a model is a lot easier than trying to blindly iterate. At least here, we have a way to measure outputs against each other.
monday.com
Sign Up
Sort by:
aichiusagi
• 19d ago
• Edited 18d ago
For some reason, fine-tuned/DPO models tend to HELLA over-fit for false positives and improvements were marginal compared to Mixtral
I ran into this too. When fine-tuning, what you need to do is provide some subset of training data where you explicitly return nothing for false positives. In my data, I set this to about ~10% of the total and the problem disappeared.
Reply
GeeBrain
• 19d ago
Oh very interesting, what did this look like exactly? Could you give me an example? I’m thinking about fine-tuning BERT for classification after this round, since using Mixtral takes forever and is unrealistic when I want to process millions of data points
reply
Can you please provide an example of an actual prompt?
GeeBrain
• 18d ago
It's literally the template + whatever you want in the {}. But here ya go...
Below is an instruction that describes a task, paired with an input that provides further context. Write response that appropriately completes the request.
Instruction: Label the comment based on this question: "Does this comment share personal details, like how friends might talk to each other, and share from little to big things in their lives?" Below are example labeled comments, w/ the reason behind their labels as context. Learn from the examples and think step by step before responding. Start your response by printing a "Yes/No" statement first as the label.
(Hint: If a comment merely expresses an opinion or admiration without any personal context or experience, label it as ‘No’. But if the comment shares additional context about the commenter’s life, it should be labeled as ‘Yes’. The level of detail matters!)
Comment: Wow, you are so beautiful.
Reason for Label: Sharing simple statements admiration or opinions, does not count as disclosing personal details, they need to express something about their personal life, habits, or experiences.
Label: No
.... (More examples)
Input:
Comment: "When he comes up?"
Label (Print Yes/No Only):
Response:
Reply
reply
trapping_rainwater
• 18d ago
What's your production use case for something like this?
Reply
reply
GeeBrain
• 18d ago
My project is around building an ML model that measures trust — kinda like a fandom score.
But in general, this type of setup I can see being really helpful when you have a lot of unlabeled data and wanna get really close with it.
Even though I’ll likely end up fine-tuning BERT models in the future for production, this has helped me understand so much about data space. Pretty fun"
URL
Suggested labels
The text was updated successfully, but these errors were encountered: