This is a bunch of AI related notebooks I used for some tasks.
LLM | en->hu BLEU | Spell error % | HuLU avg | GLUE avg |
---|---|---|---|---|
gemma-2-27b-it-Q5_K_L.gguf | 0.1364 | 3.3% | 0.478 | 0.799 |
google_gemma-3-27b-it-Q5_K_L.gguf | 0.1327 | 3.3% | 0.487 | 0.804 |
SambaLingo-Hungarian-Chat-Q5_K_M.gguf | 0.1302 | 1.8% | 0.415 | 0.339 |
salamandra-7b-instruct.Q6_K.gguf | 0.1157 | 2.9% | 0.437 | 0.588 |
Meta-Llama-3.1-70B-Instruct-Q2_K.gguf | 0.1141 | 4.5% | 0.452 | 0.723 |
PULI-LlumiX-32K-Instruct-Q4_K_M.gguf | 0.1132 | 2.9% | 0.426 | 0.499 |
gemma-2-9b-it-Q6_K_L.gguf | 0.1125 | 3.2% | 0.495 | 0.799 |
Mistral-Small-24B-Instruct-2501-Q6_K.gguf | 0.1036 | 5.2% | 0.482 | 0.811 |
phi-4-Q6_K.gguf | 0.0981 | 3.4% | 0.468 | 0.791 |
Llama-3.3-70B-Instruct-Q2_K.gguf | 0.0954 | 9.3% | 0.462 | 0.788 |
mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf | 0.0946 | 3.6% | 0.450 | 0.762 |
Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf | 0.0870 | 3.5% | 0.447 | 0.740 |
Qwen2.5-32B-Instruct-Q4_K_L.gguf | 0.0705 | 8.3% | 0.456 | 0.811 |
solar-10.7b-instruct-v1.0.Q6_K.gguf | 0.0673 | 8.0% | 0.478 | 0.699 |
Ministral-8B-Instruct-2410-Q6_K_L.gguf | 0.0672 | 6.8% | 0.416 | 0.654 |
c4ai-command-r-v01.i1-Q4_K_S.gguf | 0.0667 | 6.1% | 0.436 | 0.704 |
gemma-2-2b-it-Q6_K_L.gguf | 0.0619 | 4.5% | 0.426 | 0.624 |
salamandra-2b-instruct_Q6_K.gguf | 0.0600 | 8.6% | 0.425 | 0.316 |
Llama-3.2-3B-Instruct-Q6_K_L.gguf | 0.0534 | 4.1% | 0.419 | 0.614 |
Mistral-NeMo-Minitron-8B-Instruct-Q6_K_L.gguf | 0.0497 | 7.9% | 0.432 | 0.728 |
Yi-1.5-34B-Chat-Q4_K_M.gguf | 0.0462 | 12.3% | 0.439 | 0.809 |
llama-2-7b-32k-instruct.Q5_K_M.gguf | 0.0450 | 13.5% | 0.412 | 0.641 |
Phi-3-medium-4k-instruct-Q6_K_L.gguf | 0.0373 | 8.4 | 0.413 | 0.716 |
gpt-35-turbo-instruct | 0.0264 | 8.0% | 0.480 | --- |
OLMoE-1B-7B-0924-Instruct-Q6_K_L.gguf | 0.0201 | 10.6% | 0.374 | 0.574 |
Phi-3-mini-4k-instruct-q4.gguf | 0.0186 | 11.6% | 0.405 | 0.659 |
DeepSeek-R1-Distill-Qwen-32B-Q4_K_L.gguf | 0.0013 | 55.2% | 0.452 | 0.816 |
falcon-mamba-7b-instruct.Q6_K.gguf | 0.0000 | 54.4% | 0.365 | --- |
Notes 1:
- HuLU: Hungarian text comprehension tests
- GLUE: English text comprehension tests
- en->hu BLEU: English to Hungarian translation tests, evaluated with BLEU scoring + hunspell to check spelling errors.
Note 2:
SambaLingo-Hungarian-Chat is further trained of llama-2-7b. It translates much better to hungarian (0.1302,1.8% vs 0.0450,13.5%), however HuLU almost didn't change (improved negligable 0.415 vs 0.412), and GLUE score became catastrophic (0.339 vs 0.641). Most probably, what we can observe here, is catastrophic forgetting.
Note 3:
falcon-mamba-7b was so bad, it practically output gibberish. It's spell error is only 54%, because the other 46% were numbers. It's also very slow, and had very high request error rate for classification tasks, so I stopped mid running the GLUE eval process.
Note 4:
Even though gpt-35-turbo-instruct has a high HULU score (one of the highest), it's translation and hungarian spelling capabilities are very bad.
Note 5:
DeepSeek-R1-Distill-Qwen-32B uses a different output format, first "thinks" then "responds", thus the unmodified en->hu BLEU evaluation also takes into consideration the preceding english "thinking" output, and substantially makes results worse, even though it has the highest GLUE score, and midrange HuLU score (though slightly worse then the original Qwen2.5-32B).
- HUN_Book_scraping.ipynb: a scraper to download most PDF files and their metadata from OSzK (Országos Széchenyi Könyvtár) MEK (Magyar Elektronikus Könyvtár)
- HUN_Book_statistics.ipynb: builds some statistics from the downloaded PDFs in CSV form, to further analyze in Excel
- eval-GLUE.ipynb: a simple, locally running Koboldcpp hosted LLM evaluator on the GLUE validation dataset
- eval-HULU.ipynb: a simple, locally running Koboldcpp hosted LLM evaluator on the HuLU validation dataset
- gen-hunglish-testset.ipynb: Generate hunglish evaluation dataset, for BLEU
- eval-BLEU-en-hu.ipynb: a simple, locally running Koboldcpp hosted LLM evaluator on the hunglish-BLEU.json dataset
- LLM_Eval.xlsx: the results of some LLM evaluations I run