Pretrained Llama tokenizers don't yield the expected tokenization of "\n"

### System Info

TypeScript 5.5.4
transformers.js 3.0.2
Node.js v20.170

### Environment/Platform

- [X] Website/web-app
- [ ] Browser extension
- [X] Server-side (e.g., Node.js, Deno, Bun)
- [ ] Desktop app (e.g., Electron)
- [ ] Other (e.g., VSCode extension)

### Description

Using some pretrained tokenizers doesn't yield the same tokenization of `"\n"` or `" \n"` as [Tiktokenizer](https://tiktokenizer.vercel.app/) or [Xenova's playground](https://huggingface.co/spaces/Xenova/the-tokenizer-playground).

For example, `Xenova/llama-3-tokenizer` tokenizes `"\n"` as `[198]` and `" \n"` as `[720]`
In both playgrounds, selecting Llama 3 in Xenova's playground and meta-llama/Meta-Llama-3-8B in Tiktokenizer, the Llama 3 tokenizer should tokenize "\n" [1734] as and " \n" as [1144, 77] 

Similarly for Llama 2, `Xenova/llama-tokenizer` tokenizes "\n" as `[1, 29871, 13]` while Xenova's playground yields `[1, 320, 29876]`.

### Reproduction

```js
import { AutoTokenizer } from "@huggingface/transformers";

const tokenizer = await AutoTokenizer.from_pretrained("Xenova/llama-3-tokenizer");

const tokens = tokenizer("\n", { return_tensor: false }).input_ids;
console.log(tokens) // prints [198] while [1734] is expected
```
Similar issue with `Xenova/llama-tokenizer`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretrained Llama tokenizers don't yield the expected tokenization of "\n" #1019

System Info

Environment/Platform

Description

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pretrained Llama tokenizers don't yield the expected tokenization of "\n" #1019

Description

System Info

Environment/Platform

Description

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions