Skip to content

Pretrained Llama tokenizers don't yield the expected tokenization of "\n" #1019

@JulienVig

Description

@JulienVig

System Info

TypeScript 5.5.4
transformers.js 3.0.2
Node.js v20.170

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

Using some pretrained tokenizers doesn't yield the same tokenization of "\n" or " \n" as Tiktokenizer or Xenova's playground.

For example, Xenova/llama-3-tokenizer tokenizes "\n" as [198] and " \n" as [720]
In both playgrounds, selecting Llama 3 in Xenova's playground and meta-llama/Meta-Llama-3-8B in Tiktokenizer, the Llama 3 tokenizer should tokenize "\n" [1734] as and " \n" as [1144, 77]

Similarly for Llama 2, Xenova/llama-tokenizer tokenizes "\n" as [1, 29871, 13] while Xenova's playground yields [1, 320, 29876].

Reproduction

import { AutoTokenizer } from "@huggingface/transformers";

const tokenizer = await AutoTokenizer.from_pretrained("Xenova/llama-3-tokenizer");

const tokens = tokenizer("\n", { return_tensor: false }).input_ids;
console.log(tokens) // prints [198] while [1734] is expected

Similar issue with Xenova/llama-tokenizer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions