Improve time complexity of '_byte_pair_merge' and add test #29

michaelgiba · 2023-02-08T01:34:06Z

I saw the comment in the byte_pair_merge code about improving the time complexity and thought it sounded interesting. I'm interested to see how the overall throughput of tiktoken changes with this change. I'll work on benchmarking it when I get a chance

hauntsaninja · 2023-02-08T01:45:17Z

Thanks for looking into this! I'd played around with it too a long time back and found the constant factor was dominant in practice. Curious to see the benchmarking results here; we could also consider conditioning on piece length.

One quick note about the tests: I haven't gotten around to open sourcing most of tiktoken's tests. So apologies if that made your life harder.

michaelgiba · 2023-02-08T02:04:13Z

No problem on the tests, I figured that might be the case. Are there more benchmarking tools internally as well or is that something I could contribute organically ? If the piece lengths are typically small I would imagine this solution is slower given the extra upfront cost.

…

On Tue, Feb 7, 2023 at 7:45 PM Shantanu ***@***.***> wrote: Thanks for looking into this! I'd played around with it too a long time back and found the constant factor was dominant in practice. Curious to see the benchmarking results here; we could also consider conditioning on piece length. One quick note about the tests: I haven't gotten around to open sourcing most of tiktoken's tests. So apologies if that made your life harder. — Reply to this email directly, view it on GitHub <#29 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABTFWYOYFVZSVJ7557HPORTWWL3DPANCNFSM6AAAAAAUUVEDFQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

hauntsaninja · 2023-02-08T02:17:19Z

There is some more internal benchmarking, but it looks a lot like https://github.com/openai/tiktoken/blob/main/scripts/benchmark.py

The main thing is just finding a documents: list[str] to feed to it. I used some internal-only datasets for this purpose, but maybe openwebtext or something would work. 100MB of data should be plenty for a meaningful benchmark (at low thread count).

Piece lengths are typically very small (about word length). That said, there are some degenerate cases. The best case scenario for heap merges is probably something like:

import base64, random

def base64_noise_documents(n_docs: int):
    rand = random.Random(217)
    documents = []
    for _ in range(n_docs):
        documents.append(base64.b64encode(rand.randbytes(rand.randint(100, 10_000))).decode())
    return documents

michaelgiba · 2023-02-09T03:19:30Z

Ok I had a chance to run some simple tests. I ran with 1,3,5,7,9,11 threads on both the original version and the heap version against documents created using your random b64 example above and 100 document samples from this random ubuntu dialog dataset I found https://github.com/rkadlec/ubuntu-ranking-dataset-creator

Surprisingly the performance looks decent. I'll try to run something more rigorous when I have free time.

michaelgiba · 2023-02-13T15:56:28Z

Haven't gotten around to testing this - but I'm closing it out in favor of #31 which looks great!

hauntsaninja · 2023-02-13T20:56:48Z

Thanks for experimenting with it though! :-)

Improve time complexity of '_byte_pair_merge' and add test

1a53ea1

Remove clone

828f945

michaelgiba closed this Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve time complexity of '_byte_pair_merge' and add test #29

Improve time complexity of '_byte_pair_merge' and add test #29

Uh oh!

michaelgiba commented Feb 8, 2023

Uh oh!

hauntsaninja commented Feb 8, 2023

Uh oh!

michaelgiba commented Feb 8, 2023 via email

Uh oh!

hauntsaninja commented Feb 8, 2023

Uh oh!

michaelgiba commented Feb 9, 2023

Uh oh!

michaelgiba commented Feb 13, 2023 •

edited

Loading

Uh oh!

hauntsaninja commented Feb 13, 2023

Uh oh!

Uh oh!

Improve time complexity of '_byte_pair_merge' and add test #29

Improve time complexity of '_byte_pair_merge' and add test #29

Uh oh!

Conversation

michaelgiba commented Feb 8, 2023

Uh oh!

hauntsaninja commented Feb 8, 2023

Uh oh!

michaelgiba commented Feb 8, 2023 via email

Uh oh!

hauntsaninja commented Feb 8, 2023

Uh oh!

michaelgiba commented Feb 9, 2023

Uh oh!

michaelgiba commented Feb 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hauntsaninja commented Feb 13, 2023

Uh oh!

Uh oh!

michaelgiba commented Feb 13, 2023 •

edited

Loading