Skip to content

Improve time complexity of '_byte_pair_merge' and add test #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

michaelgiba
Copy link

I saw the comment in the byte_pair_merge code about improving the time complexity and thought it sounded interesting. I'm interested to see how the overall throughput of tiktoken changes with this change. I'll work on benchmarking it when I get a chance

@hauntsaninja
Copy link
Collaborator

Thanks for looking into this! I'd played around with it too a long time back and found the constant factor was dominant in practice. Curious to see the benchmarking results here; we could also consider conditioning on piece length.

One quick note about the tests: I haven't gotten around to open sourcing most of tiktoken's tests. So apologies if that made your life harder.

@michaelgiba
Copy link
Author

michaelgiba commented Feb 8, 2023 via email

@hauntsaninja
Copy link
Collaborator

There is some more internal benchmarking, but it looks a lot like https://github.com/openai/tiktoken/blob/main/scripts/benchmark.py

The main thing is just finding a documents: list[str] to feed to it. I used some internal-only datasets for this purpose, but maybe openwebtext or something would work. 100MB of data should be plenty for a meaningful benchmark (at low thread count).

Piece lengths are typically very small (about word length). That said, there are some degenerate cases. The best case scenario for heap merges is probably something like:

import base64, random

def base64_noise_documents(n_docs: int):
    rand = random.Random(217)
    documents = []
    for _ in range(n_docs):
        documents.append(base64.b64encode(rand.randbytes(rand.randint(100, 10_000))).decode())
    return documents

@michaelgiba
Copy link
Author

Ok I had a chance to run some simple tests. I ran with 1,3,5,7,9,11 threads on both the original version and the heap version against documents created using your random b64 example above and 100 document samples from this random ubuntu dialog dataset I found https://github.com/rkadlec/ubuntu-ranking-dataset-creator

Screenshot from 2023-02-08 21-14-04

Surprisingly the performance looks decent. I'll try to run something more rigorous when I have free time.

@michaelgiba
Copy link
Author

michaelgiba commented Feb 13, 2023

Haven't gotten around to testing this - but I'm closing it out in favor of #31 which looks great!

@hauntsaninja
Copy link
Collaborator

Thanks for experimenting with it though! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants