Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory map load vectors_from_file if requested #592

Merged
merged 20 commits into from
Oct 24, 2024

Conversation

daxpryce
Copy link
Contributor

@daxpryce daxpryce commented Oct 17, 2024

This commit adds some basic memmap support to the diskannpy.vectors_from_file utility function. You can now return a memory mapped np.array conformant view (np.memmap object) vs. requiring it be loaded fully into memory (np.ndarray style).

  • Does this PR have a descriptive title that could go in our release notes?
  • Does this PR add any new dependencies?
  • Does this PR modify any existing APIs?
    • Is the change to the API backwards compatible?
  • Should this result in any changes to our documentation, either updating existing docs or adding new ones?

What does this implement/fix? Briefly explain your changes.

Any other comments?

Note that I set the default to be mode="r" not "r+", which is the numpy default. I would prefer we don't mess with mutating the vectors prior to handing them to diskann for build, I have no idea if that way even works. I also wanted to leave it open until we figured out if it did.

…from_file` utility function. You can now return memory mapped np.array conformant view vs. requiring it beloaded fully into memory.
@daxpryce daxpryce requested a review from gopalrs October 17, 2024 00:01
@daxpryce
Copy link
Contributor Author

(note: I'll want to rebase on main after the mostly-fixed-the-build PR is merged)

Copy link
Contributor

@gopalrs gopalrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update the dependencies? (Do we even have a requirements.txt?)

@gopalrs
Copy link
Contributor

gopalrs commented Oct 21, 2024

Sorry for the delay - GH notifications go to my gmail account, and I don't look at it often enough

@daxpryce
Copy link
Contributor Author

Should we update the dependencies? (Do we even have a requirements.txt?)

Dependencies are in pyproject.toml - which is where you want dependencies defined for the library you're publishing. requirements.txt are more for applications that don't get published, and thus, won't be imported.

Anyway, the only strong requirement we have in our deps is numpy - which is actually a big challenge, as we use the numpy headers @ cxx compile time and we need to create compiled wheels for every version of numpy, which in turn can't really be published to pypi as there's no structural way to keep the "compiled for python 3.10 and numpy 1.25" separate from "compiled for python 3.11 and numpy 2.0.2". that's what conda is for, really. See #544 for more details.


Continuing on this topic but also slightly tangential, I have absolutely seen that pyarrow and polars have native (read: something was compiled) support for numpy without it being a fixed version. I have no idea how they've done it, but it may very well be possible and we can be more loosey goosey with our strict numpy requirements.


Which then dovetails into the actual answer to your question: we can absolutely upgrade the requirements, but they're basically all build time requirements, not runtime. Numpy is the only one that is a runtime requirement. Long answer shortened: "maybe, but not right now, we should look into it a bit deeper first to find a more optimal way of handling the numpy dep that isn't 'change the strict numpy 1.25 requirement into a strict numpy 2.0 requirement' because it's all sorts of crap for a bunch of reasons :)

* Update push-test.yml

When the upload-artifact action was updated to v4, the behavior of it changes such that you cannot create a unified "artifact" by having it add files to an existing artifact name. Instead, they all require a unique artifact name to be uploaded.

This should fix the current workflow build errors.

* Trying to fix the workflow to handle the name collision problem that starts in upload-artifact@v4

* Trying to get the cibuildwheel tool to work with the changes to AlmaLinux8 (why the heck is this docker image still shipping with an old gpg key anyway)

* Dynamic and Dynamic Labels were both trying to write to the  artifact
…e line above made that clear and I somehow didn't notice it.
…ager context. I should have known better than to try to force this through last night so late :D
…still in scope and hasn't been marked for collection yet? Let's try atexit instead. I hate doing this but it's either this or leave it in the temp directory
…ts around filtered indices is sporadically failing. This is new! And exciting! But also unexpected!
@daxpryce
Copy link
Contributor Author

@gopalrs

I'm getting occassional failures in one of the unit tests that seems to be around filtered memory indices. In short, my test compares an index search results filters, vs. an index search with filters (but every element has the same label). The expectation is the two result sets will be identical, but that does not seem to be the case.

I've added a comment to the unit test explaining it with your name in it to showcase what I'm doing, but because this is a: sporadic and b: seems to be a result of some changes in the cxx itself (nothing in the python has changed), I wanted to flag it as a test case that maybe the Cxx tests aren't catching or covering.

@@ -306,6 +306,8 @@ def test_exhaustive_validation(self):
with_filter_but_label_all, _ = index.search(
query_vectors[0], k_neighbors=k*2, complexity=128, filter_label="all"
)
# this has begun failing randomly. gopalrs - this *shouldn't* fail, but for some reason the filtered search
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gopalrs this is the comment I'm talking about. it works sometimes, other times it doesn't - the inconsistency is going to make it hard to track down, but it should work 100% of the time (it did before at least!)

@daxpryce daxpryce merged commit e6afdbb into main Oct 24, 2024
36 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants