-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory map load vectors_from_file if requested #592
Conversation
…from_file` utility function. You can now return memory mapped np.array conformant view vs. requiring it beloaded fully into memory.
(note: I'll want to rebase on main after the mostly-fixed-the-build PR is merged) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update the dependencies? (Do we even have a requirements.txt?)
Sorry for the delay - GH notifications go to my gmail account, and I don't look at it often enough |
Dependencies are in pyproject.toml - which is where you want dependencies defined for the library you're publishing. requirements.txt are more for applications that don't get published, and thus, won't be imported. Anyway, the only strong requirement we have in our deps is numpy - which is actually a big challenge, as we use the numpy headers @ cxx compile time and we need to create compiled wheels for every version of numpy, which in turn can't really be published to pypi as there's no structural way to keep the "compiled for python 3.10 and numpy 1.25" separate from "compiled for python 3.11 and numpy 2.0.2". that's what conda is for, really. See #544 for more details. Continuing on this topic but also slightly tangential, I have absolutely seen that pyarrow and polars have native (read: something was compiled) support for numpy without it being a fixed version. I have no idea how they've done it, but it may very well be possible and we can be more loosey goosey with our strict numpy requirements. Which then dovetails into the actual answer to your question: we can absolutely upgrade the requirements, but they're basically all build time requirements, not runtime. Numpy is the only one that is a runtime requirement. Long answer shortened: "maybe, but not right now, we should look into it a bit deeper first to find a more optimal way of handling the numpy dep that isn't 'change the strict numpy 1.25 requirement into a strict numpy 2.0 requirement' because it's all sorts of crap for a bunch of reasons :) |
* Update push-test.yml When the upload-artifact action was updated to v4, the behavior of it changes such that you cannot create a unified "artifact" by having it add files to an existing artifact name. Instead, they all require a unique artifact name to be uploaded. This should fix the current workflow build errors. * Trying to fix the workflow to handle the name collision problem that starts in upload-artifact@v4 * Trying to get the cibuildwheel tool to work with the changes to AlmaLinux8 (why the heck is this docker image still shipping with an old gpg key anyway) * Dynamic and Dynamic Labels were both trying to write to the artifact
…arrays,not all() the python function
…arrays,not all() the python function
…arrays,not all() the python function
…e line above made that clear and I somehow didn't notice it.
…time in our unit test.
…time in our unit test.
…till made the mistake. Wee!
…ager context. I should have known better than to try to force this through last night so late :D
…still in scope and hasn't been marked for collection yet? Let's try atexit instead. I hate doing this but it's either this or leave it in the temp directory
…ts around filtered indices is sporadically failing. This is new! And exciting! But also unexpected!
I'm getting occassional failures in one of the unit tests that seems to be around filtered memory indices. In short, my test compares an index search results filters, vs. an index search with filters (but every element has the same label). The expectation is the two result sets will be identical, but that does not seem to be the case. I've added a comment to the unit test explaining it with your name in it to showcase what I'm doing, but because this is a: sporadic and b: seems to be a result of some changes in the cxx itself (nothing in the python has changed), I wanted to flag it as a test case that maybe the Cxx tests aren't catching or covering. |
@@ -306,6 +306,8 @@ def test_exhaustive_validation(self): | |||
with_filter_but_label_all, _ = index.search( | |||
query_vectors[0], k_neighbors=k*2, complexity=128, filter_label="all" | |||
) | |||
# this has begun failing randomly. gopalrs - this *shouldn't* fail, but for some reason the filtered search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gopalrs this is the comment I'm talking about. it works sometimes, other times it doesn't - the inconsistency is going to make it hard to track down, but it should work 100% of the time (it did before at least!)
This commit adds some basic memmap support to the
diskannpy.vectors_from_file
utility function. You can now return a memory mapped np.array conformant view (np.memmap object) vs. requiring it be loaded fully into memory (np.ndarray style).What does this implement/fix? Briefly explain your changes.
Any other comments?
Note that I set the default to be mode="r" not "r+", which is the numpy default. I would prefer we don't mess with mutating the vectors prior to handing them to diskann for build, I have no idea if that way even works. I also wanted to leave it open until we figured out if it did.