Memory map load vectors_from_file if requested #592

daxpryce · 2024-10-17T00:01:00Z

This commit adds some basic memmap support to the diskannpy.vectors_from_file utility function. You can now return a memory mapped np.array conformant view (np.memmap object) vs. requiring it be loaded fully into memory (np.ndarray style).

Does this PR have a descriptive title that could go in our release notes?
Does this PR add any new dependencies?
Does this PR modify any existing APIs?
- Is the change to the API backwards compatible?
Should this result in any changes to our documentation, either updating existing docs or adding new ones?

What does this implement/fix? Briefly explain your changes.

Any other comments?

Note that I set the default to be mode="r" not "r+", which is the numpy default. I would prefer we don't mess with mutating the vectors prior to handing them to diskann for build, I have no idea if that way even works. I also wanted to leave it open until we figured out if it did.

…from_file` utility function. You can now return memory mapped np.array conformant view vs. requiring it beloaded fully into memory.

daxpryce · 2024-10-17T00:33:18Z

(note: I'll want to rebase on main after the mostly-fixed-the-build PR is merged)

…rkdown event

gopalrs

Should we update the dependencies? (Do we even have a requirements.txt?)

gopalrs · 2024-10-21T19:57:54Z

Sorry for the delay - GH notifications go to my gmail account, and I don't look at it often enough

daxpryce · 2024-10-22T19:38:49Z

Should we update the dependencies? (Do we even have a requirements.txt?)

Dependencies are in pyproject.toml - which is where you want dependencies defined for the library you're publishing. requirements.txt are more for applications that don't get published, and thus, won't be imported.

Anyway, the only strong requirement we have in our deps is numpy - which is actually a big challenge, as we use the numpy headers @ cxx compile time and we need to create compiled wheels for every version of numpy, which in turn can't really be published to pypi as there's no structural way to keep the "compiled for python 3.10 and numpy 1.25" separate from "compiled for python 3.11 and numpy 2.0.2". that's what conda is for, really. See #544 for more details.

Continuing on this topic but also slightly tangential, I have absolutely seen that pyarrow and polars have native (read: something was compiled) support for numpy without it being a fixed version. I have no idea how they've done it, but it may very well be possible and we can be more loosey goosey with our strict numpy requirements.

Which then dovetails into the actual answer to your question: we can absolutely upgrade the requirements, but they're basically all build time requirements, not runtime. Numpy is the only one that is a runtime requirement. Long answer shortened: "maybe, but not right now, we should look into it a bit deeper first to find a more optimal way of handling the numpy dep that isn't 'change the strict numpy 1.25 requirement into a strict numpy 2.0 requirement' because it's all sorts of crap for a bunch of reasons :)

* Update push-test.yml When the upload-artifact action was updated to v4, the behavior of it changes such that you cannot create a unified "artifact" by having it add files to an existing artifact name. Instead, they all require a unique artifact name to be uploaded. This should fix the current workflow build errors. * Trying to fix the workflow to handle the name collision problem that starts in upload-artifact@v4 * Trying to get the cibuildwheel tool to work with the changes to AlmaLinux8 (why the heck is this docker image still shipping with an old gpg key anyway) * Dynamic and Dynamic Labels were both trying to write to the artifact

…arrays,not all() the python function

…e line above made that clear and I somehow didn't notice it.

…time in our unit test.

…till made the mistake. Wee!

…ager context. I should have known better than to try to force this through last night so late :D

…still in scope and hasn't been marked for collection yet? Let's try atexit instead. I hate doing this but it's either this or leave it in the temp directory

…ts around filtered indices is sporadically failing. This is new! And exciting! But also unexpected!

daxpryce · 2024-10-24T18:31:54Z

@gopalrs

I'm getting occassional failures in one of the unit tests that seems to be around filtered memory indices. In short, my test compares an index search results filters, vs. an index search with filters (but every element has the same label). The expectation is the two result sets will be identical, but that does not seem to be the case.

I've added a comment to the unit test explaining it with your name in it to showcase what I'm doing, but because this is a: sporadic and b: seems to be a result of some changes in the cxx itself (nothing in the python has changed), I wanted to flag it as a test case that maybe the Cxx tests aren't catching or covering.

daxpryce · 2024-10-24T18:41:34Z

python/tests/test_static_memory_index.py

@@ -306,6 +306,8 @@ def test_exhaustive_validation(self):
            with_filter_but_label_all, _ = index.search(
                query_vectors[0], k_neighbors=k*2, complexity=128, filter_label="all"
            )
+            # this has begun failing randomly. gopalrs - this *shouldn't* fail, but for some reason the filtered search


@gopalrs this is the comment I'm talking about. it works sometimes, other times it doesn't - the inconsistency is going to make it hard to track down, but it should work 100% of the time (it did before at least!)

This commit adds some basic memmap support to the `diskannpy.vectors_…

ea2db1b

…from_file` utility function. You can now return memory mapped np.array conformant view vs. requiring it beloaded fully into memory.

daxpryce requested a review from gopalrs October 17, 2024 00:01

Rough stab at some 'how to use the python library' in the workflow ma…

ef6c66f

…rkdown event

gopalrs reviewed Oct 21, 2024

View reviewed changes

gopalrs approved these changes Oct 23, 2024

View reviewed changes

daxpryce added 18 commits October 23, 2024 19:12

Unit tests failed, .all() is apparently what you need to do on numpy …

7d8402b

…arrays,not all() the python function

Unit tests failed, .all() is apparently what you need to do on numpy …

8bca41d

…arrays,not all() the python function

Unit tests failed, .all() is apparently what you need to do on numpy …

cf3b884

…arrays,not all() the python function

my mistake: offset is bytes, not offset * sizeof(dtype). Literally th…

28a9547

…e line above made that clear and I somehow didn't notice it.

Typo in the memmap parameter

1caeefd

Windows does not appreciate two readers of the same file at the same …

3cdd3cf

…time in our unit test.

Windows does not appreciate two readers of the same file at the same …

5db81de

…time in our unit test.

TempFileWrappers aren't filenames and I know better than that but I s…

46624cd

…till made the mistake. Wee!

Pay no attention to my foolishness

0941721

This scenario is cursed

7a7b4f3

.

5f81d3d

.

a3c0685

.

9a904a2

.

09ef8f5

I was trying to unlink the file while we were still in the ContextMan…

c543129

…ager context. I should have known better than to try to force this through last night so late :D

Still can't delete the file - perhaps because the memmapped array is …

fa111cc

…still in scope and hasn't been marked for collection yet? Let's try atexit instead. I hate doing this but it's either this or leave it in the temp directory

The unit test for the new feature is fixed, but the static memory tes…

b608101

…ts around filtered indices is sporadically failing. This is new! And exciting! But also unexpected!

daxpryce commented Oct 24, 2024

View reviewed changes

daxpryce merged commit e6afdbb into main Oct 24, 2024
36 of 38 checks passed

daxpryce mentioned this pull request Oct 24, 2024

[BUG] Unit test failure with filtered indices #594

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory map load vectors_from_file if requested #592

Memory map load vectors_from_file if requested #592

daxpryce commented Oct 17, 2024 •

edited by gopalrs

Loading

daxpryce commented Oct 17, 2024

gopalrs left a comment

gopalrs commented Oct 21, 2024

daxpryce commented Oct 22, 2024

daxpryce commented Oct 24, 2024

daxpryce Oct 24, 2024

Memory map load vectors_from_file if requested #592

Memory map load vectors_from_file if requested #592

Conversation

daxpryce commented Oct 17, 2024 • edited by gopalrs Loading

What does this implement/fix? Briefly explain your changes.

Any other comments?

daxpryce commented Oct 17, 2024

gopalrs left a comment

Choose a reason for hiding this comment

gopalrs commented Oct 21, 2024

daxpryce commented Oct 22, 2024

daxpryce commented Oct 24, 2024

daxpryce Oct 24, 2024

Choose a reason for hiding this comment

daxpryce commented Oct 17, 2024 •

edited by gopalrs

Loading