-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable kernel & memcpy overlapping in IVF index building #230
Enable kernel & memcpy overlapping in IVF index building #230
Conversation
/ok to test |
@tfeher , do you have any suggestions on how we should expose the overlapping feature to users? Currently, user need to create a stream pool with one stream in it (separate from the main stream), so we could achieve kernel copy overlapping. But this model is not intuitive. Supporting overlapping in ANN benchmark is not difficult, but for end-user, do we want to add a flag to our API (specifically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Rui for the PR, this is an important improvement for the build time of IVF methods! I In general it looks good, but there are a few small things that we shall clarify.
This is a good question. In other words, to enable overlapping of clustering and H2D copies, we have to do the following: handle = DeviceResources(n_streams=1)
index = ivf_pq.build(build_params, dataset, resources=handle) The question is, whether we would like to have an extra build option like build_params.overlap_hd2_copies = True
index = ivf_pq.build(build_params, dataset) I would prefer that we use the first option, if we have the resources (stream pool) then we use that automatically. But we would need to update the docstring of |
Hi @cjnolet , could you comment on the two choices Tamas summarized here? |
08dde39
to
8b7a116
Compare
Re-requesting reviews. Added documentations. Exposed the overlapping option to users. Users need to setup a stream pool in resource and pass that to the |
@abc99lr just to add my response here quickly- I do agree with Tamas' suggestion that we should utilize the infrastructure already available in the handle to inform the user-configurable behavior of algorithms as much as possible. That being said, the expectations of these configurations do need to be communicated more clearly to the users through the docs, I +1 to improving the docstring. |
Thanks for the review @cjnolet . Could you let me know if there is anything specific we should add to the docstring? I have added more documentations in my changes last Friday. |
@abc99lr the doc description itself looks okay, but, but it would be helpful to users to add a line or two to the usage example below the description that sets a stream pool on the handle. I would also add a small code comment to the snippet that says something like “create a cuda stream pool to …”. Thanks again for the contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @abc99lr for the updates! It looks good, let's just add a small improvement to the docstring code examples.
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Rui for the update, the PR looks good to me!
/ok to test |
Thanks, my pre-commit was installed in my previous conda env. I had to rebuild an env so forgot to add it. Thanks for the instruction. I think we should be good now. |
/ok to test |
/merge |
Currently, in IVF index building (both IVF-Flat and IVF-PQ), large dataset is usually in pageable host memory or mmap-ed file. In both case, after the cluster centers are trained, the entire dataset needs to be copied twice to the GPU -- one for assigning vectors to clusters, the other for copying vectors to the corresponding clusters. Both copies are done using `batch_load_iterator` in a chunk-by-chunk fashion. Since the source buffer is in pageable memory, the current `batch_load_iterator` implementation doesn't support kernel and memcopy overlapping. This PR adds support on prefetching with `cudaMemcpyAsync` on pageable memory. We achieve kernel copy overlapping by launching kernel first following by the prefetching of the next chunk. We benchmarked the change on L40S. The results show 3%-21% speedup on index building, without impacting the search recall (about 1-2%, similar to run-to-run variance). algo | dataset | model | with prefetching (s) | without prefetching (s) | speedup -- | -- | -- | -- | -- | -- IVF-PQ | deep-100M | d64b5n50K | 97.3547 | 100.36 | 1.03 IVF-PQ | wiki-all-10M | d64-nlist16K | 14.9763 | 18.1602 | 1.21 IVF-Flat | deep-100M | nlist50K | 78.8188 | 81.4461 | 1.03 This PR is related to the issue submitted to RAFT: rapidsai/raft#2106 Authors: - Rui Lan (https://github.com/abc99lr) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: rapidsai#230
Currently, in IVF index building (both IVF-Flat and IVF-PQ), large dataset is usually in pageable host memory or mmap-ed file. In both case, after the cluster centers are trained, the entire dataset needs to be copied twice to the GPU -- one for assigning vectors to clusters, the other for copying vectors to the corresponding clusters. Both copies are done using
batch_load_iterator
in a chunk-by-chunk fashion. Since the source buffer is in pageable memory, the currentbatch_load_iterator
implementation doesn't support kernel and memcopy overlapping. This PR adds support on prefetching withcudaMemcpyAsync
on pageable memory. We achieve kernel copy overlapping by launching kernel first following by the prefetching of the next chunk.We benchmarked the change on L40S. The results show 3%-21% speedup on index building, without impacting the search recall (about 1-2%, similar to run-to-run variance).
This PR is related to the issue submitted to RAFT: rapidsai/raft#2106