-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(gpu): improve full propagation in sum and sub #1763
base: main
Are you sure you want to change the base?
feat(gpu): improve full propagation in sum and sub #1763
Conversation
6b37fe7
to
44cb537
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @guillermo-oyarzun! Thanks a lot for this PR 🙏 Here comes a first review: I don't know the details of the implementation so it's hard for me to go through all the logic though. Maybe if you walk me through it it would help.
uint32_t gpu_count, | ||
int8_t **mem_ptr_void); | ||
|
||
void scratch_cuda_integer_overflowing_sub_kb_64_inplace( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure some cuda_integer_radix_overflowing_sub are not left in some places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed some unnecessary headers that were still there
int_radix_lut(cudaStream_t const *streams, uint32_t const *gpu_indexes, | ||
uint32_t gpu_count, int_radix_params params, uint32_t num_luts, | ||
uint32_t num_radix_blocks, uint32_t lut_count, | ||
bool allocate_gpu_memory) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit strange to have num_luts and lut_count, how about we rename lut_count
to num_many_lut
? (That renaming would have to be applied everywhere lut_count
is used)
multi_gpu_alloc_lwe_async(streams, gpu_indexes, active_gpu_count, | ||
lwe_after_ks_vec, num_radix_blocks, | ||
params.small_lwe_dimension + 1); | ||
multi_gpu_alloc_many_lwe_async(streams, gpu_indexes, active_gpu_count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this function could be renamed: multi_gpu_alloc_lwe_many_lut_output_async
?
}; | ||
|
||
// needed for the division to update the lut indexes | ||
void update_lut_indexes(cudaStream_t const *streams, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure having the function in this way is the most readable way to go? Maybe it would be better to write the whole logic for index update in the division itself instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved most of the logic to the division now
luts_array_second_step->release(streams, gpu_indexes, gpu_count); | ||
|
||
if (use_sequential_algorithm_to_resolver_group_carries) { | ||
seq_group_prop_mem->release(streams, gpu_indexes, gpu_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete is missing
cudaStream_t const *streams, uint32_t const *gpu_indexes, | ||
uint32_t gpu_count, Torus *lwe_array, int_radix_params params, | ||
int_shifted_blocks_and_states_memory<Torus> *mem, void *const *bsks, | ||
Torus *const *ksks, uint32_t num_blocks, uint32_t lut_stride, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe num_blocks -> num_radix_blocks
tfhe/src/core_crypto/gpu/mod.rs
Outdated
message_modulus, | ||
); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to add this to core crypto, do we?
@@ -227,6 +285,18 @@ impl CudaServerKey { | |||
streams.synchronize(); | |||
} | |||
|
|||
pub fn unchecked_add_assign_with_packing<T: CudaIntegerRadixCiphertext>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have this entry point on the Rust side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is something needed for the signed overflowing add/sub, not actually tested in this PR, I could remove it and just included in the other PR we will have
/// | ||
/// - `streams` __must__ be synchronized to guarantee computation has finished, and inputs must | ||
/// not be dropped until streams is synchronized | ||
pub(crate) unsafe fn new_propagate_single_carry_assign_async<T>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we name this one propagate_single_carry_assign_async
and remove the old one?
where | ||
T: CudaIntegerRadixCiphertext, | ||
{ | ||
self.propagate_fast_single_carry_assign_async(ct, streams, input_carry, requested_flag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we keep only the fast version on the Rust side, and remove the old one?
09770dd
to
935588f
Compare
935588f
to
9af4fde
Compare
closes: please link all relevant issues
PR content/description
Check-list: