Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gpu): improve full propagation in sum and sub #1763

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

guillermo-oyarzun
Copy link
Member

closes: please link all relevant issues

PR content/description

Check-list:

  • Tests for the changes have been added (for bug fixes / features)
  • Docs have been added / updated (for bug fixes / features)
  • Relevant issues are marked as resolved/closed, related issues are linked in the description
  • Check for breaking changes (including serialization changes) and add them to commit message following the conventional commit specification

@guillermo-oyarzun guillermo-oyarzun self-assigned this Nov 8, 2024
@cla-bot cla-bot bot added the cla-signed label Nov 8, 2024
@guillermo-oyarzun guillermo-oyarzun force-pushed the go/refactor/improve-full-propagation-and-sum-algorithms branch from 6b37fe7 to 44cb537 Compare November 8, 2024 11:56
Copy link
Contributor

@agnesLeroy agnesLeroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @guillermo-oyarzun! Thanks a lot for this PR 🙏 Here comes a first review: I don't know the details of the implementation so it's hard for me to go through all the logic though. Maybe if you walk me through it it would help.

uint32_t gpu_count,
int8_t **mem_ptr_void);

void scratch_cuda_integer_overflowing_sub_kb_64_inplace(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure some cuda_integer_radix_overflowing_sub are not left in some places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed some unnecessary headers that were still there

int_radix_lut(cudaStream_t const *streams, uint32_t const *gpu_indexes,
uint32_t gpu_count, int_radix_params params, uint32_t num_luts,
uint32_t num_radix_blocks, uint32_t lut_count,
bool allocate_gpu_memory) {
Copy link
Contributor

@agnesLeroy agnesLeroy Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit strange to have num_luts and lut_count, how about we rename lut_count to num_many_lut? (That renaming would have to be applied everywhere lut_count is used)

multi_gpu_alloc_lwe_async(streams, gpu_indexes, active_gpu_count,
lwe_after_ks_vec, num_radix_blocks,
params.small_lwe_dimension + 1);
multi_gpu_alloc_many_lwe_async(streams, gpu_indexes, active_gpu_count,
Copy link
Contributor

@agnesLeroy agnesLeroy Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this function could be renamed: multi_gpu_alloc_lwe_many_lut_output_async?

};

// needed for the division to update the lut indexes
void update_lut_indexes(cudaStream_t const *streams,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure having the function in this way is the most readable way to go? Maybe it would be better to write the whole logic for index update in the division itself instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved most of the logic to the division now

luts_array_second_step->release(streams, gpu_indexes, gpu_count);

if (use_sequential_algorithm_to_resolver_group_carries) {
seq_group_prop_mem->release(streams, gpu_indexes, gpu_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete is missing

cudaStream_t const *streams, uint32_t const *gpu_indexes,
uint32_t gpu_count, Torus *lwe_array, int_radix_params params,
int_shifted_blocks_and_states_memory<Torus> *mem, void *const *bsks,
Torus *const *ksks, uint32_t num_blocks, uint32_t lut_stride,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe num_blocks -> num_radix_blocks

message_modulus,
);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to add this to core crypto, do we?

@@ -227,6 +285,18 @@ impl CudaServerKey {
streams.synchronize();
}

pub fn unchecked_add_assign_with_packing<T: CudaIntegerRadixCiphertext>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have this entry point on the Rust side?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is something needed for the signed overflowing add/sub, not actually tested in this PR, I could remove it and just included in the other PR we will have

///
/// - `streams` __must__ be synchronized to guarantee computation has finished, and inputs must
/// not be dropped until streams is synchronized
pub(crate) unsafe fn new_propagate_single_carry_assign_async<T>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we name this one propagate_single_carry_assign_async and remove the old one?

where
T: CudaIntegerRadixCiphertext,
{
self.propagate_fast_single_carry_assign_async(ct, streams, input_carry, requested_flag)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we keep only the fast version on the Rust side, and remove the old one?

@guillermo-oyarzun guillermo-oyarzun force-pushed the go/refactor/improve-full-propagation-and-sum-algorithms branch 9 times, most recently from 09770dd to 935588f Compare November 13, 2024 14:52
@guillermo-oyarzun guillermo-oyarzun force-pushed the go/refactor/improve-full-propagation-and-sum-algorithms branch from 935588f to 9af4fde Compare November 13, 2024 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants