Skip to content

Conversation

@viclafargue
Copy link
Contributor

The SNMG ANN build spawns threads with #pragma omp parallel for, but CAGRA build also uses OpenMP internally. With nested parallelism disabled by default, CAGRA's internal parallel regions ran serially on a single thread, causing severe performance degradation.

  • Enable nested parallelism with omp_set_nested(1)
  • Limit outer loop to num_ranks threads (one per GPU)
  • Inside each rank thread, allocate threads_per_rank for CAGRA's internal parallelism
  • Restore original thread count after parallel region

@viclafargue viclafargue requested a review from a team as a code owner November 10, 2025 16:15
@viclafargue viclafargue changed the title SNMG ANN build OpenMP nested parallelism SNMG ANN build with OpenMP nested parallelism Nov 10, 2025
@viclafargue viclafargue added bug Something isn't working non-breaking Introduces a non-breaking change labels Nov 11, 2025
}

// Restore original thread count
cuvs::core::omp::set_num_threads(saved_omp_threads);
Copy link
Contributor

@tarang-jain tarang-jain Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this restoration is a bit confusing. the pool of threads is set inside the for loop above, but its restored after the for loop? Can you verify if this logic is correct?

Copy link
Contributor Author

@viclafargue viclafargue Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This is actually not necessary. My worry was that the main thread could be one of the threads in the loop. But, after checking it appears like OpenMP is designed in such a way that omp_set_num_threads calls inside of a parallel region only affect how many threads each thread can use in nested parallel region. The main thread is unaffected, and its number of threads is actually preserved. I made sure that each thread sees its number of thread restored back to the global maximal number of threads within the parallel region.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added the same OpenMP usage for the extend function.

@viclafargue viclafargue changed the base branch from main to release/25.12 November 17, 2025 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Development

Successfully merging this pull request may close these issues.

4 participants