Skip to content

Conversation

@Jonas-Heinrich
Copy link

@Jonas-Heinrich Jonas-Heinrich commented Sep 3, 2023

Hi,

while comparing QPL to TUM Umbra's related functionality, I noticed that the former has a large overhead for small amounts of tuples. After investigating with VTune and the following microbenchmark, I think I found a typo that leads to an accidental copy of the function pointer table in input_stream_t::initialize_sw_kernels. Here's the benchmark:

#include "qpl/qpl.h"
#include <benchmark/benchmark.h>

#include "../DataDistribution.hpp"
#include "../Utils.hpp"

int main() {
    // pointer wrapper with destructor
   ExecutionContext fixture(qpl_path_software);
   qpl_job* job = reinterpret_cast<qpl_job*>(fixture.getExecutionContext());

   uint32_t numTuples = 1;
   double selectivity = 0.5;
   uint32_t inBitWidth = 8;
   qpl_out_format outFormat = qpl_out_format::qpl_ow_32;

   // private utilities, does what you would expect
   LittleEndianBufferBuilder builder(inBitWidth, numTuples);
   Dataset dataset = UniformDistribution::generateDataset(
      builder,
      inBitWidth,
      (1ul << inBitWidth) - 1,
      numTuples,
      umbra::VectorizedFunctions::Mode::Eq,
      selectivity);
   std::vector<uint8_t> destination;
   destination.resize(divceil(32 * numTuples, 8));

   for (size_t i = 0; i < 1'000'000'000; i++) {
      if (i % 1'000'000 == 0) {
         std::cout << i << std::endl;
      }

      // Parameterize jobs.
      {
         job->parser = builder.getQPLParser();
         job->next_in_ptr = dataset.data.data();
         job->available_in = dataset.data.size();
         job->next_out_ptr = destination.data();
         job->available_out = static_cast<uint32_t>(destination.size());
         job->op = map_umbra_to_qpl_op(umbra::VectorizedFunctions::Mode::Eq);
         job->src1_bit_width = inBitWidth;
         job->num_input_elements = numTuples;
         job->out_bit_width = outFormat;
         auto [param_low, param_high] = dataset.predicateArguments->template getValues<uint32_t, 2>();
         job->param_low = param_low;
         job->param_high = param_high;
         job->flags = static_cast<uint32_t>(QPL_FLAG_OMIT_CHECKSUMS | QPL_FLAG_OMIT_AGGREGATES);
      }

      qpl_status status = qpl_execute_job(job);
      if (status != QPL_STS_OK) {
         ERROR("An error occurred during job execution: " << status);
      }

      const auto indicesByteSize = job->total_out;
      const auto bytesPerHit = divceil(32, 8);
      const auto qplHits = indicesByteSize / bytesPerHit;
      if ((outFormat != qpl_ow_nom && qplHits != dataset.predicateHits)) {
         ERROR("Result does not fit expectations: " << qplHits << " != " << dataset.predicateHits);
      } else if (outFormat == qpl_ow_nom && job->total_out != divceil(numTuples, 8)) {
         ERROR("Result does not fit expectations: " << job->total_out << " != " << divceil(numTuples, 8));
      }

      benchmark::DoNotOptimize(job->total_out);
      benchmark::DoNotOptimize(status);
      benchmark::DoNotOptimize(destination);
      benchmark::ClobberMemory();
   }
}

The benchmark was run for 15s on an i9 13900K. Screenshot of VTune summary before the PR:

Screenshot 2023-09-03 at 12 45 58

after the PR:

Screenshot 2023-09-03 at 13 39 15

@Jonas-Heinrich
Copy link
Author

Jonas-Heinrich commented Sep 3, 2023

After further investigation, I noticed that the issue is also present in other locations. The second force push now includes other locations (found by searching for core_sw::dispatcher::kernels_dispatcher::get_instance()).

@mzhukova mzhukova self-assigned this Sep 5, 2023
@mzhukova
Copy link
Contributor

Hi @Jonas-Heinrich thank you for the contribution!
Team would review and do a thorough testing on our side in the upcoming weeks.

@mzhukova mzhukova added the enhancement New feature or request label Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants