parallelize numbuf memcpy and plasma object hash construction #366

atumanov · 2017-03-13T09:59:30Z

No description provided.

AmplabJenkins · 2017-03-13T17:18:31Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-03-13T17:18:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/279/
Test FAILed.

pcmoritz · 2017-03-13T22:37:51Z

src/numbuf/python/src/pynumbuf/memory.h

@@ -44,7 +50,11 @@ class FixedBufferStream : public arrow::io::OutputStream,
    DCHECK(position_ + nbytes <= size_) << "position: " << position_
                                        << " nbytes: " << nbytes << "size: " << size_;
    uint8_t* dst = data_ + position_;
-    memcpy(dst, data, nbytes);
+    if (nbytes >= (1<<20)) {


Let's make this a constant (and for the code that computes the hash too)

will do, where's a good place to put it? It's worth defining things like #define MB (1<<20) and using that for better readability.

pcmoritz · 2017-03-13T22:53:56Z

src/plasma/plasma_client.cc

+  // Start the prefix thread.
+  threads.push_back(std::thread(
+      compute_block_hash, data, prefix, &threadhash[0]));
+  for (int i = 1; i <= numthreads; i++) {


let's try to make NUMTHREADS the real # of threads here and have i = 0; for i = 1, ..., NUMTHREADS-2; i = NUMTHREADS-1

@pcmoritz , I could do that, but in cases where we get aligned, well-behaved input, we won't even have to start the prefix and suffix threads, making the expected number of actual threads \in [numthreads; numthreads+2] in the general case. Alternatively, I could issue the prefix and suffix memcopy in the main thread, without spawning a thread for it. Would that be better? I felt that, having a guaranteed tight margin on the expected number of threads is sufficient.

robertnishihara · 2017-03-13T23:02:47Z

src/numbuf/thirdparty/build_thirdparty.sh

@@ -24,5 +24,6 @@ echo "building arrow"
 cd $TP_DIR/arrow/cpp
 mkdir -p $TP_DIR/arrow/cpp/build
 cd $TP_DIR/arrow/cpp/build
-cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_FLAGS="-g" -DCMAKE_CXX_FLAGS="-g" -DARROW_BUILD_TESTS=OFF ..
+cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_FLAGS="-g -lpthread" -DCMAKE_CXX_FLAGS="-g -lpthread" -DARROW_BUILD_TESTS=OFF ..
+make clean


we should consider removing this before merging, since we probably don't want to rebuild arrow normally

yes, i'll remove make clean before merging.

pcmoritz · 2017-03-13T23:05:41Z

src/plasma/plasma_client.cc

+  const uint64_t numthreads = NUMTHREADS;
+  uint64_t threadhash[numthreads+2];
+  //CHECK(numthreads > 0);
+  const uint64_t blocksz = 64; // cache block alignment (alternative: page size)


Let's not use abbreviations here (block_size, data_begin maybe, data_end or names like this)

it's going to be unbelievably verbose. The more verbose things are, the more difficult it is to read code. Using "sz" suffix is standard in the linux kernel. See here for example:
https://github.com/torvalds/linux/blob/5924bbecd0267d87c24110cbe2041b5075173a25/arch/microblaze/include/asm/mmu.h#L101

Hm, we need to have a consistent style here. We are using extremely few abbreviations in the code right now (basically just abbreviating id for identifier and db for database). I think that will make it more understandable to people. I'm with you on trying to be concise with code, but let's not trade off readability. It will be good if our code is readable by many different people.

pcmoritz · 2017-03-13T23:13:11Z

src/plasma/plasma_client.cc

+  //CHECK(numthreads > 0);
+  const uint64_t blocksz = 64; // cache block alignment (alternative: page size)
+  // Calculate the first and last aligned positions in the data stream.
+  unsigned char *databp = (unsigned char *)(((uint64_t)data + blocksz-1) & ~(blocksz-1));


hm, can you explain how this works? any way to simplify this?

this is meant to work for blocksz = any power of 2. The only way to simplify is to assume some specific alignment and substitute the constants. Let's assume 64 byte alignment. The first term on the rhs, tips the pointer over the closest aligned memory position. The second term is the negation of the power of 2 alignment minus 1. So it leaves all bits on except the bits that correspond to the bit-width of the alignment (in this case : 6 least significant bits will be zero). Doing a binary AND of the first term with the second term zeros out 6 least significant bits, thus causing the result to be 64-byte aligned to the next 64-byte aligned memory position. If it's already aligned, it stays the same.

there are a few ways this could be simplified, like doing a shift right, shift left by the alignment bitwidth, but it may get weird on different architectures, if you are not careful. Because the shift can be with/without sign bit, with/without carry. It would also require us to add a variable that derives the alignment bitwidth of the specified block size (i.e. a log base 2 of blocksz).

Ok cool thanks, that makes sense, let's keep it the way it is and add some comments that describe what you just said

pcmoritz

We should also have a C unit tests that tests that the parallel memcpy is correct.

AmplabJenkins · 2017-03-14T10:09:58Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-03-14T10:09:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/289/
Test FAILed.

AmplabJenkins · 2017-03-14T10:34:59Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-03-14T10:35:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/290/
Test FAILed.

AmplabJenkins · 2017-03-20T12:11:50Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-03-20T12:11:50Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/353/
Test PASSed.

AmplabJenkins · 2017-03-20T18:39:47Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-03-20T18:39:47Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/356/
Test PASSed.

wesm · 2017-03-20T19:49:28Z

src/numbuf/python/src/pynumbuf/memory.h

 class FixedBufferStream : public arrow::io::OutputStream,
                          public arrow::io::ReadableFileInterface {
 public:
  virtual ~FixedBufferStream() {}

  explicit FixedBufferStream(uint8_t* data, int64_t nbytes)
-      : data_(data), position_(0), size_(nbytes) {}
+      : data_(data), position_(0), size_(nbytes), threadpool_(THREADPOOL_SIZE) {}


It'd be great if you could contribute this code to arrow/io. I opened JIRAs for both the FixedBufferStream and multithreaded memcpy

Hey Wes,
We're refactoring this into a standalone class. Should be easier to contribute to arrow as well when we're done with it.
Thanks,
Alexey

AmplabJenkins · 2017-03-20T21:25:12Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-03-20T21:25:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/364/
Test PASSed.

pcmoritz · 2017-03-20T22:25:49Z

src/numbuf/python/src/pynumbuf/memory.h

-      memcpy(dst, data, nbytes);
-    }
+    memcpy_helper.memcopy(dst, data, nbytes);
+//    if (nbytes >= BYTES_IN_MB) {


Let's get rid of these comments

pcmoritz · 2017-03-20T22:27:17Z

src/plasma/plasma_client.cc

+  *hash = XXH64_digest(&hash_state);
+}
+
+inline bool compute_object_hash_parallel(XXH64_state_t *hash_state,


Let's do the same naming convention here as in ParallelMemcpy

AmplabJenkins · 2017-03-21T00:14:45Z

Build finished. Test PASSed.

AmplabJenkins · 2017-03-21T00:14:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/370/
Test PASSed.

AmplabJenkins · 2017-03-21T01:00:31Z

Build finished. Test PASSed.

AmplabJenkins · 2017-03-21T01:00:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/372/
Test PASSed.

AmplabJenkins · 2017-03-21T01:32:25Z

Build finished. Test PASSed.

AmplabJenkins · 2017-03-21T01:32:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/374/
Test PASSed.

robertnishihara · 2017-03-21T01:50:17Z

src/plasma/plasma_client.cc

+  const uint64_t numthreads = THREADPOOL_SIZE;
+  uint64_t threadhash[numthreads + 2];
+  const uint64_t block_size = BLOCK_SIZE;
+  // Calculate the first and last aligned positions in the data stream.


we should make the comment style consistent

AmplabJenkins · 2017-03-21T02:46:09Z

Build finished. Test PASSed.

AmplabJenkins · 2017-03-21T02:46:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/378/
Test PASSed.

AmplabJenkins · 2017-03-21T03:10:15Z

Build finished. Test PASSed.

AmplabJenkins · 2017-03-21T03:10:16Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/379/
Test PASSed.

AmplabJenkins · 2017-03-21T05:45:13Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-03-21T05:45:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/382/
Test PASSed.

robertnishihara · 2017-03-21T06:10:36Z

src/plasma/plasma_store.cc

@@ -231,7 +232,8 @@ int create_object(Client *client_context,
    return PlasmaError_OutOfMemory;
  }
  /* Allocate space for the new object */
-  uint8_t *pointer = (uint8_t *) dlmalloc(data_size + metadata_size);
+  uint8_t *pointer =
+      (uint8_t *) dlmemalign(BLOCK_SIZE, data_size + metadata_size);


we should document why we are doing this, specifically that 64-byte alignment is REQUIRED by compute_object_hash_parallel

robertnishihara · 2017-03-21T06:11:15Z

src/plasma/plasma_client.cc

+  const uint64_t block_size = BLOCK_SIZE;
+  /* Calculate the first and last aligned positions in the data stream. */
+  const uint64_t data_address = reinterpret_cast<uint64_t>(data);
+  uint64_t left_address = (data_address + block_size - 1) & ~(block_size - 1);


I think it should be possible to simplify a bunch of the code here, e.g., left_address = data_address. Is that right?

robertnishihara · 2017-03-21T06:11:58Z

src/plasma/plasma_client.cc

+static inline bool compute_object_hash_parallel(XXH64_state_t *hash_state,
+                                                const unsigned char *data,
+                                                int64_t nbytes) {
+  const uint64_t numthreads = THREADPOOL_SIZE;


We should check that data % 64 == 0 and document why we are requiring this (explain that we don't want it to straddle multiple cache blocks)

robertnishihara · 2017-03-21T06:13:05Z

src/numbuf/python/src/pynumbuf/memory.h

+      double elapsed =
+          ((tv2.tv_sec - tv1.tv_sec) * 1000000 + (tv2.tv_usec - tv1.tv_usec)) / 1000000.0;
+      // TODO: replace this with ARROW_LOG(ARROW_INFO) or better equivalent.
+      printf("Copied %llu bytes in time = %8.4f MBps=%8.4f\n", nbytes, elapsed,


Is this printf still happening?

AmplabJenkins · 2017-03-21T06:17:10Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-03-21T06:17:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/384/
Test PASSed.

AmplabJenkins · 2017-03-21T06:40:37Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-03-21T06:40:38Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/385/
Test PASSed.

AmplabJenkins · 2017-03-21T18:50:04Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-03-21T18:50:05Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/388/
Test PASSed.

AmplabJenkins · 2017-03-21T19:06:08Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-03-21T19:06:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/389/
Test PASSed.

robertnishihara · 2017-03-21T19:31:12Z

src/plasma/plasma_client.cc

+   * be faster if the blocks that we divide the data into do not straddle extra
+   * cache blocks. The incoming addresses are 64-byte aligned because we
+   * allocate them with dlmemalign in create_object in plasma_store.cc. */
+  CHECK(data_address % 64 == 0);


The call to dlmemalign is in the plasma store. This check here is in the plasma client, so the check only makes sense if the alignment is preserved by memory mapping.

this check is not necessary. The code above should correctly compute the hash regardless of alignment. The reason for this is that we always start the first chunk at the given data_address. The invariant is that, given any alignment, with fixed numthreads and blocksz, the chunks produced for each thread will be exactly the same. This is a correctness property for deterministically computing the object hash.

AmplabJenkins · 2017-03-21T19:39:25Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-03-21T19:39:25Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/390/
Test PASSed.

atumanov · 2017-03-21T20:27:53Z

src/plasma/plasma_client.cc

+   * be faster if the blocks that we divide the data into do not straddle extra
+   * cache blocks. The incoming addresses are 64-byte aligned because we
+   * allocate them with dlmemalign in create_object in plasma_store.cc. */
+  CHECK(data_address % 64 == 0);


this check is not necessary. The code above should correctly compute the hash regardless of alignment. The reason for this is that we always start the first chunk at the given data_address. The invariant is that, given any alignment, with fixed numthreads and blocksz, the chunks produced for each thread will be exactly the same. This is a correctness property for deterministically computing the object hash.

atumanov · 2017-03-21T20:28:19Z

src/plasma/plasma_store.cc

+   * order to align the allocated region to a 64-byte boundary. This is not
+   * strictly necessary, but it is an optimization that speeds up the
+   * computation of a hash of the data (see compute_object_hash_parallel in
+   * plasma_client.cc). */
  uint8_t *pointer =
      (uint8_t *) dlmemalign(BLOCK_SIZE, data_size + metadata_size);


if this helps, I can change the code back to dlmalloc for the purposes of this PR.

AmplabJenkins · 2017-03-21T22:50:08Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-03-21T22:50:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/391/
Test PASSed.

pcmoritz reviewed Mar 13, 2017

View reviewed changes

robertnishihara reviewed Mar 13, 2017

View reviewed changes

pcmoritz reviewed Mar 13, 2017

View reviewed changes

pcmoritz force-pushed the parallel-objecthash-merge branch from 19c609b to a658e50 Compare March 20, 2017 10:37

wesm reviewed Mar 20, 2017

View reviewed changes

pcmoritz reviewed Mar 20, 2017

View reviewed changes

robertnishihara reviewed Mar 21, 2017

View reviewed changes

parallelizing memcopy and object hash construction in numbuf/plasma

8cb2633

atumanov force-pushed the parallel-objecthash-merge branch from 1aa4c33 to 8cb2633 Compare March 21, 2017 05:28

clang format

6a62c09

robertnishihara reviewed Mar 21, 2017

View reviewed changes

whitespace

e120ec6

atumanov added 2 commits March 21, 2017 11:32

refactoring compute object hash: get rid of the prefix chunk

9425418

clang format

05622be

Document performance optimization.

7cfdc59

robertnishihara reviewed Mar 21, 2017

View reviewed changes

atumanov commented Mar 21, 2017

View reviewed changes

Remove check for 64-byte alignment, since it may not be guaranteed.

549c0b7

robertnishihara merged commit a3d5860 into ray-project:master Mar 21, 2017

robertnishihara deleted the parallel-objecthash-merge branch March 21, 2017 23:17

parallelize numbuf memcpy and plasma object hash construction #366

parallelize numbuf memcpy and plasma object hash construction #366

Conversation

atumanov commented Mar 13, 2017

AmplabJenkins commented Mar 13, 2017

AmplabJenkins commented Mar 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcmoritz Mar 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcmoritz Mar 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcmoritz left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Mar 14, 2017

AmplabJenkins commented Mar 14, 2017

AmplabJenkins commented Mar 14, 2017

AmplabJenkins commented Mar 14, 2017

AmplabJenkins commented Mar 20, 2017

AmplabJenkins commented Mar 20, 2017

AmplabJenkins commented Mar 20, 2017

AmplabJenkins commented Mar 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Mar 20, 2017

AmplabJenkins commented Mar 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

Choose a reason for hiding this comment

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Mar 21, 2017

AmplabJenkins commented Mar 21, 2017

pcmoritz Mar 13, 2017 •

edited

Loading

pcmoritz Mar 13, 2017 •

edited

Loading