Multi-threading stops after certain input sizes. #13062

jzakiya · 2023-02-10T21:29:05Z

This is a followup on an issue I raised in: #12943

This issue is not tied to the Crystal version , and exists on my Intel i7-6700HQ and AMD Ryzen 9 5900HX laptops.

There is a consistent|reproducible problem with multi-threading and the type of memory used in the threads.

The code below uses Slices of 64-bit memory to make up the seg array in the twins_sieve method.

https://gist.github.com/jzakiya/2b65b609f091dcbb6f792f16c63a8ac4

def twins_sieve(r_hi, kmin, kmax, ks, start_num, end_num, modpg, primes, resinvrs)
  s = 6                                                # shift value for 64 bits
  bmask = (1 << s) - 1                                 # bitmask val for 64 bits
  sum, ki, kn  = 0_u64, kmin-1, ks                     # init these parameters
  hi_tp, k_max = 0_u64, kmax                           # max twinprime|resgroup
  seg = Slice(UInt64).new(((ks - 1) >> s) + 1)         # seg array of ks resgroups
  ki    += 1 if r_hi - 2 < (start_num - 2) % modpg + 2 # ensure lo tp in range
  k_max -= 1 if r_hi > (end_num - 2) % modpg + 2       # ensure hi tp in range
  nextp = nextp_init(r_hi, ki, modpg, primes,resinvrs) # init nextp array
  while ki < k_max                       # for ks size slices upto kmax
    kn = k_max - ki if ks > (k_max - ki) # adjust kn size for last seg
    primes.each_with_index do |prime, j| # for each prime r0..sqrt(N)
                                         # for lower twinpair residue track
      k = nextp.to_unsafe[j * 2]         # starting from this resgroup in seg
      while k < kn                       # mark primenth resgroup bits prime mults
        seg.to_unsafe[k >> s] |= 1_u64 << (k & bmask)
        k += prime  end                  # set resgroup for prime's next multiple
      nextp.to_unsafe[j * 2] = k - kn    # save 1st resgroup in next eligible seg
                                         # for upper twinpair residue track
      k = nextp.to_unsafe[j * 2 | 1]     # starting from this resgroup in seg
      while k < kn                       # mark primenth resgroup bits prime mults
        seg.to_unsafe[k >> s] |= 1_u64 << (k & bmask)
        k += prime  end                  # set resgroup for prime's next multiple
      nextp.to_unsafe[j * 2 | 1]= k - kn # save 1st resgroup in next eligible seg
    end                                  # set as nonprime unused bits in last seg[n]
                                         # so fast, do for every seg[i]
    seg.to_unsafe[(kn - 1) >> s] |= ~1u64 << ((kn - 1) & bmask)
    cnt = 0                              # count the twinprimes in the segment
    seg[0..(kn - 1) >> s].each { |m| cnt += (~m).popcount }
    if cnt > 0                           # if segment has twinprimes
      sum += cnt                         # add segment count to total range count
      upk = kn - 1                       # from end of seg, count back to largest tp
      while seg.to_unsafe[upk >> s] & (1_u64 << (upk & bmask)) != 0; upk -= 1 end
      hi_tp = ki + upk                   # set its full range resgroup value
    end
    ki += ks                             # set 1st resgroup val of next seg slice
    seg.fill(0) if ki < k_max            # set next seg to all primes if in range
  end                                    # when sieve done, numerate largest twin
                                         # for ranges w/o twins set largest to 1
  hi_tp = (r_hi > end_num || sum == 0) ? 1 : hi_tp * modpg + r_hi
  {hi_tp.to_u64, sum.to_u64}             # return largest twinprime|twins count
end

This version uses a BitArray for the seg array in twins_sieve,

https://gist.github.com/jzakiya/806cddb0dc5967d567bcb47f89d58e60

def twins_sieve(r_hi, kmin, kmax, ks, start_num, end_num, modpg, primes, resinvrs)
  # Return the last twinprime|sum for the range for this twinpair residues.
  sum, ki, kn  = 0_u64, kmin-1, ks                     # init these parameters
  hi_tp, k_max = 0_u64, kmax                           # max twinprime|resgroup
  seg = BitArray.new(ks)                               # seg array of ks resgroups
  ki    += 1 if r_hi - 2 < (start_num - 2) % modpg + 2 # ensure lo tp in range
  k_max -= 1 if r_hi > (end_num - 2) % modpg + 2       # ensure hi tp in range
  nextp = nextp_init(r_hi, ki, modpg, primes,resinvrs) # init nextp array
  while ki < k_max                       # for ks size slices upto kmax
    kn = k_max - ki if ks > (k_max - ki) # adjust kn size for last seg
    primes.each_with_index do |prime, j| # for each prime r0..sqrt(N)
                                         # for lower twinpair residue track
      k = nextp.to_unsafe[j * 2]         # starting from this resgroup in seg
      while k < kn                       # until end of seg
        seg.unsafe_put(k, true)          # mark primenth resgroup bits prime mults
        k += prime  end                  # set resgroup for prime's next multiple
      nextp.to_unsafe[j * 2] = k - kn    # save 1st resgroup in next eligible seg
                                         # for upper twinpair residue track
      k = nextp.to_unsafe[j * 2 | 1]     # starting from this resgroup in seg
      while k < kn                       # until end of seg
        seg.unsafe_put(k, true)          # mark primenth resgroup bits prime mults
        k += prime  end                  # set resgroup for prime's next multiple
      nextp.to_unsafe[j * 2 | 1]= k - kn # save 1st resgroup in next eligible seg
    end                                   
    cnt = seg[...kn].count(false)        # count|store twinprimes in segment
    if cnt > 0                           # if segment has twinprimes
      sum += cnt                         # add segment count to total range count
      upk = kn - 1                       # from end of seg, count back to largest tp
      while seg.unsafe_fetch(upk); upk -= 1 end
      hi_tp = ki + upk                   # set its full range resgroup value
    end
    ki += ks                             # set 1st resgroup val of next seg slice
    seg.fill(false) if ki < k_max        # set next seg to all primes if in range
  end                                    # when sieve done, numerate largest twin 
                                         # for ranges w/o twins set largest to 1
  hi_tp = (r_hi > end_num || sum == 0) ? 1 : hi_tp * modpg + r_hi
  {hi_tp.to_u64, sum.to_u64}             # return largest twinprime|twins count
end

The Slice version runs multi-threaded with inputs at least < 60_000_000_000_000
but reverts to a single_thread with inputs at least > 70_000_000_000_000.

This works with multi-threads.

$ CRYSTAL_WORKERS=16 ./twinprimes_ssoz.172  60_000_000_000_000

This runs in a single-thread.

$ CRYSTAL_WORKERS=16 ./twinprimes_ssoz.172  70_000_000_000_000

You can use [h|t]op to view threads activity.

The problem does not exist when using the BitArray version.

I thought it might be an indexing issue with the Slices as the input ranges got bigger,
but the seg array sizes don't dramatically change, and they are the same values
used in the BitArray version. The serial (one-thread) version of each run with no problems
for all tried inputs, so it seems that's not the cause of this behavior.

So at first glance, there seems to be some connection between the Slice sizes and multi-threading?
That's my best guess at the moment.

(The behavior doesn't exist in any other language version I've implemented this algorithm in).

The text was updated successfully, but these errors were encountered:

straight-shoota · 2023-02-13T12:53:12Z

Hm the program seems to run multiple threads with 70_000_000_000_000 input for me:

  PID USER      PRI  NI  VIRT   RES   SHR S CPU%▽MEM%   TIME+  Command
 4233 johannes   20   0  175G  288M  3700 R 1590  1.8  8:38.54 ./twinprimes-slice 70_000_000_000_000
 4249 johannes   20   0  175G  288M  3700 R 100.  1.8  0:32.45 ./twinprimes-slice 70_000_000_000_000
 4250 johannes   20   0  175G  288M  3700 R 99.3  1.8  0:32.19 ./twinprimes-slice 70_000_000_000_000
 4251 johannes   20   0  175G  288M  3700 R 99.3  1.8  0:32.41 ./twinprimes-slice 70_000_000_000_000
 4252 johannes   20   0  175G  288M  3700 R 100.  1.8  0:32.37 ./twinprimes-slice 70_000_000_000_000
 4253 johannes   20   0  175G  288M  3700 R 100.  1.8  0:32.41 ./twinprimes-slice 70_000_000_000_000
 4254 johannes   20   0  175G  288M  3700 R 100.  1.8  0:32.47 ./twinprimes-slice 70_000_000_000_000
 4255 johannes   20   0  175G  288M  3700 R 100.  1.8  0:32.41 ./twinprimes-slice 70_000_000_000_000
 4256 johannes   20   0  175G  288M  3700 R 100.  1.8  0:32.42 ./twinprimes-slice 70_000_000_000_000
 4257 johannes   20   0  175G  288M  3700 R 99.3  1.8  0:32.38 ./twinprimes-slice 70_000_000_000_000
 4258 johannes   20   0  175G  288M  3700 R 99.3  1.8  0:32.39 ./twinprimes-slice 70_000_000_000_000
 4259 johannes   20   0  175G  288M  3700 R 99.3  1.8  0:32.45 ./twinprimes-slice 70_000_000_000_000
 4261 johannes   20   0  175G  288M  3700 R 99.3  1.8  0:32.38 ./twinprimes-slice 70_000_000_000_000
 4262 johannes   20   0  175G  288M  3700 R 98.7  1.8  0:32.39 ./twinprimes-slice 70_000_000_000_000
 4263 johannes   20   0  175G  288M  3700 R 99.3  1.8  0:32.38 ./twinprimes-slice 70_000_000_000_000
 4260 johannes   20   0  175G  288M  3700 R 99.3  1.8  0:32.34 ./twinprimes-slice 70_000_000_000_000

jzakiya · 2023-02-13T17:54:29Z

You may need to use larger numbers on your system.

I monitor the threads with htop, which you can divide your terminal into separate screens to see.

When I start the program it will run multi-threaded initially, and then very quickly drop to just one thread and stay there.

straight-shoota · 2023-02-13T20:38:04Z

I still can't reproduce that behaviour even when the input value is a magnitude larger. All 16 threads are still running at 100%.

Crystal 1.7.2 (2023-01-23)

LLVM: 14.0.6
Default target: x86_64-unknown-linux-gnu

i7-11800H

What does "drop to just one thread" mean. Is there actually just one thread and the others are terminated? Or are there still 16 threads but only one is doing work?

jzakiya · 2023-02-13T21:10:21Z

Yes. It starts all 16 threads, then very quickly (maybe ~1sec run time), all 16 threads stop, and 1 thread exists.
But that 1 thread is stuck, because it isn't processing data and then incrementing to the next (as seen with htop).
No matter how long I let it go no data is processed, so I have to manually abort the program.

I've been running it, and running it, and get the same behavior. And the BitArray version has no problems at all.

This in on my AMD Ryzen 9 5900HX system, but when I get home I'll run it on my older i7-5700HQ laptop.

jzakiya · 2023-02-14T16:32:11Z

I compiled both source code versions with 1.72 on my i7-6700HQ laptop and get the same behavior on it.

So that seems to rule out some unique hardware issue (though both systems use PCLinuxOS distro).

jzakiya added the kind:bug A bug in the code. Does not apply to documentation, specs, etc. label Feb 10, 2023

Blacksmoke16 added topic:stdlib:concurrency topic:multithreading labels Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threading stops after certain input sizes. #13062

Multi-threading stops after certain input sizes. #13062

jzakiya commented Feb 10, 2023

straight-shoota commented Feb 13, 2023

jzakiya commented Feb 13, 2023

straight-shoota commented Feb 13, 2023 •

edited

Loading

jzakiya commented Feb 13, 2023 •

edited

Loading

jzakiya commented Feb 14, 2023

Multi-threading stops after certain input sizes. #13062

Multi-threading stops after certain input sizes. #13062

Comments

jzakiya commented Feb 10, 2023

straight-shoota commented Feb 13, 2023

jzakiya commented Feb 13, 2023

straight-shoota commented Feb 13, 2023 • edited Loading

jzakiya commented Feb 13, 2023 • edited Loading

jzakiya commented Feb 14, 2023

straight-shoota commented Feb 13, 2023 •

edited

Loading

jzakiya commented Feb 13, 2023 •

edited

Loading