Description
I recently found a synchronisation issue: #770 Although a workaround exists (#772), it reveals some problems in the design of the work packet system.
- When can more buckets be opened?
- How to prevent opening new work packets if more work packets are being added?
Currently, buckets form a linear (totally ordered) relation. A bucket can be opened if (1) all workers are idle, and (2) the coordinator doesn't have pending work, and (3) all previous buckets are empty.
Condition (1) exists because a bucket is only considered finished if all of its work packets are finished, but a worker may be executing one of its work packet.
Condition (1) and (2) exist because one work packet (coordinator packet or ordinary packet) may add more work packets to buckets. If a subsequent bucket is opened before work packets are added to a prior bucket, it will be an error. That's why we use the counter pending_coordinator_packets
to prevent buckets from being opened if there are coordinator packets.
The conditions (1) (2) and (3) are reasonably sound, but are too complicated to test. As can be seen in the GCWorkScheduler::poll_slow
function, the condition variable worker_monitor.1
is not guarded by any concrete variables. It is very different from typical Condvar use cases like the following:
let mut guard = mutex.lock();
while !can_open_more_buckets {
guard = condvar.wait(guard);
}
Actually, all other lines of code in poll_slow
other than the worker_monitor.1.wait()
itself are computing the condition of "can_open_more_buckets". Those lines of code include:
- Getting schedulable work (
self.poll_schedulable_work(worker)
) - Getting worker-specific work (
self.worker_group.has_designated_work()
) - Checking if any open bucket has sentinels (
self.schedule_sentinels()
) - Checking if the opening condition of any bucket is met (
self.update_buckets()
). Currently, the opening condition isscheduler.are_buckets_drained(&cur_stages)
wherecur_stages
includes all previous stages.
All of them need to be checked every time the last worker has parked. Even with such care, we still encounter synchronisation bugs like #770
I think the root problem is that the opening condition is way too complicated than it should be.
A previous issue #546 identified that the synchronisation is too complicated, and proposed a major refactoring where there is only one BinaryHeap to hold all poll-able work.
Here I am making an orthogonal proposal of when to open new buckets.
Proposal
Each bucket now has the following states:
- For closing a bucket:
pending_packets: AtomicUsize
The number of work packets added to the bucket but has not been finished.sentinel: Option<GCWork>
Sentinel of the bucket (if exists)finished: bool
A boolean variable that explicitly records whether this bucket has finished.
- For opening a bucket:
unfinished_predecessors: usize
How many predecessors need to finish before this bucket can be opened? (Only count immediate predecessors.)is_open: bool
A boolean variable that explicitly records whether this bucket is open.
Every time a packet is added to a bucket, increment pending_packets
by one. Every time a work packet is removed from a bucket, do not decrement anything! Every time a worker finished executing a work packet, decrements pending_packets
by one and test if it is reduced to zero. When reduced to zero, set finished
to true. The sentinel is counted in pending_packets
. When setting sentinel, increment the pending_packets
counter by one; when the bucket is empty but pending_packets
is still 1, there must be a sentinel..
To prevent the pending_packets
from incrementing from zero after opening, work packets are only allowed to be added to buckets that are either not yet opened or the same bucket as the current packet. It is forbidden to add more work packets to finished buckets. It is always possible to add packets to the special "Unconstrainted" bucket because it is never finished.
When a bucket is finished, decrement the unfinished_predecessors
counter of all successors by one. When a bucket's unfinished_predecessors
is decremented to 0, open that bucket.
By doing this, we no longer require all workers to park before opening new work packets.
This solves the "opening buckets" problem. To wake up parked workers, we can use a global pending_packets
counter for that purpose, and the condition variable workers waits on can be notified when the global pending_packets
counter is incremented from 0.
Extending from linear buckets structure
With a predecessor count attached to each bucket, it is possible to express the dependencies between work buckets in a non-linear manner.
Some buckets have unusual opening conditions. For example,
- The
Prepare
bucket is opened when all mutators have stopped. - The
Closure
bucket is opened after theStopMutators
coordinator work is finished. Currently, we prevent theClosure
bucket from opening by requiring thepending_coordinator_packets
counter to be zero. I think we can express the same thing as "theClosure
work packet requires not only thePrepare
bucket to finish, but also theStopMutators
coordinator packet to finish". We can express this by setting theunfinished_predecessors
of theClosure
bucket to 2, one is thePrepare
bucket and the other is theStopMutators
coordinator packet. - The
ScheduleCollection
coordinator packet basically prevents all other work buckets from opening. We can express this by setting theunfinished_predecessors
ofPrepare
to 2, one is "mutators stopped", and the other is "theScheduleCollection
coordinator packet is finished".
So the actual work bucket dependency of our current system is (simplified, omitted built-in weak processing buckets and mark-compact buckets):
┌─────────────────┐
│ScheduleCollector│
│packet finished ├──┐
└─────────────────┘ │
│
┌───────────────┐ │ ┌───────┐
│mutator stopped├────┴──►Prepare│
└───────────────┘ └──┬────┘
│
│
┌───────────────┐ ┌──▼────┐
│StopMutators ├───────►Closure│
│packet finished│ └──┬────┘
└───────────────┘ │
│
┌──▼─────────┐
│VMRefClosure│
└──┬─────────┘
│
│
┌──▼────┐
│Release│
└──┬────┘
│
│
┌──▼──┐
│Final│
└─────┘