Skip to content

Core ROSS RNG streams; Event Time Signature Paradigm; Deterministic Tiebreaker #180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

nmcglo
Copy link
Member

@nmcglo nmcglo commented Nov 17, 2020

This is a big PR. Sorry, there's a lot to unpack here.

Starting with the easiest. This PR adds a new array of RNGs to ROSS LPs. These RNG streams are "Core RNGs" and access and usage of them should be reserved exclusively for the ROSS engine itself. Models should not touch these RNG streams at all. Doing so will break any determinism expected by ROSS patterns that utilize them.

Prior to now, obviously, there's nothing in ROSS that would utilize such a stream but also included in this PR is an implementation of an event tie-breaking mechanism which gives ROSS the capability of handling event ties (events that occur on the same LP at the same virtual time) in a deterministic way that is consistent between simulations, regardless of event delivery order.

Deterministic Tie-breaking can be implemented by creating a random value at the creation of an event, this value is encoded into the ROSS event struct and is utilized to break any event ties (same destination LP at same time). Because this separate RNG is only accessed by ROSS, it can be rolled back if the event becomes RC'd or cancelled. Because of
determinism, any ordering as a result of this tiebreaker will be consistent across simulation runs regardless of event delivery order or stragglers. If a regular model-accessed LP RNG was used for this purpose, the tiebreaking sequence would be subject to interference.

The deterministic tiebreaker itself is rather simple. When an event is created, a random value is generated from a ROSS core RNG stream. When that event is RC'd or cancelled, that random value is also
reversed. Because this stream is only utilized by said tiebreaking mechanism, the ordering of tiebreaking values created by the stream is deterministic across simulations. When comparing two events received by an LP at the same timestamp, the determining factor in which is processed first will be decided - deterministically - by the tiebreaker value.

While the concept itself is simple, and implementing the tiebreaker into the event struct is similarly simple, getting this
to work with the concept of GVT and rollbacks is not.

A better way to think about this tiebreaking mechanism is to think of it as making sure that there are actually no such thing as event ties. This paradigm shift means that determining "when" GVT happens is no longer a single TW_STIME value. There is now an event signature struct which contains a timestamp and a tiebreaker value. This signature is all that is necessary for determining ordering of two events in the simulation. Thus, the time of the last GVT is no longer just the single dimensional virtual timestamp, but it also includes a tiebreaker which divides events that happen at the same primary timestamp as GVT but with their own tiebreaker values which will be deterministically separated as "before GVT" and "after GVT".

Thus, rollbacks now also no longer go back to a single timestamp value in time, but to a two dimensional timestamp value consisting of the primary timestamp and an event tiebreaker.

As complex as this system is, it does have its benefits:

  1. Comparing events for Splay and AVL trees to determine ordering require fewer comparisons and thus less compute time spent.
  2. If primary timestamp ties are numerous in a model, rolling back of one event at said timestamp will no longer require rolling back all events with the same timestamp, only those whose tiebreaker values
    determine that they happen "after" the event that prompted the rollback.
  3. Event ties are statistically impossible to force. Because the tiebreaker value is generated using its own independent RNG stream with an extremely long period, two events at the same primary timestamp ALSO generating an identical random value is nearly impossible. This also means that model developers will no longer have to generate their own small noise to add onto their event timestamps to prevent event ties, significantly improving the administrative code complexity - reducing the likelihood that a developer will forget to roll back an RNG from noise and plunge their entire model into non-determinism.

This feature has been walled off behind a CMAKE Define Variable: USE_RAND_TIEBREAKER. Set this value to ON during CMAKE ROSS configuration and all code enabling the tiebreaking value generation and the timestamp-to-time-signature paradigm shift will be switched on by pre-processor #ifdef's.

Ultimately, it may be beneficial to make this event time signature the actual primary mechanism by which the ROSS engine operates but I didn't want to make said major change in this PR. There is probably a cleaner way to implement it as well (possibly using the TW_STIME API?)


If this merge represents a feature addition to ROSS, the following items must be completed before the branch will be merged:

  • Document the feature on the blog (See the website Contributing guide).
    Include a link to your blog post in the Pull Request.
  • Builds should cleanly compile with -Wall and -Wextra.
  • One or more TravisCI tests should be created (and they should pass)
  • Through the TravisCI tests, coverage should increase
  • Test with CODES to ensure everything continues to work

This commit adds a separate array of RNG streams on each LP that aren't
to be utilized by developed models. These separate RNG streams can be
utilized to leverage the deterministic RNG nature that ROSS can manage
toward other goals of the ROSS engine itself.

Notable example use for this: Deterministic Tiebreaking
Deterministic Tiebreaking can be implemented by creating a random value
at the creation of an event, this value is encoded into the ROSS event
struct and is utilized to break any event ties (same destination LP at
same time). Because this separate RNG is only accessed by ROSS, it can
be rolled back if the event becomes RC'd or cancelled. Because of
determinism, any ordering as a result of this tiebreaker will be
consistent across simulation runs regardless of event delivery order
or stragglers. If a regular model-accessed LP RNG was used for this
purpose, the tiebreaking sequence would be subject to interference.
This commit adds the functionality of the deterministic tiebreaker
mentioned in an earlier commit which added the core ROSS engine
exclusive RNGs.

The deterministic tiebreaker itself is rather simple. When an event
is created, a random value is generated from a ROSS core RNG stream.
When that event is RC'd or cancelled, that random value is also
reversed. Because this stream is only utilized by said tiebreaking
mechanism, the ordering of tiebreaking values created by the stream
is deterministic across simulations. When comparing two events
received by an LP at the same timestamp, the determining factor
in which is processed first will be decided - deterministically -
by the tiebreaker value.

While the concept itself is simple, and implementing the tiebreaker
into the event struct is similarly simple, getting this tiebreaker
to work with the concept of GVT and rollbacks is not.

A better way to think about this tiebreaking mechanism is to think
of it as making sure that there are actually no such thing as event
ties. This paradigm shift means that determining "when" GVT happens
is no longer a single TW_STIME value. There is now an event signature
struct which contains a timestamp and a tiebreaker value. This
signature is all that is necessary for determining ordering of two
events in the simulation. Thus the time of the last GVT is no longer
just the single dimensional virtual timestamp, but it also includes
a tiebreaker which divides events that happen at the same primary
timestamp as GVT but with their own tiebreaker values which will be
deterministically separated as "before GVT" and "after GVT".

Thus, rollbacks now also no longer go back to a single timestamp
value in time, but to a two dimensional timestamp value consisting
of the primary timestamp and an event tiebreaker.

As complex as this system is, it does have its benefits:
1) Comparing events for Splay and AVL trees to determine ordering
require fewer comparisons and thus less compute time spent.
2) If primary timestamp ties are numerous in a model, rolling back
of one event at said timestamp will no longer require rolling back
all events with the same timestamp, only those whose tiebreaker values
determine that they happen "after" the event that prompted the
rollback.
3) Event ties are statistically impossible to force. Because the
tiebreaker value is generated using its own independent RNG stream
with an extremely long period, two events at the same primary
timestamp ALSO generating an identical random value is nearly
impossible. This also means that model developers will no longer
have to generate their own small noise to add onto their event
timestamps to prevent event ties, significantly improving the
administrative code complexity - reducing the likelihood that
a developer will forget to roll back an RNG from noise and
plunge their entire model into non-determinism.

This feature has been walled off behind a CMAKE Define Variable:
USE_RAND_TIEBREAKER. Set this value to ON during CMAKE ROSS
configuration and all code enabling the tiebreaking value
generation and the timestamp-to-time-signature paradigm shift
will be switched on by pre-processor #ifdef's.
@codecov
Copy link

codecov bot commented Nov 17, 2020

Codecov Report

Merging #180 (e638cf6) into develop (cc6ec2a) will decrease coverage by 0.28%.
The diff coverage is 78.21%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #180      +/-   ##
===========================================
- Coverage    58.17%   57.88%   -0.29%     
===========================================
  Files           33       33              
  Lines         3565     3588      +23     
===========================================
+ Hits          2074     2077       +3     
- Misses        1491     1511      +20     
Impacted Files Coverage Δ
core/ross-inline.h 40.00% <ø> (ø)
core/ross-kernel-inline.h 61.53% <ø> (ø)
core/tw-eventq.h 90.90% <ø> (ø)
core/tw-lp.c 52.45% <0.00%> (ø)
core/avl_tree.c 66.07% <42.85%> (-3.74%) ⬇️
core/network-mpi.c 76.70% <52.94%> (-3.55%) ⬇️
core/tw-event.c 76.27% <76.47%> (+1.95%) ⬆️
core/tw-sched.c 82.27% <77.27%> (-0.37%) ⬇️
core/queue/splay.c 97.59% <87.50%> (-1.22%) ⬇️
core/rand-clcg4.c 84.72% <88.88%> (-0.04%) ⬇️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc6ec2a...e638cf6. Read the comment docs.

@nmcglo nmcglo requested a review from gonsie November 17, 2020 20:03
@nmcglo nmcglo marked this pull request as draft January 13, 2021 04:45
@nmcglo
Copy link
Member Author

nmcglo commented Jan 13, 2021

Converted to draft PR - there are some technical limitations of this new paradigm that I want to fully understand and figure out a workaround if possible. Notably: zero offset event timestamps will break this system. I understand why this happens and I have a feeling like there may not be a workaround but just want to make sure this doesn't get merged until it's fully explored.

@nmcglo nmcglo marked this pull request as ready for review June 7, 2021 02:39
@nmcglo
Copy link
Member Author

nmcglo commented Jun 7, 2021

This has been thoroughly run through the wringer for my last two papers this year. In the ROSS only paper, there were no issues. In the CODES paper, there were a couple determinism issues but I'm 100% certain that these are CODES issues.

@nmcglo
Copy link
Member Author

nmcglo commented Jun 7, 2021

This is actually way out of date. Creating a new pull request with the better branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant