Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix embargo timeout in dandelion++ #9295

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vtnerd
Copy link
Contributor

@vtnerd vtnerd commented Apr 20, 2024

Summary

@Boog900 pointed out that the embargo duration in Dandelion++ was incorrect - it was using poisson distribution instead of exponential distribution. I don't recall why I used poisson distribution, other than it takes an "average" parameter, which I took to mean the average embargo timeout. This is not the same distribution as meant in the Dandelion++ paper.

The primary difference is that the average embargo timeout will drop from ~39s to ~7s. There shouldn't be any loss in privacy as a result of this, because the propagation time to 10 nodes is roughly 1.75s.

Additionally @Boog900 discovered that the paper stated log but almost certainly meant ln (which helps bring down the average fluff time too).

Fluff probability

Is once again 10%, which should result in longer stem phases. Since the distribution is now much shorter for the embargo timeout, this shouldn't result in longer flood times.

Fallout

I'm not aware of any fingerprinting that can be done on the existing implementation. The randomized duration should still make it difficult to determine which node in the stem-set fluffed first. Perhaps @Boog900 can share some thoughts on this topic.

Fluff Timers

I reduced the average poisson distribution for fluff delay from 5s to 1s. This is an arbitrary change, but was made due to the new reality of much shorter embargo timeouts. @Boog900 thoughts on this portion of the code? Dandelion++ doesn't really specify a randomized flush interval for fluff mode, this comes from inspecting the Bitcoin code.

Poisson Distribution

Poisson is still being used in a few places, but I am not aware of any issues right now. I will dig deeper to see if these need changing:

  • The delay when "forwarding" from i2p/tor to p2p/clearnet is using a poisson distribution
  • The dandelion++/noise epoch has a minimum time, with a randomized poisson distribution added
  • The fluff timers have use a poisson distribution for flushing

I'm not aware of these timers violating the Dandelion++ paper (again read above about fluff timers).

Future

I expect some feedback from @Boog900 and possibly others as to the additional changes that need to be made.

@vtnerd
Copy link
Contributor Author

vtnerd commented Apr 20, 2024

I should also mention this does mean in unlucky cases where a blackhole occurs after just one hop, could result in longer delays than with a poisson distribution (where the overwhelming number of values are around 39s).

@Boog900
Copy link
Contributor

Boog900 commented Apr 20, 2024

I should also mention this does mean in unlucky cases where a blackhole occurs after just one hop, could result in longer delays than with a poisson distribution

This does bring up an interesting point, using the exponential distribution could make it easier to estimate how many hops the transaction did before it reached the black hole.

If the attacker keeps track of the time it receives a tx, and the time it takes for the tx to be broadcasted, then it could calculate the probability of that happening for different amounts of hops.

For example if the tx gets blackholed after one hop then the average time for that tx to get diffused is 75s whereas a tx that makes it 9 hops will have an average time of 8.3s, so if the tx takes 300s to get diffused then we can say that is much more likely to happen with 1 hops than 9. The paper seemingly doesn't mention this.

Fallout

The problem with using the poisson distribution is that it is not memoryless, so nodes earlier in the stem phase are slightly more likely to fluff first under a black hole attack. How much more likely? I don't know exactly but just off the top of my head I can't imagine it being significant.

Fluff Timers

I feel 1 second is too low, although the previous was 5 seconds it was 2.5 for outgoing connections:

constexpr const fluff_duration fluff_average_out{fluff_duration{fluff_average_in} / 2};
this will change it to half a second. I would rather be on the safe side here.

@vtnerd
Copy link
Contributor Author

vtnerd commented Apr 21, 2024

For example if the tx gets blackholed after one hop then the average time for that tx to get diffused is 75s whereas a tx that makes it 9 hops will have an average time of 8.3s, so if the tx takes 300s to get diffused then we can say that is much more likely to happen with 1 hops than 9. The paper seemingly doesn't mention this.

I'm wondering whether my parameters are too high - we previously lowered the parameters so that the diffusion came quicker. Should I do the same here? The worst case scenario is more likely and longer than the existing poisson method.

This does bring up an interesting point, using the exponential distribution could make it easier to estimate how many hops the transaction did before it reached the black hole.

This doesn't reveal the origin IP address though. So I think it's still better to go with the paper here.

The problem with using the poisson distribution is that it is not memoryless, so nodes earlier in the stem phase are slightly more likely to fluff first under a black hole attack. How much more likely? I don't know exactly but just off the top of my head I can't imagine it being significant.

Poisson distribution is also considered memoryless - but it may have different properties making it less suitable.

I feel 1 second is too low, although the previous was 5 seconds it was 2.5 for outgoing connections:

Revert back to 5 seconds? I didn't want to overlap with the blackhole timeout.

@selsta
Copy link
Collaborator

selsta commented Apr 21, 2024

In the past we had a lot of sybil nodes that were intentionally blackholing transactions, a significantly longer average time to diffusion would be bad for user experience.

I don't know if these sybil nodes are still there.

@Boog900
Copy link
Contributor

Boog900 commented Apr 22, 2024

I'm wondering whether my parameters are too high - we previously lowered the parameters so that the diffusion came quicker. Should I do the same here? The worst case scenario is more likely and longer than the existing poisson method.

I think so, especially if we have had problems with black holes in the past.

If were to choose a time for which we would want a chosen percentage of txs to be fluffed under, if they were to be immediately black holed, we could find the highest k value possible for a certain ep.

For example if we were to say we want 90% of txs to be fluffed under 60s with ep=0.1 in an black hole attack where the tx gets dropped immediately, the highest k value we can use is 6 with on average 91% of txs having a value less than 60s.

I think we could get away with k=8, with ep=0.1 this means our fluff probability would be 0.125. Using this value means ~85% of txs will get fluffed under 90s if they were to be immediately black holed. This is reasonable IMO, considering block time is ~2 mins and this will only affect txs which get immediately black holed.

With k=10 70% of txs that get immediately black holed will be fluffed under 90s and with k=9 ~78%.

This doesn't reveal the origin IP address though. So I think it's still better to go with the paper here.

True, just wanted to mention.

Poisson distribution is also considered memoryless

The time between events in a Poisson process is memoryless, it can be modeled with the exponential distribution, but I don't think the Poisson distribution itself is memoryless.

Revert back to 5 seconds? I didn't want to overlap with the blackhole timeout.

I think so, I don't think overlapping is too big a concern due to how variable the output of the exponential distribution is.

@vtnerd
Copy link
Contributor Author

vtnerd commented Apr 22, 2024

New force push has the parameters recommended by @Boog900 . I'm a little worried the new timeout may not be aggressive enough - but I'm leaning towards it being acceptable.

@Boog900
Copy link
Contributor

Boog900 commented Apr 25, 2024

We could go lower but 8 should be fine, more numbers:

Txs fluffed under 180s when immediately black holed:

  • k=9, 95%
  • k=8, 97.9%
  • k=7, 99.4%

This means if an attacker managed to black hole every transaction immediately with k=8 85% would be fluffed under 90s and ~98% under 180s. For safety we could add an upper bound on the timer, to prevent an unlucky situation.

@vtnerd
Copy link
Contributor Author

vtnerd commented Jun 20, 2024

I added a 180s embargo max to the logic (as per @Boog900 suggestion).

@Rucknium
Copy link

In random_exponential_duration you @vtnerd write this comment:

Note this always rounds down to nearest whole number. if std::lround
was used instead, then 0 seconds would be used less frequently. Not sure
which is better, since we cannot broadcast on sub-seconds intervals.

Why is there this restriction to broadcast in only integer second intervals? When you take the floor of an exponential distribution, you get a geometric distribution (see here). The geometric distribution is memoryless like the exponential distribution, but the substitution might affect the privacy properties of Dandelion++.

I have been looking at whether the fluff-phase timer should also be changed from Poisson to exponential. The Dandelion++ paper doesn't explicitly say that the fluff timers should be exponential, but it strongly hints that way IMHO. Algorithm 5 "Dandelion++ Spreading at node v" in Fanti et al. (2018) ends with Diffusion(X ,v, H). The paper says "Bitcoin Core, the most popular Bitcoin implementation, adopted a protocol called diffusion, where each node spreads transactions with independent, exponential delays to its neighbors on the P2P graph." Fanti & Viswanath have an earlier paper about the privacy properties of bitcoin's transaction broadcast system. It describes diffusion: "In diffusion spreading, each source or relay node transmits the message to each of its uninfected neighbors with an independent, exponential delay of rate λ. We assume a continuous-time system, in which a node starts the exponential clocks as soon as it receives (or creates) a message."

@Boog900 brought up the possibility that the total RAM load on nodes would increase if the fluff timer was switched from Poisson to exponential. The Poisson and exponential have the same mean (when you specify the mean to be the same), but the exponential distribution has much higher variance with our parameters. That means that there may be a higher probability of occasionally having a much higher number of transactions loaded in the node's per-connection fluff queues.

I wrote a simulation to test this hypothesis. In the end, the total RAM load is not much different between the Poisson and exponential timers.

  1. Let the number of connections be 100.
  2. Let txs arrival be a Poisson process with rate 1/3 (one every 3 seconds on average). Produce 100,000 txs (about 3 days of "data" at the rate of Monero's current transaction volume).
  3. Use this procedure to set a timer: Initially, no flush timer is set. When the node gets a fluff-phase tx, a timer is set. If the node gets another fluff-phase tx before the timer expires, the node just adds the tx to the queue. Then the timer expires and all txs in the queue are sent to the peer. The node waits until a new tx is received to set a new timer.
  4. Set timers by monerod's procedure (option 2). Let the timers be Poisson and exponential, both with mean 5 seconds, to compare them.
  5. Record the maximum simultaneous number of aggregate txs in all of the peer queues.
  6. Run this simulation 100 times.

When the timer is Poisson, maximum simultaneous number of aggregate txs in all of the peer queues is an average of 843 across the 100 simulations. When the timer is exponential, it is 851. This is not a big difference IMHO. When I set the number of transactions to be lower, e.g. 10,000, the averages for the Poisson and exponential timers are farther apart. This may mean that the two numbers may be even closer to each other when the simulated time period is extended further. The R simulation code is below.

# install.packages(c("data.table", "zoo", "parallelly" "future", "future.apply"))
# Install these packages if not already installed

library(data.table)
library(zoo)


# timer.method <- "set_when_previous_timer_expired"
timer.method <- "set_when_new_tx"


n.tx <- 100000
n.peers <- 100
n.monte.carlo.sims <- 100

do.multithreaded <- FALSE
# Multithread will use more RAM

if (do.multithreaded) {
  n.workers <- floor(parallelly::availableCores()/2)
  future::plan(future::multicore, workers = n.workers)
} else {
  future::plan(future::sequential)
}



random.txs <- function(n) { rexp(n, 1/3) }
# Distribution of arrival times between transactions is exponential with
# rate paramaeter 1/3. This is 60^2*24/3 = 28800 transactions per day

stopifnot(timer.method %in% c("set_when_previous_timer_expired", "set_when_new_tx"))


if (timer.method == "set_when_previous_timer_expired") {
  set.timers <- function(tx.arrival, random.flush) {
    y <- random.flush(length(tx.arrival) * 2)
    
    while ( sum(y) <= max(tx.arrival) ) {
      y <- c(y, random.flush(length(tx.arrival)))
    }
    # In case the time period of the flush timers
    # do not completely cover the time of the tx arrivals,
    # add more flush timers.
    
    y <- cumsum(y)
    y <- y[ y <= max(tx.arrival) ]
    y
  }
}

if (timer.method == "set_when_new_tx") {
  set.timers <- function(tx.arrival, random.flush) {
    y <- vector("numeric", length(tx.arrival) + 1)
    j <- 1
    
    while (j <= length(tx.arrival)) {
      y[j] <- tx.arrival[j] + random.flush(1)
      # Add a random flush timer to the tx arrival time. The flush timer may
      # expire before any new txs arrive or may expire after a few more txs.
      # We need to figure out which tx arrives after the timer expires so we
      # can set the next timer.
      shortcut.length <- 100
      # The shortcut.length is the number of elements of tx.arrival to evaluate
      # to find how many transactions will be broadcast in the queue before
      # the flush timer expires. It is shorter than the total length of tx.arrival
      # to speed up computation.
      while (TRUE) {
        increment <- which(tx.arrival[ j:min(c(j + shortcut.length, length(tx.arrival))) ] > y[j])[1]
        if (! is.na(increment)) { break }
        if (j + shortcut.length < length(tx.arrival)) {
          shortcut.length <- shortcut.length + 1000
          # When which() does not have a TRUE element, it will return NA.
          # If the shortcut did not search to the end of the tx.arrival
          # vector, then add to shortcut.length and try again
        } else {
          break
        }
      }
      j <- j - 1 + increment
      if (is.na(j)) { break }
    }
    
    y <- y[y != 0]
    y
  }
}




set.seed(314)

final.results <- list()


for (timer.distribution in c("exp", "pois")) {
  
  
  stopifnot(timer.distribution %in% c("exp", "pois"))
  
  if (timer.distribution == "exp") {
    random.flush <- function(n) { rexp(n, 1/5) }
  }
  
  if (timer.distribution == "pois") {
    random.flush <- function(n) { rpois(n, 20)/4 }
  }
  
  max.results <- vector("numeric", n.monte.carlo.sims)
  
  for (k in 1:n.monte.carlo.sims) {
    
    tx.arrival <- cumsum(random.txs(n.tx))
    
    peer.timers <- future.apply::future_replicate(n.peers, {
      
      peer.queues <- set.timers(tx.arrival, random.flush)
      
      peer.queues <- setdiff(peer.queues, tx.arrival)
      # Cannot have tx arrive and flush at same time. This is rare because
      # the tx arrival is exp-distributed (i.e. continuous). This
      # could occur if the flush timer is zero, which would occur rarely 
      # with the Poisson distribution. setdiff() will also remove any
      # duplicates in peer.queues
      # This error will occur on c() below if any elements are at
      # the same time:
      # Error in rbind.zoo(...) : indexes overlap
      
      tx.added <- zoo(rep(1, length(tx.arrival)), tx.arrival)
      tx.flushed <- zoo(rep(0, length(peer.queues)), peer.queues)
      # Create zoo time objects
      
      all.events <- sort(c(tx.added, tx.flushed))
      
      peer.queues.filled <- data.table(all.events = all.events)[,
        all.events := cumsum(all.events), .(cumsum(coredata(all.events) == 0))]$all.events
      # https://stackoverflow.com/questions/65335978/how-to-perform-cumsum-with-reset-at-0-in-r
      # When we encounter a "1" from tx.added, add it to the running total.
      # When we encounter at "0" from tx.flushed, reset the counter to zero.
      
      peer.queues.filled <- data.table(master = index(peer.queues.filled), 
        x = coredata(peer.queues.filled))
      
      data.table::setkey(peer.queues.filled, master)
      
      list(peer.queues = peer.queues, peer.queues.filled = peer.queues.filled)
    },
      simplify = FALSE,
      future.globals = c("set.timers", "tx.arrival", "random.flush"),
      future.packages = c("data.table", "zoo"))
    
    peer.queues <- lapply(peer.timers, FUN = function(x) {x$peer.queues})
    peer.queues.filled <- lapply(peer.timers, FUN = function(x) {x$peer.queues.filled})
    rm(peer.timers)
    
    peer.queues.all <- data.table(master = sort(unique(c(tx.arrival, unlist(peer.queues)))))
    # Create a master table of the time of all events.
    # This table will be merged with each connection's running totals.
    
    rm(tx.arrival, peer.queues)
    
    data.table::setkey(peer.queues.all, master)
    
    peer.queues.all <- future.apply::future_lapply(peer.queues.filled, 
      FUN = function(y) {
        
        y <- merge(peer.queues.all, y, by = "master", all = TRUE)
        
        y[, master := NULL]
        
        y[, x := data.table::nafill(x, "locf")]
        # "locf" means "last observation carried forward".
        y[, x := data.table::nafill(x, fill = 0)]
        # The observations in the beginning will still be NA. Fill with 0.
        y
        
      },
      future.globals = c("peer.queues.all"),
      future.packages = c("data.table")
    )
    
    peer.queues.all <- do.call(cbind, peer.queues.all)
    
    results <- rowSums(peer.queues.all)
    # The sum of each row is the aggregate number of txs in all
    # queues at the time of each event
    
    max.results[k] <- max(results)
    
    rm(peer.queues.all, peer.queues.filled, results)
    
    gc()
    
    cat(base::date(), ", Flush timer distribution: ", timer.distribution, ", Iteration: ", k, "\n", sep = "")
    
  }
  
  final.results[[timer.distribution]] <- max.results
  
}


summary(final.results$exp)
summary(final.results$pois)

t.test(final.results$exp, final.results$pois)

References

Fanti, G., Venkatakrishnan, S. B., Bakshi, S., Denby, B., Bhargava, S., & Miller, A., Viswanath P (2018). "Dandelion++: Lightweight cryptocurrency networking with formal anonymity guarantees."

Fanti G & Viswanath P (2017) "Anonymity Properties of the Bitcoin P2P Network"

@vtnerd
Copy link
Contributor Author

vtnerd commented Aug 5, 2024

Why is there this restriction to broadcast in only integer second intervals?

This refers to the blackhole timeout only, the fluff timers are different:

  • The timeout (hijacked) a uint64_t field which was storing a time_t value. For compatibility reasons I kept everything in seconds (although with a uint64_t it is large enough to store nanoseconds).
  • The timeout check occurs via on_idle timers which occur at one second intervals. This could be updated to contain a new timer just for the tx_pool that is scheduled at nanosecond expiration, but the timer execution is still approximate anyway (a free io_service thread is needed).
  • I didn't think sub-second expirations mattered much here. With the lower expected average timeout maybe it's time to re-think this, but 7s is still pretty long.

@Boog900
Copy link
Contributor

Boog900 commented Aug 17, 2024

Using the current pseudo-geometric distribution increases the percent of txs we would expect to not make it all the way through the stem stage before an embargo timer firing.

If we assume a tx takes 0.175s to pass through a node (the number we already used when calculating the embargo rate) then with the current code 2% of txs will be fluffed before reaching the first hop whereas the expected value should be 0.37%

If my maths is correct I expect that the number of txs not making it all the way to the 8th node to be 17.5% whereas we should be targeting ep which we have set at 10%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants