-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault running SimNetwork under verilator #1885
Comments
Oh, I also meant to mention: I found someone on the mailing list with a similar-looking symptom ( https://groups.google.com/g/chipyard/c/i0pNR4t8HFA/m/NBMP4fcsAQAJ ), but given that they reported 1) the crash occurred in what appears to be an initial block rather than the nba driving the network_tick, and 2) solving the problem by changing the name of the connected bus I suspect that is a different issue, though perhaps still somewhere in SimNetwork.cc & friends. |
I wonder if adding a lock to the network__tick and network_init functions would be sufficient. |
I think it depends on how you mean: I noticed the If you mean "instead of using a ucontext/coroutine thing, set up a cond-with-lock in Between repair and replace, I'm personally leaning towards trying to figure out what the fesvr's |
I suspect the solution to the problem does not require messing around in context_t. I believe several other devices uses FESVR's |
Perhaps! I did see that other devices made use of Two especially relevant details I've noticed so far:
I think it's the combination of these two that's causing the behavior I'm seeing: when the scheduler happens to place the If I'm further right in saying that any call to |
Hmm, well, yes and no to that last question. Adding a spike tile1 did perturb the scheduler enough that the simulation worked at least once2 with
and failed with
But, adding/removing cores doesn't change which threads do the initialization:
And, experimenting also provided a counterexample to my speculation that any
Also, it seems that
It seems like the spike tile is in fact the odd one out; since it's driven from a single DPI-C entrypoint that inits itself on demand, it's not possible(?) for it to be scheduled on to two different threads by the Verilator microtask scheduler. Footnotes
|
Ah, one speculative code change later and, oh, hello:
init and tick ran in different OS threads, and yet no crash! That result via noticing that it was a very different crash in the case that someone had previously populated the thread-local. In reading the code to try and identify how the control flow ought to return, I noticed that both the device and the switch had a pointer to another thread's local storage (they were both storing if (!netdev || !netsw) {
fprintf(stderr, "You forgot to call network_init!");
exit(1);
}
+ netdev->target = context_t::current();
+ netsw->main = context_t::current();
netdev->tick(out_valid, out_data, out_last);
netdev->switch_to_host();
netsw->distribute();
netsw->switch_to_worker(); and, since I'm still not quite sure what all this means yet, nor what course of action it suggests, but I did find that result very interesting. |
Ok, so I've experimented a bit more, and I've come up with three potentially useful perspectives here:
I'd be lying if I said I didn't see the last as the most pragmatic option. I also feel like it's a bit of of a loss: I've really enjoyed learning about the I believe I can fill out any of those three directions into a full-fledged PR to the upstream(s) in question here (IceNet & testchipip, or riscv-isa-sim). Do any of them especially call to you, @jerryz123 ? I see that you've done a lot of this work, so I suspect you'd have a better sense for the overall context (pun intended). |
While I agree with your reasoning here, I don't think its reasonable to expect/require SimDevice implementations to be thread-safe, where the constructor/tick functions can be called from distinct threads. As far as I can tell, this quirk only appears with Verilator multi-threading. The other simulators don't do this, even with multithreading enabled.
Does this generalize to systems with multiple contexts? IMO its better to require the programmer to explicitly specify the next context to execute. There are use-cases of context_t which have multiple contexts (not just target/host).
The Another example is the tick function for
I couldn't think of a way to make that system work without context_t. My belief here is that the Thank you for digging into this, I've learned quite a bit about the subtleties of the context_t behavior in a multi-threaded system from your analysis. |
Yeah, I hear you about not wanting the {runtime,complexity} overhead of generalized thread-safety in the simulated devices. I want to note that the situation here calls for a much narrower "kind" of thread safety—it's the difference between what Rust calls
To be fully transparent, I have only a few dozen hours' worth of experience with any of the commercial verilog implementations, and none to the depth I've gotten here with verilator. I do wonder if it's on accident or by design that the other simulators don't encounter this behavior: do you know if there's some verilog standard (implicit or explicit) that verilator is violating here by evaluating the
The
I appreciate the examples! I'm glad to have the benefit of your experience here. I'd agree at this point that "avoid use of Unfortunately, I'm not sure there's a general resolution that doesn't at least involve at least looking at the other use sites: neither threading nor non-local control flow are famous for composing well. I haven't identified any answers that reside entirely within
It's a good idea, that's how it seems the spike tile (and, perhaps, I'm also not entirely sure how durable it would be, as a solution: the verilator documentation on task scheduling suggests that they tried both static and dynamic scheduling and went for the static for performance (rather than correctness) reasons. I suspect a dynamically scheduled runtime (e.g. one based on work-stealing) would probably cause even My guess is that's a decision that's unlikely to be reversed any time soon ("efficient dynamic scheduling" notwithstanding), and you're in the much better position than I to know if "all DPI-C that uses I think my plan at this point is to continue experimenting with ucontexts to get a better understanding of what it means to nest them (& how else they're used by htif & friends), and whether there's somewhere besides a thread-local to pass task-local sideband data to try and break the coupling there. Pursuing a definition of a task that didn't care what thread it was scheduled on (as long as it wasn't scheduled more than once) seems like it offers a resolution that's relatively low-impact and high-durability to me.
Thank you for reading, and for the feedback! I'm glad you've found it helpful, I've very much enjoyed learning about all these fine details as well 😄 Footnotes
|
I think I finally understand the problem here well enough that I feel confident about what's happening. Much of the clarity came when I got curious about the question "why is assigning I found that even in single threaded mode, I understand the desire to move responsibility for that to the verilog implementation, but I don't see how to do so effectively. In this case there's three non-lexical scopes that all need to line up (the thread local, the So, I'd like to pitch a three step plan:
What do you think? Would you be willing to accept a change like 1 and piecemeal updates for 2? Footnotes |
Background Work
Chipyard Version and Hash
Release: N/A
Hash: ef71dfd
OS Setup
(partial) `printenv`
`conda list`
Other Setup
Followed the "setting up the repository" guide, and added this to
PeripheralDeviceConfigs.scala
:Current Behavior
When I run:
make -C sims/verilator CONFIG=TapNICRocketConfig VERILATOR_THREADS=1
with any number of threads larger than1
, I end up with a simulator program that dies with a SIGSEGV almost as soon as I can launch it.As a consequence (and, to address what I'm really after), running the
pingd.c
test results in a ~2s ping on my system. That's longer than the default interval of a second, which means that runningping
with no arguments against the single-threaded RTL simulation is an effective DOS strategy as it sends ICMP echo requests at about 2x the throughput the simulator can maintain.Expected Behavior
Simply, I expected to be able to "throw more threads at it," as most of the time seems to be going to front-end stalls due to icache misses, something that multiple threads addresses nicely by way of expanding the effective available icache space.
More broadly, I suppose I expected there to be a way to get to workable performance of the RTL model for functional simulation of simple network nodes without custom hardware, proprietary software, or an FPGA-equipped cloud instance. Perhaps my expectation that such a path exists through
verilator
is worth discussing here, too?Other Information
coredumpctl debug
suggests this is because the thread-local context_t isn't fully initialized in all threads:I'm not familiar with the usercontext patterns/types used by the fesvr to implement what appears to be "green threads" (really, co-routines), but I do see in the verilator docs that in multithreaded verilator it's the verilated model which creates and manages all N-1 threads except for whatever called
eval
:And the latter bit suggests to me that while under some conditions the DPI functions may be called from the same threads, I see no guarantees that DPI from the
always
blocks in a module will be called using the same thread as theinitial
blocks (which the current implementation implicitly assumes). I suspect that one more degree of refinement around "the model's threads ought to be more compatible with the fesvr's coroutine implementation backing the DPI SimNetwork implementation" would help here, not least in identifying whether this is even a crash that chipyard itself has any leverage over.I'm opening this here because I believe it's the right home for this issue, since it seems that even if verilator or fesvr exposed a callback or config option that would affect the outcome there'd still need to be a change here in chipyard to take advantage of it, but I'm very new to this space and would welcome your guidance.
The text was updated successfully, but these errors were encountered: