-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Future Plans: better throughput, more measurements/second #22
Comments
Hi John - let me try to answer some parts.
For the solutions:
The key to speeding up the measurement cycle it to reduce the time the 2560 spends in the compute/format loop. As long as the transmit buffers stay below 64 bytes, I think the serial Tx is outside the critical path (i.e. handled by interrupts).
-- Tom, N5EG |
The Mega has 4 hardware UARTs that are all the same. (That's one of its big advantages over the smaller boards). The USB-to-serial conversion is handled on a separate chip that's wired up to UART0; the CPU doesn't see any of it. In fact if you want to ignore USB and use serial you can do it right now by unplugging the USB, applying power to Gnd and 5V (or Gnd and Vin), and hooking up TTL serial to pins 0 and 1. And I agree with Tom about Serial.print, it's not likely to be your bottleneck. In fact, I wouldn't be surprised at all if the most expensive thing is simply all of the 64-bit arithmetic (especially the divisions) in print_signed_picos_as_seconds. Not sure how much improvement is available there, but I think it's at least some. Another thing worth trying is eliminating all of the digitalRead and digitalWrite in the measurement loop. Those functions have a bunch of overhead due to checks needed for beginner-friendliness (look up the port number and mask in a table of pin numbers, make sure it's a valid pin number, see if it's doing PWM, etc.). It's possible to bypass all that for performance, and may be worth it here. |
Thanks, Andrew.
You're confirming what I gathered after the bit of research I was able
to do.
I have two ideas that separately or together might get around the 64 bit
bottleneck, but haven't had time to fully think through either of them:
1. Don't do math or print results at 1 picosecond resolution, but at 10
ps. That reduces scaling at the small end by an order of magnitude, and
shouldn't cost anything given 60-70 ps of noise. Then maybe int32 is
big enough for practical purposes.
2. Process and print the integer and fractional part of the result
separately, each as a native 32 bit, e.g.,
Serial.print(integer_part);Serial.print(".");Serial.println(fractional.part)
That requires converting PICstop (count of 100us ticks) into two
variables: integer seconds, and n*0.0001, then subtracting the
time-of-flight variable and deal with any rollovers that result. Can
all that be done with less overhead than the current approach? I have
no idea.
John
----
…On 02/25/2017 03:22 PM, Andrew Rodland wrote:
The Mega has 4 hardware UARTs that are all the same. (That's one of its
big advantages over the smaller boards). The USB-to-serial conversion is
handled on a separate chip that's wired up to UART0; the CPU doesn't see
any of it. In fact if you want to ignore USB and use serial you can do
it right now by unplugging the USB, applying power to Gnd and 5V (or Gnd
and Vin), and hooking up TTL serial to pins 0 and 1. And I agree with
Tom about Serial.print, it's not likely to be your bottleneck.
In fact, I wouldn't be surprised at all if the most expensive thing is
simply all of the 64-bit arithmetic (especially the divisions) in
print_signed_picos_as_seconds. Not sure how much improvement is
available there, but I think it's at least some.
Another thing worth trying is eliminating all of the digitalRead and
digitalWrite in the measurement loop. Those functions have a bunch of
overhead due to checks needed for beginner-friendliness (look up the
port number and mask in a table of pin numbers, make sure it's a valid
pin number, see if it's doing PWM, etc.). It's possible to bypass all
that for performance, and may be worth it here.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#22 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ANonv3QQF4zwhPs8W9w7tCI3VkCvfVWYks5rgI2ZgaJpZM4J4jpm>.
|
Option 2 is very likely to help if done right. Division is pretty much the most expensive operation on a microcontroller, and it gets exponentially slower the further you go beyond the native word size (which is only 8 bits for AVR). If you can stop the values from getting so large in the first place, even at the cost of having more variables, and even at the cost of doing a division earlier on (but with smaller quantities) you can easily win. |
Attached is sample code for an int32-based printout routine. Not
tested, but would appreciate review for dumb logic/math errors.
This moves a little more work into the ISR, but the benefit is there is
no division required at all, at least to do the printing.
Note that we will still need to multiply and divide at 64 bits in the
tdc read function in order to get sufficient resolution in the
time-of-flight calculation.
The timestamp() function returns integer picoseconds in an int64_t
because the time interval and period routines depend on timestamp math.
But we're still left with the problem of printing those int64 results,
so more thought is needed. But at least this is a start.
Is this on the right track?
John
----
…On 02/26/2017 07:15 PM, Andrew Rodland wrote:
Option 2 is very likely to help if done right. Division is pretty much
the most expensive operation on a microcontroller, and it gets
exponentially slower the further you go beyond the native word size
(which is only 8 bits for AVR). If you can stop the values from getting
so large in the first place, even at the cost of having more variables,
and even at the cost of doing a division earlier on (but with smaller
quantities) you can easily win.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#22 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ANonv6CZkBvjtFveh7951k86GH8O8afWks5rghWxgaJpZM4J4jpm>.
|
I think this number is off, if PICstop_usec is actually 100us ticks, you want to multiply by 1e8, not 1e6. Either way, you need an int64_t at this point, or you will exceed the range of int32_t (+/- about 2e9). Picoseconds are tough, we can't even fit one second worth of them in a 32-bit int. But as demonstrated by tof, the number of picos in a 100us tick does fit comfortably in an int32, and it's a nice decimal division (meaning you can be clever with print routines; instead of multiplying, adding, and printing, you can just print the two values back-to-back if you make sure to zero-pad the second one). So I would suggest dividing things up along those lines, as a I admit it seems a bit complicated, but it could be done up pretty tidily, and the next simpler option that I see (that of I'm willing to make an attempt at doing it, if you think it's something you would want to maintain. It's the sort of thing I enjoy playing with anyhow. |
I'd love it if you'd take a whack at that, Andrew!
And you're right about 1e8, of course.
Thanks!
John
----
…On 3/1/2017 1:58 AM, Andrew Rodland wrote:
|int32_t frac = PICstop_usec * 1000000; // mult by 1e6 to make picos |
I think this number is off, if PICstop_usec is actually 100us ticks, you
want to multiply by 1e8, not 1e6.
Either way, you need an int64_t at this point, or you will exceed the
range of int32_t (+/- about 2e9). Picoseconds are tough, we can't even
fit one second worth of them in a 32-bit int.
But as demonstrated by tof, the number of picos in a 100us tick /does/
fit comfortably in an int32, and it's a nice decimal division (meaning
you can be clever with print routines; instead of multiplying, adding,
and printing, you can just print the two values back-to-back if you make
sure to zero-pad the second one). So I would suggest dividing things up
along those lines, as a |struct timestamp { int32_t sec; uint16_t ticks;
uint32_t ps; }| (80 bits, with the decimal point located convenently
between two quantities, time range over a hundred years). Adding and
subtracting them would be done through (relatively cheap) library
functions that add/subtract corresponding parts and deal with
carry/overflow.
I admit it seems a bit complicated, but it could be done up pretty
tidily, and the next simpler option that I see (that of |struct {
int32_t ticks; uint32_t ps; }|) only covers a range of about 5 days
which would become inconvenient during long measurement runs.
I'm willing to make an attempt at doing it, if you think it's something
you would want to maintain. It's the sort of thing I enjoy playing with
anyhow.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#22 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ANonv4blZ4USSlikjQsokSbOZoxpUT-vks5rhRcTgaJpZM4J4jpm>.
|
It's been mentioned that baud rate imposes a basic limit on throughput. I've been using 115200 as that is the maximum speed of the Arduino IDE terminal window, but the hardware apparently supports up to 2mb. I just did a test of the current code with the "DETAIL_TIMING" flag set in TICC.ino, and found that 230400 worked just fine and 576000 worked with some garbles. But neither one resulted in reduced execution time. So I think that at present we are compute, and not wire, bound. |
Andrew, not sure I replied to this, but I did do macros for LED control.
Don't know if they make much difference over digitalWrite(), but I'll
take what I can get.
John
----
…On 02/25/2017 03:22 PM, Andrew Rodland wrote:
The Mega has 4 hardware UARTs that are all the same. (That's one of its
big advantages over the smaller boards). The USB-to-serial conversion is
handled on a separate chip that's wired up to UART0; the CPU doesn't see
any of it. In fact if you want to ignore USB and use serial you can do
it right now by unplugging the USB, applying power to Gnd and 5V (or Gnd
and Vin), and hooking up TTL serial to pins 0 and 1. And I agree with
Tom about Serial.print, it's not likely to be your bottleneck.
In fact, I wouldn't be surprised at all if the most expensive thing is
simply all of the 64-bit arithmetic (especially the divisions) in
print_signed_picos_as_seconds. Not sure how much improvement is
available there, but I think it's at least some.
Another thing worth trying is eliminating all of the digitalRead and
digitalWrite in the measurement loop. Those functions have a bunch of
overhead due to checks needed for beginner-friendliness (look up the
port number and mask in a table of pin numbers, make sure it's a valid
pin number, see if it's doing PWM, etc.). It's possible to bypass all
that for performance, and may be worth it here.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#22 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ANonv3QQF4zwhPs8W9w7tCI3VkCvfVWYks5rgI2ZgaJpZM4J4jpm>.
|
Status update: I found some time to work on this over the weekend. I refactored timestamps the way I suggested in my earlier comment, got everything working, and ended up with execution times that were almost exactly the same as before. I'll investigate more to try to figure out where the time is really being spent. |
That's interesting...
I just made a change (version 20170308.1) that might help with
profiling. The new version I just pushed to main adds a "Null Output"
mode that goes through all the motions but doesn't format or output the
result. Setting that mode, plus uncommenting DETAIL_TIMING, should
provide better ability to see the speed of the loop without serial
output overhead.
A very quick test indicates that with a single channel active, time to
get and calculate a result is about 650ms, but the total time through
the loop is about 1100ms -- almost half the time is spent not doing
useful work. I haven't had a chance yet to look for where that time is
being spent.
…On 03/07/2017 01:07 PM, Andrew Rodland wrote:
Status update: I found some time to work on this over the weekend. I
refactored timestamps the way I suggested in my earlier comment, got
everything working, and ended up with execution times that were almost
exactly the same as before. I'll investigate more to try to figure out
where the time is really being spent.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#22 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ANonv5kTQS4cmnPgi87fPfBedGWpnIFkks5rjZz1gaJpZM4J4jpm>.
|
Serial.print() statements in the DETAIL_TIMING debug code were swamping the actual execution time. I just pushed 20170309.1 which just prints total execution time (micros) with nothing else and that cut down (but did not completely eliminate) the overhead. That in conjunction with the "NULL" measurement mode should allow better profiling -- toggling between NULL and TIMESTAMP mode should show the result formatting/output overhead. |
That's great! I'll rebase my work on top of this and see what comes out. |
I suppose this is dead, it's been over a year. The changes I made worked, but contrary to expectation, made absolutely no difference to the performance of the code, and I couldn't get to the bottom of why. I'll try to find my code and push it up in case anyone wants to look into it in the future. |
Without any optimizing (and using i64_t way more than is probably necessary, and Arduino that's really slow), the time inside the TICC measurement loop from "got data" to end of calculation is around 500us.
However, printing over USB takes 1 to 5ms. From some Googling I've learned that the following may be bottlenecks:
I've also found or thought of a couple of possible solutions:
Thoughts?
The text was updated successfully, but these errors were encountered: