Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future Plans: better throughput, more measurements/second #22

Open
n8ur opened this issue Sep 8, 2016 · 15 comments
Open

Future Plans: better throughput, more measurements/second #22

n8ur opened this issue Sep 8, 2016 · 15 comments

Comments

@n8ur
Copy link
Member

n8ur commented Sep 8, 2016

Without any optimizing (and using i64_t way more than is probably necessary, and Arduino that's really slow), the time inside the TICC measurement loop from "got data" to end of calculation is around 500us.

However, printing over USB takes 1 to 5ms. From some Googling I've learned that the following may be bottlenecks:

  1. The "Serial.print" function is very inefficient. Using "Serial.write" might be faster, but we have to send it the data character-by-character.
  2. USB packetizes data and sends variable length packets of up to 4K once per millisecond. So there's latency waiting for the next slot, and if the data crosses packets there's further delay.

I've also found or thought of a couple of possible solutions:

  1. Use Serial.write -- probably not a big win if you still have USB latency.
  2. Buffer the print requests so that we don't send on each measurement, but build up until we can send a burst. We do not need real-time data output; if a burst arrives every 10 seconds, I don't think that's a problem.
  3. Use one of the other UART ports on the Mega CPU for output. This would require an added TTL to RS-232 converter. It would get rid of USB latency, but may have other overhead or bottlenecks (I'm not clear whether only serial port 1 is hardware UART and the others may involve more bit banging).
  4. Use SPI to send data to another Arduino, which would buffer it to send out in large packets and could also do other cool things like drive an LCD front panel. I kind of like this idea as presumably the SPI would be able to move the data out pretty darn quickly, and we already have the SPI code in the program.

Thoughts?

@Tom-McDermott
Copy link
Contributor

Tom-McDermott commented Sep 11, 2016

Hi John - let me try to answer some parts.

  1. When passed a string, Serial.print basically just calls Serial.write. There's not much difference performance-wise. Printing a float is where Serial.print is inefficient. Our code creates a string out of the uint64_t by using printf. I suspect that printf and Serial.print are roughly similarly inefficient.
  2. USB 2.0 packetizes 1024 bytes/frame. However I don't think the Arduino has any added processing as the buffer is passed to the 16U2, which queues it and completes the USB transfer independently from the 2560 processor. Any delay here is in when the characters hit the host computer (latency about which we don't care).

For the solutions:

  1. All 4 of the serial interfaces on the Mega2560 each have only a one-byte hardware buffer. The Arduino serial library puts the transmit characters into a 64-byte ring buffer then returns. The library uses interrupts to clock out the contents of that 64-byte buffer. If You try to put more than 64 bytes into the buffer, then the Serial.write (or print) call blocks, and all other processing stops until it drops to below 64 bytes in the buffer again.
  2. I don't think things improve by choosing a different serial output as the 16U2 offloads most or all of the USB work. Other ports have the drawback that the host computer would need a serial interface.

The key to speeding up the measurement cycle it to reduce the time the 2560 spends in the compute/format loop. As long as the transmit buffers stay below 64 bytes, I think the serial Tx is outside the critical path (i.e. handled by interrupts).

  1. We could use SPI to another Arduino. The question is: does that actually reduce the time in the critical processing loop? I'm not convinced that it would, but it might. If we just sent the 8-byte uint64_t to the 2nd Arduino via an SPI, and had that 2nd Arduino do all the formatting into ASCII, then it would remove some time from the critical-path loop of the 1st Arduino.

-- Tom, N5EG

@arodland
Copy link

The Mega has 4 hardware UARTs that are all the same. (That's one of its big advantages over the smaller boards). The USB-to-serial conversion is handled on a separate chip that's wired up to UART0; the CPU doesn't see any of it. In fact if you want to ignore USB and use serial you can do it right now by unplugging the USB, applying power to Gnd and 5V (or Gnd and Vin), and hooking up TTL serial to pins 0 and 1. And I agree with Tom about Serial.print, it's not likely to be your bottleneck.

In fact, I wouldn't be surprised at all if the most expensive thing is simply all of the 64-bit arithmetic (especially the divisions) in print_signed_picos_as_seconds. Not sure how much improvement is available there, but I think it's at least some.

Another thing worth trying is eliminating all of the digitalRead and digitalWrite in the measurement loop. Those functions have a bunch of overhead due to checks needed for beginner-friendliness (look up the port number and mask in a table of pin numbers, make sure it's a valid pin number, see if it's doing PWM, etc.). It's possible to bypass all that for performance, and may be worth it here.

@n8ur
Copy link
Member Author

n8ur commented Feb 26, 2017 via email

@arodland
Copy link

Option 2 is very likely to help if done right. Division is pretty much the most expensive operation on a microcontroller, and it gets exponentially slower the further you go beyond the native word size (which is only 8 bits for AVR). If you can stop the values from getting so large in the first place, even at the cost of having more variables, and even at the cost of doing a division earlier on (but with smaller quantities) you can easily win.

@n8ur
Copy link
Member Author

n8ur commented Feb 27, 2017 via email

@n8ur
Copy link
Member Author

n8ur commented Feb 27, 2017

print.txt

@arodland
Copy link

arodland commented Mar 1, 2017

int32_t frac = PICstop_usec * 1000000; // mult by 1e6 to make picos

I think this number is off, if PICstop_usec is actually 100us ticks, you want to multiply by 1e8, not 1e6.

Either way, you need an int64_t at this point, or you will exceed the range of int32_t (+/- about 2e9). Picoseconds are tough, we can't even fit one second worth of them in a 32-bit int.

But as demonstrated by tof, the number of picos in a 100us tick does fit comfortably in an int32, and it's a nice decimal division (meaning you can be clever with print routines; instead of multiplying, adding, and printing, you can just print the two values back-to-back if you make sure to zero-pad the second one). So I would suggest dividing things up along those lines, as a struct timestamp { int32_t sec; uint16_t ticks; uint32_t ps; } (80 bits, with the decimal point located convenently between two quantities, time range over a hundred years). Adding and subtracting them would be done through (relatively cheap) library functions that add/subtract corresponding parts and deal with carry/overflow.

I admit it seems a bit complicated, but it could be done up pretty tidily, and the next simpler option that I see (that of struct { int32_t ticks; uint32_t ps; }) only covers a range of about 5 days which would become inconvenient during long measurement runs.

I'm willing to make an attempt at doing it, if you think it's something you would want to maintain. It's the sort of thing I enjoy playing with anyhow.

@n8ur
Copy link
Member Author

n8ur commented Mar 1, 2017 via email

@n8ur
Copy link
Member Author

n8ur commented Mar 2, 2017

It's been mentioned that baud rate imposes a basic limit on throughput. I've been using 115200 as that is the maximum speed of the Arduino IDE terminal window, but the hardware apparently supports up to 2mb. I just did a test of the current code with the "DETAIL_TIMING" flag set in TICC.ino, and found that 230400 worked just fine and 576000 worked with some garbles. But neither one resulted in reduced execution time. So I think that at present we are compute, and not wire, bound.

@n8ur
Copy link
Member Author

n8ur commented Mar 3, 2017 via email

@arodland
Copy link

arodland commented Mar 7, 2017

Status update: I found some time to work on this over the weekend. I refactored timestamps the way I suggested in my earlier comment, got everything working, and ended up with execution times that were almost exactly the same as before. I'll investigate more to try to figure out where the time is really being spent.

@n8ur
Copy link
Member Author

n8ur commented Mar 9, 2017 via email

@n8ur
Copy link
Member Author

n8ur commented Mar 9, 2017

Serial.print() statements in the DETAIL_TIMING debug code were swamping the actual execution time. I just pushed 20170309.1 which just prints total execution time (micros) with nothing else and that cut down (but did not completely eliminate) the overhead. That in conjunction with the "NULL" measurement mode should allow better profiling -- toggling between NULL and TIMESTAMP mode should show the result formatting/output overhead.

@arodland
Copy link

That's great! I'll rebase my work on top of this and see what comes out.

@arodland
Copy link

I suppose this is dead, it's been over a year. The changes I made worked, but contrary to expectation, made absolutely no difference to the performance of the code, and I couldn't get to the bottom of why. I'll try to find my code and push it up in case anyone wants to look into it in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants