-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Hi,
I've been digging into checksum errors that we have been getting while attempting to queue messages. This issue was found while receiving unsolicited messages at a fairly high rate of one every 5 seconds and also trying to queue messages . This issue was intermittent so it was difficult to track down. However, the error was caught during debugging:
15:31:12.588 -> sendCommandWithResponse: ====>
15:31:12.588 -> sendCommand: Command: $TD HD=34819200,"1673299804+9999+9999+9999+9999+9999+999+9999+9999+999+999+999+999+9+408563-966562+367+2+SH2_01;1673299865+9999+9999+9999+9999+9999+999+9999+9999+999+999+9"*4a
15:31:12.588 ->
15:31:12.635 -> waitForResponse: $TD OK,4601487328959*2c
15:31:12.635 ->
15:31:12.635 -> checksum = 50
15:31:12.635 -> expectedChecksum = 2C
15:31:12.635 -> checkChecksum: invalid checksum
15:31:12.635 -> sendCommandWithResponse: <====
'checksum' and 'expectedChecksum' were variables printed from the 'checkChecksum()' function in the 'SparkFun_Swarm_Satellite_Arduino_Library.cpp' file. 'checksum' is the value as calculated from the string and 'expectedChecksum' is the value parsed from the string. With this example, we can see that the checksum found on the response is 2C and the expectedChecksum matches this but the checksum does not match.
Looking at a larger number of communications also helps to give us a picture of what may be happening:
15:16:58.946 -> checkUnsolicitedMsg: event(s) found! ===>
15:16:58.946 -> checkUnsolicitedMsg: start of event: $RT RSSI=-109*15
15:16:58.946 -> checksum = 15
15:16:58.946 -> expectedChecksum = 15
15:16:58.994 -> checkUnsolicitedMsg: end of event
15:16:58.994 -> checkUnsolicitedMsg: <=== end of event(s)!
15:17:07.835 -> sendCommandWithResponse: ====>
15:17:07.880 -> sendCommand: Command: $TD HD=34819200,"1673298960+9999+9999+9999+9999+9999+999+9999+9999+999+999+999+999+9+408564-966562+359+2+SH2_01;1673299020+9999+9999+9999+9999+9999+999+9999+9999+999+999+9"*4b
15:17:07.880 ->
15:17:07.926 -> waitForResponse: $TD OK,4601487328952*27
15:17:07.926 ->
15:17:07.926 -> checksum = 15
15:17:07.926 -> expectedChecksum = 15
15:17:07.926 -> sendCommandWithResponse: <====
15:17:07.926 -> sendCommandWithResponse: ====>
15:17:07.926 -> sendCommand: Command: $MT C=U*12
15:17:07.926 ->
15:17:08.535 -> waitForResponse: $MT 1*08
15:17:08.535 ->
15:17:08.535 -> checksum = 15
15:17:08.535 -> expectedChecksum = 15
15:17:08.535 -> sendCommandWithResponse: <====
15:17:08.535 -> The message has been added to the transmit queue. The message ID is 4601487328952
15:17:08.582 -> Waiting for the $TD SENT
15:17:09.562 -> checkUnsolicitedMsg: backlog found! backlog length is 41
15:17:09.562 -> checkUnsolicitedMsg: event(s) found! ===>
15:17:09.562 -> checkUnsolicitedMsg: start of event: $RT RSSI=-109*15
15:17:09.562 -> checksum = 15
15:17:09.609 -> expectedChecksum = 15
15:17:09.609 -> checkUnsolicitedMsg: end of event
15:17:09.609 -> checkUnsolicitedMsg: start of event: $TD OK,4601487328952*27
15:17:09.609 -> checksum = 27
15:17:09.609 -> expectedChecksum = 27
15:17:09.609 -> checkUnsolicitedMsg: end of event
15:17:09.609 -> checkUnsolicitedMsg: start of event: $RT RSSI=-110*1d
15:17:09.609 -> checksum = 1D
15:17:09.609 -> expectedChecksum = 1D
15:17:09.657 -> checkUnsolicitedMsg: end of event
15:17:09.657 -> checkUnsolicitedMsg: <=== end of event(s)!
Here, we can see that the checksum for the first receive test unsolicited message is calculated correctly. Then, both the checksum and expectedChecksum values do not match up with the checksum from the return after the queue message is sent. The received string shows a checksum of 27 but they both stay as 15. The same is true for the response to the get unsent messages count command. It has a checksum of 08 but again we are stuck at 15. The values do still match, so no error is detected. Then when the backlog is being pruned, we find that the checksums begin to match up with the message strings again.
This looked to me like the string being passed into the 'checkChecksum()' function was not being updated. So I looked into where this was and found it in the function below. If we look at the bottom where the checksum is actually checked, we see that the '_swarmBacklog' buffer is passed into the function at the index 'responseStartedAt'. However, when we look at where that index is found, we see that the 'responseDest' buffer is the one being iterated through to find our response. This means that the index we are finding probably won't match with the location of the response in the _swarmBacklog, especially if we are receiving unsolicited messages which are put in there. It is most likely corresponding to the first message in the _swarmBacklog which in this case is probably a receive test message. This will usually result in the wrong message string being checked for a correct checksum which will normally still find that the checksums match. However, it can cause unexpected results.
Swarm_M138_Error_e SWARM_M138::waitForResponse(const char *expectedResponseStart, const char *expectedErrorStart,
char *responseDest, size_t destSize, unsigned long timeout)
{
unsigned long timeIn;
bool found = false;
size_t destIndex = 0;
bool responseStartSeen = false, errorStartSeen = false;
int responseIndex = 0, errorIndex = 0;
size_t responseStartedAt = 0, errorStartedAt = 0;
Swarm_M138_Error_e err = SWARM_M138_ERROR_ERROR;
bool printedSomething = false;
timeIn = millis();
while ((!found) && ((timeIn + timeout) > millis()))
{
size_t hwAvail = hwAvailable();
if (hwAvail > 0) //hwAvailable can return -1 if the serial port is NULL
{
if ((destIndex + hwAvail) < destSize) // Check there is room to store the response (with a null on the end!)
{
size_t bytesRead = hwReadChars((char *)&responseDest[destIndex], hwAvail);
if (_printDebug == true)
{
if (printedSomething == false)
{
_debugPort->print(F("waitForResponse: "));
printedSomething = true;
}
_debugPort->print((const char *)&responseDest[destIndex]);
}
// Check each character to see if it is the expected resonse or error
for (size_t chrPtr = destIndex; chrPtr < (destIndex + bytesRead); chrPtr++)
{
char c = responseDest[chrPtr]; // Check each character
if (c == expectedResponseStart[responseIndex])
{
if ((responseIndex == 0) && (responseStartSeen == false))
responseStartedAt = chrPtr; // Record where the (possible) response started
if (++responseIndex == (int)strlen(expectedResponseStart))
{
responseStartSeen = true;
}
}
else
{
responseIndex = 0;
}
if (expectedErrorStart != NULL)
{
if (c == expectedErrorStart[errorIndex])
{
if ((errorIndex == 0) && (errorStartSeen == false))
errorStartedAt = chrPtr; // Record where the (possible) error started
if (++errorIndex == (int)strlen(expectedErrorStart))
{
errorStartSeen = true;
}
}
else
{
errorIndex = 0;
}
}
// Set found to true if we have seen expectedResponseStart or expectedErrorStart _and_ have received a \n
if ((responseStartSeen || errorStartSeen) && (c == '\n'))
found = true;
}
// Now also copy the response into the backlog, if there is room
// _swarmBacklog is a global array that holds the backlog of any events
// that came in while waiting for response. To be processed later within checkUnsolicitedMsg().
// Note: the expectedResponse or expectedError will also be added to the backlog.
// Everything in the backlog is 'printable'. So it is OK to use strlen - so long as there is a
// \0 at the end of the buffer!
size_t backlogLength = strlen((const char *)_swarmBacklog);
if ((backlogLength + bytesRead) < _RxBuffSize) // Is there room to store the new data?
{
memcpy((char *)&_swarmBacklog[backlogLength], (char *)&responseDest[destIndex], bytesRead);
}
else
{
if (_printDebug == true)
{
if (printedSomething == true)
{
_debugPort->println();
printedSomething = false;
}
_debugPort->println(F("waitForResponse: Panic! _swarmBacklog is full!"));
}
}
destIndex += bytesRead; // Increment destIndex
}
else
{
if (_printDebug == true)
{
if (printedSomething == true)
{
_debugPort->println();
printedSomething = false;
}
_debugPort->println(F("waitForResponse: Panic! responseDest is full!"));
}
}
}
else
delay(1);
}
if (_printDebug == true)
if (printedSomething)
_debugPort->println();
if (found == true)
{
if (errorStartSeen) // Error needs priority over response as response is often the beginning of error!
{
// if (_printDebug == true)
// {
// _debugPort->print(F("waitForResponse: errorStart: "));
// _debugPort->println((char *)&_swarmBacklog[errorStartedAt]);
// }
err = checkChecksum((char *)&_swarmBacklog[errorStartedAt]);
if (err == SWARM_M138_ERROR_SUCCESS)
{
extractCommandError((char *)&_swarmBacklog[errorStartedAt]);
err = SWARM_M138_ERROR_ERR;
}
}
else if (responseStartSeen)
{
// if (_printDebug == true)
// {
// _debugPort->print(F("waitForResponse: responseStart: "));
// _debugPort->println((char *)&_swarmBacklog[responseStartedAt]);
// }
err = checkChecksum((char *)&_swarmBacklog[responseStartedAt]);
}
}
else
err = SWARM_M138_ERROR_TIMEOUT;
pruneBacklog(); // Prune any incoming non-actionable URC's and responses/errors from the backlog
return (err);
}
As a temporary fix, I changed the calls to the checkChecksum function here to look like this:
err = checkChecksum((char *)&responseDest[responseStartedAt]);;
After implementing this change, the checksums seemed to match up to those on the received messages and after testing this with a large number of messages, we have not seen the error again.
Anyway, that's just what I found. Please take a look and see if you guys end up with similar conclusions.
Best,
Hayden