DDS-related processes crash after system time adjustment #3836

xjzer · 2023-09-09T19:03:18Z

Is there an already existing issue for this?

I have searched the existing issues

Expected behavior

Expected normal operation of the process

Current behavior

1. After the controller is powered up, the DDS-related processes start normally, and then the system time is adjusted, at which point the processes may crash occasionally.
2. The results of gdb parsing the core file show that both crash the program after an exception is thrown by the destructor ~RTPSMessageGroup().
3. After removing try/catch/throw from the destructor ~RTPSMessageGroup(), and after a couple of problems, the gbd backtrace is as follows.

delete try catch

RTPSMessageGroup::~RTPSMessageGroup() noexcept(false)
{
    //try
    //{
        send();
    //}
    //catch (...)
    //{
    //    if (!internal_buffer_)
    //   {
    //         participant_->return_send_buffer(std::move(send_buffer_));
    //     }
    //     throw;
    // }

    if (!internal_buffer_)
    {
        participant_->return_send_buffer(std::move(send_buffer_));
    }
}

add log

std::chrono::time_point<std::chrono::steady_clock> start_steady_clock = std::chrono::steady_clock::now();
std::chrono::time_point<std::chrono::system_clock> start_system_clock = std::chrono::system_clock::now();
if (!sender_->send(msgToSend, max_blocking_time_is_set_ ? max_blocking_time_point_
                                                        : (std::chrono::steady_clock::now() + std::chrono::hours(24))))
{
    std::chrono::time_point<std::chrono::steady_clock> end_steady_clock = std::chrono::steady_clock::now();
    std::chrono::time_point<std::chrono::system_clock> end_system_clock = std::chrono::system_clock::now();

    std::time_t start_c = std::chrono::system_clock::to_time_t(start_system_clock);
    std::time_t end_c = std::chrono::system_clock::to_time_t(end_system_clock);
    std::cerr << "max_blocking_time_is_set_ = " << max_blocking_time_is_set_ << std::endl;
    std::cerr
        << "max_blocking_time_point_ = "
        << std::chrono::duration_cast<std::chrono::milliseconds>(max_blocking_time_point_.time_since_epoch()).count()
        << std::endl;
    std::cerr << "start steady_clock = "
              << std::chrono::duration_cast<std::chrono::milliseconds>(start_steady_clock.time_since_epoch()).count()
              << ",
        end steady_clock =
        "
        << std::chrono::duration_cast<std::chrono::milliseconds>(end_steady_clock.time_since_epoch()).count()
        << std::endl;
    std::cerr << "start system_clock = " << std::put_time(std::localtime(&start_c), "% F % T") << "."
              << std::setfill('0') << std::setw(3)
              << (std::chrono::duration_cast<std::chrono::milliseconds>(start_system_clock.time_since_epoch()) -
                  std::chrono::duration_cast<std::chrono::seconds>(start_system_clock.time_since_epoch()))
                     .count()
              << std::endl;

    std::cerr << "end system_clock = " << std::put_time(std::localtime(&end_c), "% F % T") << "." << std::setfill('0')
              << std::setw(3)
              << (std::chrono::duration_cast<std::chrono::milliseconds>(end_system_clock.time_since_epoch()) -
                  std::chrono::duration_cast<std::chrono::seconds>(end_system_clock.time_since_epoch()))
              << std::endl;

    std::cerr << "== typeid(*sender_)" << typeid(*sender_).name() << std::endl;
    std::cout << "dynamic_cast<LocatorSelectorSender *> " << dynamic_cast<LocatorSelectorSender *>(sender_)
              << std::endl;
    std::cout << "dynamic_cast<ReaderLocator *> " << dynamic_cast<ReaderLocator *>(sender_) << std::endl;
    std::cout << "dynamic_cast<DirectMessageSender *> " << dynamic_cast<DirectMessageSender *>(sender_) << std::endl;
    std::cout << "dynamic_cast<WriterProxy *> " << dynamic_cast<WriterProxy *>(sender_) << std::endl;
    throw timeout();
}

log1
log2
gdb bt1
gdb bt2
gdb bt3

Steps to reproduce

After the controller is powered up, the DDS-related processes start normally, and then the system time is adjusted, at which point the processes may crash occasionally.
This problem is episodic and no clear steps to reproduce it have been found.

Fast DDS version/commit

2.5.0

Platform/Architecture

Other. Please specify in Additional context section.

Transport layer

Default configuration, UDPv4 & SHM

Additional context

aarch64
Linux
UDP Transport

XML configuration file

use fastddsgen

Relevant log output

No response

Network traffic capture

No response

JLBuenoLopez · 2023-09-11T05:30:03Z

Hi @xjzer

Thanks for opening the ticket. I am afraid that Fast DDS v2.5.x is no longer being maintained. Can you please check if it is happening in some of the currently maintained branches? You can find more information in Fast DDS Release support page.

Also, is this crash happening when the system time adjustment is to fast the time forward or to rewind it?

xjzer · 2023-09-11T11:32:20Z

Hi @xjzer

Thanks for opening the ticket. I am afraid that Fast DDS v2.5.x is no longer being maintained. Can you please check if it is happening in some of the currently maintained branches? You can find more information in Fast DDS Release support page.

Also, is this crash happening when the system time adjustment is to fast the time forward or to rewind it?

Hi @JLBuenoLopez-eProsima

Can you recommend a relatively stable version, as the latest is not necessarily the most stable.
every time the controller starts, there is a time jump, for example from 2020 to 2023, but the process does not necessarily crash. But on the other hand, although the process crashes are occasional, the point in time of each crash must correspond to the point in time when the time is adjusted.In short, when time is adjusted, there is not necessarily a process crash; when the process crashes, there must be a time adjustment.

JLBuenoLopez · 2023-09-12T05:16:14Z

@xjzer,

Any Fast DDS version is released as a stable version. Nevertheless, patch releases are frequently done to fix detected bugs. Currently, Fast DDS v2.10.x and Fast DDS v2.11.x are the officially supported versions. Both of them are in its second patch release so the can be considered to be really stable.
Thanks for the info. We will need to take a look into the reported issue.

xjzer · 2023-09-12T08:16:39Z

@JLBuenoLopez-eProsima

Some more clues, based on the backtrace provided earlier:

The problem occurs because sender_->send() in RTPSMessageGroup::send() function returns false.
Then according to the dynamic_case message, there are two possibilities for sender_ when something goes wrong, one is LocatorSelectorSender and the other is WriterProxy.
so far found that the two derived classes send, will call RTPSParticipantImpl::sendSync () function, and this function returns false only one case, is try_lock_until return false, I added a log here to reproduce the problem. The log when the problem occurs is as follows.

function call graph
add log
coredump1 log
coredump1 backtrace
coredump2 log
coredump2 backtrace

According to the information as above, it can be confirmed that RTPSMessageGroup::send() throws an exception due to the failure of try_lock_until, but try_lock_until is not returned false due to timeout, how should this be troubleshooted or handled, is it triggering the here mentioned "fail spuriously"?

xjzer added the triage Issue pending classification label Sep 9, 2023

JLBuenoLopez added need more info Issue that requires more info from contributor and removed triage Issue pending classification labels Sep 11, 2023

JLBuenoLopez added bug Issue to report a bug and removed need more info Issue that requires more info from contributor labels Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDS-related processes crash after system time adjustment #3836

DDS-related processes crash after system time adjustment #3836

xjzer commented Sep 9, 2023

JLBuenoLopez commented Sep 11, 2023

xjzer commented Sep 11, 2023 •

edited

Loading

JLBuenoLopez commented Sep 12, 2023

xjzer commented Sep 12, 2023

DDS-related processes crash after system time adjustment #3836

DDS-related processes crash after system time adjustment #3836

Comments

xjzer commented Sep 9, 2023

Is there an already existing issue for this?

Expected behavior

Current behavior

Steps to reproduce

Fast DDS version/commit

Platform/Architecture

Transport layer

Additional context

XML configuration file

Relevant log output

Network traffic capture

JLBuenoLopez commented Sep 11, 2023

xjzer commented Sep 11, 2023 • edited Loading

JLBuenoLopez commented Sep 12, 2023

xjzer commented Sep 12, 2023

xjzer commented Sep 11, 2023 •

edited

Loading