Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Startup freeze with multiple motors #8

Open
Cdfghglz opened this issue Oct 27, 2022 · 9 comments
Open

Startup freeze with multiple motors #8

Cdfghglz opened this issue Oct 27, 2022 · 9 comments
Assignees

Comments

@Cdfghglz
Copy link

Hi

Some 70% of the time, the first motor that is called from EthercatMaster's startup() gets stuck in the state 4 - SAFE_OP.

There are no errors on the motor, but it does not react to any subsequent state change attempts. Also it seems to break the communication as the working counter is too low afterwards.

The issue is present with >12 Maxon controllers. I can not reproduce the issue with <12 motors.

Any ide what could be causing this or how to debug? Thanks!

Also linking leggedrobotics/maxon_epos_ethercat_sdk#6

@JohannesPankert
Copy link

That sounds like a timing issue. Do you run your update thread with the RT_PRIO scheduler and a high thread priority? Do you allow the process to set the thread priority?
To do so, you need to add the following lines to your /etc/security/limits.conf file:

* hard rtprio 99
* soft rtprio 99

@Cdfghglz
Copy link
Author

Cdfghglz commented Nov 7, 2022

Thank you for the answer.

The template I used only sets up RT prio for the PDO communication after the startup phase. Once the startup is successful, all slaves seem to be satisfied with the realtimeness from then on in my case.

Nevertheless I also tried to set the RT prio in the main thread, from which the startup is performed, to no avail.

One more note on the timing: The issue happens on a TX2. Interestingly, I was not able to reproduce on a desktop gen 11 i9.

@JohannesPankert
Copy link

ok, have you checked whether the thread priority is actually set? This should work if you run the executable with sudo rights or if you add the lines mentioned above to your limits.conf file.

@tfabi has a similar issue with a jetson computer. If you cannot resolve it with thread priorities, you might want to try to reduce the update rate to a level that your computer can handle.

@Cdfghglz
Copy link
Author

Cdfghglz commented Nov 7, 2022

Yeah I forgot to confirm the config to you. And yes I definitely run with sudo, always assured by Setting RT Priority: successful.

The update rate again only concerns the PDO r/w thread, not the startup process as far as I am concerned.

@JohannesPankert
Copy link

You could try to lock the thread to a certain CPU core

bool setRealtimePriority(int priority = 99, int cpu_core = -1) const;

I am not an expert with Jetson TX2 boards but don't they have different types of CPUs? Maybe you could select a higher performant CPU core here.

@JohannesPankert
Copy link

It has been a while since I worked with the SDK but if I remember correctly, the state machine of the drives would advance in the update thread, even when the configuration is not completed yet.

@Cdfghglz
Copy link
Author

Cdfghglz commented Nov 7, 2022

2x the update period makes no difference.

Despite the cpu_core argument set to 1, which is the faster of the two, ~2.4Ghz denver 2, the cpu keeps being scheduled by the OS. I would have to play around with isolcpu I guess.

Given the assumption about insufficient compute, I wonder whether the issue can be addressed in some way by utilizing the underlying SOEM interface in a more "gentle" way. It seems odd to me that the issue would be tied to the cpu switch latencies or the clock speed. I could try to setup a RPI which has specs comparable to the slower TX2 cpus.

@firesurfer
Copy link
Contributor

Hi I am one of the original developers of that package.
From my experience ethercat tends to be very timing sensitive (this is what comes with realtime capability). I could imagine that with a larger amount of drives the SOEM implementation might not be capable of keeping up the required timing (+ timing stability). So this might be the reason why this works on a faster CPU (also could be that the x86 kernel code is more optimized?!) What we try with setting the realtime priority as well as tying the process to one core we try to minimize timing jitter due to the scheduler. (Btw. the reason you see the thread still hopping around might be due to a different scheduler being used on the Jetson, everything one can do from user space is to give the scheduler hints)
Here comes in also the point Johannes suggested: The allowed amount of jitter depends on the actual ethercat device. Most drives actually have the amount of allowed jitter depending on the update rate. So if you have a lower update rate you might have more jitter in your timing. (But depending on the drive there might be also some hard upper and lower limits)

I have the following suggestions for you:

Take a look into running a realtime kernel (also make sure that you compile SOEM the right way then)
If you do not want to run a realtime kernel it might be worth a shot to actually split up the drives to 2 ethercat buses running seperate masters. (Btw. we also made the experience that USB to Ethernet adapters can influence the timing quiet a bit)

In the end there is still the possibility that the maxon_epos_ethercat_sdk does some stuff rather in an inefficient way. What you can try is to use SOEM directly and just try to change the Ethercat statemachine to Safeop for that amount of drives. This might point you into the right direction where to look next. If it works then you might want to checkout if there is some suboptimal usage of SOEM from the maxon sdk.

@JohannesPankert
Copy link

And just to make sure: You build everything with -o3 optimization (Release in catkin), right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants