Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.Sync0 unstable problem and 2. state machine error reset problem #557

Closed
WillyTuring opened this issue Oct 9, 2021 · 15 comments
Closed

Comments

@WillyTuring
Copy link

Hi Soem community,
We are one of the lucky users of great Soem community, first of all thanks for the contribution from all of you guys, we've learned a lot from here.
As our implementation proceed, we met a few problems here that we stuggled for quite a while and then decided to ask for some professional opinions from here hope to enlighten us for our work, let me briefly state our situation and problems in the following.
1.Current Situation:
We've managed to establish communication with the EtherCAT slave driver and write control word to change the driver state machine to move the motor.
We are using one of the Chinese servo driver called Cooldrive, model R6. The ESI file and datasheet will be attached as well.
We are running in the Sync0 mode.
We configured our Sync0 cycle time at 4,000,000ns; EtherCAT communication clycle time at 4,000,000ns; SM2 Event generation cycle at 4,000,000ns ;copy and cal time at 2,000ns.
Our initialization sequence is : ①Call Init(ec_config_init); ②Call ec_configdc; ③Call ec_dcsync0; ④Do mapping(ec_config_map)
2.Problem 1 and Observation:
2.1. Problem 1
Whenver we start to send real data, like when we are running on the CSP mode, we wrote a loop that is constantly sending the postion value to the slave, the sync situation would become unstable, to an extent that when we run the motor for a while, some of the slaves will be forced leave out of the operational mode and back to the safe operational mode and cause slave entering the error state.
2.2. Observation
①When we run the code, we also use the testing software that provided by the manufacturer, whenever we are forced into the error state, it gave us a report says the syncing communication abnormal, no more useful information than that.
②We did some trial and error, and we came to notice if we add any delay inside our position sending loop, it will cause the motor to make a huge abnormal noisy sound and draw a ton of current up to 7/8 A, while when it runs normally without any load with the same testing speed, it only draws around 0.7A, so we have to delete all the delays we had inside the loop.
③We added some code in the error detection thread trying to find out the error code(register 0x0300, 0x0308) and the system time difference(0x092C), we found out 1. for the error code, sometimes it reports the Forward Rx Error back up to the value 500-ish; 2. for the system time difference, the value fluctuates quite a lot, up to 2s in the consecutive cycle, I will attach our code report as well.
④We read the value in register 0x0910(local copy of reference time), 0x0918(local time), 0x0920(system offset time) at the beginning, we found out the local copy of reference time between 6 slaves are different; and local time≠local copy of reference time + system offset time - propagation delay.
And when we read the system time difference register(0x092C), we found out the value is 0 until we send any real data(Like change the control mode etc).
⑤I read the captured data in wireshark, the frames sended are in a sequence mathching our code, one slight observation is, from captured data we've noticed that at stage dcsync0 and mapping, the data shows each of the slave will go through dcsync0 and mapping continuously and only when it's done, it will go to the next slave.
And there is only one or a very few incorrect wkc shown.
⑥From the data we captured in Wireshark, we can confirm our slaves are using 64 bit clock.
CodeReport1

2.Problem 2:
We tried to follow what ArthurKetels suggested about how to reach a stable sync time for the network as stated in #487. But the problem is whenever we change the sequence like put the mapping before configure dc, the motor won't work.
3.Problem 3:
Whenever the slaves went to the error state in state machine, all we can do for now is to reboot the driver in order to recover it to the Power disabled state, we tried write 0x08 to control word, write value to the register 0x0040 to reset, nothing seems to work at the moment for us.
4.Problem 4:
We configured EtherCAT communication cycle time = SM2 Event generation cycle = Sync0 cycle time = 4,000,000ns; but whenever we start sending real data(like postion value to the slave), the SM2 Event generation cycle will be changed to 250us and fluctuates quite obvious.
But we can not add any delay inside the position sending loop, as it would cause the motor making abnormal sound and drawing tons of current issue while running.
//
So above are the three problems we have at this moment, we are seeking any suggestions and advices to solve these problems, every word is so welcomed and appreciated here, thanks for the kindness of you all in advance.
And below are some questions i have personally related to the communication that i am eager to know why as well:
1.Question 1:
When it comes to accessing ESC register, how to decide which command to use specifically?
2.Question 2:
Shouldn't the Copy of system time(register 0x0910) is the value for all slaves?
3.Question 3:
Why wireshark doesn't capture the type of data like changing the control mode and sending the position value to the slave?
4.Question 4:
I've noticed in Wireshark, there are a lot of cases when the write using "write" command but only writes a zero at first and then it may or may not write some other data to the same register later on.
Wish all the best here.

Willy
@WillyTuring
Copy link
Author

[Uploading ESI_Wirshark_CodeReports.zip…](

@ArthurKetels
Copy link
Contributor

Thanks for the extended report. I am waiting for the report file.
What I can say from your screen capture of register 0x92c is that you interpret the data wrong. from the datasheet:

Table 103: Register System Time Difference (0x092C:0x092F)
Bit    Description
30:0   Mean difference between local copy of System Time and received System Time values
       Difference = Received System Time – local copy of System Time
31     0: Local copy of System Time less than received System Time
       1: Local copy of System Time greater than or equal to received System Time

Your "large" values are just small negative numbers.

@WillyTuring
Copy link
Author

Hi,Arthur, thank you for such a swift reponse, that is so much appreciated.
Oh, got it, because the most significant bit is actually just an indication of greater or less than received system time, right? so if we take it out of our equation, the value would be really small actually, that's why we see the pattern it only changes in the last smallest digits.
What kind of report file or other information do you need to understand our situation better?

@ArthurKetels
Copy link
Contributor

I need the minimal version of the code (smallest complete code that explains the problem), and the complete wireshark trace of that code. In this case you can get the slave operational so the slave documentation is not necessary at this time.

The big advantage of posting that minimal complete code is that others in the future might learn from it. Too often only snippets of code are posted, and that just doesn't tell the whole story. Also when the issues are resolved post the complete corrected code again. So others might benefit.

@WillyTuring
Copy link
Author

ok, i will upload the entire code, but i remember i uploaded the wireshark capture, that was captured 2 hours ago, please let me know if you can see it and download it,the code is in sync with the captured data which i will upload right now.

@WillyTuring
Copy link
Author

WillyTuring commented Oct 9, 2021

main.zip
Hi Arthur,this is our code, please take a look, if any information still needed let me know please. Thanks in advance.

@ArthurKetels
Copy link
Contributor

Thanks for the code. The wireshark data did not come trough. You can check this for yourself. A click on main.zip will download the code. [Uploading ESI_Wirshark_CodeReports.zip…]( is just text, no data attached to it.

As for the code, well, it is wrong in so many ways, I do not really know how to give a cohesive answer. You should follow the advice I have in #487. Your code does in now way follow these steps.

The largest problem is that you try to do process data exchange in your main loop while at the same time running the real-time ethercat thread that also does PDO exchange. Again ALL PDO transfers belong in the real-time thread.

Only start-up the threads AFTER you have a valid mapping, so AFTER ec_config_map(&IOmap);

You have to send clock sync packets (at least 10000) before enabling dorun=1.

A PDO / SYNC0 offset of only 2us is not enough by a long shot. Please try first with half the cycle time, or 2ms.

When doing intra task communication please take care of data ownership and prevent races.

That you could get your slaves to OP is pure luck, not because of proper programming.

@WillyTuring
Copy link
Author

Uploading ClassicCapture.zip…
Check this out, this should work if the format type is correct.

@WillyTuring
Copy link
Author

We are doing the code optimization right now following your kind advices, when we are done, we will test to see if anything changes,please feel so free to judge our code structure whenever something comes out of your mind, all of the critiques are so much welcomed.
Thanks so much for the help,please take a loop at the wireshark data while we do the optimization,we will keep update here for everyone, when it's done eventually we will definitely upload the code as well.

@WillyTuring
Copy link
Author

@ArthurKetels Hi Arthur, hope you have a good day. We modified the code as you suggested and it worked pretty good now, always reach the target position, I will upload the updated code here, hope you can take a look.
I don't want to be a hand of the party here, just ask your help for our specific case, but i still have some questions and concerns that hope you can shed some light on it.
1.Concern:
My biggest concern is as we've learned so much about using EtherCAT from the whole community by now, we now have a pretty good grasp about the protocol, but our hunch tells us this is actually far from enough to give us the REAL confidence about the reliability of our code implementation,so we are still going to work on improvement, so we would be so pleased if you could point us to some direction that we can improve on it to build this confidence.
2.Question 1:
When you were saying this : "You have to send clock sync packets (at least 10000) before enabling dorun=1."
My understanding is that from the moment we do dc configration, all of the slaves will have "the same" global clock and then starts drift; and from the moment we do slaves dcsync, all slaves will calculate the sync interrupt start time and start generate sync interrupt cyclically; and the Master is the one who send packet through SM Event and slaves read the packet in every sync interrupt, so when you suggested us from the quote above, i imagine we need to use send_processdata 100000 times, is that what you meant? And please correct my misunderstanding above if i'm wrong.
3.Question 2:
When it comes to accessing ESC register, how to decide which command to use specifically?
4.Question 3:
Shouldn't the Copy of system time(register 0x0910) is the value for all slaves?
5.Question 4:
Why wireshark doesn't capture the type of data like changing the control mode and sending the position value to the slave?
6.Question 5:
I've noticed in Wireshark, there are a lot of cases when the write using "write" command but only writes a zero at first and then it may or may not write some other data to the same register later on.
main.zip

@ArthurKetels
Copy link
Contributor

Thanks for the updates. I am happy you got it working.
To answer your questions (or at least to some extend):

  1. Your hunch is right. I do not see enough quality code that proves you can deliver robust solutions. This has nothing to do with EtherCAT but more programming principles in general. Real time programming for servo tasks is more involved than many might think. However this is not the place to discuss those things. My suggestion is to look books / advisors that deal with this specific set of programming skills. SOEM is build for those with skills, not for beginners.
  2. There is a specific EtherCAT packet to distribute clocks. Read the documentation about the EtherCAT protocol. But in your case the easiest thing to do is call 10.000 times the PDO exchange.
for(int i = 0 ; i < 10000 ; i++)
{
    ec_send_processdata();
    ec_receive_processdata();
}

Calling ec_configdc() 10.000 times is useless.
The slave to slave synchronization works by reading the clock of the reference slave and copying to all other slaves. The other slaves then observe the reference clock and compare and adjust their local clocks. This is handled by sending a special packet. SOEM attaches this packet automatically to the PDO transfer when ec_configdc() is called.
3. To read an ESC register you can use the FPRD command. Just read the SOEM source where you see it used many times. Only when you want to read a register of all slaves at the same time you use BRD. The results of all registers are combined with a bitwise and.
4. Read the datasheet of the ESC this explains it in detail you could also download the big EtherCAT poser:
https://www.ethercat.org/download/documents/EtherCAT_Device_Protocol_Poster.pdf
5. It is, you just have to parse the content of the EtherCAT frames yourself. Wireshark shows you to protocol packets, the content of the packet is up to you to translate. The process data is simply an exchange of virtual memory space that is configured by SOEM and programmed in each slave. This is the 'mapping' part.
6. This question is too generic to answer. Please be specific. A practical example will do.

@WillyTuring
Copy link
Author

Thank you SO MUCH for such a detailed response.
Right now since the initial primitive code testing stage works fine for now, we are going to configure the code into class and organize the code better, meanwhile write the code in a structure as REAL TIME as possible following the coding conventions.
Although we still have some questions and curiosities that are not totally clear for us at the moment, we will try our best to figure out ourselves first then maybe have a discussion with you and the community later on.
Again, thank you SO MUCH for the on-time and unselfish help, if there's any way we could return you back directly or indirectly in a way you want us to, please let us know without any hesitation, we shall do our best.
You are the MAN here, Looking forward to the next discussion!

@WillyTuring
Copy link
Author

@ArthurKetels Hi Arthur, hope you are well. Recently we encountered a new problem which is our slave state machine always went to fault when we trying to go back from operational mode to safe operational mode, after command is executed. Do you have any clue what might be the problem?
Our current sequence of close up are:
1.First we return our slave in state machine back from "Normal Operation" state to "Do not connect main power" state which is the default state when we power on the driver.
2.Then we tried all different orders for ①setting dorun = 0; ②disable dcsync0; ③return our slave back to pre operational mode; We swapped the order between them but non of them worked.
We ran out of guesses at this moment and hope you can shed some light on us, thanks in advance, always have a good day.
Willy

@nakarlsson
Copy link
Contributor

@WillyTuring, can this issue be closed

@WillyTuring
Copy link
Author

@WillyTuring, can this issue be closed

I'm sorry, but yes. I have another problem lately that i will open a new issue later, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants