-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Lazy pirate" client pattern: socket instance EFSM error & memory inconsistency when binding through dll #3495
Comments
I don't quite understand your comment regarding the workaround. I assume you based your code on http://zguide.zeromq.org/cpp:lpclient. In the original code, the program exits after regarding the server as offline, and the socket is no longer used. If you incorporate this code into a program that continues afterwards, the REQ socket must be closed and reopened, as in the case of retrying. The REQ socket type has a strict state model requiring alternating sending and receiving of a message. If you don't follow this, you get an EFSM error as you encountered (which is thrown as an exception when using cppzmq, which must be caught). The socket cannot know that there will be no reply. |
Thank you, now its very clear to me what caused the EFSM error. |
What mql4 binding do you refer to? I found https://github.com/dingmaotu/mql-zmq but that doesn't use cppzmq. |
Yes, I'm using that binding. It seems to use the C binding as the code imports C functions, or at least it seems to be. #import "libzmq.dll"
//+------------------------------------------------------------------+
//| Sockets |
//+------------------------------------------------------------------+
intptr_t zmq_socket(intptr_t context,int type);
int zmq_close(intptr_t s);
int zmq_bind(intptr_t s,const char &addr[]);
int zmq_connect(intptr_t s,const char &addr[]);
int zmq_unbind(intptr_t s,const char &addr[]);
int zmq_disconnect(intptr_t s,const char &addr[]);
int zmq_send(intptr_t s,const uchar &buf[],size_t len,int flags);
int zmq_send_const(intptr_t s,const uchar &buf[],size_t len,int flags);
int zmq_recv(intptr_t s,uchar &buf[],size_t len,int flags);
int zmq_socket_monitor(intptr_t s,const char &addr[],int events);
//+------------------------------------------------------------------+
//| Message |
//+------------------------------------------------------------------+
int zmq_msg_send(zmq_msg_t &msg,intptr_t s,int flags);
int zmq_msg_recv(zmq_msg_t &msg,intptr_t s,int flags);
//+------------------------------------------------------------------+
//| I/O multiplexing |
//+------------------------------------------------------------------+
int zmq_poll(PollItem &items[],int nitems,long timeout);
//+------------------------------------------------------------------+
//| Message proxying |
//+------------------------------------------------------------------+
int zmq_proxy(intptr_t frontend_ref,intptr_t backend_ref,intptr_t capture_ref);
int zmq_proxy_steerable(intptr_t frontend_ref,intptr_t backend_ref,intptr_t capture_ref,intptr_t control_ref);
#import When I hit this problem, I started to code in MQL with this binding "AS IS", then it went unstable. |
The LPclient is based on czmq, which is a higher abstraction level around libzmq... The mql binding however (as well as cppzmq) are at the abstraction level of libzmq. |
So, If I want to use the czmq abstractions at MQL, will I need to compile czmq as a dll, and then write a binding for it? |
Solution-1
I have wrapped the Lazy pirate Pattern into a .dll
And I made a MQL binding for it.
It can be found here
What is provided is included in the readme.
If I found a way to do the correct libzmq.dll calls without crashing the metatrader under wine, I would update this.
UPDATE (unconfirmed hypothesis)
I will cover this in more detail in next update.
This problem seems to be related with "the art" of passing strings and objects through dll boundaries in imported function parameters, and the classic null terminated C string issues when its null end is truncated caused by an incorrect size parameter when copying the contents.
As you remember in strcpy() documentation:
Also the size of the dll string returned must be allocated in the caller memory before to the call and return from dll boundaries, plus one byte for the null terminator.
Failure in do so, will cause memory inconsistencies that will lead to read/write access violations in both sides, in the dll internals, and in the caller's main thread.
SOLUTION-0
This going to be a TL;TR, but I want to share my experience, so It might be useful to others with similar issues.
The short answer, is:
That was a program logic issue, not a ZMQ issue.
Also, I'm not sure, but there seems to be a better/modern approach to do I/O multiplexing using zmq_poller API instead of calling zmq_poll directly.
The lazy pirate pattern client here is a finite state machine type
In this approach all pattern internals are encapsulated into one object, that accepts a ZMQ address string as constructor, (specifying the worker address to connect to), and performs the message transaction with the method "string sendTX(string)".
That method returns a string with the body of the message received from server, or an empty string on communication failure (for simplification).
sendTX() reuses the code from the Lazy Pirate client in C++ reliable REQ/REP patterns from the guide examples.
Originally, the program has 4 exit states depending on the transaction status.
The main issue appeared here is related with the unhandled "clear exit" case when "abort, retries exceeded" exit condition.
It caused a EFSM error in the recycled socket when a new message is queued to transmit after the failed exit.
According to sigiesec's comment, there is no way for a socket to know that a message won't arrive.
So it is up to you, decide what will be done in that case.
So, the sendTX() method should reset the socket status when abandoning by communication retries exceeded as in the case when retry after failure.
Also, it needs to take some decision when a malformed reply arrives, and all exceptions must be caught.
There is a working sample code without exceptions logic (for simplicity): multiworker_lppclient.cpp
It can be built directly using g++ from command line in Linux
Now, the same code ported to MQL language, the standard of the MetaTrader, linked with this issue shows unexpected behaviour somewhere in between socket creation, and socket.poll(), socket.send() operations.
I'm suspecting something on the binding causes it to throw an unhandled exception, but in this case, exceptions can't be caught as the language don't support try, catch, finally kind of stuff, or at least I don't know how to catch them.
To make things worse, ZMQ API are imported from a Windows dll, in the EA, that is a separate thread during execution of the agent, and when it crashes, it crashes the main thread also exiting the app, and leaving no logs about what kind of exception was thrown on the child process.
UPDATE-1
See sigiesec's comment below regarding the strict state model of REQ sockets.
UPDATE-0
I found a workaround for the problem.
When linger is set to 0, socket is configured to not wait at close time.
Then, the instance of the socket object will "delete itself" after a time, even when zmq_close() isn't called explicitly.
Subsequent REQ/REPs on that instance will lead to unexpected behaviour.
I found a workaround for this doing a socket object instantiation from context before call send_s() every time with every "transaction".
In the documentation of zmq_setsockopt there is no clear information of this behaviour.
Here is the apparently working solution.
Issue description
I'm trying to extend the "lazy pirate" client pattern to do a reliable REQ/REP.
If I use synchronous communication REQ/REP on the base, everything works fine, but when I try to incorporate the lazy pirate pattern behaviour to the client and test it with the sample lazy pirate server provided in the examples, everything goes weird.
When the requests does not timeout, it seems to do the job, but when a reply is delayed enough, this piece of code go in a random crash behaviour.
This sample code does nothing on final "abandon, retries exceeded" part of the algorithm, it simply returns the control to main loop and tries again.
The lpclient c++ REQ/REP routine have been copied here with minor modifications.
Environment
g++ (Ubuntu 7.4.0-1ubuntu1~18.04) 7.4.0
I compiled the example provided directly from command line:
Then I exec it with:
./multiworker_lpclient
libzmq.dll version 4.3.2.0 compiled in VS2019 (see this issue)
Minimal test code / Steps to reproduce the issue
multiworker_lpclient.cpp
What's the actual result? (include assertion message & call stack if applicable)
gdb "run" output
Stack Trace:
What's the expected result?
Reliable communication when one or both workers have delayed replies.
Reliable communication when one or both workers crashes, and some external procedure restarts them.
Stable application control, when things don't go as expected.
The text was updated successfully, but these errors were encountered: