"Lazy pirate" client pattern: socket instance EFSM error & memory inconsistency when binding through dll #3495

VideoMem · 2019-05-07T04:13:39Z

Solution-1

I have wrapped the Lazy pirate Pattern into a .dll
And I made a MQL binding for it.
It can be found here

What is provided is included in the readme.
If I found a way to do the correct libzmq.dll calls without crashing the metatrader under wine, I would update this.

UPDATE (unconfirmed hypothesis)

I will cover this in more detail in next update.
This problem seems to be related with "the art" of passing strings and objects through dll boundaries in imported function parameters, and the classic null terminated C string issues when its null end is truncated caused by an incorrect size parameter when copying the contents.

As you remember in strcpy() documentation:

To avoid overflows, the size of the array pointed by destination shall be long enough to contain the same C string as source (including the terminating null character), and should not overlap in memory with source.

Also the size of the dll string returned must be allocated in the caller memory before to the call and return from dll boundaries, plus one byte for the null terminator.
Failure in do so, will cause memory inconsistencies that will lead to read/write access violations in both sides, in the dll internals, and in the caller's main thread.

SOLUTION-0

This going to be a TL;TR, but I want to share my experience, so It might be useful to others with similar issues.
The short answer, is:
That was a program logic issue, not a ZMQ issue.

Also, I'm not sure, but there seems to be a better/modern approach to do I/O multiplexing using zmq_poller API instead of calling zmq_poll directly.

The lazy pirate pattern client here is a finite state machine type

In this approach all pattern internals are encapsulated into one object, that accepts a ZMQ address string as constructor, (specifying the worker address to connect to), and performs the message transaction with the method "string sendTX(string)".
That method returns a string with the body of the message received from server, or an empty string on communication failure (for simplification).
sendTX() reuses the code from the Lazy Pirate client in C++ reliable REQ/REP patterns from the guide examples.
Originally, the program has 4 exit states depending on the transaction status.

It behaves as standard synchronous REQ/REP client when all responses of the requests arrives in time.
It resets the internal state of the REQ socket when the transaction timed out when retrying a failed transaction.
It prints an error when a malformed message arrives with some filter criteria.
It EXITS the program when maximum transaction retries have been exceed (that's not the case here)

The main issue appeared here is related with the unhandled "clear exit" case when "abort, retries exceeded" exit condition.
It caused a EFSM error in the recycled socket when a new message is queued to transmit after the failed exit.
According to sigiesec's comment, there is no way for a socket to know that a message won't arrive.
So it is up to you, decide what will be done in that case.

So, the sendTX() method should reset the socket status when abandoning by communication retries exceeded as in the case when retry after failure.
Also, it needs to take some decision when a malformed reply arrives, and all exceptions must be caught.

There is a working sample code without exceptions logic (for simplicity): multiworker_lppclient.cpp

It can be built directly using g++ from command line in Linux

$ g++ -ansi multiworker_lpclient.cpp -lzmq -o multiworker_lpclient
$./multiworker_lpclient

Now, the same code ported to MQL language, the standard of the MetaTrader, linked with this issue shows unexpected behaviour somewhere in between socket creation, and socket.poll(), socket.send() operations.
I'm suspecting something on the binding causes it to throw an unhandled exception, but in this case, exceptions can't be caught as the language don't support try, catch, finally kind of stuff, or at least I don't know how to catch them.
To make things worse, ZMQ API are imported from a Windows dll, in the EA, that is a separate thread during execution of the agent, and when it crashes, it crashes the main thread also exiting the app, and leaving no logs about what kind of exception was thrown on the child process.

UPDATE-1

See sigiesec's comment below regarding the strict state model of REQ sockets.

UPDATE-0

I found a workaround for the problem.
When linger is set to 0, socket is configured to not wait at close time.
Then, the instance of the socket object will "delete itself" after a time, even when zmq_close() isn't called explicitly.
Subsequent REQ/REPs on that instance will lead to unexpected behaviour.
I found a workaround for this doing a socket object instantiation from context before call send_s() every time with every "transaction".
In the documentation of zmq_setsockopt there is no clear information of this behaviour.

Here is the apparently working solution.

Issue description

I'm trying to extend the "lazy pirate" client pattern to do a reliable REQ/REP.
If I use synchronous communication REQ/REP on the base, everything works fine, but when I try to incorporate the lazy pirate pattern behaviour to the client and test it with the sample lazy pirate server provided in the examples, everything goes weird.
When the requests does not timeout, it seems to do the job, but when a reply is delayed enough, this piece of code go in a random crash behaviour.
This sample code does nothing on final "abandon, retries exceeded" part of the algorithm, it simply returns the control to main loop and tries again.
The lpclient c++ REQ/REP routine have been copied here with minor modifications.

Environment

g++ (Ubuntu 7.4.0-1ubuntu1~18.04) 7.4.0

I compiled the example provided directly from command line:

g++ -ansi multiworker_lpclient.cpp -lzmq -o multiworker_lpclient

Then I exec it with:
./multiworker_lpclient

libzmq version : 4.1.2
libzmq.dll version 4.3.2.0 compiled in VS2019 (see this issue)
OS: linux 5.0.7, ubuntu bionic

Minimal test code / Steps to reproduce the issue

multiworker_lpclient.cpp

//
//  Lazy Pirate client
//  Use zmq_poll to do a safe request-reply
//  To run, start piserver and then randomly kill/restart it
//  This example needs two workers on different addresses
//
#include "zhelpers.hpp"
#include <sstream>
using namespace std;

#define REQUEST_TIMEOUT     2500    //  msecs, (> 1000!)
#define REQUEST_RETRIES     3       //  Before we abandon

class WorkerClientBase {
    protected:
        string zmq_address; 
        zmq::socket_t *client;
        zmq::context_t *context;
        void init();
        void close();
        void connect();
   public: 
        virtual string getName() { return "WorkerClientBase"; }
        void setAddr(string addr) { zmq_address = addr; }
        string sendTX(string payload);
        WorkerClientBase() { context = new zmq::context_t(1); }
        ~WorkerClientBase();
};


class WorkerA: public WorkerClientBase {
    public:
        WorkerA(string addr);
        string getName() { return "WorkerA"; }
        void txSomething();
};

class WorkerB: public WorkerClientBase {
    public:
        WorkerB(string addr);
        string getName() { return "WorkerB"; }
        void txSomething();

};

void WorkerClientBase::init() {
    client = new zmq::socket_t (*context, ZMQ_REQ);
}

void WorkerClientBase::close() {
    delete client;
}

WorkerClientBase::~WorkerClientBase() {
    close();
    delete context;
}

void WorkerClientBase::connect() {
    cout << "I: connecting to server…" << endl;
    init();
    client->connect (zmq_address);

    //  Configure socket to not wait at close time
    int linger = 0;
    client->setsockopt (ZMQ_LINGER, &linger, sizeof (linger));
}

string WorkerClientBase::sendTX(string payload) {
    int retries_left = REQUEST_RETRIES;
    string reply = "";

    while (retries_left) {
        stringstream request;
        request << payload;
        s_send (*client, request.str());
        sleep (1);

        bool expect_reply = true;
        while (expect_reply) {
            //  Poll socket for a reply, with timeout
            zmq::pollitem_t items[] = { { *client, 0, ZMQ_POLLIN, 0 } };
            zmq::poll (&items[0], 1, REQUEST_TIMEOUT);

            //  If we got a reply, process it
            if (items[0].revents & ZMQ_POLLIN) {
                //  We got a reply from the server, must match sequence
                reply = s_recv (*client);
                if (reply.size() > 0) {
                    cout << "I: server replied OK (" << reply.size() << ") bytes" << endl;
                    retries_left = 0;
                    expect_reply = false;
                }
                else {
                    cout << "E: malformed reply from server: " << reply << endl;
                }
            }
            else
            if (--retries_left == 0) {
                cout << "E: server seems to be offline, abandoning" << endl;
                expect_reply = false;
                break;
            }
            else {
                cout << "W: no response from server, retrying…" << endl;
                //  Old socket will be confused; close it and open a new one
                close();
                connect();
                //  Send request again, on new socket
                s_send (*client, request.str());
            }
        }
    }
    return reply;
}

WorkerA::WorkerA(string addr):WorkerClientBase() {
    setAddr(addr);
    connect();
}

WorkerB::WorkerB(string addr):WorkerClientBase() {
    setAddr(addr);
    connect();
}

void WorkerA::txSomething() {
    string payload = "#this is a R script";
    printf("reply body: %s\n", sendTX(payload).c_str());
}

void WorkerB::txSomething() {
    string payload = "{ \"some_json_log\": { \"SYMBOL\": \"EURUSD\", \"MAGIC\": \"42\", \"etc ...\":\"etc\" }}";
    printf("reply body: %s\n", sendTX(payload).c_str());
}

WorkerA* workerA;
WorkerB* workerB;

void OnInit() {
   workerA = new WorkerA("tcp://localhost:5555");
   workerB = new WorkerB("tcp://localhost:5566");
}

int tickC = 0;
void OnTick() {
    printf("On tick loop number %d:\n", tickC);
    workerA->txSomething();
    workerB->txSomething();
    tickC++;
}

int main () {
    OnInit();
    while(true) {
        OnTick();    
    }
    return 0;
}

What's the actual result? (include assertion message & call stack if applicable)

gdb "run" output

Starting program: /home/user/Develop/zmqTests/multiworker_lpclient 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
I: connecting to server…
[New Thread 0x7ffff6005700 (LWP 20700)]
[New Thread 0x7ffff5804700 (LWP 20701)]
I: connecting to server…
On tick loop number 0:
I: server replied OK (#this is a R script)
reply body: #this is a R script
I: server replied OK ({ "some_json_log": { "SYMBOL": "EURUSD", "MAGIC": "42", "etc ...":"etc" }})
reply body: { "some_json_log": { "SYMBOL": "EURUSD", "MAGIC": "42", "etc ...":"etc" }}
On tick loop number 1:
W: no response from server, retrying…
I: connecting to server…
W: no response from server, retrying…
I: connecting to server…
E: server seems to be offline, abandoning
reply body: 
I: server replied OK ({ "some_json_log": { "SYMBOL": "EURUSD", "MAGIC": "42", "etc ...":"etc" }})
reply body: { "some_json_log": { "SYMBOL": "EURUSD", "MAGIC": "42", "etc ...":"etc" }}
On tick loop number 2:
terminate called after throwing an instance of 'zmq::error_t'
  what():  Operation cannot be accomplished in current state

Thread 1 "multiworker_lpc" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

Stack Trace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff71ea801 in __GI_abort () at abort.c:79
#2  0x00007ffff783f8b7 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff7845a06 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff7845a41 in std::terminate() ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff7845c74 in __cxa_throw ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00005555555584e8 in zmq::socket_t::send(zmq::message_t&, int) ()
#7  0x00005555555566b3 in s_send(zmq::socket_t&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#8  0x00005555555572eb in WorkerClientBase::sendTX(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) ()
#9  0x0000555555557869 in WorkerA::txSomething() ()
#10 0x0000555555557c17 in OnTick() ()
#11 0x0000555555557c77 in main ()

What's the expected result?

Reliable communication when one or both workers have delayed replies.
Reliable communication when one or both workers crashes, and some external procedure restarts them.
Stable application control, when things don't go as expected.

The text was updated successfully, but these errors were encountered:

sigiesec · 2019-05-08T09:47:22Z

I don't quite understand your comment regarding the workaround.

I assume you based your code on http://zguide.zeromq.org/cpp:lpclient. In the original code, the program exits after regarding the server as offline, and the socket is no longer used.

If you incorporate this code into a program that continues afterwards, the REQ socket must be closed and reopened, as in the case of retrying.

The REQ socket type has a strict state model requiring alternating sending and receiving of a message. If you don't follow this, you get an EFSM error as you encountered (which is thrown as an exception when using cppzmq, which must be caught). The socket cannot know that there will be no reply.

VideoMem · 2019-05-08T20:53:44Z

Thank you, now its very clear to me what caused the EFSM error.
The workaround that I described is near to that, but not exactly the same. I that, I'm closing and reopening the socket regardless of the state of the previous REQ.
Now I'm fixing it, and testing it.
I've encountered this, when dealing with the mql4 binding of this library.
In mql4 there is no way to catch an exception, it will simply do a "silent crash" without any further debug information.
I will update this when done, it might be helpful to others.

sigiesec · 2019-05-14T16:23:18Z

What mql4 binding do you refer to? I found https://github.com/dingmaotu/mql-zmq but that doesn't use cppzmq.

VideoMem · 2019-05-15T10:01:27Z

Yes, I'm using that binding. It seems to use the C binding as the code imports C functions, or at least it seems to be.

#import "libzmq.dll"
//+------------------------------------------------------------------+
//| Sockets                                                          |
//+------------------------------------------------------------------+
intptr_t zmq_socket(intptr_t context,int type);
int zmq_close(intptr_t s);
int zmq_bind(intptr_t s,const char &addr[]);
int zmq_connect(intptr_t s,const char &addr[]);
int zmq_unbind(intptr_t s,const char &addr[]);
int zmq_disconnect(intptr_t s,const char &addr[]);
int zmq_send(intptr_t s,const uchar &buf[],size_t len,int flags);
int zmq_send_const(intptr_t s,const uchar &buf[],size_t len,int flags);
int zmq_recv(intptr_t s,uchar &buf[],size_t len,int flags);
int zmq_socket_monitor(intptr_t s,const char &addr[],int events);
//+------------------------------------------------------------------+
//| Message                                                          |
//+------------------------------------------------------------------+
int zmq_msg_send(zmq_msg_t &msg,intptr_t s,int flags);
int zmq_msg_recv(zmq_msg_t &msg,intptr_t s,int flags);
//+------------------------------------------------------------------+
//| I/O multiplexing                                                 |
//+------------------------------------------------------------------+
int zmq_poll(PollItem &items[],int nitems,long timeout);
//+------------------------------------------------------------------+
//| Message proxying                                                 |
//+------------------------------------------------------------------+
int zmq_proxy(intptr_t frontend_ref,intptr_t backend_ref,intptr_t capture_ref);
int zmq_proxy_steerable(intptr_t frontend_ref,intptr_t backend_ref,intptr_t capture_ref,intptr_t control_ref);
#import

When I hit this problem, I started to code in MQL with this binding "AS IS", then it went unstable.
I prefer using C++ over C on finite state machine kinda stuff, so I switch to cppzmq to write the same program and test the concept at C++ level.
I found here that the same program in C++ works pretty well, so I switched the plans, and binded this program on a C++ dll that can be included at mt4.
Now, it is working, at least as stable as when I started to code in MQL with synchronous REQ/REP.
During the development of the "Lazy pirate" dll binding for MQL, I faced the "passing arguments issue" of the dll boundaries with same sympthoms.
So, I hypothesized that same thing were happening on the original MQL binding as well.
But now, I reviewed the C version of the LPClient, and found it uses zsocket_destroy (ctx, client);. There is no "socket_destroy" on the mql binding.
I'm suspecting that I missed that very important part of this whole thing.

sigiesec · 2019-05-15T10:05:36Z

The LPclient is based on czmq, which is a higher abstraction level around libzmq... The mql binding however (as well as cppzmq) are at the abstraction level of libzmq.

VideoMem · 2019-05-15T10:31:06Z

So, If I want to use the czmq abstractions at MQL, will I need to compile czmq as a dll, and then write a binding for it?
I found that MQL is little tricky with the types passed through dll boundaries, strings works well when passed as a pointer to char array (as shown above const uchar &ptr[]), together with an integer length parameter to avoid overflows when strcpy at dll code.
Null terminator is not passed sometimes, so the contents needs to be copied with strncpy and add the null termination after that (I have not confirmed this, but if it is true, there is a major concern here)
And when the char array returns from the dll, the same applies at MQL level, so all returns might contain a length parameter to avoid overflows. You can not rely on finding the null terminator in the returned char array.
Integers seems to be int32_t type in x86, as int64_t on x64.
Structures seems to suffer from byte misalignment issues between platforms x86 <-> x64, as the integer types changes.
String type seems to map to a wchar_t, and need marshalling.
Character encoding? Well, the references said that is UTF16LE, but seems to be dependant on locale setting of MQL, and needs to be taken in account when converting MQL strings to char arrays. Right now I'm avoiding messing with it, copying the strings at byte level on the dll, and dealing with that at server code.
I have not explored enough the czmq interface to see if it will fit with these restrictions, but I'll give it a try, it seems very pretty.

VideoMem mentioned this issue May 7, 2019

Port from c++ of the common "Lazy Pirate" pattern client causes random memory access violation dingmaotu/mql-zmq#33

Closed

VideoMem changed the title ~~c++ random crash: terminate called after throwing an instance of 'zmq::error_t' adapting the lazy pirate pattern client from examples to REQ/REP with different workers~~ "Lazy pirate" client pattern unexpected behaviour May 7, 2019

VideoMem changed the title ~~"Lazy pirate" client pattern unexpected behaviour~~ "Lazy pirate" client pattern: socket instance deletes itself randomly when linger is set to 0 May 7, 2019

sigiesec added the Binding (non-C) label May 8, 2019

sigiesec added Area (Runtime / Usage) Help Request labels May 8, 2019

VideoMem changed the title ~~"Lazy pirate" client pattern: socket instance deletes itself randomly when linger is set to 0~~ "Lazy pirate" client pattern: socket instance EFSM error [SOLVED] May 10, 2019

VideoMem changed the title ~~"Lazy pirate" client pattern: socket instance EFSM error [SOLVED]~~ "Lazy pirate" client pattern: socket instance EFSM error, unhandled exception when binding through dll May 13, 2019

VideoMem changed the title ~~"Lazy pirate" client pattern: socket instance EFSM error, unhandled exception when binding through dll~~ "Lazy pirate" client pattern: socket instance EFSM error, memory inconsistency when binding through dll May 14, 2019

VideoMem changed the title ~~"Lazy pirate" client pattern: socket instance EFSM error, memory inconsistency when binding through dll~~ "Lazy pirate" client pattern: socket instance EFSM error & memory inconsistency when binding through dll May 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Lazy pirate" client pattern: socket instance EFSM error & memory inconsistency when binding through dll #3495

"Lazy pirate" client pattern: socket instance EFSM error & memory inconsistency when binding through dll #3495

VideoMem commented May 7, 2019 •

edited

Loading

sigiesec commented May 8, 2019

VideoMem commented May 8, 2019

sigiesec commented May 14, 2019

VideoMem commented May 15, 2019

sigiesec commented May 15, 2019

VideoMem commented May 15, 2019 •

edited

Loading

"Lazy pirate" client pattern: socket instance EFSM error & memory inconsistency when binding through dll #3495

"Lazy pirate" client pattern: socket instance EFSM error & memory inconsistency when binding through dll #3495

Comments

VideoMem commented May 7, 2019 • edited Loading

Issue description

Environment

Minimal test code / Steps to reproduce the issue

What's the actual result? (include assertion message & call stack if applicable)

What's the expected result?

sigiesec commented May 8, 2019

VideoMem commented May 8, 2019

sigiesec commented May 14, 2019

VideoMem commented May 15, 2019

sigiesec commented May 15, 2019

VideoMem commented May 15, 2019 • edited Loading

VideoMem commented May 7, 2019 •

edited

Loading

VideoMem commented May 15, 2019 •

edited

Loading