Description
Since we upgraded our app from 0.10.38 to v6, we experienced a lot of problem with IPC messaging.
Essentially, we have a web application with a few workers to handle all the requests. The workers contains caches to speed up the request and theses caches are synchronized between process with IPC messaging. We were also using log4js as a logging library with the clustered appender that uses IPC to send all child logs back to the master to have a single process handling the logs.
All was working fine under 0.10.38, but when we upgraded to 6.0.0 (and then 6.2.0) our app kept crashing under various circumstances
We soon realized that if we send too much data (or too fast) through IPC, that it was freezing our application.
We began refactoring our entire app to use IPC to the strict minimum.
- We created a custom logging process that receive logs by TCP instead of IPC
- We refactored our entire master/worker process so the workers could load all the information on their own and restrict IPC messages to only "trigger" messages instead of sending all the data.
All thoses changes are good for our application, since it reduced dependencies from master/worker and did a better separation of responsibilities, but I still see it as a flaw in Node.JS since the IPC is a fairly simple communication mechanism to exchange information between workers, but it seems so fragile now that we are afraid of using it.
I attached a simple script that reproduce the problem. It is not a real scenario, just a test case I created to reproduce the problem of the application that stop responding.
On my laptop, the app crash at startup (or before the first log) with 5 forks (maybe because I have 4 physical core)
At first I tested with 3 workers and It froze after 5-10 minutes (all process CPU go down to 0 and there's no more log output)
If I remove the "bacon ipsum" from the worker message, it works (might freeze after a while)
If I increase the message interval from 1ms to 10ms, it works (might freeze after a while)
If I spawn only 4 workers it works (will probably freeze after 5-10 minutes)
If I execute it with 0.10.38 it works (as long as I ran it)
So if you play with the timings, size of messages and/or number of forks, you should be able to reproduce the problem.
One thing I observed is that the IPC messaging seem to have improve in performance big time from 0.10 to 6. If i run the test with 3 workers for 10 seconds with 0.10.38 the master only handle 1902 messages and in comparison with 6.3.0, in the same 10 seconds, the master handles 25514 messages.
I also tested it with 4.4.7 and it freeze at startup with 5 forks and after 4 minutes with 4 forks
My specs :
NodeJS Windows 6.3.0 64 bits (bug)
NodeJS Windows 6.2.0 64 bits (bug)
NodeJS Windows 4.4.7 64 bits (bug)
NodeJS Windows 0.10.38 64 bits (OK)