WIP - Improved HA stability #9854

timw · 2022-08-14T23:56:03Z

What does this PR do?

This is an in-progress set of changes we're working on to increase the stability of OrientDB in HA/distributed deployments.
We're currently testing these fixes in pre-production and then moving them to production, but are opening them up now for discussion to see if there are potentially better ways to fix these issues, and allow broader testing to see if they introduce other issues (as we don't exhaustively test all features/APIs).

Motivation

We have encountered multiple outages in production due to various stability issues related to distributed operation.
These issues are numerous and somewhat interacting, and required the development of a dedicated stress test suite to replicate in development.

A full list of the unique issues encountered during testing is too long to enumerate, covering many (many) different exceptions, but a general overview of the core issues follows:

there is no way to remotely (via API) verify the HA status of the servers/databases in the community edition. This is remedied by a small update to the HA STATUS -servers query to effectively do what the enterprise agent APIs can achieve.
database opening is not coordinated with the distributed plugin status, allowing some internal structures (notably shared context construction) to be initialised in the window between startup and distributed plugin startup completing if there is a client transaction load on the database during node startup, resulting in inconsistent initialisation and subsequent transaction failures.
database opening does not effectively check that the distributed server/database status was online, resulting in access to inconsistent/uninitialised state and subsequent exceptions during transactions.
remote full database install sometimes fails with corrupted zip streams. We suspect this is related to a stream copying bug, which is patched and seems to have fixed that issue.
construction of the DistributedDatabaseImpl was not thread-safe, and allowed uninitialised object structures to be accessed, causing NPEs at runtime.
database creation suffers from similar issues to database opening when load exists on the server, so checks have been added to prevent database creation before the HA status is stable.

Related issues

There's a patch in this PR that artificially widens the distributed plugin startup time - we found that this allows easier reproduction of the production issues we observe. The cause of this is that the distributed plugin makes TCP connections to remote nodes during startup, which in our production case is cross AZ in AWS and thus has higher latencies than a local data-centre, which increases the window in which some of the startup issues occur.

During testing/resolving these issues, multiple enhancements were made to logging/tracing:

a facility to name and trace async tasks has been added, allowing failed background tasks to be identified precisely in logs to make debugging error conditions much easier.
the log format used when an exception is logged is corrected - currently it omits the leading newline and the formatting, which causes error messages to be smashed into the preceding log message.
various individual cases of incorrect logging of errors were fixed.

The stress test tool we have developed is now available at [https://github.com/indexity-io/orientdb-stress]. It's open source licensed, but we've kept it distinct from OrientDB as it needs to run across multiple versions/patches and that doesn't work well in-tree. It currently requires some of the patches in this branch to run successfully.

Additional Notes

There is an additional class of issues that this branch does not currently fix, which is related to storage closures during database installation while in-flight transactions have already opened the database. This causes transaction errors due to closed in-memory and on-disk structures, and often leads to cascading database installs, failed updates and (in rarer situations) lost updates.
We have some fixes designed for this issue, but are debating whether it's worth developing them further as they are not observed with the enterprise agent deployed (the full database sync in the enterprise edition does not close storage for backup/remote install, and so does not encounter these problems).

I've ported these changes to 3.2 and tested to the point that I'm fairly confident that they can be reproduced in that branch and solve the same issues - 3.2 already had some changes that 3.1 did not have that try to address some of these issues, but fail under stress testing without the fixes in this branch. I've paused that work for now until the 3.1 changes can be discussed and made stable.
There are additional issues in 3.2 that will need to be addressed (creating databases currently fails in a distributed cluster soon after startup) that cause problems for the stress test tool, and 3.2 also suffers from the issues with database storage closure under load (in particular I've observed lost updates on some occasions).

Checklist

[x] I have run the build using mvn clean package command
[x] My unit tests cover both failure and success scenarios

joao-carimo · 2022-08-16T16:37:21Z

@timw if OrientDB supported database Alias, full database restoration would be made easier:
1-Restore the desired backup (using the enterprise plugin) into a temporary database, with a temporary alias
2- Freezing the current database
3- Swapping the database Alias between the restored and the Current database
4- release the database
5- drop the previous database as soon as the backup restoration is confirmed

This means that a database could be accessed through it's alias name.
Under the enterprise plugin, Just by using a command such as:
ALTER DATABASE set ALIAS=<'alias name'>;
would define a database alias.
Otherwise the alias would have the same name as the database (by default would guarantee compatibility).

Swapping database ALIAS between database, would be the most effective way to restore a database.
As the temporary database could be tested before being accepted or revoked.

In a distributed mode, the restored database should be propagated across the nodes.

What are your thoughts about this?
Currently, with the Enterprise edition plugin, we need to restart the application server, which means that users would loose information and the application would be unavailable for a period of time.

Another thing:
We have been struggling to run the database under load balance configuration (Round Robin).
Have you experienced the same issue?

Kind regards,

Joao

tglman · 2022-08-18T14:46:00Z

Hi @timw,

Your observations and fixes are spot on, this work looks amazing, there are additional refactor as well that we are slowly working on on our side to make all more stable, like unify the ODistributedDatabaseImpl lifecycle with the OSharedContext lifecycle, that should help to make sure opening/closing/installing of a database would work in a more reliable way!

In the specific of this pull request, has a lot of changes inside, and probably will be better apply some of the changes one by one, to reduce risk and complexity, so if you can could you please split this PR in multiple PR?

I will list some commits that I saw in this PR that could merged strait away in a independent PR:

1:
79e6b4d
2:
6d1c5c2
a09415f
3:
4116f42
4:
2abb66a
5:
19c3ca7
6:
3860d38

if you could open some PR which each of them this commits, I could merge them straight away, then for the rest of the commits that require additional checking and testing we can proceed later on with review and merges, also after the merging of some of the listed commit this PR will become smaller.

Regards

timw · 2022-08-19T03:42:37Z

Happy to do some separate PRs for the independent bits (part of the reason for this WIP was to have that discussion).
Part of the rationale for them was the overall context of improving the HA stability, so it's good to see some of them in-sequence.

timw · 2022-08-23T23:13:41Z

@tglman - I've created the separate PRs now.
If/when those get merged, I'll rebase this PR.

timw · 2022-08-24T21:20:08Z

I've rebased on the head of 3.1.x now, with the separately merged commits removed now.

Split loading of enabled server plugins and starting of plugins to allow presence of a distributed server manager to be detected prior to network listeners being established and storage open. This allows guarding of constructs that require the distributed plugin to be present and running, which currently experience a race condition between the network listeners starting and the distributed plugin fully starting.

The current usages of openNoAuthenticate include cases (like DB delta/full syncs) that need to bypass not only auth checks but distributed online status.

Errors in unbound tasks in executors that are launched from common points (e.g. OrientDBEmbedded#execute) are hard to trace. This change allows a task ID to be associated with each execution, which will be reported on any exception, and if debug logging is enabled, a full stack trace identifying the launching call site will be attached.

This allows improved logging and tracing consistency over general use of new Thread()

This prevents storage tasks that require the distributed status to be online from accessing distributed lifecycle objects that have not yet been set up (which shows up as NPEs during execution).

This avoids accesses to uninitialised distributed state during initial database setup from cluster.

…ributed lock.

Prefer attempting to cancel task before execution before waiting. Also removes double logging of execution exception, and avoids problem where get cannot be called after cancel.

Provide tracing overrides to aid in tracking async errors.

Allows registering live updates to succeed when distributed plugin not online.

View update uses distributed state, which can break if view update occurs during a distributed state change, breaking the update loop.

ODistributedDatabaseImpl construction registered the instance, leaking the this reference, and shut down the previous instance if present. The previous instance may not have been constructed fully however, so shutdown could NPE, resulting in the construction of the current instance aborting with uninitialised state, which would then be picked up by other threads finding it registered in the message service. This change externalises the construction into an atomic operation in the message service, and makes the state in the distributed database impl final. The warning about needing registration because of use "further in the call chain" appears to be spurious.

…nt installs.

…eated. If the plugin isn't online, initialisation of newly created database will fail, resulting in a partially initialised database that will break when used (usually because the schema hasn't been loaded).

…ce conditions on startup.

tglman · 2022-11-14T12:13:03Z

Hi,

I saw that you update this PR with new commits, so I went through it again, here a list of commit that I would accept straight away in independent PRs

#1
1694e6f

#2 view fixes
3c53320
ede8fbf

Could you send this Independently ?

Regards

timw · 2023-03-27T09:18:17Z

@tglman - sorry, I didn't notice your last update, and got distracted with other work for a while so only just checked it today.
I've created the requested PRs (I note 3c53320 has already been cherry picked).

It would also be good to get some guidance on what to do with the remaining work in this PR, once those other minor items are removed.

We've been running this in production for some months now, and have had no issues or outages, so are pretty confident on stability.
3.1 is a dormant branch though, so I'd like to start porting the fixes to 3.2, but there are some changes made in 3.2 around the database opening code paths that will require some rework.

tglman · 2023-04-04T11:57:05Z

Hi,

Thank you to have created the specific PRs, I merged some of them and ported the changes also to 3.2.x (also all the previous merged changes have already been ported to 3.2.x).

I see in this PR are left 3 main set of changes

Introduction of new executors classes
Refactors around views scheduling + execution
Change of initialization and flow of the plugins specifically the distributed one

For the executors, what is the scope of this specialized executor ? looking at the code it seems that it add additional logging to make sure to correlate the error with the source caller, am I getting it right ? This is cool but is not free so I'm pro about it just maybe make the tracing possible to be turned off, also I'm happy to have this in 3.2.x even though in that version all the executors are in the OrientDB context and should be turned off with it, and all the Orient global executor have been removed, but I guess is not a problem to use this tracing executor in there as well.

for the view changes in the specific of 3.1.x I think we could merge it, but for 3.2.x there have been a big refactor of that logic, that should already resolve the problem that this changes are fixing, so I think is not needed to port views changes to 3.2.x

For the plugin loading change i see there is trying to make sure that the detection of the distributed plugin is done earlier with the "distributedPluginEnabled" flag, I understand why this is done, I had more then a few problems on database creation on initialization and detection of distributed environment at startup, this could be ok in 3.1.x but also here I think we managed to solve this problem with a more structural approach in the 3.2.x version, so I do not know if this changes is worth to be ported.

Thank you for this work anyway, it is impressively good !

timw · 2023-05-10T09:43:52Z

I'll try to find some time to fix up the tracing executors soon - as you note we can avoid the callable construction when tracing is disabled (which is already detected in the prepareTrace).

For the view and lifecycle changes, I'll need to re-test against the current 3.2 head to be sure (I see a lot of changes in the database/session/distributed area that I'll have to understand as well), but at the time of creating this PR my load test app could reliably break 3.2 in a lot of the same ways that 3.1 broke.

Given we're pretty stable on our 3.1 fork for now, the best way forward might be to re-do the stress testing on 3.2 and then port the still relevant fixes over to 3.2 and look at that in a separate PR.

tglman · 2024-01-22T12:32:04Z

Hi,

I did manually port the executor tracing in 3.2.x and add a global configuration to enable it for debug purpose, the rest of the part of this PR have been already solved in 3.2.x, so I'm going to close this.

Regards

timw · 2024-01-22T22:51:06Z

Thanks for that.
We will be upgrading to 3.2 mid 2024 I expect, so I’ll repeat the stress testing process in 3.2 then.
I think the architectural changes make porting the remaining patches tricky, so probably best to start from 1st principles again and see if there are outstanding/new issues.

timw changed the base branch from develop to 3.1.x August 15, 2022 09:54

timw force-pushed the 3.1/ha_db_stability branch 2 times, most recently from 800c193 to 1642367 Compare August 23, 2022 23:00

timw force-pushed the 3.1/ha_db_stability branch 2 times, most recently from de16b79 to 8b6c929 Compare August 24, 2022 21:19

pkendall64 force-pushed the 3.1/ha_db_stability branch from 8b6c929 to 36d255f Compare October 31, 2022 18:17

timw added 13 commits November 1, 2022 08:03

Minor consistency fixes in embedded DB input validation.

392c94b

Factor embedded DB open methods into single core implementation

06cf474

Add distinct openInternal operation that bypasses online checks.

c6f96fd

The current usages of openNoAuthenticate include cases (like DB delta/full syncs) that need to bypass not only auth checks but distributed online status.

Fix database online checks for distributed database opens.

3b88eca

Use openInternal path for bypass access.

5808351

Use simpler but equivalent openDatabase calls

345c9b6

Add tracing support to scheduled executor service.

fb9170d

Add a global executor for use instead of general one-off threads.

6ff2afd

This allows improved logging and tracing consistency over general use of new Thread()

Defer setting running to end of distributed plugin startup.

0ce5af1

This prevents storage tasks that require the distributed status to be online from accessing distributed lifecycle objects that have not yet been set up (which shows up as NPEs during execution).

Defer installation of databases until distributed plugin is online.

1c49c99

This avoids accesses to uninitialised distributed state during initial database setup from cluster.

Sanity check that distributed plugin is online before attempting dist…

cdcec5b

…ributed lock.

timw added 12 commits November 1, 2022 08:03

Fix waiting for last task in ViewManager close.

3c53320

Prefer attempting to cancel task before execution before waiting. Also removes double logging of execution exception, and avoids problem where get cannot be called after cancel.

Use tracing executor service in OrientDBEmbedded

db45760

Provide tracing overrides to aid in tracking async errors.

Use internal open for ViewManager init

ad6363f

Allows registering live updates to succeed when distributed plugin not online.

Make ViewManager updates resilient to offline DB status.

ede8fbf

View update uses distributed state, which can break if view update occurs during a distributed state change, breaking the update loop.

Log transaction ID on re-enqueue

4ec87b1

Add task name based tracing to scheduled tasks.

d533fc8

Avoid repeated distributed database shutdowns.

99a0bfb

Fix race in registering installing database to guard against concurre…

3d6d5e5

…nt installs.

Eliminate duplicate scheduleTask code.

d042c99

Guard database create to avoid partially initialised storage being cr…

5899101

…eated. If the plugin isn't online, initialisation of newly created database will fail, resulting in a partially initialised database that will break when used (usually because the schema hasn't been loaded).

DEV: Expand HazelcastPlugin startup time to widen window to expose ra…

61e3af5

…ce conditions on startup.

pkendall64 force-pushed the 3.1/ha_db_stability branch from 36d255f to 61e3af5 Compare October 31, 2022 19:04

tglman closed this Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP - Improved HA stability #9854

WIP - Improved HA stability #9854

timw commented Aug 14, 2022

joao-carimo commented Aug 16, 2022

tglman commented Aug 18, 2022

timw commented Aug 19, 2022

timw commented Aug 23, 2022

timw commented Aug 24, 2022

tglman commented Nov 14, 2022

timw commented Mar 27, 2023

tglman commented Apr 4, 2023

timw commented May 10, 2023

tglman commented Jan 22, 2024

timw commented Jan 22, 2024

WIP - Improved HA stability #9854

WIP - Improved HA stability #9854

Conversation

timw commented Aug 14, 2022

What does this PR do?

Motivation

Related issues

Additional Notes

Checklist

joao-carimo commented Aug 16, 2022

tglman commented Aug 18, 2022

timw commented Aug 19, 2022

timw commented Aug 23, 2022

timw commented Aug 24, 2022

tglman commented Nov 14, 2022

timw commented Mar 27, 2023

tglman commented Apr 4, 2023

timw commented May 10, 2023

tglman commented Jan 22, 2024

timw commented Jan 22, 2024