Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add beta support to scale Stateful Executors with consensus using RAFT #5564

Merged
merged 165 commits into from
May 10, 2023

Conversation

JoanFM
Copy link
Member

@JoanFM JoanFM commented Jan 2, 2023

Goals:
Provide a PoC of how StatefulExecutors could handle with consensus

Implementation:

  • Handle raftadmin add voters from jina orchestrate
  • Handle terminate signals on the raft node
  • SIGTERM Shutdown handling
  • Logger in raft naming editing (if they merge and release)
  • Check the replica trying for indexing In Local, same request is not guaranteed to try every replica when retries= -1 #5601
  • Make it work with shards (workspace handling)
  • Expose RAFT options in parser
  • Panic if FSM Apply fails What happens when FSM.Apply fails? hashicorp/raft#307 (We have some failure reasons like grpc connectivity that should not be there)
  • Implement proper Snapshot and Restore
  • properly create workspace folders and log/data files, without the need for explicitly setting raft_bootstrap parameter
  • Improve AddVoter calls (we do 10 for now until it works)
  • Improve readiness check, (node ready before restore is done), RAFT node should expose a health check service which is a proxy to executor health check service. It should be used by Pod.wait_for_ready.
  • Implement bulding wheels for different platforms (Look at ANNLite for wheels building)
  • HANDLE LICENSE (since we do not change the code, no need to do anything)
  • Handle ContainerPod
  • Clean logs
  • RPC should implement streaming endpoint
  • currently, ports are being assigned in an underterministic way. The snapshots store information about each node in the cluster. When a cluster needs to be restored, we have to make sure it is spawned with the same ports attributed to replicas.
  • Fix raft library compilation for emulated python on mac (will build to arm architecture and therefore can't be linked to x86_64 emulated python)
  • Handle better reduction in Shards (Head)
  • (OPTIONAL) Implement ApplyBatch in the FSM that will help on reusing client and connection
  • (OPTIONAL) Make jina.pb.go independent of docarray proto. (Maybe assume DocArray is just binary?)
  • (OPTIONAL) Study if we can call WorkerRequestHandler directly from Golang

Testing:

  • Unit test golang module
  • Integration test
  • Try killing nodes, see leadership changes

Further understanding:

  • How to handle timeout leader in snapshotting (Right now FAIL, can we give up leadership?, run snapshot in new process (copy-on-write)?
  • Understand restore behavior
  • Understand vote, bootstrap, etc ...

@github-actions
Copy link

github-actions bot commented Jan 2, 2023

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions github-actions bot added area/core This issue/PR affects the core codebase area/network This issue/PR affects network functionality component/client component/proto labels Jan 2, 2023
@github-actions
Copy link

github-actions bot commented Jan 2, 2023

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

github-actions bot commented Jan 2, 2023

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

github-actions bot commented Jan 2, 2023

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

1 similar comment
@github-actions
Copy link

github-actions bot commented Jan 2, 2023

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions github-actions bot added area/entrypoint This issue/PR affects the entrypoint codebase area/helper This issue/PR affects the helper functionality labels Jan 2, 2023
@github-actions
Copy link

github-actions bot commented Jan 2, 2023

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

github-actions bot commented Jan 2, 2023

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions github-actions bot added area/cli This issue/PR affects the command line interface area/docs This issue/PR affects the docs labels Jan 2, 2023
@github-actions
Copy link

github-actions bot commented Jan 2, 2023

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@codecov
Copy link

codecov bot commented Jan 2, 2023

Codecov Report

Merging #5564 (e10211c) into master (87fa2db) will decrease coverage by 3.11%.
The diff coverage is 29.98%.

@@            Coverage Diff             @@
##           master    #5564      +/-   ##
==========================================
- Coverage   84.34%   81.24%   -3.11%     
==========================================
  Files         141      142       +1     
  Lines       11333    12095     +762     
==========================================
+ Hits         9559     9826     +267     
- Misses       1774     2269     +495     
Flag Coverage Δ
jina 81.24% <29.98%> (-3.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
jina/checker.py 94.44% <ø> (-0.16%) ⬇️
jina/orchestrate/flow/base.py 89.79% <ø> (ø)
jina/orchestrate/pods/helper.py 82.69% <ø> (-0.33%) ⬇️
jina/proto/docarray_v1/pb/jina_pb2.py 0.00% <0.00%> (ø)
jina/proto/docarray_v1/pb/jina_pb2_grpc.py 0.00% <0.00%> (ø)
jina/proto/docarray_v2/pb/jina_pb2.py 0.00% <0.00%> (ø)
jina/proto/docarray_v2/pb/jina_pb2_grpc.py 0.00% <0.00%> (ø)
jina/proto/docarray_v2/pb2/jina_pb2.py 0.00% <0.00%> (ø)
jina/proto/docarray_v2/pb2/jina_pb2_grpc.py 0.00% <0.00%> (ø)
jina/serve/networking/utils.py 86.31% <ø> (ø)
... and 23 more

... and 8 files with indirect coverage changes

@github-actions
Copy link

github-actions bot commented May 9, 2023

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@JoanFM JoanFM requested a review from alexcg1 May 9, 2023 17:25
@github-actions
Copy link

github-actions bot commented May 9, 2023

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

1 similar comment
@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

Signed-off-by: Joan Fontanals Martinez <joan.martinez@jina.ai>
@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@JoanFM JoanFM closed this May 10, 2023
@JoanFM JoanFM reopened this May 10, 2023
@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

Signed-off-by: Joan Fontanals Martinez <joan.martinez@jina.ai>
@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

📝 Docs are deployed on https://jina-v-raft--jina-docs.netlify.app 🎉

@JoanFM JoanFM closed this May 10, 2023
@JoanFM JoanFM reopened this May 10, 2023
@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@JoanFM JoanFM merged commit 71db6a0 into master May 10, 2023
@JoanFM JoanFM deleted the jina-v-raft branch May 10, 2023 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cicd This issue/PR affects the cicd pipeline area/cli This issue/PR affects the command line interface area/core This issue/PR affects the core codebase area/docker This issue/PR affects the docker functionality area/docs This issue/PR affects the docs area/entrypoint This issue/PR affects the entrypoint codebase area/helper This issue/PR affects the helper functionality area/housekeeping This issue/PR is housekeeping area/network This issue/PR affects network functionality area/setup This issue/PR affects setting up Jina area/testing This issue/PR affects testing component/proto size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In Local, same request is not guaranteed to try every replica when retries= -1
4 participants