Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed formatting issues #192

Merged
merged 2 commits into from
Mar 20, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
246 changes: 181 additions & 65 deletions discussions/performance.md
Original file line number Diff line number Diff line change
@@ -1,135 +1,251 @@
3.11.20 -- Performance Workstream

Wednesday, March 11, 2020

Performance Goals:
Current HW system achieving stable 1k TPS, peak 5k and proven horizontal scalability (more instances = more performance in almost linear fashion) - Kim/Sam w/ Miller`
Get volunteers for two immediate Proof of Concepts:
**Performance Goals:**

Current HW system achieving stable 1k TPS, peak 5k and proven horizontal scalability (more instances = more performance in almost linear fashion)

**POCs:**

POCs:
Test the impact or a direct replace of the mysql DB with an shared memory network service like redis (using redlock alg if locks are required)

Test a different method of sharing state, using a light version of event-drive with some CQRS

Resources:
**Resources:**

Slack Channel: perf-engineering

Mid-PI performance presentation:
https://github.com/mojaloop/documentation-artifacts/tree/master/presentations/March2020-PI9-MidPI-Review

Setting up the monitoring components
https://github.com/mojaloop/helm/tree/master/monitoring

Action/Follow-up Items
**Action/Follow-up Items**

• What Kafka metrics (client & server side) should we be reviewing? - Confluent to assist

• Explore Locking and position settlement - Sybrin to assist
o Review RedLock - pessimistic locking vs automatic locking
o Remove the shared DB in the middle (automatic locking on Reddis)

1. Review RedLock - pessimistic locking vs automatic locking

2. Remove the shared DB in the middle (automatic locking on Reddis)

• Combine prepare/position handler w/ distributed DB

• Review node.js client and how it impact kafka, configuration of Node and ultimate Kafka client - Nakul

• Turn back on tracing to see how latency and applications are behaving

• Ensure the call counts have been rationalized (at a deeper level)

• Validate the processing times on the handlers and we are hitting the cache

• Async patterns in Node

• Missing someone who is excellent on mysql and percona
o Are we leveraging this correctly
o What cache layer are we using (in memory)

1. Are we leveraging this correctly

2. What cache layer are we using (in memory)

• Review the event modeling implementation - identify the domain events

• Node.js/kubernetes -

• Focus on application issues not as much as arch issues

• How we are doing async technology - review this (Node.JS - larger issue) threaded models need to be optimize - Nakul

Meeting Notes/Deatils
**Meeting Notes/Deatils**

• History
o Technology has been put in place, hoped the design solves an enterprise problem
o Community effort did not prioritize on making the slices of the system enterprise grade or cheap to run
o OSS technology choices

1. Technology has been put in place, hoped the design solves an enterprise problem

2. Community effort did not prioritize on making the slices of the system enterprise grade or cheap to run

3. OSS technology choices

• Goals
o Optimize current system
o Make it cheaper to run
o Make it scalable to 5K TPS
o Ensure value added services can effectively and securely access transaction data

1. Optimize current system

2. Make it cheaper to run

3. Make it scalable to 5K TPS

4. Ensure value added services can effectively and securely access transaction data

• Testing Constraints
o Only done the golden transfer - transfer leg
o Flow of transfer
o Simulators (legacy and advance) - using the legacy one for continuity
o Disabled the timeout handler
o 8 DFSP (participant organizations) w/ more DFSPs we would be able to scale

1. Only done the golden transfer - transfer leg

2. Flow of transfer

3. Simulators (legacy and advance) - using the legacy one for continuity

4. Disabled the timeout handler

5. 8 DFSP (participant organizations) w/ more DFSPs we would be able to scale

• Process
o Jmeter initiates payer request
o Legacy simulator Receives fulfill notify callback
o Legacy simulator Handles Payee processing, initiatives Fulfillment Callback
o Record in the positions table for each DFSP

1. Jmeter initiates payer request

2. Legacy simulator Receives fulfill notify callback

3. Legacy simulator Handles Payee processing, initiatives Fulfillment Callback

4. Record in the positions table for each DFSP

• Partial algorithm where the locking is done to reserve the funds, do calculations and do the final commits

• Position handler is Processing one record at a time
 Future algorithm would do a bulk
 One transfer is handler by one position handler

1. Future algorithm would do a bulk

2. One transfer is handler by one position handler

• Transfers are all pre-funded
 Reduced settlement costs
 Can control how fast DFSPs respond to the fulfill request (complete the transfers committed first before handling new requests)
o System need to timeout transfers that go longer then 30 seconds

1. Reduced settlement costs

2. Can control how fast DFSPs respond to the fulfill request (complete the transfers committed first before handling new requests)

3. System need to timeout transfers that go longer then 30 seconds

• Any redesign of the DBs

• Test Cases
o Financial transaction

1. Financial transaction

• End-to-end

• Prepare-only

• Fulfil only
o Individual Mojaloop Characterization

1. Individual Mojaloop Characterization

• Services & Handlers

• Streaming Arch & Libraries

• Database

• What changed: 150 to 300 TPS
o How we process the messages
o Position handler (run in mixed mode, random

1. How we process the messages

2. Position handler (run in mixed mode, random

• Latency Measurement
o 5 sec for DB to process, X sec for Kafka to process
o How to measure this?

1. 5 sec for DB to process, X sec for Kafka to process

2. How to measure this?

• Targets
o High enough the system has to function well
o Crank the system up to add scale (x DFSPs addition)
o Suspicious cases for investigations
o Observing contentions around the DB
o Shared DB, 600MS w/ out any errors

1. High enough the system has to function well

2. Crank the system up to add scale (x DFSPs addition)

3. Suspicious cases for investigations

4. Observing contentions around the DB

5. Shared DB, 600MS w/ out any errors

• Contention is fully on the DB

• Bottleneck is the DB (distribute systems so they run independently
 16 databases run end to end
o GSMA - 500 TPS
o What is the optimal design?

1. 16 databases run end to end

2. GSMA - 500 TPS

3. What is the optimal design?

• Contentions
o System handler contention

1. System handler contention

• Where the system can be scaled
o If there are arch changes that we need to make we can explore this

1. If there are arch changes that we need to make we can explore this

• Consistency for each DFSP

• Threading of info flows - open question
o Sku'ed results of single DB for all DFSPs
o Challenge is where get to with additional HW

1. Sku'ed results of single DB for all DFSPs

2. Challenge is where get to with additional HW

• What are the limits of the application design
o Financial transfers (in and out of the system)

1. Financial transfers (in and out of the system)

• Audit systems

• Settlement activity

• Grouped into DB solves some issues

• Confluent feedback
o Shared DB issues, multiple DBs
o Application design level issues
o Seen situations where we ran a bunch of simulators/sandboxes

1. Shared DB issues, multiple DBs

2. Application design level issues

3. Seen situations where we ran a bunch of simulators/sandboxes

• Need to rely on tracers and scans once this gets in productions

• Miguel states we disable tracing for now

• Known Issues
o Load CPU resources on boxes (node waiting around) - reoptimize code
o Processing times increase over time

1. Load CPU resources on boxes (node waiting around) - reoptimize code

2. Processing times increase over time

• Optimization
o Distributed monolithic - PRISM - getting rid of redundant reads
o Combine the handlers - Prepare+Position & Fulfil+Position

1. Distributed monolithic - PRISM - getting rid of redundant reads

2. Combine the handlers - Prepare+Position & Fulfil+Position

• What are we trying to fix?
o Can we scale the system?
o What does this cost to do this? (scale unit cost)
o Need to understand - how to do this from a small and large scale
o Optimized the resources
o 2.5 sprints
o Need to scale horizontal
o Add audit and repeatability -

1. Can we scale the system?

2. What does this cost to do this? (scale unit cost)

3. Need to understand - how to do this from a small and large scale

3. Optimized the resources

4. 2.5 sprints

5. Need to scale horizontal

6. Add audit and repeatability -

Attendees:

• Don, Jordon (newly hired perf expert) - Coil

• Sam, Miguel, Roman, Valentine, Warren, Bryan, Rajiv - ModusBox

• Pedro - Crosslake

• Rhys, Nakul Mishra - Confluent

• Miller - BGMF

• In-person: Lewis (CL), Rob (MB), Roland (Sybrin), Greg (Sybrin), Megan (V), Simeon (V), Kim (CL)