2482 privacy requests approved in quick succession error when not running worker #2489

adamsachs · 2023-02-02T21:46:47Z

Closes #2482
Closes #2486

Code Changes

create a new db Session per celery task invocation
ensure a common Engine and sessionmaker are reused across invocations to keep db connection resources in check
update worker "main" module to ensure saas request overrides are added to worker runtime
- the important line here is https://github.com/ethyca/fides/pull/2489/files#diff-8ad3421e67c70da2822e09434a416a583e1b18d9dd58e87938820f9a07466542R3 - from fides.api.ops.service.saas_request.override_implementations import *
- also needed to split out worker "main" into its own module to avoid circular dependency issues from tasks/__init__.py once the above import was added

Steps to Confirm

for Privacy requests approved in quick succession error when not running worker #2482 fix, run without a worker (e.g. nox -s 'fides_env(dev)') and approve multiple privacy requests in quick succession using multiple browser tabs on the admin UI.
- ensure privacy requests are executing concurrently, i.e. a second request is approved before the first request has gone to completion. you may need to add longer-running connectors to your local environment to get into this state easily. or you can use a script to send "approve" API calls in rapid succession. (will likely upload a reference script shortly)
for saas overrides are not loaded into celery worker runtime #2486, run with a worker (e.g. nox -s dev -- worker ui) and switch celery.task_always_eager = false in fides.toml
- you'll also need to configure a connector that uses request overrides. examples are ACS (adobe campaign standard) or Firebase
- run a privacy request. before this fix, you'd get errors immediately. with this fix, privacy request execution should proceed normally on the worker.
see individual issues for more detailed steps to repro

Pre-Merge Checklist

All CI Pipelines Succeeded
Documentation:
- documentation complete, PR opened in fidesdocs
- documentation issue created in fidesdocs
Issue Requirements are Met
Relevant Follow-Up Issues Created
Update CHANGELOG.md

Description Of Changes

These are two distinct fixes that are coupled together here because they touch similar parts of the codebase and came up when testing each other.

Sill pending some local testing to ensure db connections are kept in check even with the changes to session management here

we still maintain a single engine and sessionmaker per celery task

move worker main into its own module to avoid dependency conflicts

codecov · 2023-02-02T22:07:41Z

Codecov Report

Base: 88.45% // Head: 88.45% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (8206c04) compared to base (82330c8).
Patch coverage: 80.00% of modified lines in pull request are covered.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2489   +/-   ##
=======================================
  Coverage   88.45%   88.45%           
=======================================
  Files         327      328    +1     
  Lines       15957    15959    +2     
  Branches     4431     4433    +2     
=======================================
+ Hits        14115    14117    +2     
  Misses       1688     1688           
  Partials      154      154

Impacted Files	Coverage Δ
src/fides/cli/commands/util.py	`59.21% <0.00%> (ø)`
src/fides/api/ops/worker/__init__.py	`57.14% <57.14%> (ø)`
.../ops/service/messaging/message_dispatch_service.py	`74.74% <100.00%> (ø)`
.../service/privacy_request/request_runner_service.py	`85.65% <100.00%> (ø)`
src/fides/api/ops/tasks/__init__.py	`82.35% <100.00%> (+5.42%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

adamsachs · 2023-02-03T04:11:12Z

running some tests locally, both using a worker or not, and requiring request approval or not, i feel reasonably confident that we haven't introduced significant regressions in terms of database connection resources, especially when looking at test results that @pattisdr had gathered in ethyca/fidesops#944

i tested locally by using the attached testing script, tweaking number of requests and approval concurrency. when running the tests, i had another terminal running SELECT sum(numbackends) FROM pg_stat_database; against the app postgres db, to track how many open connections we had at any given moment.

here were my general findings:

running a worker, we stayed between 6-8 connections consistently, even executing 20 requests at a time many times over. everything was stable.
without workers, we need to turn on manual request approval in order to get any sort of concurrency - otherwise, requests that are created just get queued up naturally when bulk creating.
- consistently approving batches of 5-10 requests one after the other, our active connections did rise up to 25, but they also would drop down after requests finished processing.
- there was no evidence of a significant connection "leak" as was found in Reduce # of clients connected to the application db [#810] fidesops#944 testing
- after enough batches of 10 concurrent approvals, i did start seeing QueuePool limit of size 5 overflow 10 reached errors pop on the server logs
  - when this happened, privacy requests that were processing continued to completion successfully, while some never actually got "approved" and stayed in a "new"/pending state, able to be approved and then processed successfully after the server got back to normal functioning
- my reading of this is that we haven't introduced a problem, but seem to just be running into a limit on our connection pool. looking more closely at the stack trace actually seems to be indicating that the problem is occurring not in our task's Engine, but instead in the Engine used generally by the API endpoints. (stack track attached)
- before this fix, we were not able to run any concurrent requests when not running a worker - we'd error almost immediately. in that sense, it's hard to really determine whether this is a regression or we're just hitting new edge cases, because we couldn't even get more than one request running in parallel before this fix
- in my testing so far, these symptoms seem a lot easier to deal with than the problem we were facing before, and from what i can tell this seems to be a separate problem
if i hit the privacy request approval endpoint with 20 clients at the same time, it actually does cause the server to lock up pretty consistently. this does not seem to occur with normal application use, but did occur if i sent 20 approval requests in parallel from my testing script.
- i confirmed that this occurs both when running a worker and when not running a worker. it also occurs even without any of the changes in this PR, so i think it is a separate issue that likely needs to be investigated - it may have the same root cause as the connection pool limit being reached with batches of ~10 concurrent approval requests, i.e. what's mentioned above.

QueuePool_limit_stack.txt
execute_requests_concurrently.py.txt

pattisdr · 2023-02-03T14:41:53Z

Starting review...

src/fides/cli/commands/util.py

tests/ctl/cli/test_cli.py

src/fides/api/ops/tasks/__init__.py

pattisdr

This seems like a necessary change for customers not running workers: it doesn't seem even workable without this. As an aside, it would be useful for us to better understand what is actually happening when task_always_eager is True. In my opinion, the preferred mode for running fides should be with workers. If customers are not doing that, we need to investigate if task_always_eager=True is the best way to actually execute privacy requests without workers and if there are other side effects we're not aware of.

Background from last PR

This PR ethyca/fidesops#944 assumed we were running celery workers and it makes sense to me why this is starting to breakdown when we're not using workers.

The biggest benefit from the PR was sharing engines across a celery process. I did add some smaller changes that also seemed to improve session management alongside: like passing along the same session for execution logs and sharing sessions, but sharing a session wasn't a requirement so it makes sense to investigate that here.

Testing setup:

I used Adam's script (thanks for attaching!) to "Make 20 requests with concurrency 5 to approve privacy requests", comparing with a worker / without a worker, to sharing a sessionmaker (new behavior) versus sharing a session (previous behavior). I also increased the pool_size and max_overflow to 50 each to avoid hitting these limits early and be able to properly compare. I had two connectors set up to increase the duration of a privacy request. I monitored the number of connections open and verified privacy request statuses and execution logs.

Takeaways

It would be useful to add a separate PR sooner rather than later to allow the engine pool_size and max_overflow to be configurable.
I don't know if task_always_eager=True should even be used in production. Sean says that this makes code run locally, synchronously, which was not the intent at all. Part of the reasons for having celery were to be having these queues that don't consume too many resources at once, and execute privacy requests at a reasonable rate.
Running a worker: Didn't see huge differences between sharing a session versus only sharing the sessionmaker. In my previous testing in the last PR, I had noticed slightly better session management when sharing a session, but wasn't seeing that in this particular test setup.
No worker: Confirmed huge improvements when not sharing a session. With the pool_size and max_overflow increased, session management seemed similar (just looking at connections used, max idle connections, etc).. However, sharing a session was causing major errors in privacy requests being able to complete successfully. 97% of privacy requests were failing with Session-related errors, like NoneType' object has no attribute 'dispatch'" or "Instance '<PrivacyRequest at 0xffff7d23eb60>' is not persistent within this Session", or NoneType' object has no attribute 'twophase'

adamsachs · 2023-02-03T22:47:02Z

It would be useful to add a separate PR sooner rather than later to allow the engine pool_size and max_overflow to be configurable.

created #2507 for follow up

adamsachs added 2 commits February 2, 2023 16:35

create new db session each privacy request invocation

3fe55b4

we still maintain a single engine and sessionmaker per celery task

import saas request overrides module into worker runtime

02175da

move worker main into its own module to avoid dependency conflicts

adamsachs added bug Something isn't working do not merge Please don't merge yet, bad things will happen if you do labels Feb 2, 2023

adamsachs requested a review from pattisdr February 2, 2023 21:46

adamsachs self-assigned this Feb 2, 2023

adamsachs linked an issue Feb 2, 2023 that may be closed by this pull request

Privacy requests approved in quick succession error when not running worker #2482

Closed

add a CLI test for worker command

4003325

adamsachs marked this pull request as ready for review February 3, 2023 14:24

adamsachs removed the do not merge Please don't merge yet, bad things will happen if you do label Feb 3, 2023

update changelog

8206c04

seanpreston reviewed Feb 3, 2023

View reviewed changes

src/fides/cli/commands/util.py Show resolved Hide resolved

seanpreston reviewed Feb 3, 2023

View reviewed changes

tests/ctl/cli/test_cli.py Show resolved Hide resolved

seanpreston reviewed Feb 3, 2023

View reviewed changes

src/fides/api/ops/tasks/__init__.py Show resolved Hide resolved

pattisdr approved these changes Feb 3, 2023

View reviewed changes

seanpreston approved these changes Feb 3, 2023

View reviewed changes

adamsachs merged commit 2a6a3ed into main Feb 3, 2023

adamsachs deleted the 2482-privacy-requests-approved-in-quick-succession-error-when-not-running-worker branch February 3, 2023 18:20

This was referenced Feb 3, 2023

provide config properties to set db engine pool_size and max_overflow #2507

Closed

Consider adjusting autocommit and autoflush defaults to True on Engine creation #2508

Open

adamsachs mentioned this pull request Feb 6, 2023

provide config overrides for db engine pool params #2511

Merged

6 tasks

pattisdr mentioned this pull request Feb 6, 2023

Release v2.6.2 #2519

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2482 privacy requests approved in quick succession error when not running worker #2489

2482 privacy requests approved in quick succession error when not running worker #2489

adamsachs commented Feb 2, 2023 •

edited

Loading

codecov bot commented Feb 2, 2023 •

edited

Loading

adamsachs commented Feb 3, 2023

pattisdr commented Feb 3, 2023

pattisdr left a comment •

edited

Loading

adamsachs commented Feb 3, 2023

2482 privacy requests approved in quick succession error when not running worker #2489

2482 privacy requests approved in quick succession error when not running worker #2489

Conversation

adamsachs commented Feb 2, 2023 • edited Loading

Code Changes

Steps to Confirm

Pre-Merge Checklist

Description Of Changes

codecov bot commented Feb 2, 2023 • edited Loading

Codecov Report

adamsachs commented Feb 3, 2023

pattisdr commented Feb 3, 2023

pattisdr left a comment • edited Loading

Choose a reason for hiding this comment

adamsachs commented Feb 3, 2023

adamsachs commented Feb 2, 2023 •

edited

Loading

codecov bot commented Feb 2, 2023 •

edited

Loading

pattisdr left a comment •

edited

Loading