Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2482 privacy requests approved in quick succession error when not running worker #2489

Conversation

adamsachs
Copy link
Contributor

@adamsachs adamsachs commented Feb 2, 2023

Closes #2482
Closes #2486

Code Changes

  • create a new db Session per celery task invocation
  • ensure a common Engine and sessionmaker are reused across invocations to keep db connection resources in check
  • update worker "main" module to ensure saas request overrides are added to worker runtime

Steps to Confirm

  • for Privacy requests approved in quick succession error when not running worker #2482 fix, run without a worker (e.g. nox -s 'fides_env(dev)') and approve multiple privacy requests in quick succession using multiple browser tabs on the admin UI.

    • ensure privacy requests are executing concurrently, i.e. a second request is approved before the first request has gone to completion. you may need to add longer-running connectors to your local environment to get into this state easily. or you can use a script to send "approve" API calls in rapid succession. (will likely upload a reference script shortly)
  • for saas overrides are not loaded into celery worker runtime #2486, run with a worker (e.g. nox -s dev -- worker ui) and switch celery.task_always_eager = false in fides.toml

    • you'll also need to configure a connector that uses request overrides. examples are ACS (adobe campaign standard) or Firebase
    • run a privacy request. before this fix, you'd get errors immediately. with this fix, privacy request execution should proceed normally on the worker.
  • see individual issues for more detailed steps to repro

Pre-Merge Checklist

Description Of Changes

These are two distinct fixes that are coupled together here because they touch similar parts of the codebase and came up when testing each other.

Sill pending some local testing to ensure db connections are kept in check even with the changes to session management here

we still maintain a single engine and sessionmaker per celery task
move worker main into its own module to avoid dependency conflicts
@adamsachs adamsachs added bug Something isn't working do not merge Please don't merge yet, bad things will happen if you do labels Feb 2, 2023
@adamsachs adamsachs requested a review from pattisdr February 2, 2023 21:46
@adamsachs adamsachs self-assigned this Feb 2, 2023
@codecov
Copy link

codecov bot commented Feb 2, 2023

Codecov Report

Base: 88.45% // Head: 88.45% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (8206c04) compared to base (82330c8).
Patch coverage: 80.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2489   +/-   ##
=======================================
  Coverage   88.45%   88.45%           
=======================================
  Files         327      328    +1     
  Lines       15957    15959    +2     
  Branches     4431     4433    +2     
=======================================
+ Hits        14115    14117    +2     
  Misses       1688     1688           
  Partials      154      154           
Impacted Files Coverage Δ
src/fides/cli/commands/util.py 59.21% <0.00%> (ø)
src/fides/api/ops/worker/__init__.py 57.14% <57.14%> (ø)
.../ops/service/messaging/message_dispatch_service.py 74.74% <100.00%> (ø)
.../service/privacy_request/request_runner_service.py 85.65% <100.00%> (ø)
src/fides/api/ops/tasks/__init__.py 82.35% <100.00%> (+5.42%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@adamsachs
Copy link
Contributor Author

running some tests locally, both using a worker or not, and requiring request approval or not, i feel reasonably confident that we haven't introduced significant regressions in terms of database connection resources, especially when looking at test results that @pattisdr had gathered in ethyca/fidesops#944

i tested locally by using the attached testing script, tweaking number of requests and approval concurrency. when running the tests, i had another terminal running SELECT sum(numbackends) FROM pg_stat_database; against the app postgres db, to track how many open connections we had at any given moment.

here were my general findings:

  • running a worker, we stayed between 6-8 connections consistently, even executing 20 requests at a time many times over. everything was stable.
  • without workers, we need to turn on manual request approval in order to get any sort of concurrency - otherwise, requests that are created just get queued up naturally when bulk creating.
    • consistently approving batches of 5-10 requests one after the other, our active connections did rise up to 25, but they also would drop down after requests finished processing.
    • there was no evidence of a significant connection "leak" as was found in Reduce # of clients connected to the application db [#810] fidesops#944 testing
    • after enough batches of 10 concurrent approvals, i did start seeing QueuePool limit of size 5 overflow 10 reached errors pop on the server logs
      • when this happened, privacy requests that were processing continued to completion successfully, while some never actually got "approved" and stayed in a "new"/pending state, able to be approved and then processed successfully after the server got back to normal functioning
    • my reading of this is that we haven't introduced a problem, but seem to just be running into a limit on our connection pool. looking more closely at the stack trace actually seems to be indicating that the problem is occurring not in our task's Engine, but instead in the Engine used generally by the API endpoints. (stack track attached)
    • before this fix, we were not able to run any concurrent requests when not running a worker - we'd error almost immediately. in that sense, it's hard to really determine whether this is a regression or we're just hitting new edge cases, because we couldn't even get more than one request running in parallel before this fix
    • in my testing so far, these symptoms seem a lot easier to deal with than the problem we were facing before, and from what i can tell this seems to be a separate problem
  • if i hit the privacy request approval endpoint with 20 clients at the same time, it actually does cause the server to lock up pretty consistently. this does not seem to occur with normal application use, but did occur if i sent 20 approval requests in parallel from my testing script.
    • i confirmed that this occurs both when running a worker and when not running a worker. it also occurs even without any of the changes in this PR, so i think it is a separate issue that likely needs to be investigated - it may have the same root cause as the connection pool limit being reached with batches of ~10 concurrent approval requests, i.e. what's mentioned above.

QueuePool_limit_stack.txt
execute_requests_concurrently.py.txt

@adamsachs adamsachs marked this pull request as ready for review February 3, 2023 14:24
@pattisdr
Copy link
Contributor

pattisdr commented Feb 3, 2023

Starting review...

@adamsachs adamsachs removed the do not merge Please don't merge yet, bad things will happen if you do label Feb 3, 2023
Copy link
Contributor

@pattisdr pattisdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a necessary change for customers not running workers: it doesn't seem even workable without this. As an aside, it would be useful for us to better understand what is actually happening when task_always_eager is True. In my opinion, the preferred mode for running fides should be with workers. If customers are not doing that, we need to investigate if task_always_eager=True is the best way to actually execute privacy requests without workers and if there are other side effects we're not aware of.


Background from last PR

This PR ethyca/fidesops#944 assumed we were running celery workers and it makes sense to me why this is starting to breakdown when we're not using workers.

The biggest benefit from the PR was sharing engines across a celery process. I did add some smaller changes that also seemed to improve session management alongside: like passing along the same session for execution logs and sharing sessions, but sharing a session wasn't a requirement so it makes sense to investigate that here.

Testing setup:

I used Adam's script (thanks for attaching!) to "Make 20 requests with concurrency 5 to approve privacy requests", comparing with a worker / without a worker, to sharing a sessionmaker (new behavior) versus sharing a session (previous behavior). I also increased the pool_size and max_overflow to 50 each to avoid hitting these limits early and be able to properly compare. I had two connectors set up to increase the duration of a privacy request. I monitored the number of connections open and verified privacy request statuses and execution logs.

Takeaways

  • It would be useful to add a separate PR sooner rather than later to allow the engine pool_size and max_overflow to be configurable.
  • I don't know if task_always_eager=True should even be used in production. Sean says that this makes code run locally, synchronously, which was not the intent at all. Part of the reasons for having celery were to be having these queues that don't consume too many resources at once, and execute privacy requests at a reasonable rate.
  • Running a worker: Didn't see huge differences between sharing a session versus only sharing the sessionmaker. In my previous testing in the last PR, I had noticed slightly better session management when sharing a session, but wasn't seeing that in this particular test setup.
  • No worker: Confirmed huge improvements when not sharing a session. With the pool_size and max_overflow increased, session management seemed similar (just looking at connections used, max idle connections, etc).. However, sharing a session was causing major errors in privacy requests being able to complete successfully. 97% of privacy requests were failing with Session-related errors, like NoneType' object has no attribute 'dispatch'" or "Instance '<PrivacyRequest at 0xffff7d23eb60>' is not persistent within this Session", or NoneType' object has no attribute 'twophase'

@adamsachs adamsachs merged commit 2a6a3ed into main Feb 3, 2023
@adamsachs adamsachs deleted the 2482-privacy-requests-approved-in-quick-succession-error-when-not-running-worker branch February 3, 2023 18:20
@adamsachs
Copy link
Contributor Author

It would be useful to add a separate PR sooner rather than later to allow the engine pool_size and max_overflow to be configurable.

created #2507 for follow up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants