Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Fix missing code_owner cases in Datadog #609

Open
2 tasks
robrap opened this issue Apr 26, 2024 · 3 comments
Open
2 tasks

[APM] Fix missing code_owner cases in Datadog #609

robrap opened this issue Apr 26, 2024 · 3 comments

Comments

@robrap
Copy link
Contributor

robrap commented Apr 26, 2024

A/C:

  • All successful traces in Datadog have the code owner attribute
    • Query: env:prod service:edx-edxapp-lms operation_name:django.request -@http.status_code:(4* OR 5*) -@code_owner_squad:*
    • Add @session_email_mismatch:* (any custom attribute we always attach to the root span) to filter out false positives from Disabling NR APM causes trace concatenation in Datadog #692 (which causes us to attach the code owner attribute to the wrong span).
    • Add @http.status_code:* to filter out traces where an HTTP status code was never even attached.
  • Spot-check shows that NR and DD have similar counts for code owners over the same time span

The current implementation of code-owner tagging in LMS uses _get_module_from_request_path to determine what Python module is handling the request. However, some requests fail to identify the module and fall back to _get_module_from_current_transaction, which contains New Relic specific code. We'll need to find a way to make this work with Datadog.

Identifying where fallbacks occur

  • In production: Query New Relic for code_owner_transaction_name. (We only get code_owner_path_error if both path-based and transaction-based module lookups fail.)
  • In devstack: newrelic is not enabled, so these requests fall all the way through to the end of _get_module_from_request.

Known problematic views

Over the past month I observed that about 1% of requests in prod LMS had required a transaction-based module lookup, but only for a very small list of transaction types:

from Transaction select count(*) facet code_owner_transaction_name
where code_owner_transaction_name is not null and appName = 'prod-edx-edxapp-lms'
since 1 hour ago

Occurrences in production, from most to least frequent, along with how to reproduce them:

  • [FIXED] openedx.core.djangoapps.contentserver.middleware:StaticContentServer - Course assets should be served by a view rather than a middleware openedx/edx-platform#34702
  • django.core.handlers.wsgi:WSGIHandler._get_response
    • A variety of non-200 responses, mostly under /api, such as a 301 from /api/discussion/v1/comments
  • lms.djangoapps.mobile_api.middleware:AppVersionUpgrade
    • 426s from /api/mobile/v2/users/.../course_enrollments
  • corsheaders.middleware:CorsMiddleware
    • 200s from URLs like /api/courseware/course/course-v1:..., /api/grades/v1/gradebook/course-v1:.../bulk-updatehistory/, /api/course_home/course_metadata/course-v1:ArmEducationX (latter is possibly someone probing the API)
  • openedx.core.djangoapps.safe_sessions.middleware:SafeSessionMiddleware
    • 401s and 302s from a collection of endpoints

Other observations

  • These are all middleware. In fact, there are no results for code_owner_module like '%middleware%' and code_owner_transaction_name is null -- requests that are handled by a middleware never have a request-path-based answer for code owner module.
    • That aligns with not being able to use resolve.

Assorted notes

  • I think the implementation of _get_module_from_request_path could be improved, because I think the view_func might come in different places for different requests. There is auth related test code that does this. (TODO: find code example)
  • We need to ensure that we know how to find the spans with the new attribute, and that error spans have this attribute, or that there is some way to actually set up performance and error monitoring using this attribute.
  • Or if DD has other suggestions, we can certainly take them.
@timmc-edx timmc-edx self-assigned this Apr 29, 2024
@timmc-edx
Copy link
Member

Interestingly, there's a property django.resolver_match that we might be able to use instead of _get_module_from_request_path.

@timmc-edx timmc-edx changed the title [APM] Enable edx-platform split-owner monitoring in DD (code_owner tag) [APM] Fix missing code_owner cases in Datadog May 7, 2024
@timmc-edx timmc-edx removed their assignment May 7, 2024
@robrap
Copy link
Contributor Author

robrap commented Aug 21, 2024

@timmc-edx:

@timmc-edx
Copy link
Member

I've updated the ticket with a few notes (a query to find these traces, course assets middleware was fixed).

Running env:prod service:edx-edxapp-lms operation_name:django.request -@http.status_code:(4* OR 5*) -@code_owner_squad:* @session_email_mismatch:* again I see:

  • 56 spans total in the past day
  • Adding @http.status_code:* removes about half -- these are incomplete traces (interrupted requests?) with no status code at all, across a small random set of views. Not worried about these.
  • The remainder are GET 404 (meaning, couldn't be resolved to a view, probably handled by a middleware) with an HTTP 301 response. From a bit of poking around, I think these are all URLs with a missing trailing slash, and we're redirecting to add that slash before the code owner can be attached. (Probably Django is doing this.)

So probably we could get down to zero (and re-establish our monitor) if we:

  • Excluded missing-status-code traces
  • Excluded 301 responses with a missing trailing slash in the path

On the django.resolver_match question, there likely would be a performance improvement, but we'd have to side-by-side test the options to see if they agreed closely enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

2 participants