Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Memory Issue #3880

Open
bmalinconico opened this issue Aug 30, 2024 · 5 comments
Open

Possible Memory Issue #3880

bmalinconico opened this issue Aug 30, 2024 · 5 comments
Assignees
Labels
bug Involves a bug community Was opened by a community member

Comments

@bmalinconico
Copy link

bmalinconico commented Aug 30, 2024

Current behaviour
I have found that enabling auto-compaction in the Ruby GC is causing what appears to be random memory related bugs. This bug manifests in the following way:

Manifestation 1

TypeError: wrong argument type XXX (expected PG::Connection)

usr/src/app/vendor/bundle/ruby/3.3.0/gems/datadog-2.3.0/lib/datadog/tracing/contrib/pg/instrumentation.rb:27:in `exec': wrong argument type Set (expected PG::Connection) (TypeError)
	
from /usr/src/app/vendor/bundle/ruby/3.3.0/gems/datadog-2.3.0/lib/datadog/tracing/contrib/pg/instrumentation.rb:27:in `block in exec'
	
from /usr/src/app/vendor/bundle/ruby/3.3.0/gems/datadog-2.3.0/lib/datadog/tracing/contrib/pg/instrumentation.rb:145:in `block in trace'
	
from /usr/src/app/vendor/bundle/ruby/3.3.0/gems/datadog-2.3.0/lib/datadog/tracing/trace_operation.rb:206:in `block in measure'

Where XXX is any random built-in or app specific class.

This is raised on a call to exec or exec_params in the DD Postgres instrumentation.

Manifestation 2
This error is produced by the reproduction steps I will provide later.

PG::InvalidDatetimeFormat: ERROR:  invalid input syntax for type date: ""
CONTEXT:  unnamed portal parameter $1 = ''

I patched the DD PG instrumentation for exec_params and rescued the error with a pry session. The params array contained no empty values and a retry of the block succeeded

Manifestation 3
Occasional segfaults.

All of these errors feel like something is holding a memory reference that is being moved, resulting in random garbage getting passed down the stack and occasionally referencing a freed memory location.

Expected behaviour
Not an error!

Steps to reproduce
I was unable to reproduce on my local machine but a local containerized env may be able to reproduce it. I was only able to reproduce this in a container running on EC2, that machine is Linux x86_64.

Dockerfile to reproduce this image

FROM ruby:3.3.4
RUN apt-get update && apt-get install libjemalloc2 && rm -rf /var/lib/apt/lists/*
ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

My running application is able to reproduce this error easily when I have the compacting garbage collector enabled due to the volume of activity. Reproducing this in a shell is much more time consuming as you need to (presumably) wait for some compaction.

I'll also acknowledge this may not be datadog, but I've tried to narrow it down as much as I can.

ENV['DATABASE_URL'] = 'setme'
require 'pg'
require 'datadog'
GC.auto_compact = true

Datadog.configure do |c|
  c.tracing.instrument :pg
end

conn = PG.connect(ENV.fetch('DATABASE_URL', nil))
loop do
  conn.exec_params("SELECT #{1_664.times.map { |i| "$#{i + 1}::date as f_#{i}" }.join(',')}", 1_664.times.map { Date.today })
  print '.'
end

I'm going to reiterate that reproducing this is annoying, since there is no small amount of luck trying to get a compacting GC run to trigger at the right time. Doing the above in concurrent fibers increased the odds of it happening (probably due to increased memory churn) however I am providing the smallest repo I can.

Environment

  • datadog version: Currently 2.3.0 but I was upgrading to 1.2.3 and enabling the profiler. When I downgraded to 1.2.3 the issue was still present if I turned on auto_compact
  • Configuration block (Datadog.configure ...):
  • Ruby version: ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [x86_64-linux]
  • Operating system: Linux
  • Relevant library versions: PG - 1.5.6
@bmalinconico bmalinconico added bug Involves a bug community Was opened by a community member labels Aug 30, 2024
@bmalinconico bmalinconico changed the title Possible Memory Corruption Issue Possible Memory Issue Aug 30, 2024
@bmalinconico bmalinconico reopened this Aug 30, 2024
@ivoanjo ivoanjo self-assigned this Aug 30, 2024
@ivoanjo
Copy link
Member

ivoanjo commented Aug 30, 2024

Hey @bmalinconico thanks for reaching out and sorry you ran into this.

I was able to see the PG::InvalidDatetimeFormat showing up with your reproducer script 👀 on version 2.3.0 of the gem a few times (not very often). I wasn't able to trigger the other situations.

I did do a few experiments with this reproducer and the instrumentation does suspiciously seem to be the "straw that breaks the camel's back"; e.g. this does somewhat look to be a bug in pg that the extra memory pressure of tracing seems to trigger... But I don't have convincing evidence either way yet.

Just to confirm, you mentioned:

Currently 2.3.0 but I was upgrading to 1.2.3 and enabling the profiler. When I downgraded to 1.2.3 the issue was still present if I turned on auto_compact

Did you mean that you were upgrading from 1.23 and then downgraded again?

Occasional segfaults.

Can you share the output from the Ruby VM crash? It may help in tracking the issue down.

Or, even better, if you're able to get a core dump, it would be a really useful tool to help track this down.

@bmalinconico
Copy link
Author

@ivoanjo thanks for confirming. The PG::InvalidDatetimeFormat is the only manifestation I've been able to reproduce on demand. All of these manifestations were collected from production failures, and correlated with slow a/b testing of settings.

Yes I first encounter this when I upgraded to 2.3.0 from 1.2.3 (among other changes). In order to isolate the issue I rolled everything back and started piecing it back together. GC compaction was triggering this even on 1.2.3 (same PG version).

Our tests suites have occasionally triggered the seg fault when auto compaction was on. I'll see if I can pull the crash log from that.

@ivoanjo
Copy link
Member

ivoanjo commented Sep 2, 2024

Yes I first encounter this when I upgraded to 2.3.0 from 1.2.3 (among other changes). In order to isolate the issue I rolled everything back and started piecing it back together. GC compaction was triggering this even on 1.2.3 (same PG version).

I'm a bit confused about this part -- you've mentioned version 1.2.3 in a few places above. Is this version 1.23.something? 🤔 1.2.0 is quite old (July 2022) and there were no other point releases in that series.

Our tests suites have occasionally triggered the seg fault when auto compaction was on. I'll see if I can pull the crash log from that.

That would be great! 👀

@bmalinconico
Copy link
Author

Sorry I was typing it out from memory. This was encountered when upgrading from 1.22.0 -> 2.3.0, when we encountered the error we rolled back to 1.22 and the issue was still present.

-    ddtrace (1.22.0)
-      datadog-ci (~> 0.8.1)
+    datadog (2.3.0)
       debase-ruby_core_source (= 3.3.1)
-      libdatadog (~> 7.0.0.1.0)
+      libdatadog (~> 11.0.0.1.0)
       libddwaf (~> 1.14.0.0.0)
       msgpack

I was not able to find a CI run with the output in it. I've got a branch with auto_compact enabled and I'm trying to get the failure to occur.

@ivoanjo
Copy link
Member

ivoanjo commented Sep 4, 2024

I've got a branch with auto_compact enabled and I'm trying to get the failure to occur.

Thanks! Hopefully that will help shine some light on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Involves a bug community Was opened by a community member
Projects
None yet
Development

No branches or pull requests

2 participants