[Bug] JobManager frequent GC causes Yarn container memory overflow #2995

13301891422 · 2023-08-30T07:13:31Z

Search before asking

I had searched in the issues and found no similar issues.

Java Version

1.8.0_212

Scala Version

2.12.x

StreamPark Version

2.0.0

Flink Version

1.15.4

deploy mode

yarn-application

What happened

When I submit a Flink On Yarn (Yarn application mode) task using StreamPark, the JobManager's parameters look like this:

jobmanager.memory.heap.size 469762048b
jobmanager.memory.jvm-metaspace.size 268435456b
jobmanager.memory.jvm-overhead.max 201326592b
jobmanager.memory.jvm-overhead.min 201326592b
jobmanager.memory.off-heap.size 134217728b
jobmanager.memory.process.size 1024mb

After the task runs for a period of time (about 3 to 20 days), the Container running the JobManager will always be killed by ResourceManager. Then I start the GC log of the JobManager. The process that discovered JobManager performs a next-generation GC about every 2 minutes or so, as follows:

2023-08-30T13:56:57.694+0800: [GC (Allocation Failure)] [PSYoungGen: 149956K->1673K(150528K)] 315127K->166876K(456704K), 0.0138514 secs] [Times: user=0.54 sys=0.05, real=0.02 secs]
2023-08-30T13:59:17.558+0800: [GC (Allocation Failure)] [PSYoungGen: 150141K->1636K(150528K)] 315344K->166871K(456704K), 0.0285263 secs] [Times: user= 1.20sys =0.11, real=0.03 secs]
...
2023-08-30T14:47:54.412+0800: [GC (Allocation Failure)] [PSYoungGen: 148425K->1700K(150016K)] 314796K->168135K(456192K), 0.0258613 secs] [Times: user= 0.96sys =0.06, real=0.03 secs]
2023-08-30T14:50:12.434+0800: [GC (Allocation Failure)] [PSYoungGen: 149138K->1156K(150016K)] 315573K->167607K(456192K), 0.0233593 secs] [Times: user=0.77 sys=0.07, real=0.03 secs]

In order to understand the cause of JobManager's frequent GC, I dump the objects in JobManager's java heap into local files, and then use VisualVM to open them for analysis, and find that Char[] occupies the largest memory space, as shown in the following figure:

What are the reasons for this? If we use FLINK_HOME/bin/flink run t yarn-per-job to submit the task from the command line, we will not generate so many Char[]. The GC time of JobManager (this program's parameters are exactly the same as the above parameters) is about once every 40 minutes. This situation seems to be relatively normal

As for the reason why containers are frequently killed, we will set jobmanager.memory.enable-jvm-direct-memory-limit = true to avoid memory overlimit. Do we know whether this parameter is useful for memory overlimit killing?

Error Exception

Failing this attempt.Diagnostics: [2023-08-22 08:49:10.443]Container [pid=77475,container/D=container_e08_1683881703260_1165_01 0000011 running 9510912B beyond the
PHYSICAL' memory limit. Current usage: 1.0 GB of 1 GB physical memory used; 3.2 GB of 2.1 GB virtual memory used. Killing container

Screenshots

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!(您是否要贡献这个PR?)

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

wolfboys · 2023-09-02T06:01:04Z

Thank you for the detailed feedback. StreamPark is merely a platform for managing and submitting flinkjob. You can look into your flinkjob itself to investigate further reasons.

13301891422 changed the title ~~[Bug] Bug title~~ [Bug] JobManager frequent GC causes Yarn container memory overflow Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] JobManager frequent GC causes Yarn container memory overflow #2995

[Bug] JobManager frequent GC causes Yarn container memory overflow #2995

13301891422 commented Aug 30, 2023

wolfboys commented Sep 2, 2023

[Bug] JobManager frequent GC causes Yarn container memory overflow #2995

[Bug] JobManager frequent GC causes Yarn container memory overflow #2995

Comments

13301891422 commented Aug 30, 2023

Search before asking

Java Version

Scala Version

StreamPark Version

Flink Version

deploy mode

What happened

Error Exception

Screenshots

Are you willing to submit PR?

Code of Conduct

wolfboys commented Sep 2, 2023