Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] JobManager frequent GC causes Yarn container memory overflow #2995

Open
2 of 3 tasks
13301891422 opened this issue Aug 30, 2023 · 1 comment
Open
2 of 3 tasks

Comments

@13301891422
Copy link

Search before asking

  • I had searched in the issues and found no similar issues.

Java Version

1.8.0_212

Scala Version

2.12.x

StreamPark Version

2.0.0

Flink Version

1.15.4

deploy mode

yarn-application

What happened

When I submit a Flink On Yarn (Yarn application mode) task using StreamPark, the JobManager's parameters look like this:

jobmanager.memory.heap.size 469762048b
jobmanager.memory.jvm-metaspace.size 268435456b
jobmanager.memory.jvm-overhead.max 201326592b
jobmanager.memory.jvm-overhead.min 201326592b
jobmanager.memory.off-heap.size 134217728b
jobmanager.memory.process.size 1024mb

After the task runs for a period of time (about 3 to 20 days), the Container running the JobManager will always be killed by ResourceManager. Then I start the GC log of the JobManager. The process that discovered JobManager performs a next-generation GC about every 2 minutes or so, as follows:

2023-08-30T13:56:57.694+0800: [GC (Allocation Failure)] [PSYoungGen: 149956K->1673K(150528K)] 315127K->166876K(456704K), 0.0138514 secs] [Times: user=0.54 sys=0.05, real=0.02 secs]
2023-08-30T13:59:17.558+0800: [GC (Allocation Failure)] [PSYoungGen: 150141K->1636K(150528K)] 315344K->166871K(456704K), 0.0285263 secs] [Times: user= 1.20sys =0.11, real=0.03 secs]
...
2023-08-30T14:47:54.412+0800: [GC (Allocation Failure)] [PSYoungGen: 148425K->1700K(150016K)] 314796K->168135K(456192K), 0.0258613 secs] [Times: user= 0.96sys =0.06, real=0.03 secs]
2023-08-30T14:50:12.434+0800: [GC (Allocation Failure)] [PSYoungGen: 149138K->1156K(150016K)] 315573K->167607K(456192K), 0.0233593 secs] [Times: user=0.77 sys=0.07, real=0.03 secs]

In order to understand the cause of JobManager's frequent GC, I dump the objects in JobManager's java heap into local files, and then use VisualVM to open them for analysis, and find that Char[] occupies the largest memory space, as shown in the following figure:

Snipaste_Heap_Dump Snipaste_Heap_Dump_2 image

What are the reasons for this? If we use FLINK_HOME/bin/flink run t yarn-per-job to submit the task from the command line, we will not generate so many Char[]. The GC time of JobManager (this program's parameters are exactly the same as the above parameters) is about once every 40 minutes. This situation seems to be relatively normal

As for the reason why containers are frequently killed, we will set jobmanager.memory.enable-jvm-direct-memory-limit = true to avoid memory overlimit. Do we know whether this parameter is useful for memory overlimit killing?

Error Exception

Failing this attempt.Diagnostics: [2023-08-22 08:49:10.443]Container [pid=77475,container/D=container_e08_1683881703260_1165_01 0000011 running 9510912B beyond the
PHYSICAL' memory limit. Current usage: 1.0 GB of 1 GB physical memory used; 3.2 GB of 2.1 GB virtual memory used. Killing container

Screenshots

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!(您是否要贡献这个PR?)

Code of Conduct

@13301891422 13301891422 changed the title [Bug] Bug title [Bug] JobManager frequent GC causes Yarn container memory overflow Aug 30, 2023
@wolfboys
Copy link
Member

wolfboys commented Sep 2, 2023

Thank you for the detailed feedback. StreamPark is merely a platform for managing and submitting flinkjob. You can look into your flinkjob itself to investigate further reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants