Skip to content

AWS: OOM during execution #64

Closed
@sparshev

Description

Recently I found OOM during utilizing the Fish node for AWS workloads (maybe leak happens in the fish logic itself as well).

Steps to Reproduce

  1. Run Fish node and load it with requests for Allocate/Deallocate AWS instances
  2. Wait, in a month it will die

Platform and Version

  • Ubuntu 20.04.6 LTS
  • Aquarium Fish v0.7.1 (231111.070935)

Logs taken while reproducing problem

Fish log:

May 03 15:27:34 host aquarium-fish[2535560]: INFO:        Fish: Start executing Application 12050150-63b1-463f-a0c0-340be13ab1bc NEW
May 03 15:27:34 host aquarium-fish[2535560]: INFO:        Fish: Allocate the resource using the driver aws
May 03 15:27:34 host aquarium-fish[2535560]: INFO:        AWS: Selected security group: sg-0c23fd513c198d0c9 fish-14e08c41499b
May 03 15:27:34 host aquarium-fish[2535560]: INFO:        AWS: Selected snapshot: snap-03ad771fb3259b032 fish-14e08c41499b
May 03 15:27:35 host systemd[1]: aquarium-fish.service: Main process exited, code=killed, status=9/KILL
May 03 15:27:35 host systemd[1]: aquarium-fish.service: Failed with result 'signal'.
May 03 15:27:35 host systemd[1]: aquarium-fish.service: Scheduled restart job, restart counter is at 1.
May 03 15:27:35 host systemd[1]: Stopped Run aquarium-fish node service as unprevileged user.
May 03 15:27:35 host systemd[1]: Started Run aquarium-fish node service as unprevileged user.
May 03 15:27:36 host aquarium-fish[3339182]: INFO:        Aquarium Fish v0.7.1 (231111.070935)
May 03 15:27:36 host aquarium-fish[3339182]: INFO:        Fish init TLS...
May 03 15:27:36 host aquarium-fish[3339182]: INFO:        Fish starting ORM...
...

Dmesg OOM:

[17237754.236655] aquarium-fish invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[17237754.236662] CPU: 0 PID: 2535578 Comm: aquarium-fish Tainted: P            E     5.4.0-121-generic #137-Ubuntu
[17237754.236663] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.2-1.1~u16.04+mcp2 04/01/2014
[17237754.236675] Call Trace:
[17237754.237606]  dump_stack+0x6d/0x8b
[17237754.238009]  dump_header+0x4f/0x1eb
[17237754.238011]  oom_kill_process.cold+0xb/0x10
[17237754.238288]  out_of_memory+0x1cf/0x4d0
[17237754.238426]  __alloc_pages_slowpath+0xd5e/0xe50
[17237754.238670]  ? x2apic_send_IPI_mask+0x13/0x20
[17237754.238672]  __alloc_pages_nodemask+0x2d0/0x320
[17237754.238674]  alloc_pages_vma+0x7f/0x220
[17237754.238677]  wp_page_copy+0x45b/0xa00
[17237754.238678]  do_wp_page+0x94/0x6a0
[17237754.238683]  ? __switch_to_asm+0x34/0x70
[17237754.238684]  ? __switch_to_asm+0x40/0x70
[17237754.238686]  __handle_mm_fault+0x771/0x7a0
[17237754.238687]  handle_mm_fault+0xca/0x200
[17237754.238690]  do_user_addr_fault+0x1f9/0x450
[17237754.238691]  __do_page_fault+0x58/0x90
[17237754.238969]  ? exit_to_usermode_loop+0x8f/0x160
[17237754.238970]  do_page_fault+0x2c/0xe0
[17237754.238972]  do_async_page_fault+0x39/0x70
[17237754.238974]  async_page_fault+0x34/0x40
[17237754.238988] RIP: 0033:0xa035d7
[17237754.238991] Code: 24 18 48 8b 4c 24 20 48 c7 c5 80 00 00 00 f3 0f 6f 00 f3 0f 6f 0b f3 0f 6f 11 f3 0f 6f 1a 66 0f ef c1 66 0f ef c2 66 0f ef c3 <f3> 0f 7f 02 48 83 c0 10 48 83 c3 10 48 83 c1 10 48 83 c2 10 48 83
[17237754.238992] RSP: 002b:000000c000604e80 EFLAGS: 00010202
[17237754.238993] RAX: 000000c3dd110c00 RBX: 000000c3d99dec00 RCX: 000000c000604f18
[17237754.238994] RDX: 000000c3dd111000 RSI: 000000c000604f18 RDI: 000000c000605318
[17237754.238994] RBP: 0000000000000080 R08: 000000c3c07d4000 R09: 000000001c93cc00
[17237754.238995] R10: 00000000000124f4 R11: 000000000000dcc7 R12: 000000c000604f08
[17237754.238995] R13: 0000000000080000 R14: 000000c00060f860 R15: 0000000000000003
[17237754.239068] Mem-Info:
[17237754.239089] active_anon:3638021 inactive_anon:364553 isolated_anon:543
                   active_file:118 inactive_file:187 isolated_file:0
                   unevictable:0 dirty:0 writeback:7 unstable:0
                   slab_reclaimable:13933 slab_unreclaimable:23964
                   mapped:209 shmem:546 pagetables:9136 bounce:0
                   free:33675 free_pcp:19 free_cma:0
[17237754.239092] Node 0 active_anon:14552084kB inactive_anon:1458212kB active_file:472kB inactive_file:748kB unevictable:0kB isolated(anon):2172kB isolated(file):0kB mapped:836k
B dirty:0kB writeback:28kB shmem:2184kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 11415552kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[17237754.239092] Node 0 DMA free:15908kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present
:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[17237754.239239] lowmem_reserve[]: 0 2991 15993 15993 15993
[17237754.239241] Node 0 DMA32 free:64208kB min:12576kB low:15720kB high:18864kB active_anon:2995624kB inactive_anon:40kB active_file:84kB inactive_file:88kB unevictable:0kB writ
epending:0kB present:3129192kB managed:3063656kB mlocked:0kB kernel_stack:0kB pagetables:3316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[17237754.239243] lowmem_reserve[]: 0 0 13001 13001 13001
[17237754.239244] Node 0 Normal free:54584kB min:54940kB low:68672kB high:82404kB active_anon:11556980kB inactive_anon:1458172kB active_file:388kB inactive_file:464kB unevictable
:0kB writepending:28kB present:13631488kB managed:13313812kB mlocked:0kB kernel_stack:4496kB pagetables:33228kB bounce:0kB free_pcp:88kB local_pcp:0kB free_cma:0kB
[17237754.239246] lowmem_reserve[]: 0 0 0 0 0
[17237754.239247] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[17237754.239252] Node 0 DMA32: 73*4kB (UM) 58*8kB (UM) 22*16kB (UM) 14*32kB (UM) 5*64kB (UM) 2*128kB (UM) 0*256kB 1*512kB (M) 2*1024kB (UM) 1*2048kB (U) 14*4096kB (ME) = 64084kB
[17237754.239256] Node 0 Normal: 4563*4kB (UMEH) 2862*8kB (UMEH) 676*16kB (UMEH) 96*32kB (UME) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 55100kB
[17237754.239280] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[17237754.239280] 2099 total pagecache pages
[17237754.239284] 1219 pages in swap cache
[17237754.239285] Swap cache stats: add 135784, delete 134572, find 10926656/10927674
[17237754.239285] Free swap  = 0kB
[17237754.239286] Total swap = 483800kB
[17237754.239286] 4194168 pages RAM
[17237754.239286] 0 pages HighMem/MovableOnly
[17237754.239287] 95824 pages reserved
[17237754.239290] 0 pages cma reserved
[17237754.239290] 0 pages hwpoisoned
[17237754.239291] Tasks state (memory values in pages):
[17237754.239291] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
----8<---->8-----
[17237754.239313] [    452]     0   452     2424       61    53248        3             0 cron
[17237754.239315] [    453]   103   453     1930      242    53248       12          -900 dbus-daemon
[17237754.239319] [    462]     0   462    20473       36    61440       39             0 irqbalance
[17237754.239324] [    463]     0   463     8364     2220    94208       69             0 networkd-dispat
[17237754.239325] [    470]     0   470    12679     1234   126976     4145             0 salt-minion
[17237754.239327] [    474]     0   474     4342      124    73728      113             0 systemd-logind
[17237754.239329] [    500]     0   500     2127        0    49152       30             0 agetty
[17237754.239331] [    522]     0   522     2184       24    49152        4             0 agetty
[17237754.239332] [    535]   109   535     1205       66    45056       13             0 chronyd
[17237754.239334] [    546]   109   546     1172       29    45056       16             0 chronyd
[17237754.239340] [    567]     0   567    58412      309    86016       42             0 polkitd
[17237754.239341] [    645]     0   645    27711     1159   110592      755             0 unattended-upgr
----8<---->8-----
[17237754.239347] [    799]     0   799      622        1    45056       17             0 none
----8<---->8-----
[17237754.239355] [ 500099]     0 500099     4846       91    57344      192         -1000 systemd-udevd
[17237754.239357] [ 500230]   106 500230     2725       13    53248       30             0 uuidd
[17237754.239359] [ 500334]     0 500334   376580     2488   307200      948          -999 containerd
[17237754.239361] [ 500418]     0 500418    59639       74    94208      134             0 accounts-daemon
[17237754.239363] [ 500744]   100 500744     8900       40    81920      202             0 systemd-network
[17237754.239365] [ 500800]   101 500800     6106       43    86016      909             0 systemd-resolve
[17237754.239366] [ 500811]     0 500811   100004      281   790528        1          -250 systemd-journal
[17237754.239368] [ 501322]     0 501322   384968     1578   397312     4592          -500 dockerd
[17237754.239373] [ 507353]     0 507353     3046       52    65536      180         -1000 sshd
[17237754.239375] [ 509726]   104 509726    56079      248    77824      113             0 rsyslogd
[17237754.239377] [2535560]  1002 2535560  4722347  3929746 32657408   102647             0 aquarium-fish
[17237754.239382] [3228804]  1001 3228804    72099    14784   528384      805             0 splunkd
[17237754.239384] [3228834]  1001 3228834    19410     1117   139264      942             0 splunkd
[17237754.239397] [3339170]     0 3339170     4846       30    53248      257             0 systemd-udevd
[17237754.239399] [3339171]     0 3339171     4846       54    53248      233             0 systemd-udevd
[17237754.239400] [3339172]     0 3339172     4846       65    53248      222             0 systemd-udevd
[17237754.239402] [3339173]     0 3339173     4846      120    53248      167             0 systemd-udevd
[17237754.239403] [3339174]     0 3339174     4846       97    53248      190             0 systemd-udevd
[17237754.239405] [3339175]     0 3339175     4846       99    53248      188             0 systemd-udevd
[17237754.239406] [3339176]     0 3339176     4846      134    53248      153             0 systemd-udevd
[17237754.239407] [3339177]     0 3339177     4846        2    53248      286             0 systemd-udevd
[17237754.239411] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/aquarium-fish.service,task=aquarium-fish,pid=253
5560,uid=1002
[17237754.239476] Out of memory: Killed process 2535560 (aquarium-fish) total-vm:18889388kB, anon-rss:15718984kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:31892kB oom_score
_adj:0
[17237754.493164] oom_reaper: reaped process 2535560 (aquarium-fish), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions