Environment
- Apache Ignite 2.18.0 vs 2.17.0 (identical config and JVM args)
- JDK 21.0.6,
-Xmx1G, persistent data region
- Workload: continuous create + destroy of many short-lived
REPLICATED persisted caches (each its own cache group, default 1024 partitions). Every create/destroy triggers a topology exchange.
Description
After upgrading from 2.17.0 to 2.18.0 with no configuration change, server nodes exhaust -Xmx1G and hit OutOfMemoryError under a workload that repeatedly creates and destroys short-lived REPLICATED caches. 2.17.0 runs the same workload indefinitely with a stable sawtooth heap; 2.18.0 grows monotonically until OOM.
Heap-dump analysis (Eclipse MAT, merge-shortest-paths-to-GC-roots) shows ~930 MB retained through the exchange-history path:
exchange-worker thread
-> GridCachePartitionExchangeManager.exchFuts (ExchangeFutureSet)
-> ~596 GridDhtPartitionsExchangeFuture
-> finishState (FinishState)
-> msg (GridDhtPartitionsFullMessage)
-> partsSizes : HashMap<Integer, IntLongMap>
-> IntLongMap (34,033 instances, ~893 MB)
IntLongMap (formerly PartitionSizesMap) implements Message but is backed by a boxed HashMap<Integer, Long>. Retaining one full message per historical exchange therefore pins a fully-exploded, boxed per-partition map for every retained topology version, producing ~17.4M HashMap$Node + ~13M Integer in the dump.
Suspected cause (regression introduced in 2.18.0)
The message serialization rework (umbrella IGNITE-25490, "new ser/der scheme") converted the partition sub-maps of GridDhtPartitionsFullMessage from @GridDirectTransient live objects paired with a compact serialized byte[] twin (which was GC-able after marshalling) into first-class @Order Message fields whose live boxed object is now the permanently retained state:
- IGNITE-26517 (
92074fd) — renamed PartitionSizesMap → IntLongMap and made partsSizes an @Order(11) field. This owns the ~893 MB. Primary contributor.
- IGNITE-26839 (
f935f99) — partCntrs promoted to an @Order(12) Message field. Secondary (~21 MB long[] in this dump).
Combined with the retained exchange history (IGNITE_EXCHANGE_HISTORY_SIZE, default 1000; ~596 futures live at OOM), the node now keeps the fully-expanded boxed maps for every historical topology version instead of a few KB of serialized bytes as in 2.17.0.
Steps to reproduce
- Start a node with a persistent data region and
-Xmx1G.
- In a loop, create then destroy
REPLICATED persisted caches, each in its own cache group with 1024 partitions.
- Observe heap growth; take a heap dump and inspect retained size of
GridDhtPartitionsExchangeFuture → finishState.msg.partsSizes.
Expected vs actual
- Expected: Retained heap of exchange history stays bounded and small (as in 2.17.0).
- Actual: Retained heap grows with topology-version count × groups × partitions until OOM.
Suggested fixes (any of)
- Clear/null
partsSizes and partCntrs on GridDhtPartitionsFullMessage once it is stored into FinishState for history (they are not needed after the exchange completes).
- Back
IntLongMap with primitive storage (e.g. an int→long primitive map) instead of boxed HashMap<Integer, Long>.
- Retain only a compact serialized form on historical futures, rehydrating on demand (restore the 2.17.0 behavior).
Workaround
-DIGNITE_EXCHANGE_HISTORY_SIZE=100 -DIGNITE_AFFINITY_HISTORY_SIZE=50 caps the retained history (bounds the leak to ~1/6th), and reducing partition count shrinks each IntLongMap.
Environment
-Xmx1G, persistent data regionREPLICATEDpersisted caches (each its own cache group, default 1024 partitions). Every create/destroy triggers a topology exchange.Description
After upgrading from 2.17.0 to 2.18.0 with no configuration change, server nodes exhaust
-Xmx1Gand hitOutOfMemoryErrorunder a workload that repeatedly creates and destroys short-livedREPLICATEDcaches. 2.17.0 runs the same workload indefinitely with a stable sawtooth heap; 2.18.0 grows monotonically until OOM.Heap-dump analysis (Eclipse MAT, merge-shortest-paths-to-GC-roots) shows ~930 MB retained through the exchange-history path:
IntLongMap(formerlyPartitionSizesMap)implements Messagebut is backed by a boxedHashMap<Integer, Long>. Retaining one full message per historical exchange therefore pins a fully-exploded, boxed per-partition map for every retained topology version, producing ~17.4MHashMap$Node+ ~13MIntegerin the dump.Suspected cause (regression introduced in 2.18.0)
The message serialization rework (umbrella IGNITE-25490, "new ser/der scheme") converted the partition sub-maps of
GridDhtPartitionsFullMessagefrom@GridDirectTransientlive objects paired with a compact serializedbyte[]twin (which was GC-able after marshalling) into first-class@OrderMessagefields whose live boxed object is now the permanently retained state:92074fd) — renamedPartitionSizesMap→IntLongMapand madepartsSizesan@Order(11)field. This owns the ~893 MB. Primary contributor.f935f99) —partCntrspromoted to an@Order(12)Messagefield. Secondary (~21 MBlong[]in this dump).Combined with the retained exchange history (
IGNITE_EXCHANGE_HISTORY_SIZE, default 1000; ~596 futures live at OOM), the node now keeps the fully-expanded boxed maps for every historical topology version instead of a few KB of serialized bytes as in 2.17.0.Steps to reproduce
-Xmx1G.REPLICATEDpersisted caches, each in its own cache group with 1024 partitions.GridDhtPartitionsExchangeFuture→finishState.msg.partsSizes.Expected vs actual
Suggested fixes (any of)
partsSizesandpartCntrsonGridDhtPartitionsFullMessageonce it is stored intoFinishStatefor history (they are not needed after the exchange completes).IntLongMapwith primitive storage (e.g. anint→longprimitive map) instead of boxedHashMap<Integer, Long>.Workaround
-DIGNITE_EXCHANGE_HISTORY_SIZE=100 -DIGNITE_AFFINITY_HISTORY_SIZE=50caps the retained history (bounds the leak to ~1/6th), and reducing partition count shrinks eachIntLongMap.