Excessive memory usage and out-of-memory exception #502

FreakyBytes · 2019-10-28T12:53:47Z

Hi, we're currently evaluation Tile38 as in-memory geo-index for monitoring a world-wide fleet.

We ingest ~1500 events/min, most of them updates to existing objects.
Objects are stored as GeoJSON points with some additional JSON attributes and 6 fields, with an tile38_avg_point_size of 22'242.
The insert happens in batches of 500 events and all objects have a TTL of 24h.

This setup runs fine with ~100'000 to 130'000 objects in the database, consuming 2-5GB of RAM.
But then crashes after about 26h with an out-of-memory exception.
The machine running the server is a 32GB Hetzner VM, so the non-linear consumption seems a bit excessive.
I also tried the collection-optz branch, which worked slightly better but still crashed.

Are there any limitations I might not be aware of? Could the problem be related to expiring objects?
Would it be possible to add a fail-safe mechanism when the memory limit is reached, so it does not crash the entire server?

The text was updated successfully, but these errors were encountered:

tidwall · 2019-10-28T14:14:22Z

Hi,

Thanks for reporting this issue.

It's pretty standard to use Tile38 in the way that you describe. But I generally see TTLs in the 30 seconds to 1-hour range. So this makes me think that the longer 24-hour TTL might be the problem.

I'm going to do some testing on my side, but I want to make sure I use GeoJSON objects that are similar to yours. Could you share an example object?

FreakyBytes · 2019-10-28T14:37:20Z

Hi, really appriciate the fast response :)

The points we store usually look like this:

{
  "fields": ["spd", "mov", "ts", "hdg", "type", "length"],
  "objects": [
    {
      "id": "v13070664",
      "object": {
        "type": "Point",
        "coordinates": [17.950000762939453, 62.641666412353516],
        "name": "SVK05 EJDERN",
        "dim": [12, 8, 3, 1]
      },
      "fields": [0, 0, 1572271144, 115, 3, 20]
    },
    {
      "id": "v3570833",
      "object": {
        "type": "Point",
        "coordinates": [17.950000762939453, 62.63999938964844],
        "dim": [null, null, null, null]
      },
      "fields": [0, 0, 1572269281, 11, 0, 0]
    },
    {
      "id": "v2020934",
      "object": {
        "type": "Point",
        "coordinates": [17.93959617614746, 62.6373405456543],
        "name": "RESCUE SXK VASTKUST",
        "dim": [7, 1, 2, 1]
      },
      "fields": [0, 0, 1572241763, 183, 3, 8]
    }
  ]
}

Here are the commands to insert those points:


SET vessel v13070664 EX 86400 FIELD spd 0 FIELD mov 0 FIELD ts 1572271144 FIELD hdg 115 FIELD type 3 FIELD length 20 OBJECT {"type": "Point", "coordinates": [17.950000762939453, 62.641666412353516], "name": "SVK05 EJDERN", "dim": [12, 8, 3, 1]}
SET vessel v3570833 EX 86400 FIELD spd 0 FIELD mov 0 FIELD ts 1572269281 FIELD hdg 11 FIELD type 0 FIELD length 0 OBJECT {"type": "Point", "coordinates": [17.950000762939453, 62.63999938964844], "dim": [null, null, null, null]}
SET vessel v2020934 EX 86400 FIELD spd 0 FIELD mov 0 FIELD ts 1572241763 FIELD hdg 183 FIELD type 3 FIELD length 8 OBJECT {"type": "Point", "coordinates": [17.93959617614746, 62.6373405456543], "name": "RESCUE SXK VASTKUST", "dim": [7, 1, 2, 1]}

Regarding the TTL length: I've run a small trial with a TTL of 1 hour which did run smoothly over the weekend.
Currently, I'm recording some more metrics over the course of the day.

tidwall · 2019-10-28T21:42:37Z

The fact that the 1 hour ran smoothly is a good hint. Thanks for sharing the geojson. I'll goof around and see what I can dig up.

tidwall · 2019-10-29T01:32:05Z

I’ve been able to reproduce the issue and it’s absolutely related to long TTLs. I know the cause and I plan a fix in the next day or so. I’ll keep you posted.

FreakyBytes · 2019-10-29T09:56:39Z

You're awesome! Thanks! :)

This commit fixes an issue where Tile38 was using lots of extra memory to track objects that are marked to expire. This was creating problems with applications that set big TTLs. How it worked before: Every collection had a unique hashmap that stores expiration timestamps for every object in that collection. Along with the hashmaps, there's also one big server-wide list that gets appended every time a new SET+EX is performed. From a background routine, this list is looped over at least 10 times per second and is randomly searched for a potential candidates that might need expiring. The routine then removes those entries from the list and tests if the objects matching the entries have actually expired. If so, these objects are deleted them from the database. When at least 25% of the 20 candidates are deleted the loop is immediately continued, otherwise the loop backs off with a 100ms pause. Why was this was a problem: The list grows one entry for every SET+EX. When TTLs are long, like 24-hours or more, it would take at least that long before the entry is removed. So if you have objects that use TTLs and are updated often this could lead to a very large list. Issue #501

tidwall · 2019-10-29T18:08:18Z

I just pushed a fix.

This commit fixes an issue where Tile38 was using lots of extra memory to track objects that are marked to expire. This was creating problems with applications that set big TTLs. How it worked before: Every collection had a unique hashmap that stores expiration timestamps for every object in that collection. Along with the hashmaps, there's also one big server-wide list that gets appended every time a new SET+EX is performed. From a background routine, this list is looped over at least 10 times per second and is randomly searched for potential candidates that might need expiring. The routine then removes those entries from the list and tests if the objects matching the entries have actually expired. If so, these objects are deleted them from the database. When at least 25% of the 20 candidates are deleted the loop is immediately continued, otherwise the loop backs off with a 100ms pause. Why this was a problem. The list grows one entry for every SET+EX. When TTLs are long, like 24-hours or more, it would take at least that much time before the entry is removed. So for databased that have objects that use TTLs and are updated often this could lead to a very large list. How it was fixed. The list was removed and the hashmap is now search randomly. This required a new hashmap implementation, as the built-in Go map does not provide an operation for randomly geting entries. The chosen implementation is a robinhood-hash because it provides open-addressing, which makes for simple random bucket selections. Issue #502

tidwall · 2019-10-29T22:22:34Z

And here's the test program I used to monitor the growing heap.
https://gist.github.com/tidwall/1505193d06de4ecd347da912f6c1860e

FreakyBytes · 2019-10-30T17:40:19Z

This looks really awesome!
My test node is currently running since 4h with 140 000 points, using only 0.19 GB heap memory - in comparison to 19 GB with a similar amount of points previously.

I'll let it run for some time and will report back on Friday. But your fix looks excellent at the moment. Thanks!

These are the results so far. The dip marks the point where I restarted and deployed your fix.

tidwall · 2019-10-30T17:43:09Z

That's great news!

FreakyBytes · 2019-11-02T09:41:25Z

Alright, sorry for being a day late. The results so far are:

Memory consumption is far down (around 320 MB for 183000 objects, compared to multiple GB before)
There is a slight increase in memory consumption over time (in order of something like ~15 MB over 24h)
No change of insert or query time - I could notice or monitor
Object expiration works and keeps the amount of objects relatively stable (only subject to natural fluctuation)
Test is now running for 59h and I could not notice any significant memory leak like before :)

Here some graphs from the monitor's output:

In summary: Thank you so much for Tile38 and the fix :)
To me it looks perfect and seems like it fixes the original issue.

tidwall · 2019-11-02T17:18:20Z

I'm happy to help. I think everything looks in order now. I'll release a new version shortly.

tidwall · 2019-11-02T23:03:24Z

I just released a new version that includes this fix . 🎉

tidwall mentioned this issue Oct 29, 2019

Strictly check if values passed to JSET are numbers #501

Merged

tidwall closed this as completed Nov 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive memory usage and out-of-memory exception #502

Excessive memory usage and out-of-memory exception #502

FreakyBytes commented Oct 28, 2019

tidwall commented Oct 28, 2019

FreakyBytes commented Oct 28, 2019

tidwall commented Oct 28, 2019

tidwall commented Oct 29, 2019

FreakyBytes commented Oct 29, 2019

tidwall commented Oct 29, 2019 •

edited

Loading

tidwall commented Oct 29, 2019

FreakyBytes commented Oct 30, 2019

tidwall commented Oct 30, 2019

FreakyBytes commented Nov 2, 2019

tidwall commented Nov 2, 2019

tidwall commented Nov 2, 2019 •

edited

Loading

Excessive memory usage and out-of-memory exception #502

Excessive memory usage and out-of-memory exception #502

Comments

FreakyBytes commented Oct 28, 2019

tidwall commented Oct 28, 2019

FreakyBytes commented Oct 28, 2019

tidwall commented Oct 28, 2019

tidwall commented Oct 29, 2019

FreakyBytes commented Oct 29, 2019

tidwall commented Oct 29, 2019 • edited Loading

tidwall commented Oct 29, 2019

FreakyBytes commented Oct 30, 2019

tidwall commented Oct 30, 2019

FreakyBytes commented Nov 2, 2019

tidwall commented Nov 2, 2019

tidwall commented Nov 2, 2019 • edited Loading

tidwall commented Oct 29, 2019 •

edited

Loading

tidwall commented Nov 2, 2019 •

edited

Loading