Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory usage and out-of-memory exception #502

Closed
FreakyBytes opened this issue Oct 28, 2019 · 12 comments
Closed

Excessive memory usage and out-of-memory exception #502

FreakyBytes opened this issue Oct 28, 2019 · 12 comments

Comments

@FreakyBytes
Copy link

Hi, we're currently evaluation Tile38 as in-memory geo-index for monitoring a world-wide fleet.

We ingest ~1500 events/min, most of them updates to existing objects.
Objects are stored as GeoJSON points with some additional JSON attributes and 6 fields, with an tile38_avg_point_size of 22'242.
The insert happens in batches of 500 events and all objects have a TTL of 24h.

This setup runs fine with ~100'000 to 130'000 objects in the database, consuming 2-5GB of RAM.
But then crashes after about 26h with an out-of-memory exception.
The machine running the server is a 32GB Hetzner VM, so the non-linear consumption seems a bit excessive.
I also tried the collection-optz branch, which worked slightly better but still crashed.

Are there any limitations I might not be aware of? Could the problem be related to expiring objects?
Would it be possible to add a fail-safe mechanism when the memory limit is reached, so it does not crash the entire server?

@tidwall
Copy link
Owner

tidwall commented Oct 28, 2019

Hi,

Thanks for reporting this issue.

It's pretty standard to use Tile38 in the way that you describe. But I generally see TTLs in the 30 seconds to 1-hour range. So this makes me think that the longer 24-hour TTL might be the problem.

I'm going to do some testing on my side, but I want to make sure I use GeoJSON objects that are similar to yours. Could you share an example object?

@FreakyBytes
Copy link
Author

Hi, really appriciate the fast response :)

The points we store usually look like this:

{
  "fields": ["spd", "mov", "ts", "hdg", "type", "length"],
  "objects": [
    {
      "id": "v13070664",
      "object": {
        "type": "Point",
        "coordinates": [17.950000762939453, 62.641666412353516],
        "name": "SVK05 EJDERN",
        "dim": [12, 8, 3, 1]
      },
      "fields": [0, 0, 1572271144, 115, 3, 20]
    },
    {
      "id": "v3570833",
      "object": {
        "type": "Point",
        "coordinates": [17.950000762939453, 62.63999938964844],
        "dim": [null, null, null, null]
      },
      "fields": [0, 0, 1572269281, 11, 0, 0]
    },
    {
      "id": "v2020934",
      "object": {
        "type": "Point",
        "coordinates": [17.93959617614746, 62.6373405456543],
        "name": "RESCUE SXK VASTKUST",
        "dim": [7, 1, 2, 1]
      },
      "fields": [0, 0, 1572241763, 183, 3, 8]
    }
  ]
}

Here are the commands to insert those points:


SET vessel v13070664 EX 86400 FIELD spd 0 FIELD mov 0 FIELD ts 1572271144 FIELD hdg 115 FIELD type 3 FIELD length 20 OBJECT {"type": "Point", "coordinates": [17.950000762939453, 62.641666412353516], "name": "SVK05 EJDERN", "dim": [12, 8, 3, 1]}
SET vessel v3570833 EX 86400 FIELD spd 0 FIELD mov 0 FIELD ts 1572269281 FIELD hdg 11 FIELD type 0 FIELD length 0 OBJECT {"type": "Point", "coordinates": [17.950000762939453, 62.63999938964844], "dim": [null, null, null, null]}
SET vessel v2020934 EX 86400 FIELD spd 0 FIELD mov 0 FIELD ts 1572241763 FIELD hdg 183 FIELD type 3 FIELD length 8 OBJECT {"type": "Point", "coordinates": [17.93959617614746, 62.6373405456543], "name": "RESCUE SXK VASTKUST", "dim": [7, 1, 2, 1]}

Regarding the TTL length: I've run a small trial with a TTL of 1 hour which did run smoothly over the weekend.
Currently, I'm recording some more metrics over the course of the day.

@tidwall
Copy link
Owner

tidwall commented Oct 28, 2019

The fact that the 1 hour ran smoothly is a good hint. Thanks for sharing the geojson. I'll goof around and see what I can dig up.

@tidwall
Copy link
Owner

tidwall commented Oct 29, 2019

I’ve been able to reproduce the issue and it’s absolutely related to long TTLs. I know the cause and I plan a fix in the next day or so. I’ll keep you posted.

@FreakyBytes
Copy link
Author

You're awesome! Thanks! :)

tidwall referenced this issue Oct 29, 2019
This commit fixes an issue where Tile38 was using lots of extra
memory to track objects that are marked to expire. This was
creating problems with applications that set big TTLs.

How it worked before:

Every collection had a unique hashmap that stores expiration
timestamps for every object in that collection. Along with
the hashmaps, there's also one big server-wide list that gets
appended every time a new SET+EX is performed.

From a background routine, this list is looped over at least
10 times per second and is randomly searched for a potential
candidates that might need expiring. The routine then removes
those entries from the list and tests if the objects matching
the entries have actually expired. If so, these objects are
deleted them from the database. When at least 25% of
the 20 candidates are deleted the loop is immediately
continued, otherwise the loop backs off with a 100ms pause.

Why was this was a problem:

The list grows one entry for every SET+EX. When TTLs are long,
like 24-hours or more, it would take at least that long before
the entry is removed. So if you have objects that use TTLs and
are updated often this could lead to a very large list.

Issue #501
@tidwall
Copy link
Owner

tidwall commented Oct 29, 2019

I just pushed a fix.

tidwall added a commit that referenced this issue Oct 29, 2019
This commit fixes an issue where Tile38 was using lots of extra
memory to track objects that are marked to expire. This was
creating problems with applications that set big TTLs.

How it worked before:

Every collection had a unique hashmap that stores expiration
timestamps for every object in that collection. Along with
the hashmaps, there's also one big server-wide list that gets
appended every time a new SET+EX is performed.

From a background routine, this list is looped over at least
10 times per second and is randomly searched for potential
candidates that might need expiring. The routine then removes
those entries from the list and tests if the objects matching
the entries have actually expired. If so, these objects are
deleted them from the database. When at least 25% of
the 20 candidates are deleted the loop is immediately
continued, otherwise the loop backs off with a 100ms pause.

Why this was a problem.

The list grows one entry for every SET+EX. When TTLs are long,
like 24-hours or more, it would take at least that much time
before the entry is removed. So for databased that have objects
that use TTLs and are updated often this could lead to a very
large list.

How it was fixed.

The list was removed and the hashmap is now search randomly. This
required a new hashmap implementation, as the built-in Go map
does not provide an operation for randomly geting entries. The
chosen implementation is a robinhood-hash because it provides
open-addressing, which makes for simple random bucket selections.

Issue #502
@tidwall
Copy link
Owner

tidwall commented Oct 29, 2019

And here's the test program I used to monitor the growing heap.
https://gist.github.com/tidwall/1505193d06de4ecd347da912f6c1860e

@FreakyBytes
Copy link
Author

This looks really awesome!
My test node is currently running since 4h with 140 000 points, using only 0.19 GB heap memory - in comparison to 19 GB with a similar amount of points previously.

I'll let it run for some time and will report back on Friday. But your fix looks excellent at the moment. Thanks!

image
These are the results so far. The dip marks the point where I restarted and deployed your fix.

@tidwall
Copy link
Owner

tidwall commented Oct 30, 2019

That's great news!

@FreakyBytes
Copy link
Author

Alright, sorry for being a day late. The results so far are:

  • Memory consumption is far down (around 320 MB for 183000 objects, compared to multiple GB before)
  • There is a slight increase in memory consumption over time (in order of something like ~15 MB over 24h)
  • No change of insert or query time - I could notice or monitor
  • Object expiration works and keeps the amount of objects relatively stable (only subject to natural fluctuation)
  • Test is now running for 59h and I could not notice any significant memory leak like before :)

Here some graphs from the monitor's output:
Screenshot 2019-11-02 at 10 27 30
Screenshot 2019-11-02 at 10 27 42
Screenshot 2019-11-02 at 10 29 13

In summary: Thank you so much for Tile38 and the fix :)
To me it looks perfect and seems like it fixes the original issue.

@tidwall
Copy link
Owner

tidwall commented Nov 2, 2019

I'm happy to help. I think everything looks in order now. I'll release a new version shortly.

@tidwall
Copy link
Owner

tidwall commented Nov 2, 2019

I just released a new version that includes this fix . 🎉

@tidwall tidwall closed this as completed Nov 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants