-
Notifications
You must be signed in to change notification settings - Fork 801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge history db file into a common db file to avoid db file booming #518
base: master
Are you sure you want to change the base?
Conversation
…ming Signed-off-by: lixiaoyu <lixiaoyu2@taou.com>
I don't believe this is safe, nothing stops something else with that pid coming along while this is running. In addition there's nothing stopping this being run concurrently. |
|
#443 this issue has the same problem with me. I used multi process mode a few days and I found the endpoint for scrape latency was becoming more and more slow, so I checked the prometheus_multiproc_dir, and found there were 1000+ db files, because sometimes the worker would exit, so the number of db files increased dramatically |
We can't make any assumptions about how this code is used, and given it's in a multi-process environment it must avoid any race conditions. How would this work for gauges? |
As you can see in the code, we use the same method to merge metrics as they are being scraped. I have use this commit in my production environment, and it works well, the latency of scrape endpoint has been reduced significantly. We can use file lock to avoid race condition. |
We should point out that this code should only be used in the multiprocess mode |
I have added process lock code which is from django.core.files.locks |
MmapedDict(merge_file).close() | ||
|
||
# do merge, here we use the same method to merge | ||
metrics = MultiProcessCollector.merge(files + merge_files, accumulate=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brian-brazil here I use the same method to merge metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will cause deadlock: mark_process_dead
is already holding lock and calling merge
will try to acquire the same lock again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe it will -- it looks like lock acquisition via flock
is reentrant safe, but unlocking via flock
isn't, so I think this will prematurely release the lock after the merge
call here, at least on POSIX systems. I don't know what the locking behaviour on NT will be.
Signed-off-by: lixiaoyu <lixiaoyu2@taou.com>
What happens if a process exists, sets a gauge, exits, and a new process comes along with the same pid? |
for s in metric.samples:
name, labels, value, timestamp, exemplar = s
if metric.type == 'gauge':
without_pid_key = (name, tuple([l for l in labels if l[0] != 'pid']))
if metric._multiprocess_mode == 'min':
current = samples_setdefault(without_pid_key, value)
if value < current:
samples[without_pid_key] = value
elif metric._multiprocess_mode == 'max':
current = samples_setdefault(without_pid_key, value)
if value > current:
samples[without_pid_key] = value
elif metric._multiprocess_mode == 'livesum':
samples[without_pid_key] += value
else: # all/liveall
samples[(name, labels)] = value
|
It's a certainty though at scale. |
Yes,Gauge is really a problem in multi process mode, developer should think it carefully. Do you have a good idea to reduce the latency in multiprocess mode when the nums of db files increase? |
Currently the main approach is avoiding creating lots of files in the first place. |
I'm also hitting this issue (uWSGI restarted often, resulting in lots of files which cause /metrics endpoints to become too slow). I'm interested and willing to help to see this PR being merged. With the exception of test that fail due to:
What block this PR ?
|
That's not the case, mark_process_dead runs in a different process. |
I' m using this feature in my production env now and there is no problem at present |
mark_process_dead runs when process is going to exit, and it only do its own merge work, using a lock will make this progress serialized. |
In uWSGI case, mark_process_dead will probably run at process exit (so with own PID). At least it's own I did it (using Python atexit handler). But in Gunicorn, mark_process_dead will be called by master process for another PID. This indeed open a (small) race-condition: if mark_process_dead is run with the PID of a terminated worker AND a new worker start using the same PID. The lock won't help there, since the new worker kept an open file descriptor to the file that mark_process_dead will remove. This should however be quite rare as it require the OS to re-assign the PID of a recently terminated process. But nothing forbid this to happen. |
BTW, this race-condition is not new. Gauge livesum and liveall already had this issue (new worker open the file, then mark_process_dead run by master remove them). Or did I miss something that avoid this race condition which could be reused ? |
I think we can use pid+process_create_time to name the file, this could avoid 2 process with same pid to have same file name |
There's nothing stopping two processes being created and terminated in the same second with the same PID.
True, we should look at fixing that. |
In psutil, the create_time has more precision that in our case (it's a float number of seconds). But I agree that this event look like a very low probability (and in my opinion it's acceptable. maybe the merge should be an opt-in option ?) The best solution I can think of that don't have "low probability" error is to add a new method that must be called by worker itself at exit, not by the master. It will merge files. User may or not register this method if the merge is wanted or not. |
I doubt that our users will have that level of control. This needs to work with all forms of multiprocess setups, including gunicorn and multiprocessing - and handling things like a segfault of a child correctly. |
why gunicorn worker cannot register this method to ateixt? |
Atexit is not reliably generally, plus things might happen before it got registered. |
I still dont understand how to handle the growing number of db files. When is it safe to remove them? The documentation states that the files should be cleared on every deploy which suggests that they can be removed without any special hesitation unless a process is currently writing to them. So what happens if I remove files? Do I risk losing data? Then how come it works to purge the files on a hard restart? Wouldnt an easy solution be to just have some kind of cleanup cron that just removes old files? I feel that the documentation regarding this issue is a bit lacking. |
The documentation says that you should delete them when the process restarts, there's nothing more to it. Anything else is unsafe. This issue is also not the place for usage questions, please take those to the prometheus-users mailing list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def collect(self)
should be guarded by lock too. The list and subsequent merge can have different view on directory content otherwise.
The issue from 2017 #204 is still not resolved, suggesting this client library is designed to serve metrics with multi-second latency. I appreciate @brian-brazil is so careful about unlikely race conditions that may happen, but it means this library is in no way production-ready. |
Have you guys looked into a shared database as storage? For example, local redis or memcached? Their semantics when it comes to missing values align with Prometheus' (new/expired initializes to 0). The documentation correctly states that in Python, multiprocessing is the dominant solution to handling concurrency. Unfortunately, it doesn't also mention that due to how Python manages memory, another dominant solution is rotating workers after a certain number (+/- jitter) requests have been handled. This causes the orphaned pidfiles issue to be more visible. The suggested integration (that removes the db pidfile) suffers from inflating counter values. When only one process gets rotated and its' metrics file purged, Prometheus will scrape counters with lower values (by the number stored in the removed file). Consequently, it will assume the new counter value is a full increment. This, usually, is not true. We also have no way to determine how much the counter had increased since the purge (if it's below the previously observed value). Shared database complicates the deployment slightly (you need to maintain local redis/memcached). For cases where you already have local redis/memcached, this might be a good option, though. |
No description provided.