Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup checkpoints for file source #1427

Closed
anton-ryzhov opened this issue Dec 24, 2019 · 3 comments · Fixed by #4899
Closed

Cleanup checkpoints for file source #1427

anton-ryzhov opened this issue Dec 24, 2019 · 3 comments · Fixed by #4899
Labels
have: should We should have this feature, but is not required. It is medium priority. meta: good first issue Anything that is good for new contributors. source: file Anything `file` source related type: enhancement A value-adding code change that enhances its existing functionality. type: tech debt A code change that does not add user value.

Comments

@anton-ryzhov
Copy link
Contributor

As I understand checkpoint files are never removed from the data_dir. This could cause denial of service because all inodes will be in use.

This even couldn't be handled externally because the application always rewrites these files with newer timestamps — impossible to remove old ones.

Vector should remove checkpoints for nonexistent files after some time.

@binarylogic
Copy link
Contributor

Thanks! I agree. We'll get this implemented.

@binarylogic binarylogic added source: file Anything `file` source related type: enhancement A value-adding code change that enhances its existing functionality. meta: good first issue Anything that is good for new contributors. labels Dec 24, 2019
@anton-ryzhov
Copy link
Contributor Author

I also don't really get why do you store offsets this way.

According to strace you're doing lots of syscalls. mkdir then open+close for each file to store checkpoints. And then lstat, open and getdents64 the directory, unlink for every file to remove the folder before saving new checkpoints.

That creates lots of IO requests, inodes are used more and more.

Isn't it better to store checkpoints in one file (as json or whatever) and update its content. For that case garbage collection won't be so necessary, only one inode will be used.

@binarylogic
Copy link
Contributor

@anton-ryzhov that's very possible. I agree, how we store offsets is not ideal and we absolutely will be improving this. We've also discussed using leveldb since we include that for disk buffers. I like your idea of a single JSON file as well.

@lukesteensen lukesteensen added the type: tech debt A code change that does not add user value. label Feb 9, 2020
@binarylogic binarylogic added the have: should We should have this feature, but is not required. It is medium priority. label Aug 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
have: should We should have this feature, but is not required. It is medium priority. meta: good first issue Anything that is good for new contributors. source: file Anything `file` source related type: enhancement A value-adding code change that enhances its existing functionality. type: tech debt A code change that does not add user value.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants