Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider the query log format #2290

Open
ainar-g opened this issue Nov 10, 2020 · 19 comments
Open

Reconsider the query log format #2290

ainar-g opened this issue Nov 10, 2020 · 19 comments
Assignees
Labels
enhancement research Looking up ways to improve the product in the future.
Milestone

Comments

@ainar-g
Copy link
Contributor

ainar-g commented Nov 10, 2020

The current way in which we handle query logs has several issues: it's racy, files can get pretty large, the search algorithm is too complex, and we can only do basic filtering and very little aggregation. But we also don't have a lot of alternatives. Ideally we would like to have something like an SQL database, à la SQLite, but that would require cgo, which comes with its own bag of issues. Package modernc.org/sqlite seems interesting, but it's a work-in-progress and currently doesn't support all our platforms.

I will update the issue when we'll think of something better. Perhaps SQLite is not completely out of the question if we find a way to cross-compile it without too much pain for the developers.

@ainar-g ainar-g added enhancement research Looking up ways to improve the product in the future. labels Nov 10, 2020
@ainar-g ainar-g self-assigned this Nov 10, 2020
@ainar-g
Copy link
Contributor Author

ainar-g commented Nov 19, 2020

Package modernc.org/ql is also an interesting alternative.

@ameshkov
Copy link
Member

Tbh, there are a couple of advantages provided by using simple text-based storage:

  1. Logs can be easily rotated (not like now, though, but anyway)
  2. Logs can be loaded and analyzed by other software (ELK, Splunk, etc)

They should also be considered when we choose what to do next.

@ZeroClover
Copy link

Perhaps consider using Blevesearch to index BoltDB to enhance queries (more filters) and speed up queries.

Blevesearch is a native Golang application.

However, it still does not help with write performance and consumes more disk space.

Using TSDB may help performance, but it can add significant complexity.

@ainar-g
Copy link
Contributor Author

ainar-g commented Jun 2, 2021

@ameshkov,

Having benchmarked modernc.org/ql on a Linux/ARMv7 machine, single writes are somewhat slow, circa a few milliseconds, but when you use prepared statements and commit many data changes in a single transaction—the way the author of the database says is the proper way to use it—, any single one of them only takes about 20 µs or so, which is quite good, I think. Especially considering that we already have an in-memory buffer in the current implementation of our query log.

We'll still need to make a lot of testing, obviously, and the architecture needs to be thought out carefully, but having an actual, proper, pure Go database that allows GROUP BY, count, etc. would be great for the kinds of statistics we could show.

@apnagaev
Copy link

идея с elk очень интересна, также в мыслях крутится mysql... было бы не плохо иметь пару серверов с единой точкой хранения истории. Хочется инфраструктуру вида waf/haproxy->adguard, но если за балансировщиком пара серверов, то история обращений клиентов у каждого инстанса adguard своя. я пробовал просто мапить одну папку work на оба контейнера, но иногда возникает ситуация, когда статистика теряется (возможно одновременный доступ к файлу на запись, не пробовал дебажить). Иметь 2 инстанса с единым конфигом (это работает нормально) и единой историей запросов было бы очень удобно.

@gfxlabs
Copy link

gfxlabs commented Jun 3, 2022

hi there! i had a question about this issue

i've been working on this personally, since i want it as a feature for myself.

my issue comes with the query log file being giant. i have ~0.5 - 1mm DNS requests per day and the file is pretty big. i don't really care about the searching, and am trying to maximize storage space while keeping a similar feature set to what currently exists

not that it will necessarily change what i'm doing for myself, but do you guys care if the storage format is human readable (since its just json rn)? it looks like you're considering sql-like interfaces (i guess mostly for searching), so it seems not, but i wanted to check in

thanks!

@ainar-g
Copy link
Contributor Author

ainar-g commented Jun 3, 2022

@gfxlabs, the most likely approach that we will take will involve some sort of binary database storage. Possibly with an optional plain-text JSON stream, but we're still thinking about that.

I don't know if that would bring any space savings, as we're still fairly far from doing any meaningful benchmarking, but it will definitely simplify keeping the query log at a particular size. I.e. removing older records more eagerly.

@bnkas
Copy link

bnkas commented Aug 13, 2022

@ainar-g I am currently taking interest in log aggregation via Telegraf, InfluxDB, Grafana Loki, etc with other network devices and service and it currently works well because log format is in json and it is relatively easy for other tools to parse it.

While logs have room for improvement as you and others stated, I would like to add my voice that the final design approach for logs should maintain (or improve) compatibility with external log aggregation/visualization tools and platforms (telegraf, Loki, etc). May be that can be done via a configuration option (where user can choose log format of a db, json, etc) or via DB that offers a way for client tool to query and push records to an external platform.

@Zheaoli
Copy link

Zheaoli commented Aug 16, 2022

@ainar-g I am currently taking interest in log aggregation via Telegraf, InfluxDB, Grafana Loki, etc with other network devices and service and it currently works well because log format is in json and it is relatively easy for other tools to parse it.

While logs have room for improvement as you and others stated, I would like to add my voice that the final design approach for logs should maintain (or improve) compatibility with external log aggregation/visualization tools and platforms (telegraf, Loki, etc). May be that can be done via a configuration option (where user can choose log format of a db, json, etc) or via DB that offers a way for client tool to query and push records to an external platform.

@ainar-g I think this is a good suggestion. I think we can refactor the querylog module and support pushing the DNS record to other services.

@Zheaoli
Copy link

Zheaoli commented Aug 16, 2022

And I think the modernc.org/sqlite is enough to use it as querylog backend.

I'm glad to handle this action(it's important for me

@ainar-g
Copy link
Contributor Author

ainar-g commented Aug 16, 2022

@bnkas, @Zheaoli, I understand your request, but please keep the topic of discussion in this issue related to the on-disk format. Collection of logs by external services is a broad and deep topic, and, if I recall correctly, there is already a feature request about that. See, for example, #3389.

@emlimap
Copy link

emlimap commented Oct 5, 2023

This might be of interest to you

Sqinn-Go is a Go (Golang) library for accessing SQLite databases in pure Go. It uses Sqinn https://github.com/cvilsmeier/sqinn under the hood. It starts Sqinn as a child process (os/exec) and communicates with Sqinn over stdin/stdout/stderr. The Sqinn child process then does the SQLite work.

If you want SQLite but do not want cgo, Sqinn-Go can be a solution.

https://github.com/cvilsmeier/sqinn-go

Also this https://gitlab.com/cznic/sqlite

some benchmarks: https://datastation.multiprocess.io/blog/2022-05-12-sqlite-in-go-with-and-without-cgo.html

source: https://news.ycombinator.com/item?id=37770454

@ameshkov
Copy link
Member

ameshkov commented Oct 5, 2023

Huh, looks interesting

@ameshkov
Copy link
Member

ameshkov commented Oct 5, 2023

On the other hand the pure Go sqlite port is getting some traction and being very actively developed. @ainar-g argued that we should migrate to it and I am starting to think that it sounds very reasonably.

https://gitlab.com/cznic/sqlite

@ainar-g
Copy link
Contributor Author

ainar-g commented Oct 5, 2023

Using the Go SQLite port is still my position as well, although they don't currently support all platforms that AGH supports, which may be an issue.

@sky96111
Copy link

sky96111 commented Jan 31, 2024

chaiSQL seems also an interesting choice when production ready
https://github.com/chaisql/chai

@heffneil
Copy link

heffneil commented Feb 8, 2024

Has anything happened here? Its been years this has been talked about and writing to a json file is not a modern solution. I have had several crashed of adguard in my environment in a short period of time. I have disabled all logging and statistics in the meantime but having a reputable option for storing data would be nice.

@sky96111
Copy link

sky96111 commented Feb 26, 2024

I agree with that.
Refactor with db-based query and storage method is much of importance.
Especially as a DNS server, the amount of log data is always very large, which resulting in a serious slowdown of log speeds

@antoonhuiskens
Copy link

From a minimalist point of view... being able to send the query logs to any syslog server would be straightforward I suppose. With the query log in syslog there will be a plethora of tooling available to get data into promtail/loki, influxDB, elastic and many many others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement research Looking up ways to improve the product in the future.
Projects
None yet
Development

No branches or pull requests