-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design: LogQL Fields #1848
Design: LogQL Fields #1848
Conversation
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
…esign-doc/logql-fields
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
…istogram Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
|
||
Using sort operator with metric queries will result into an error.(e.g `count_over_time({job="prod/query-frontend"} | logfmt | sort latency desc [5m])`) | ||
|
||
> __Warning:__ Sorting is dependent on how Grafana is currently ordering lines. Even if we re order our responses there is no guarantee that Grafana will be relying on it. And we might have to remove uncommon labels to keep only stream to ensure ordering across field streams. This means we will need a new API (`/v2`) to support ordering, although with compromise we might get something to work with the `v1` API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davkal What do you think about this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some feedback would be helpful here. How do other datasources (if any) implement upstream-defined sorting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cloudwatch does support this I think, @kaydelaney ? This is a big document sorry ;)
quantile_over_time(.99, {job="nginx"} | glob "* -- * \"* /*<path> *\" *<status> *<latency> *" | histogram latency linear(.25, 2, 5) by (path, status) [15m]) | ||
``` | ||
|
||
## API Changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davkal Do you think this is a good approach ? I really wanted to make sure most people even if they don't update can have this new syntax supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also like some thoughts from
@davkal on this.
This is fantastic and really existing! One thing that's standing out to me is how the syntax for each new operator is specialized for that specific operator. It might be worth it to consider a well-formed syntax for how operators are declared, such as a function-like syntax with named arguments:
This example is way more to type and probably isn't the best syntactic alternative to what you've already considered, but I'm trying to show how it does the following:
|
Definitively not opposed to it ! This is where I was searching for feedback (in the syntax itself), let's see what other think about this too. I also thought about cli flags |
### Questions | ||
|
||
- Should we allow to implicit field extraction such as `{job="prod/app"} | logfmt` and `{job="prod/app"} | json` would extract all fields ? I really like the simplicity of it. | ||
- In the previous document the `as field_name1, field_name2` field renaming notation was also interesting should we use is over `ip=client_ip` for now I've opted for the later which can be omitted in case of no renaming ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I combine implicit all and renaming? Or would | json ip=client_ip
only return that single field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question ! No as soon as you add a rename only explicit fields are extracted. I would love to have an option to keep every fields while renaming one but there’s two issues:
- We need a meaningful syntax for this.
- By default we should not be greedy, at least IMO.
How do you think we can solve that ? Select suffers from the same issue.
|
||
### Glob | ||
|
||
The glob parser (as described in previous design) `{job="prod/nginx"} | glob "*<client_ip> - - [ * ] \"* *<path> "` could be used to extract fields from nginx logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should glob be able to call other parsers? it's not too uncommon in other tools that something like this is available to treat the case of some custom header followed by structured information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also is glob
really necessary if you have regexp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is IMO, this is inspired by aws. It will be much faster and much easier to use than regexp. Regexp will be the last resort if you can't parse your logs with the rest.
Regexp matching will be very slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should glob be able to call other parsers?
I see the use case not sure if there is a nice way to incorporate this in the current syntax. Definitely not part of the MVP but yeah I would like to leave the door open. If you have some suggestions let me know.
May be we have to draw a line with what we support, this seems to say we will support multiple format in a single log line. Do you have a log example ?
Promtail is able to do multiple parsing for extracted fields. But the syntax is yaml and we are less worried about the performance there.
I get the urge to solve this in LogQL especially from a UX perspective but I’m not sure this is really worth the risks and baggage it brings with it. Also this is a ton of features I would recommend trying and implementing one of them at a time should you decide to do so. I’m particularly curious what the reasoning is to favor this over let’s say integration points to other systems or data processing pipelines to allow these types of workflows? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of new features, and I'm concerned that if they start being used on hosted instances, they will negatively impact cpu, memory and query times. I would definitely many of these myself though.
|
||
### Sort | ||
|
||
The sort operation noted `| sort` allows to orders fields entries by a field value instead of the field entry time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorting can use a lot of memory, and should probably have some limits in place to avoid killing Loki.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm pretty sure I tried to explain that but it won't sort the whole time range. It will sort only the resulting logs/fields once filtered.
You get the first x sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly for your eyes, once you received the result, show me the slowest first.
|
||
### Uniq | ||
|
||
The uniq operator allows to filter out duplicates. Without any parameter (`| uniq`) each fields combination must be unique from one entry to another. However |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can only be implemented efficiently if data going to uniq
filter is already sorted. (Command line uniq
tool only compares adjacent lines as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, can we just CLI tool instead or LogQL features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can only be implemented efficiently if data going to uniq filter is already sorted. (Command line uniq tool only compares adjacent lines as well)
Again this will basically returns line and filter the one you already have, it won't deduplicates the whole time range. I'm not sure I understand your concern @pstibrany.
BTW, can we just CLI tool instead or LogQL features?
Of course it could but what's interesting server side is that any new language feature will be supported by logcli and Grafana and any other tool, I don't expect everyone to use CLI tooling.
However for CLI I have other plan, currently we don't support streaming logs and this is a great use case for CLI as you could pipe and stream to other tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again this will basically returns line and filter the one you already have, it won't deduplicates the whole time range. I'm not sure I understand your concern @pstibrany.
I think I'm missing understanding in the distinction you're making about time ranges. My concern is that to do general deduplication you need to remember all previous values. But it seems that is not an issue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loki is scanning from the start or the end. Uniq operator will just ensure that the first 1000 lines returned are unique, so it will keep scanning until we have that. It won't dedupe from start to end all lines and then return the first 1000 lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I often use logcli with much higher limits than 1000 ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably because you have duplicates :). I think we do have improvement to make to the logcli to stream logs and so enable all nice unix tool, but that's another story.
|
||
### Glob | ||
|
||
The glob parser (as described in previous design) `{job="prod/nginx"} | glob "*<client_ip> - - [ * ] \"* *<path> "` could be used to extract fields from nginx logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"glob" pattern matching as implemented in shells matches filenames. "*" will match a.txt
, but also a a.txt
and a a a.txt
files containing spaces in the filename. That doesn't seem to be the case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it will be a special glob, I used glob as it's the closest language but you're right it's not exactly that.
See aws parse example:
FIELDS @message
| PARSE @message "* [*] *" as loggingTime, loggingType, loggingMessage
| FILTER loggingType IN ["ERROR", "INFO"]
| DISPLAY loggingMessage, loggingType = "ERROR" as isError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice proposal 👍
- I tend to agree with @brancz here.
I get the urge to solve this in LogQL especially from a UX perspective but I’m not sure this is really worth the risks and baggage it brings with it. Also this is a ton of features I would recommend trying and implementing one of them at a time should you decide to do so.
I’m particularly curious what the reasoning is to favor this over let’s say integration points to other systems or data processing pipelines to allow these types of workflows?
I am exactly missing that part of this proposal. Some section called # Alternatives
where we can explore some alternatives for example:
- Not do it (: And allow already mature and fully flexible client-side CLI tools like
tr, awk, grep, uniq, jq
this has some tradeoff obviously. It would be nice if design would outline those. - Allow server side integrations with more programmable APIs using some streaming, serialized methods like protobuf, gRPC (:
Some arguments and pros & cons would be amazing.
Especially the latter alternative. I feel like with those features you are putting on yourself as the Loki Team a huge baggage of features, maintenance, and support burden. What worries me particularly is many ways of doing the same thing e.g regexp
and glob
or json
and logfmt
. Of course, do as you wish, but this is some minor suggestion from the maintainers of other big projects, where we know what support burden means. 🔥 This is why we removed some features from Thanos just to be clear and have one way of doing the thing (: (e.g problematic gossip)
- Introducing one by one from the most important would be actually nice so some
# Work plan
section would be amazing in this proposal 🤗 Thanks of small iterations you can gather some data if user is needed or not long term (:
|
||
### Glob | ||
|
||
The glob parser (as described in previous design) `{job="prod/nginx"} | glob "*<client_ip> - - [ * ] \"* *<path> "` could be used to extract fields from nginx logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also is glob
really necessary if you have regexp?
|
||
### Uniq | ||
|
||
The uniq operator allows to filter out duplicates. Without any parameter (`| uniq`) each fields combination must be unique from one entry to another. However |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, can we just CLI tool instead or LogQL features?
Of course the plan is to work operator by operator. I'm pretty confident on the implementation.
The problem we're seeing is people want to index to search, like the level of your logs. Indexing this causes a lot of problem with the current storage design and allowing it at query times solve this issue, they won't even search how to cheat the system anymore. Good feedback @brancz, can you detail more the integration points part with example ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, good job @cyriltovena going through such a detailed proposal 👍
I've read and reviewed this proposal without reading any other comment (unbiased) and just looking at it from the final user perspective.
My overall feedback is that I see a value in this language extension and I would use it. I'm the kind of user that want to be able to query logs through Grafana explore and is currently limited a lot by the query language. This extension would allow to run way more expressive queries.
On the other side, the query syntax is getting quite complicated but it's also true that this is an extension over the current LogQL (I can continue to run a query only with log selector + filter expression) so I don't see it as a big problem because you can still iteratively learn the new syntax (starting from the operator you need).
|
||
In this proposal I introduce multiple new operators in the LogQL language that will allow transforming a log stream into a field stream which can be then aggregated, compared, and sorted. | ||
|
||
You'll see that all those new operators (except aggregations over time) are chain together with the `|` pipe operator. This is because we believe it's the most intuitive way to write queries that process data in the Linux world as opposed to the function format. (e.g `cat my.log | grep -e 'foo' | jq .field | wc -l` is easier to read than `wc(jq(grep(cat my.log,'foo'),.field),-l)`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 agree.
|
||
### Questions | ||
|
||
- Should we allow to implicit field extraction such as `{job="prod/app"} | logfmt` and `{job="prod/app"} | json` would extract all fields ? I really like the simplicity of it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely yes.
### Questions | ||
|
||
- Should we allow to implicit field extraction such as `{job="prod/app"} | logfmt` and `{job="prod/app"} | json` would extract all fields ? I really like the simplicity of it. | ||
- In the previous document the `as field_name1, field_name2` field renaming notation was also interesting should we use is over `ip=client_ip` for now I've opted for the later which can be omitted in case of no renaming ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new ip=client_ip
notation looks more natural to me.
|
||
- Should we allow to implicit field extraction such as `{job="prod/app"} | logfmt` and `{job="prod/app"} | json` would extract all fields ? I really like the simplicity of it. | ||
- In the previous document the `as field_name1, field_name2` field renaming notation was also interesting should we use is over `ip=client_ip` for now I've opted for the later which can be omitted in case of no renaming ? | ||
- Do we prefer `|json` or `| json` notation ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we choose one? Can't both be supported? If not, I would pick | json
because |something
is currently used for the filter expression (ie. |=
) while here we're defining subsequent processing stages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep we can I think that's would be likely.
- In the previous document the `as field_name1, field_name2` field renaming notation was also interesting should we use is over `ip=client_ip` for now I've opted for the later which can be omitted in case of no renaming ? | ||
- Do we prefer `|json` or `| json` notation ? | ||
|
||
## LogQL Fields Operators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if I specify a label as a field in a operator? Would that work? I assume yes, but I haven't found a clear statement about it (or I missed it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't planning but that could be stretched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever will be the final decision I believe this should be clearly documented, cause it could be a frequent doubt.
|
||
For example: `{job="prod/query-frontend"} | json | sort latency desc path asc` will sort by latency descending first then path ascending. | ||
|
||
Using sort operator with metric queries will result into an error.(e.g `count_over_time({job="prod/query-frontend"} | logfmt | sort latency desc [5m])`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find the ending [5m]
very confusing. This is an extension of the current syntax (ie. count_over_time({job="prod/query-frontend"} != "something"[5m])
) which is already confusing, but further complicated by the chain of pipes.
It's not specific to the sort operator but the syntax in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support this too:
count_over_time({job="prod/query-frontend"} [5m] != "something"))
And this will also be same for this document. The range [5m]
can be at the beginning or the end. I recently realized that at the beginning was more natural for most people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting! I wasn't aware of that. Is this documented? I probably missed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have it documented 😢 but I'll add this now !
|
||
This is why select is important to use for large queries. Select will allows users to remove data (column) in their logs that are useless, repetitive and not interesting for the current investigation. | ||
|
||
The select operation will also support the [go templating language](https://golang.org/pkg/text/template/) when supplied with string argument and not fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, the select operator looks the wrong operator to add templating support. I would expect the select operator to just select fields and not transform them. I'm personally dubious about supporting the go templating language at all in any operator but if so I would suggest to use a dedicated operator.
### Uniq | ||
|
||
The uniq operator allows to filter out duplicates. Without any parameter (`| uniq`) each fields combination must be unique from one entry to another. However | ||
the filtering can be forced on a specific field, for example `| uniq status_code`, this will returns one entry per status code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you decide which entry to keep if the matching is done only on a subset of the log entry (ie. a single field)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means it will keep an entry only if the field has not been seen before. This would allows you to get the list of unique address IP as explained by @slim-bean . {app="nginx"} | glob ... | uniq address_ip
I did a poor job at explaining all those uses cases. Gonna do another run to explain that for each operators.
{job="prod/app"} | json .route client_ip=.client.ip | ||
{job="prod/app"} | json .route .client.ip as route, client_ip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be less beneficial to expose both =
and as
for assigning keys to values. Unless I'm missing something, I think it'd be better to choose one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm asking for feedback which one is better ?
Also do you prefer | json .route, .client
or |json .route .client
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| json
{job="prod/app"} | json | ||
``` | ||
|
||
> `{job="prod/app"} |json |select route client_ip` is an implicit extraction, the list of extracted field is inferred from the full query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're inferring the required field, can we remove explicit extraction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At query time yes that's the plan.
### Questions | ||
|
||
- Should we allow to implicit field extraction such as `{job="prod/app"} | logfmt` and `{job="prod/app"} | json` would extract all fields ? I really like the simplicity of it. | ||
- In the previous document the `as field_name1, field_name2` field renaming notation was also interesting should we use is over `ip=client_ip` for now I've opted for the later which can be omitted in case of no renaming ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer <ip=>client_ip
because as an operator it's easier to change queries. Using as
syntax requires that we also keep track of the index of the matches as well as requiring explicitly naming the field. a, b as _, other_b
could be used to minimize this, but I still prefer the =
syntax -- I can easily add a new key value pair in the middle of my selections without causing off-by-one issues in keying the selected values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point however = or => ?
|
||
Example: `{job="prod/query-frontend"} | json | filter latency > 10 or status >= 400` | ||
|
||
When field value and literal types are different, conversion will be attempted prior testing and will be translated into zero type values in case of failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitant about this - I think failing in the case of invalid conversions would be a more intuitive/safer way to go. If needed, we could support a way to use a default value if parsing is unsuccessful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but then I realized what if only one log doesn't have the value in the correct format ? erroring seems harsh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we ignore the line instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could yes ! Noted.
|
||
### Select | ||
|
||
The Select operator (`| select`) allows you to select and reduce field that should be returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should only use inferred selections if possible. The go templating use case is independent AFAICT and we could use a gotpl
option to do this sort of mapping I think. Thoughts @cyriltovena?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep we might be able to infer all the time. I need to verify this.
And may be a template operator is better, specially if we can just chain them but that doesn't improve brevity.
Incredible job @cyriltovena. I had a few comments but overall I'm very impressed with this so far. I think @slim-bean nailed the reasoning for why we want to support these features in loki rather than deferring to CLIs. |
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
1/ Love it, right now my organization can't use Loki because of the lack of structured logging support. This would definitively be the missing feature that would make the teams adopt it. Knowing how people work can, maybe, drive you toward the first steps, so, here it is: Our current solution C# -> serilog -> fluentd -> clickhouse -> grafana where serilog generates json, fluentd insert it to clikchouse with nested fields and where grafana, with dashboards full of Clickhouse - SQL like - queries more or less linked with prometheus labels (when possible) and grafana variables are reusing the fields/labels setup in our C# code. So very often we do something like this: in C#: in Grafana: Depending on the domain, we can display messages or sometimes only some fields, it really depends. They key is: the same team maintains the C# code and the related grafana dashboard. Everything between is generic. 2/Agree with @slim-bean on the reasoning. 3/My concern is to bring again a new language. Piping is nice, ok, but could we stick to something that already exists in the linux commands world ? I have no clue what that would be, I think for json there are already many options to parse, filter and select. 4/ So to me what is the MVP of this long set of features that could be accepted and implemented rather quickly to have some feed-back ? As you understand, for me: json parser + field extraction and light filtering would be a first step. :-) |
- Show me all first x unique IP within a time range. | ||
- Reformat a log line to be more readable. | ||
|
||
However those use cases are out of scope: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following use cases are kept out of scope:
|
||
> Implicit fields extraction could means trouble in term of performance. However in most cases we will be able to infer all used fields from the full query and so limit the extraction per line. | ||
|
||
To cover a wide range of structured and unstructured logs, we will start with those four parsers: `glob`, `logfmt`, `json` and `regexp`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about Grok expressions? Quite a lot of usage in the logging world
Is there no way of doing the field extraction in promtail in order to keep the cost of that close to the production of logs instead of making a more expensive query? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level comments:
- This should be a GDoc (already agreed)
- The mix between
(){}
,| foo | bar
,|foo .bar
andas
&=
is highly confusing to me. Perl's curse is that it allows you to do things in more than one way, which leads to bikeshedding - As per our call: Regex are more powerful than globbing, so I would argue that regex are a must, globs are optional. As such, regex should be done first.
Potentially contentious: Should we make/keep LogQL fully LISP-like and use Flux in Loki for anyone who prefers piping?
|
||
Based on the feedback of the first design and our own usage, it’s pretty clear that we want to support extracting fields out of the log line for detailed analysis. [Amazon Cloud Watch is a good example](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax-examples.html) of such language features. | ||
|
||
For instance some users are currently extracting labels using Promtail pipeline stages as a way to filter and aggregate logs but this causes issues with our storage system. (Causes the creation of a lot of small chunks, putting pressure on the filesystem when trying to aggregated over a long period of time). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance some users are currently extracting labels using Promtail pipeline stages as a way to filter and aggregate logs but this causes issues with our storage system. (Causes the creation of a lot of small chunks, putting pressure on the filesystem when trying to aggregated over a long period of time). | |
For instance some users are currently extracting labels using Promtail pipeline stages as a way to filter and aggregate logs but this causes issues with our storage system. This causes the creation of a lot of small chunks, putting pressure on the filesystem when trying to aggregate over a long period of time. |
|
||
In this proposal I introduce multiple new operators in the LogQL language that will allow transforming a log stream into a field stream which can be then aggregated, compared, and sorted. | ||
|
||
You'll see that all those new operators (except aggregations over time) are chain together with the `|` pipe operator. This is because we believe it's the most intuitive way to write queries that process data in the Linux world as opposed to the function format. (e.g `cat my.log | grep -e 'foo' | jq .field | wc -l` is easier to read than `wc(jq(grep(cat my.log,'foo'),.field),-l)`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll see that all those new operators (except aggregations over time) are chain together with the `|` pipe operator. This is because we believe it's the most intuitive way to write queries that process data in the Linux world as opposed to the function format. (e.g `cat my.log | grep -e 'foo' | jq .field | wc -l` is easier to read than `wc(jq(grep(cat my.log,'foo'),.field),-l)`) | |
You'll see that all those new operators (except aggregations over time) are chained together with the `|` pipe operator. This is because we believe it's the most intuitive way to write queries that process data in the Linux world as opposed to the function format. (e.g `cat my.log | grep -e 'foo' | jq .field | wc -l` is easier to read than `wc(jq(grep(cat my.log,'foo'),.field),-l)`) |
## Use cases | ||
|
||
The first use case is to allow users to extract data from the log content itself and use them as labels or sample values. | ||
One of the biggest benefits will be that user won't need to index log content anymore for the only purpose of filtering or creating metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the biggest benefits will be that user won't need to index log content anymore for the only purpose of filtering or creating metrics. | |
One of the biggest benefits is that users won't need to index log content just for filtering or creating metrics any more. |
|
||
- Show me the logs of requests slower than 5s for a given domain. | ||
- Show me the log with only IP address, latency and path. (Remove redundant noise within logs) | ||
- I have many redundant logs, show me only unique subset of them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I have many redundant logs, show me only unique subset of them. | |
- I have many redundant logs, show me only the unique subset of them. |
- Show me all first x unique IP within a time range. | ||
- Reformat a log line to be more readable. | ||
|
||
However those use cases are out of scope: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However those use cases are out of scope: | |
The following use cases are out of scope: |
|
||
However those use cases are out of scope: | ||
|
||
- What's the latency xth quantile over a year, we don't expect time range to be higher than couple of day max. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Higher-level note, I would be surprised if people didn't threaten to pay us to support a month at least. Would it make sense to at least think about how to extend this time frame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, with our current logging solution most investigations/use will be ranges of ~1-2 days, but some of them (~5-10%) will be up to 1-2 months.
"We don't expect time range to be higher than couple of day max."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1-2 months feels like in many cases a lot of data. I don't think it's unreasonable though to allow such a query, with the expectation that it might take minutes or more to complete. Additionally, if someone had multiple Loki installations, they could spin up, and run a large cluster for a short amount of time to perform such operations (like how Elastic Map Reduce jobs work)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly !
However those use cases are out of scope: | ||
|
||
- What's the latency xth quantile over a year, we don't expect time range to be higher than couple of day max. | ||
- Replace metrics instrumentation with logs. We want to ease new analysis but we don't want to push users to use this instead of instrumenting their application with Prometheus or any other metric client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Replace metrics instrumentation with logs. We want to ease new analysis but we don't want to push users to use this instead of instrumenting their application with Prometheus or any other metric client. | |
- Replace metrics instrumentation with logs. - We want to ease new analysis but we don't want to push users to use this instead of instrumenting their application with Prometheus or any other metric client. |
|
||
- What's the latency xth quantile over a year, we don't expect time range to be higher than couple of day max. | ||
- Replace metrics instrumentation with logs. We want to ease new analysis but we don't want to push users to use this instead of instrumenting their application with Prometheus or any other metric client. | ||
- Big data analysis, this whole new feature is mainly for ad-hoc use cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Big data analysis, this whole new feature is mainly for ad-hoc use cases. | |
- Big data analysis. - This design doc mainly covers ad-hoc use cases. |
|
||
## LogQL Parsers | ||
|
||
A parser operator can only be applied to __log__ streams (a set of entries attached to a given set of label pairs) and will transform (or extract) them into __field__ streams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A parser operator can only be applied to __log__ streams (a set of entries attached to a given set of label pairs) and will transform (or extract) them into __field__ streams. | |
A parser operator can only be applied to __log__ streams (a set of entries attached to a given set of label pairs) and will transform or extract them into __field__ streams. |
```logql | ||
{job="prod/app"} | json .route client_ip=.client.ip | ||
{job="prod/app"} | json .route .client.ip as route, client_ip | ||
{job="prod/app"} | json |select route client_ip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious why I use .route
but |select
. Together with the {}
syntax, we are at three models, now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I had submitted this, but GH being wonky today...
- I think we should move to GDoc again (already agreed)
- There's currently a mix of
(){}
,| select
,.route
, and also=
andas
. This is confusing - Regex are more powerful than globs, so regex MUST be implemented, globbing MAY be implemented
- Should we consider keeping LogQL like PromQL with (){} only and support Flux in Loki for pipe fans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good proposal overall, I like it! 🏅
Great that you've spent time thinking about the real use-cases, e.g. a developer troubleshooting an issue and needing to grep through logs. That's what we use our logging to 95% of the time.
I assume this will occur at query-time, so it'll be CPU intensive.
-
Do we want to support lazy evaluation? Or think about that when designing the API? Something similar to ruby's lazy evaluation (see below).
-
Do we want to give the user some progress indicator, e.g. query has now searched through X / Y documents.
# example of lazy evaluation in ruby
# This code will "hang" and you'll have to ctrl-c to exit
(1..Float::INFINITY).reject { |i| i.odd? }.map { |i| i*i }.first(5)
(1..Float::INFINITY).lazy.reject { |i| i.odd? }.map { |i| i*i }.first(5)
#=> [4, 16, 36, 64, 100]
http://patshaughnessy.net/2013/4/3/ruby-2-0-works-hard-so-you-can-be-lazy
https://www.honeybadger.io/blog/using-lazy-enumerators-to-work-with-large-files-in-ruby/
|
||
However those use cases are out of scope: | ||
|
||
- What's the latency xth quantile over a year, we don't expect time range to be higher than couple of day max. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, with our current logging solution most investigations/use will be ranges of ~1-2 days, but some of them (~5-10%) will be up to 1-2 months.
"We don't expect time range to be higher than couple of day max."
|
||
The json parser allows to parse a json log line and select fields to extract. If the line is not a valid json document the line will be skipped. | ||
|
||
We could support those notations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could sync with e.g. jq
, or use a sub-set of the jq
supported syntax? Nothing major though, fine with coming up with our own too
We'd avoid reinventing the wheel + it may make copy-paste easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes totally agree, probably a subset, cause jq does not have a go library (at least without cgo) and that would be too hard to support it all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Subset is totally fine and probably a better idea than supporting everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes totally agree, probably a subset, cause jq does not have a go library (at least without cgo) and that would be too hard to support it all.
https://github.com/itchyny/gojq
If you wanted to use cgo bindings: https://github.com/jzelinskie/faq/tree/master/pkg/jq but I wouldn't recommend it. It's a huge pain to work with and you'll have a bunch of security issues I'd bet.
Idk how good gojq is though.
### Faceting | ||
|
||
As explained in the [parsers section](#logql-parsers), some parsers (`json` and `logfmt`) will support implicit extraction. However when the user will want to reduce fields by selecting them, I think it would be useful to provide an API that returns all field names for a given field stream selector and a time range. This way Grafana or any other integrations will be able to provide auto completion for those fields to select. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyriltovena Looking forward to LogQL! I've added two more thoughts on this proposal, let me know what you think.
|
||
Field names are declared by each parsers, some parser will allow implicit fields extraction for __brevity__ like the `json` and `logfmt` parsers. | ||
|
||
Since field name can conflicts with labels during aggregation we will support for each parser the ability to rename a field. However in case of conflict fields will take precedence over labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid conflicts between labels and fields, how about using different symbols for them? For example @myLabel
and $myField
?
Should also make documentation etc. easier and help avoid confusion between the two.
127.0.0.1 - - [19/Jun/2012:09:16:22 +0100] "GET /GO.jpg HTTP/1.1" 499 0 "http://domain.com/htm_data/7/1206/758536.html" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; SE 2.X MetaSr 1.0)" | ||
``` | ||
|
||
would be transformed into those fields: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are field key/value pairs only strings? Obviously keys are strings, but can values be something else, e.g. arrays?
We are doing similar to Canonical Log Lines (ref 1, ref 2), but with all messages for a request nested under a messages
key, in an array. So basically every http request is one log line.
It looks like this:
{
"request-id" : "2d94738b-0324-a1fc-2916-644716f3ea36",
"ip-address" : "127.0.0.1",
"user-name" : "Jane Doe",
"http-method" : "get",
"http-response" : "200",
"duration" : 125,
"service" : "authentication",
"messages" : [
"Attempted to authenticate user Jane",
"Connecting to internal IAM system",
"Internal IAM system responded with unknown user id",
"Provided user id is unknown, cannot verify user, returning response to user",
"…",
"Requested completed"
]
}
There are two fields here that aren't strings, duration
(an integer) and messages
(an array).
Would be great if LogQL would support querying against array fields like this.
@cyriltovena let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't thought about the full json syntax yet.
First, want to voice my support for this, as it evolves Loki into a worthy OSS competitor to Humio, which occupies a similar "no indexing" place in the logging spectrum and sports a pretty expressive query language. Would like to raise a few concerns with the proposed syntax:
|
Thanks @yurishkuro, feels great to hear you support this. Do note that I'm actively working on this and it's a work in progress especially around the syntax. Once this is more stable I'll share more.
One of the big advantages of using the same Prometheus syntax is the ability to switch between metrics and logs.
This is a great idea, I don't think we will make this implicit from the start but definitely a great improvement we will be able to do later on.
Yes you will always be able to grep filter first, allowing you to reduce the parsing then you will be able to do field/label based filter.
I did entertain this idea for a long time Again thanks for your feedback, I'll keep you posted. |
Thanks, @cyriltovena
This is simultaneously a disadvantage that couples Loki with Prometheus. The alternative syntax I showed is just as easy to translate back into PromQL snippet, but allows Loki to be used in non-Prom installations (e.g. at Uber we use M3QL, not PromQL). Regarding the pipe syntax, one very intriguing proposition is combining it with M3QL for metrics, which is already pipe based. The expression can start with a query on logs, followed by some aggregation operation like count() or groupBy, which results in a set of time series that can be further manipulated by a pipe-based metrics QL. But this is just a wish list so far. |
As a sysadmin I realize I am not the primary use case for Loki, but I am really excited to see this direction for logQL, it will dramatically enhance the usability of Loki (and once it supports it) Grafana as a log visualization aid for me and my cohorts, I am very excited to get my hands on this functionality. The only usecase I see that isn't really addressed as far as I can tell, is having different log line formats in the same query. In my usecase, I can definitely see that querying for a particular IP address would return two or more log line formats, and I would love to be able to add an external classifier to determine the format of the logline and return field specifications that differ per line. something along the lines of | classify classifier.sh with an external script taking care of the parsing for that specific query, before splitting up and prettifying the output in grafana. |
In sumologic, most queries start with a
Similarly, I am an SRE and I want to be able to see logs from a combination of things in the same view, such as application logs (json) but also logs from say an envoy proxy (plaintext). Being able to manipulate how the query shows the result should be as simple as structuring your query with a |
The updated LogQL design doc is wonderful reading. It feels like the missing part of Loki. The doc is now several months old and seemingly still rather incomplete particularly in regards to implementation plans. Is this doc still under consideration or has there been some sort of pause to progress? |
Closing it, thanks everyone for all the great feedback, it was such an inspiration ! I'm proud of the result. A great example of community effort. see #2769 |
This is a design document explaining how we could extend the current LogQL language. This is based on the previous document from @joe-elliott. I've talked to @mattmendick, @joe-elliott, @slim-bean and @owen-d about this design and those discussions drove me to this current version.
I believe this will be easy to implement with the current code base and in small PRs. I'm curious to see what you think about the syntax but also about the metrics part. There's an alternative here where we allow instant vector selection but IMO this first set is far more useful.
There's definitely some performance concerns but I think it's best to keep that for once we have everything in place. It is most likely that parsers will be a significant part of the cpu usage and the root of most memory allocations.
If you find any typo or have a better wording feel free to directly commit to this PR. Let's keep comments for design discussion or questions.
/cc @bwplotka @jtlisi @brancz
Feedback is welcome !