Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design: LogQL Fields #1848

Closed
wants to merge 21 commits into from

Conversation

cyriltovena
Copy link
Contributor

@cyriltovena cyriltovena commented Mar 25, 2020

This is a design document explaining how we could extend the current LogQL language. This is based on the previous document from @joe-elliott. I've talked to @mattmendick, @joe-elliott, @slim-bean and @owen-d about this design and those discussions drove me to this current version.

I believe this will be easy to implement with the current code base and in small PRs. I'm curious to see what you think about the syntax but also about the metrics part. There's an alternative here where we allow instant vector selection but IMO this first set is far more useful.

There's definitely some performance concerns but I think it's best to keep that for once we have everything in place. It is most likely that parsers will be a significant part of the cpu usage and the root of most memory allocations.

If you find any typo or have a better wording feel free to directly commit to this PR. Let's keep comments for design discussion or questions.

/cc @bwplotka @jtlisi @brancz

Feedback is welcome !

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
…istogram

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

Using sort operator with metric queries will result into an error.(e.g `count_over_time({job="prod/query-frontend"} | logfmt | sort latency desc [5m])`)

> __Warning:__ Sorting is dependent on how Grafana is currently ordering lines. Even if we re order our responses there is no guarantee that Grafana will be relying on it. And we might have to remove uncommon labels to keep only stream to ensure ordering across field streams. This means we will need a new API (`/v2`) to support ordering, although with compromise we might get something to work with the `v1` API.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davkal What do you think about this ?

Copy link
Member

@owen-d owen-d Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some feedback would be helpful here. How do other datasources (if any) implement upstream-defined sorting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloudwatch does support this I think, @kaydelaney ? This is a big document sorry ;)

quantile_over_time(.99, {job="nginx"} | glob "* -- * \"* /*<path> *\" *<status> *<latency> *" | histogram latency linear(.25, 2, 5) by (path, status) [15m])
```

## API Changes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davkal Do you think this is a good approach ? I really wanted to make sure most people even if they don't update can have this new syntax supported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also like some thoughts from
@davkal on this.

@rfratto rfratto added the proposal A design document to propose a large change to Promtail label Mar 25, 2020
@rfratto
Copy link
Member

rfratto commented Mar 25, 2020

This is fantastic and really existing!

One thing that's standing out to me is how the syntax for each new operator is specialized for that specific operator. It might be worth it to consider a well-formed syntax for how operators are declared, such as a function-like syntax with named arguments:

{job="prod/query-frontend"} | logfmt | sort(field=duration,dir=desc) | limit(25) | select(fields=[query,duration])

This example is way more to type and probably isn't the best syntactic alternative to what you've already considered, but I'm trying to show how it does the following:

  • Allows arguments to be provided in any order
  • Allows us to add arguments without breaking backwards compatibility or reluctantly just always appending new arguments
  • Helps brings consistency to how operators are used
  • Adds an opportunity for optional arguments regardless of parameter number

@cyriltovena
Copy link
Contributor Author

cyriltovena commented Mar 25, 2020

One thing that's standing out to me is how the syntax for each new operator is specialized for that specific operator. It might be worth it to consider a well-formed syntax for how operators are declared, such as a function-like syntax with named arguments:

{job="prod/query-frontend"} | logfmt | sort(field=duration,dir=desc) | limit(25) | select(fields=[query,duration])

Definitively not opposed to it ! This is where I was searching for feedback (in the syntax itself), let's see what other think about this too. I also thought about cli flags sort --field=duration --dir=desc 🤣 .

### Questions

- Should we allow to implicit field extraction such as `{job="prod/app"} | logfmt` and `{job="prod/app"} | json` would extract all fields ? I really like the simplicity of it.
- In the previous document the `as field_name1, field_name2` field renaming notation was also interesting should we use is over `ip=client_ip` for now I've opted for the later which can be omitted in case of no renaming ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I combine implicit all and renaming? Or would | json ip=client_ip only return that single field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question ! No as soon as you add a rename only explicit fields are extracted. I would love to have an option to keep every fields while renaming one but there’s two issues:

  • We need a meaningful syntax for this.
  • By default we should not be greedy, at least IMO.

How do you think we can solve that ? Select suffers from the same issue.


### Glob

The glob parser (as described in previous design) `{job="prod/nginx"} | glob "*<client_ip> - - [ * ] \"* *<path> "` could be used to extract fields from nginx logs.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should glob be able to call other parsers? it's not too uncommon in other tools that something like this is available to treat the case of some custom header followed by structured information

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is glob really necessary if you have regexp?

Copy link
Contributor Author

@cyriltovena cyriltovena Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is IMO, this is inspired by aws. It will be much faster and much easier to use than regexp. Regexp will be the last resort if you can't parse your logs with the rest.

Regexp matching will be very slow.

Copy link
Contributor Author

@cyriltovena cyriltovena Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should glob be able to call other parsers?

I see the use case not sure if there is a nice way to incorporate this in the current syntax. Definitely not part of the MVP but yeah I would like to leave the door open. If you have some suggestions let me know.

May be we have to draw a line with what we support, this seems to say we will support multiple format in a single log line. Do you have a log example ?

Promtail is able to do multiple parsing for extracted fields. But the syntax is yaml and we are less worried about the performance there.

@brancz
Copy link

brancz commented Mar 26, 2020

I get the urge to solve this in LogQL especially from a UX perspective but I’m not sure this is really worth the risks and baggage it brings with it. Also this is a ton of features I would recommend trying and implementing one of them at a time should you decide to do so.

I’m particularly curious what the reasoning is to favor this over let’s say integration points to other systems or data processing pipelines to allow these types of workflows?

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of new features, and I'm concerned that if they start being used on hosted instances, they will negatively impact cpu, memory and query times. I would definitely many of these myself though.


### Sort

The sort operation noted `| sort` allows to orders fields entries by a field value instead of the field entry time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting can use a lot of memory, and should probably have some limits in place to avoid killing Loki.

Copy link
Contributor Author

@cyriltovena cyriltovena Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm pretty sure I tried to explain that but it won't sort the whole time range. It will sort only the resulting logs/fields once filtered.

You get the first x sorted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly for your eyes, once you received the result, show me the slowest first.


### Uniq

The uniq operator allows to filter out duplicates. Without any parameter (`| uniq`) each fields combination must be unique from one entry to another. However
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can only be implemented efficiently if data going to uniq filter is already sorted. (Command line uniq tool only compares adjacent lines as well)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, can we just CLI tool instead or LogQL features?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can only be implemented efficiently if data going to uniq filter is already sorted. (Command line uniq tool only compares adjacent lines as well)

Again this will basically returns line and filter the one you already have, it won't deduplicates the whole time range. I'm not sure I understand your concern @pstibrany.

BTW, can we just CLI tool instead or LogQL features?

Of course it could but what's interesting server side is that any new language feature will be supported by logcli and Grafana and any other tool, I don't expect everyone to use CLI tooling.

However for CLI I have other plan, currently we don't support streaming logs and this is a great use case for CLI as you could pipe and stream to other tools.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this will basically returns line and filter the one you already have, it won't deduplicates the whole time range. I'm not sure I understand your concern @pstibrany.

I think I'm missing understanding in the distinction you're making about time ranges. My concern is that to do general deduplication you need to remember all previous values. But it seems that is not an issue here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loki is scanning from the start or the end. Uniq operator will just ensure that the first 1000 lines returned are unique, so it will keep scanning until we have that. It won't dedupe from start to end all lines and then return the first 1000 lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often use logcli with much higher limits than 1000 ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably because you have duplicates :). I think we do have improvement to make to the logcli to stream logs and so enable all nice unix tool, but that's another story.


### Glob

The glob parser (as described in previous design) `{job="prod/nginx"} | glob "*<client_ip> - - [ * ] \"* *<path> "` could be used to extract fields from nginx logs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"glob" pattern matching as implemented in shells matches filenames. "*" will match a.txt, but also a a.txt and a a a.txt files containing spaces in the filename. That doesn't seem to be the case here.

Copy link
Contributor Author

@cyriltovena cyriltovena Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will be a special glob, I used glob as it's the closest language but you're right it's not exactly that.

See aws parse example:

FIELDS @message
| PARSE @message "* [*] *" as loggingTime, loggingType, loggingMessage
| FILTER loggingType IN ["ERROR", "INFO"]
| DISPLAY loggingMessage, loggingType = "ERROR" as isError

Copy link

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice proposal 👍

  1. I tend to agree with @brancz here.

I get the urge to solve this in LogQL especially from a UX perspective but I’m not sure this is really worth the risks and baggage it brings with it. Also this is a ton of features I would recommend trying and implementing one of them at a time should you decide to do so.
I’m particularly curious what the reasoning is to favor this over let’s say integration points to other systems or data processing pipelines to allow these types of workflows?

I am exactly missing that part of this proposal. Some section called # Alternatives where we can explore some alternatives for example:

  • Not do it (: And allow already mature and fully flexible client-side CLI tools like tr, awk, grep, uniq, jq this has some tradeoff obviously. It would be nice if design would outline those.
  • Allow server side integrations with more programmable APIs using some streaming, serialized methods like protobuf, gRPC (:

Some arguments and pros & cons would be amazing.

Especially the latter alternative. I feel like with those features you are putting on yourself as the Loki Team a huge baggage of features, maintenance, and support burden. What worries me particularly is many ways of doing the same thing e.g regexp and glob or json and logfmt. Of course, do as you wish, but this is some minor suggestion from the maintainers of other big projects, where we know what support burden means. 🔥 This is why we removed some features from Thanos just to be clear and have one way of doing the thing (: (e.g problematic gossip)

  1. Introducing one by one from the most important would be actually nice so some # Work plan section would be amazing in this proposal 🤗 Thanks of small iterations you can gather some data if user is needed or not long term (:


### Glob

The glob parser (as described in previous design) `{job="prod/nginx"} | glob "*<client_ip> - - [ * ] \"* *<path> "` could be used to extract fields from nginx logs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is glob really necessary if you have regexp?


### Uniq

The uniq operator allows to filter out duplicates. Without any parameter (`| uniq`) each fields combination must be unique from one entry to another. However

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, can we just CLI tool instead or LogQL features?

@cyriltovena
Copy link
Contributor Author

I get the urge to solve this in LogQL especially from a UX perspective but I’m not sure this is really worth the risks and baggage it brings with it. Also this is a ton of features I would recommend trying and implementing one of them at a time should you decide to do so.

Of course the plan is to work operator by operator. I'm pretty confident on the implementation.

I’m particularly curious what the reasoning is to favor this over let’s say integration points to other systems or data processing pipelines to allow these types of workflows?

The problem we're seeing is people want to index to search, like the level of your logs. Indexing this causes a lot of problem with the current storage design and allowing it at query times solve this issue, they won't even search how to cheat the system anymore.

Good feedback @brancz, can you detail more the integration points part with example ?

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, good job @cyriltovena going through such a detailed proposal 👍

I've read and reviewed this proposal without reading any other comment (unbiased) and just looking at it from the final user perspective.

My overall feedback is that I see a value in this language extension and I would use it. I'm the kind of user that want to be able to query logs through Grafana explore and is currently limited a lot by the query language. This extension would allow to run way more expressive queries.

On the other side, the query syntax is getting quite complicated but it's also true that this is an extension over the current LogQL (I can continue to run a query only with log selector + filter expression) so I don't see it as a big problem because you can still iteratively learn the new syntax (starting from the operator you need).


In this proposal I introduce multiple new operators in the LogQL language that will allow transforming a log stream into a field stream which can be then aggregated, compared, and sorted.

You'll see that all those new operators (except aggregations over time) are chain together with the `|` pipe operator. This is because we believe it's the most intuitive way to write queries that process data in the Linux world as opposed to the function format. (e.g `cat my.log | grep -e 'foo' | jq .field | wc -l` is easier to read than `wc(jq(grep(cat my.log,'foo'),.field),-l)`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 agree.


### Questions

- Should we allow to implicit field extraction such as `{job="prod/app"} | logfmt` and `{job="prod/app"} | json` would extract all fields ? I really like the simplicity of it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely yes.

### Questions

- Should we allow to implicit field extraction such as `{job="prod/app"} | logfmt` and `{job="prod/app"} | json` would extract all fields ? I really like the simplicity of it.
- In the previous document the `as field_name1, field_name2` field renaming notation was also interesting should we use is over `ip=client_ip` for now I've opted for the later which can be omitted in case of no renaming ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new ip=client_ip notation looks more natural to me.


- Should we allow to implicit field extraction such as `{job="prod/app"} | logfmt` and `{job="prod/app"} | json` would extract all fields ? I really like the simplicity of it.
- In the previous document the `as field_name1, field_name2` field renaming notation was also interesting should we use is over `ip=client_ip` for now I've opted for the later which can be omitted in case of no renaming ?
- Do we prefer `|json` or `| json` notation ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we choose one? Can't both be supported? If not, I would pick | json because |something is currently used for the filter expression (ie. |=) while here we're defining subsequent processing stages.

Copy link
Contributor Author

@cyriltovena cyriltovena Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep we can I think that's would be likely.

- In the previous document the `as field_name1, field_name2` field renaming notation was also interesting should we use is over `ip=client_ip` for now I've opted for the later which can be omitted in case of no renaming ?
- Do we prefer `|json` or `| json` notation ?

## LogQL Fields Operators
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I specify a label as a field in a operator? Would that work? I assume yes, but I haven't found a clear statement about it (or I missed it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't planning but that could be stretched.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever will be the final decision I believe this should be clearly documented, cause it could be a frequent doubt.


For example: `{job="prod/query-frontend"} | json | sort latency desc path asc` will sort by latency descending first then path ascending.

Using sort operator with metric queries will result into an error.(e.g `count_over_time({job="prod/query-frontend"} | logfmt | sort latency desc [5m])`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find the ending [5m] very confusing. This is an extension of the current syntax (ie. count_over_time({job="prod/query-frontend"} != "something"[5m])) which is already confusing, but further complicated by the chain of pipes.

It's not specific to the sort operator but the syntax in general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support this too:

count_over_time({job="prod/query-frontend"} [5m] != "something"))

And this will also be same for this document. The range [5m] can be at the beginning or the end. I recently realized that at the beginning was more natural for most people.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting! I wasn't aware of that. Is this documented? I probably missed it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have it documented 😢 but I'll add this now !


This is why select is important to use for large queries. Select will allows users to remove data (column) in their logs that are useless, repetitive and not interesting for the current investigation.

The select operation will also support the [go templating language](https://golang.org/pkg/text/template/) when supplied with string argument and not fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, the select operator looks the wrong operator to add templating support. I would expect the select operator to just select fields and not transform them. I'm personally dubious about supporting the go templating language at all in any operator but if so I would suggest to use a dedicated operator.

### Uniq

The uniq operator allows to filter out duplicates. Without any parameter (`| uniq`) each fields combination must be unique from one entry to another. However
the filtering can be forced on a specific field, for example `| uniq status_code`, this will returns one entry per status code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you decide which entry to keep if the matching is done only on a subset of the log entry (ie. a single field)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means it will keep an entry only if the field has not been seen before. This would allows you to get the list of unique address IP as explained by @slim-bean . {app="nginx"} | glob ... | uniq address_ip

I did a poor job at explaining all those uses cases. Gonna do another run to explain that for each operators.

Comment on lines +146 to +147
{job="prod/app"} | json .route client_ip=.client.ip
{job="prod/app"} | json .route .client.ip as route, client_ip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be less beneficial to expose both = and as for assigning keys to values. Unless I'm missing something, I think it'd be better to choose one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm asking for feedback which one is better ?

Also do you prefer | json .route, .client or |json .route .client ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| json

{job="prod/app"} | json
```

> `{job="prod/app"} |json |select route client_ip` is an implicit extraction, the list of extracted field is inferred from the full query.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're inferring the required field, can we remove explicit extraction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At query time yes that's the plan.

### Questions

- Should we allow to implicit field extraction such as `{job="prod/app"} | logfmt` and `{job="prod/app"} | json` would extract all fields ? I really like the simplicity of it.
- In the previous document the `as field_name1, field_name2` field renaming notation was also interesting should we use is over `ip=client_ip` for now I've opted for the later which can be omitted in case of no renaming ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer <ip=>client_ip because as an operator it's easier to change queries. Using as syntax requires that we also keep track of the index of the matches as well as requiring explicitly naming the field. a, b as _, other_b could be used to minimize this, but I still prefer the = syntax -- I can easily add a new key value pair in the middle of my selections without causing off-by-one issues in keying the selected values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point however = or => ?


Example: `{job="prod/query-frontend"} | json | filter latency > 10 or status >= 400`

When field value and literal types are different, conversion will be attempted prior testing and will be translated into zero type values in case of failure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant about this - I think failing in the case of invalid conversions would be a more intuitive/safer way to go. If needed, we could support a way to use a default value if parsing is unsuccessful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but then I realized what if only one log doesn't have the value in the correct format ? erroring seems harsh.

Copy link
Member

@owen-d owen-d Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we ignore the line instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could yes ! Noted.


### Select

The Select operator (`| select`) allows you to select and reduce field that should be returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should only use inferred selections if possible. The go templating use case is independent AFAICT and we could use a gotpl option to do this sort of mapping I think. Thoughts @cyriltovena?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep we might be able to infer all the time. I need to verify this.

And may be a template operator is better, specially if we can just chain them but that doesn't improve brevity.

@owen-d
Copy link
Member

owen-d commented Mar 31, 2020

Incredible job @cyriltovena. I had a few comments but overall I'm very impressed with this so far. I think @slim-bean nailed the reasoning for why we want to support these features in loki rather than deferring to CLIs.

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
@alexvaut
Copy link

alexvaut commented Apr 8, 2020

1/ Love it, right now my organization can't use Loki because of the lack of structured logging support. This would definitively be the missing feature that would make the teams adopt it. Knowing how people work can, maybe, drive you toward the first steps, so, here it is:

Our current solution C# -> serilog -> fluentd -> clickhouse -> grafana where serilog generates json, fluentd insert it to clikchouse with nested fields and where grafana, with dashboards full of Clickhouse - SQL like - queries more or less linked with prometheus labels (when possible) and grafana variables are reusing the fields/labels setup in our C# code. So very often we do something like this:

in C#:
log.Info("{field1} {field2} {field3} {label1} {label2}",f1,f2,f3,l1,l2)

in Grafana:
SELECT fields['field1'], fields['field2'] FROM logs WHERE labels['label1']='xxx' AND labels['label2']='yyy' AND fields['field3']='zzz'
and display it in tables

Depending on the domain, we can display messages or sometimes only some fields, it really depends. They key is: the same team maintains the C# code and the related grafana dashboard. Everything between is generic.

2/Agree with @slim-bean on the reasoning.

3/My concern is to bring again a new language. Piping is nice, ok, but could we stick to something that already exists in the linux commands world ? I have no clue what that would be, I think for json there are already many options to parse, filter and select.

4/ So to me what is the MVP of this long set of features that could be accepted and implemented rather quickly to have some feed-back ? As you understand, for me: json parser + field extraction and light filtering would be a first step. :-)

- Show me all first x unique IP within a time range.
- Reformat a log line to be more readable.

However those use cases are out of scope:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following use cases are kept out of scope:


> Implicit fields extraction could means trouble in term of performance. However in most cases we will be able to infer all used fields from the full query and so limit the extraction per line.

To cover a wide range of structured and unstructured logs, we will start with those four parsers: `glob`, `logfmt`, `json` and `regexp`.
Copy link
Member

@wardbekker wardbekker Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about Grok expressions? Quite a lot of usage in the logging world

@ptman
Copy link

ptman commented Apr 17, 2020

Is there no way of doing the field extraction in promtail in order to keep the cost of that close to the production of logs instead of making a more expensive query?

Copy link
Member

@RichiH RichiH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level comments:

  • This should be a GDoc (already agreed)
  • The mix between (){}, | foo | bar, |foo .bar and as & = is highly confusing to me. Perl's curse is that it allows you to do things in more than one way, which leads to bikeshedding
  • As per our call: Regex are more powerful than globbing, so I would argue that regex are a must, globs are optional. As such, regex should be done first.

Potentially contentious: Should we make/keep LogQL fully LISP-like and use Flux in Loki for anyone who prefers piping?


Based on the feedback of the first design and our own usage, it’s pretty clear that we want to support extracting fields out of the log line for detailed analysis. [Amazon Cloud Watch is a good example](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax-examples.html) of such language features.

For instance some users are currently extracting labels using Promtail pipeline stages as a way to filter and aggregate logs but this causes issues with our storage system. (Causes the creation of a lot of small chunks, putting pressure on the filesystem when trying to aggregated over a long period of time).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For instance some users are currently extracting labels using Promtail pipeline stages as a way to filter and aggregate logs but this causes issues with our storage system. (Causes the creation of a lot of small chunks, putting pressure on the filesystem when trying to aggregated over a long period of time).
For instance some users are currently extracting labels using Promtail pipeline stages as a way to filter and aggregate logs but this causes issues with our storage system. This causes the creation of a lot of small chunks, putting pressure on the filesystem when trying to aggregate over a long period of time.


In this proposal I introduce multiple new operators in the LogQL language that will allow transforming a log stream into a field stream which can be then aggregated, compared, and sorted.

You'll see that all those new operators (except aggregations over time) are chain together with the `|` pipe operator. This is because we believe it's the most intuitive way to write queries that process data in the Linux world as opposed to the function format. (e.g `cat my.log | grep -e 'foo' | jq .field | wc -l` is easier to read than `wc(jq(grep(cat my.log,'foo'),.field),-l)`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You'll see that all those new operators (except aggregations over time) are chain together with the `|` pipe operator. This is because we believe it's the most intuitive way to write queries that process data in the Linux world as opposed to the function format. (e.g `cat my.log | grep -e 'foo' | jq .field | wc -l` is easier to read than `wc(jq(grep(cat my.log,'foo'),.field),-l)`)
You'll see that all those new operators (except aggregations over time) are chained together with the `|` pipe operator. This is because we believe it's the most intuitive way to write queries that process data in the Linux world as opposed to the function format. (e.g `cat my.log | grep -e 'foo' | jq .field | wc -l` is easier to read than `wc(jq(grep(cat my.log,'foo'),.field),-l)`)

## Use cases

The first use case is to allow users to extract data from the log content itself and use them as labels or sample values.
One of the biggest benefits will be that user won't need to index log content anymore for the only purpose of filtering or creating metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One of the biggest benefits will be that user won't need to index log content anymore for the only purpose of filtering or creating metrics.
One of the biggest benefits is that users won't need to index log content just for filtering or creating metrics any more.


- Show me the logs of requests slower than 5s for a given domain.
- Show me the log with only IP address, latency and path. (Remove redundant noise within logs)
- I have many redundant logs, show me only unique subset of them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- I have many redundant logs, show me only unique subset of them.
- I have many redundant logs, show me only the unique subset of them.

- Show me all first x unique IP within a time range.
- Reformat a log line to be more readable.

However those use cases are out of scope:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However those use cases are out of scope:
The following use cases are out of scope:


However those use cases are out of scope:

- What's the latency xth quantile over a year, we don't expect time range to be higher than couple of day max.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Higher-level note, I would be surprised if people didn't threaten to pay us to support a month at least. Would it make sense to at least think about how to extend this time frame?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, with our current logging solution most investigations/use will be ranges of ~1-2 days, but some of them (~5-10%) will be up to 1-2 months.

"We don't expect time range to be higher than couple of day max."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1-2 months feels like in many cases a lot of data. I don't think it's unreasonable though to allow such a query, with the expectation that it might take minutes or more to complete. Additionally, if someone had multiple Loki installations, they could spin up, and run a large cluster for a short amount of time to perform such operations (like how Elastic Map Reduce jobs work)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly !

However those use cases are out of scope:

- What's the latency xth quantile over a year, we don't expect time range to be higher than couple of day max.
- Replace metrics instrumentation with logs. We want to ease new analysis but we don't want to push users to use this instead of instrumenting their application with Prometheus or any other metric client.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Replace metrics instrumentation with logs. We want to ease new analysis but we don't want to push users to use this instead of instrumenting their application with Prometheus or any other metric client.
- Replace metrics instrumentation with logs. - We want to ease new analysis but we don't want to push users to use this instead of instrumenting their application with Prometheus or any other metric client.


- What's the latency xth quantile over a year, we don't expect time range to be higher than couple of day max.
- Replace metrics instrumentation with logs. We want to ease new analysis but we don't want to push users to use this instead of instrumenting their application with Prometheus or any other metric client.
- Big data analysis, this whole new feature is mainly for ad-hoc use cases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Big data analysis, this whole new feature is mainly for ad-hoc use cases.
- Big data analysis. - This design doc mainly covers ad-hoc use cases.


## LogQL Parsers

A parser operator can only be applied to __log__ streams (a set of entries attached to a given set of label pairs) and will transform (or extract) them into __field__ streams.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A parser operator can only be applied to __log__ streams (a set of entries attached to a given set of label pairs) and will transform (or extract) them into __field__ streams.
A parser operator can only be applied to __log__ streams (a set of entries attached to a given set of label pairs) and will transform or extract them into __field__ streams.

```logql
{job="prod/app"} | json .route client_ip=.client.ip
{job="prod/app"} | json .route .client.ip as route, client_ip
{job="prod/app"} | json |select route client_ip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious why I use .route but |select. Together with the {} syntax, we are at three models, now.

Copy link
Member

@RichiH RichiH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had submitted this, but GH being wonky today...

  • I think we should move to GDoc again (already agreed)
  • There's currently a mix of (){}, | select, .route, and also = and as. This is confusing
  • Regex are more powerful than globs, so regex MUST be implemented, globbing MAY be implemented
  • Should we consider keeping LogQL like PromQL with (){} only and support Flux in Loki for pipe fans?

Copy link

@sandstrom sandstrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good proposal overall, I like it! 🏅

Great that you've spent time thinking about the real use-cases, e.g. a developer troubleshooting an issue and needing to grep through logs. That's what we use our logging to 95% of the time.

I assume this will occur at query-time, so it'll be CPU intensive.

  1. Do we want to support lazy evaluation? Or think about that when designing the API? Something similar to ruby's lazy evaluation (see below).

  2. Do we want to give the user some progress indicator, e.g. query has now searched through X / Y documents.

# example of lazy evaluation in ruby

# This code will "hang" and you'll have to ctrl-c to exit
(1..Float::INFINITY).reject { |i| i.odd? }.map { |i| i*i }.first(5)

(1..Float::INFINITY).lazy.reject { |i| i.odd? }.map { |i| i*i }.first(5)
#=> [4, 16, 36, 64, 100]

http://patshaughnessy.net/2013/4/3/ruby-2-0-works-hard-so-you-can-be-lazy
https://www.honeybadger.io/blog/using-lazy-enumerators-to-work-with-large-files-in-ruby/


However those use cases are out of scope:

- What's the latency xth quantile over a year, we don't expect time range to be higher than couple of day max.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, with our current logging solution most investigations/use will be ranges of ~1-2 days, but some of them (~5-10%) will be up to 1-2 months.

"We don't expect time range to be higher than couple of day max."


The json parser allows to parse a json log line and select fields to extract. If the line is not a valid json document the line will be skipped.

We could support those notations:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could sync with e.g. jq, or use a sub-set of the jq supported syntax? Nothing major though, fine with coming up with our own too

We'd avoid reinventing the wheel + it may make copy-paste easier

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes totally agree, probably a subset, cause jq does not have a go library (at least without cgo) and that would be too hard to support it all.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Subset is totally fine and probably a better idea than supporting everything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes totally agree, probably a subset, cause jq does not have a go library (at least without cgo) and that would be too hard to support it all.

https://github.com/itchyny/gojq

If you wanted to use cgo bindings: https://github.com/jzelinskie/faq/tree/master/pkg/jq but I wouldn't recommend it. It's a huge pain to work with and you'll have a bunch of security issues I'd bet.

Idk how good gojq is though.

### Faceting

As explained in the [parsers section](#logql-parsers), some parsers (`json` and `logfmt`) will support implicit extraction. However when the user will want to reduce fields by selecting them, I think it would be useful to provide an API that returns all field names for a given field stream selector and a time range. This way Grafana or any other integrations will be able to provide auto completion for those fields to select.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good idea!

Copy link

@sandstrom sandstrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyriltovena Looking forward to LogQL! I've added two more thoughts on this proposal, let me know what you think.


Field names are declared by each parsers, some parser will allow implicit fields extraction for __brevity__ like the `json` and `logfmt` parsers.

Since field name can conflicts with labels during aggregation we will support for each parser the ability to rename a field. However in case of conflict fields will take precedence over labels.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid conflicts between labels and fields, how about using different symbols for them? For example @myLabel and $myField?

Should also make documentation etc. easier and help avoid confusion between the two.

127.0.0.1 - - [19/Jun/2012:09:16:22 +0100] "GET /GO.jpg HTTP/1.1" 499 0 "http://domain.com/htm_data/7/1206/758536.html" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; SE 2.X MetaSr 1.0)"
```

would be transformed into those fields:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are field key/value pairs only strings? Obviously keys are strings, but can values be something else, e.g. arrays?

We are doing similar to Canonical Log Lines (ref 1, ref 2), but with all messages for a request nested under a messages key, in an array. So basically every http request is one log line.

It looks like this:

{
  "request-id" : "2d94738b-0324-a1fc-2916-644716f3ea36",
  "ip-address" : "127.0.0.1",
  "user-name" : "Jane Doe",
  "http-method" : "get",
  "http-response" : "200",
  "duration" : 125,
  "service" : "authentication",
  "messages" : [
    "Attempted to authenticate user Jane",
    "Connecting to internal IAM system",
    "Internal IAM system responded with unknown user id",
    "Provided user id is unknown, cannot verify user, returning response to user",
    "",
    "Requested completed"
  ]
}

There are two fields here that aren't strings, duration (an integer) and messages (an array).

Would be great if LogQL would support querying against array fields like this.

@cyriltovena let me know what you think!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't thought about the full json syntax yet.

@yurishkuro
Copy link

First, want to voice my support for this, as it evolves Loki into a worthy OSS competitor to Humio, which occupies a similar "no indexing" place in the logging spectrum and sports a pretty expressive query language.

Would like to raise a few concerns with the proposed syntax:

  • Requiring Prom-like syntax {job=“appname”} at the beginning is unnecessarily inconsistent with the rest of the pipe-based QL. Please consider supporting an alternate syntax for meta fields, like #job=appname or @job=appname
  • Requiring a parser seems like unnecessary friction to user experience. It may be necessary in some cases, but just as likely (especially for people migrating from ELK stack) that the logs are already known to be in a certain format like json. I suggest making this optional and potentially deriving this setting from configuration somewhere.
  • it was not clear how one could combine field-based filter with full grep. Humio does it pretty elegantly - any freeform text between || is grep.
  • Perhaps the biggest concern is that Range Vector Operations are still using functional notation instead of pipe-based language, which goes against the main motivation provided in the intro section. Could they not be written as just additional stages in the pipeline, like filter a=b | groupBy city district ?

@cyriltovena
Copy link
Contributor Author

First, want to voice my support for this, as it evolves Loki into a worthy OSS competitor to Humio, which occupies a similar "no indexing" place in the logging spectrum and sports a pretty expressive query language.

Thanks @yurishkuro, feels great to hear you support this.

Do note that I'm actively working on this and it's a work in progress especially around the syntax. Once this is more stable I'll share more.

Would like to raise a few concerns with the proposed syntax:

  • Requiring Prom-like syntax {job=“appname”} at the beginning is unnecessarily inconsistent with the rest of the pipe-based QL. Please consider supporting an alternate syntax for meta fields, like #job=appname or @job=appname

One of the big advantages of using the same Prometheus syntax is the ability to switch between metrics and logs.

  • Requiring a parser seems like unnecessary friction to user experience. It may be necessary in some cases, but just as likely (especially for people migrating from ELK stack) that the logs are already known to be in a certain format like json. I suggest making this optional and potentially deriving this setting from configuration somewhere.

This is a great idea, I don't think we will make this implicit from the start but definitely a great improvement we will be able to do later on.

  • it was not clear how one could combine field-based filter with full grep. Humio does it pretty elegantly - any freeform text between || is grep.

Yes you will always be able to grep filter first, allowing you to reduce the parsing then you will be able to do field/label based filter.

  • Perhaps the biggest concern is that Range Vector Operations are still using functional notation instead of pipe-based language, which goes against the main motivation provided in the intro section. Could they not be written as just additional stages in the pipeline, like filter a=b | groupBy city district ?

I did entertain this idea for a long time {app="foo"} |= "err" | rate 5m. For now we have settle on using function for metric queries, this way it is easier to learn if you come from Prometheus. Once we have this in, I think we can bring back this discussion. Tom Wilkie also suggested this at some point (e.g support both function and pipe syntax).

Again thanks for your feedback, I'll keep you posted.

@yurishkuro
Copy link

Thanks, @cyriltovena

One of the big advantages of using the same Prometheus syntax is the ability to switch between metrics and logs.

This is simultaneously a disadvantage that couples Loki with Prometheus. The alternative syntax I showed is just as easy to translate back into PromQL snippet, but allows Loki to be used in non-Prom installations (e.g. at Uber we use M3QL, not PromQL).

Regarding the pipe syntax, one very intriguing proposition is combining it with M3QL for metrics, which is already pipe based. The expression can start with a query on logs, followed by some aggregation operation like count() or groupBy, which results in a set of time series that can be further manipulated by a pipe-based metrics QL. But this is just a wish list so far.

@Wnthr
Copy link

Wnthr commented Jul 2, 2020

As a sysadmin I realize I am not the primary use case for Loki, but I am really excited to see this direction for logQL, it will dramatically enhance the usability of Loki (and once it supports it) Grafana as a log visualization aid for me and my cohorts, I am very excited to get my hands on this functionality.

The only usecase I see that isn't really addressed as far as I can tell, is having different log line formats in the same query. In my usecase, I can definitely see that querying for a particular IP address would return two or more log line formats, and I would love to be able to add an external classifier to determine the format of the logline and return field specifications that differ per line. something along the lines of | classify classifier.sh with an external script taking care of the parsing for that specific query, before splitting up and prettifying the output in grafana.

@ghostsquad
Copy link

ghostsquad commented Jul 3, 2020

As a sysadmin I realize I am not the primary use case for Loki, but I am really excited to see this direction for logQL, it will dramatically enhance the usability of Loki (and once it supports it) Grafana as a log visualization aid for me and my cohorts, I am very excited to get my hands on this functionality.

The only usecase I see that isn't really addressed as far as I can tell, is having different log line formats in the same query. In my usecase, I can definitely see that querying for a particular IP address would return two or more log line formats, and I would love to be able to add an external classifier to determine the format of the logline and return field specifications that differ per line. something along the lines of | classify classifier.sh with an external script taking care of the parsing for that specific query, before splitting up and prettifying the output in grafana.

In sumologic, most queries start with a source, which is essentially a specific application or logfile or something like that. So, you can compose queries/filters using something like:

((source=foo | ...) OR (source=bar | ...)) | <more filters here>

Similarly, I am an SRE and I want to be able to see logs from a combination of things in the same view, such as application logs (json) but also logs from say an envoy proxy (plaintext). Being able to manipulate how the query shows the result should be as simple as structuring your query with a OR and then at some point, using parenthesis to "combine" these results to be used by further filters.

@afletch
Copy link

afletch commented Aug 21, 2020

The updated LogQL design doc is wonderful reading. It feels like the missing part of Loki. The doc is now several months old and seemingly still rather incomplete particularly in regards to implementation plans. Is this doc still under consideration or has there been some sort of pause to progress?

@cyriltovena
Copy link
Contributor Author

Closing it, thanks everyone for all the great feedback, it was such an inspiration ! I'm proud of the result. A great example of community effort.

see #2769

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal A design document to propose a large change to Promtail size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.