Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LID: introduce Coprocessor #8616

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions docs/sources/lids/0003-Coprocessor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
title: "0003: Coprocessor"
description: "introduce coprocessor from Google’s BigTable coprocessor and HBase coprocessor"
---

# 0003: Coprocessor

**Author:** liguozhong (fuling.lgz@alibaba-inc.com)

**Date:** 02/2023

**Sponsor(s):** @jeschkies

**Type:** Feature

**Status:** Draft

**Related issues/PRs:** https://github.com/grafana/loki/issues/8559 AND https://github.com/grafana/loki/issues/91

**Thread from [mailing list](https://groups.google.com/forum/#!forum/lokiproject):** N/A

---

## Background

{log_type="service_metrics"} |= "ee74f4ee-3059-4473-8ba6-94d8bfe03272"

We have counted the source distribution of our logql, and 85% of the grafana log explore queries are traceID queries.
Generally, the log time range of traceID is about 10 minutes(trace time= start~end)

## Problem Statement

but because users do not know traceID start time and end time , they usually search for 7 day log.
In fact having a time range of "7d-10m" is an invalid search.

## Goals
Copy link
Contributor

@jeschkies jeschkies Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood the HBase use case they basically created a plug-in system with some hooks. This seems to be the goal here. How would we implement it? How are we shipping the data? We could use Hashicorp's go-plug but there's probably s serialization overhead. As I understood the HBase design the coprocessor is started on the host where the data is an this can read the data directly. Not sure we can do it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood the HBade use case they basically created a plug-in system with some hooks.

yes.

How would we implement it? How are we shipping the data? We could use Hashicorp's go-plug but there's probably s serialization overhead

My suggestion is that we should not restrict the language. Through http+protobuf, more developer can work together with loki.
Because prometheus remote read and k8s hpa implement the plugin mechanism through http+protobuf, this has achieved great success.
The implementation of HBase limits the jvm language so that the coprocessor can only be implemented in languages such as java or scala. I don't think it's a good idea.

Hashicorp's go-plug

Will this limit the development language?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this limit the development language?

No, they also use gRPC but make it a little simpler for plugins in Go.

Copy link
Contributor Author

@liguozhong liguozhong Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if there is no restriction on the implementation language.
Can we refer to open telemetry’s collector that supports both http and grpc protocols?
The SREs I know actually don’t know much about grpc, but they can easily build an http server .

ex: 😱
protoc -I ./vendor/github.com/gogo/protobuf:./vendor:./pkg/logqlmodel/stats:./pkg/logproto --gogoslick_out=plugins=grpc,Mgoogle/protobuf/any.proto=github.com/gogo/protobuf/types,:pkg/logproto/ pkg/logproto/logproto.proto -I ./

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the plug-in is restricted to Go. One could use gRPC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great


So we hope to introduce some auxiliary abilities to solve this "7d-10m" invalid search.

We have checked that in the database field, such feature have been implemented very maturely.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where in the database? Which database do you mean?


And our team tried to implement the preQuery Coprocessor, and achieved great success.

Through this feature, we solved the problem of "loki + traceID search is very slow".

## Non-Goals

The problem of slow traceID search In the past six months, we tried to introduce kv system / reverse index text system/ bloomfilter to speed up logql return, but the machine cost was too high and finally gave up.
So we don't want to introduce cost-intensive solutions to solve the problem of slow traceID search.

## Proposals

### Proposal 0: Do nothing

Users cannot migrate from log systems like ELK and other indexing schemes to LOKI,if the user heavily uses traceID log search.

### Proposal 1: Query Coprocessor

Thanks Google’s BigTable coprocessor and HBase coprocessor.
HBase coprocessor_introduction link: https://blogs.apache.org/hbase/entry/coprocessor_introduction
The idea of HBase Coprocessors was inspired by Google’s BigTable coprocessors. Jeff Dean gave a talk at LADIS ’09 (http://www.scribd.com/doc/21631448/Dean-Keynote-Ladis2009, page 66-67)
HBase Coprocessor
The RegionObserver interface provides callbacks for:

`**preOpen, postOpen**`: Called before and after the region is reported as online to the master.
`**preFlush, postFlush**`: Called before and after the memstore is flushed into a new store file.
`**preGet, postGet**`: Called before and after a client makes a Get request.
`**preExists, postExists**`: Called before and after the client tests for existence using a Get.
`**prePut and postPut**`: Called before and after the client stores a value.
`**preDelete and postDelete**`: Called before and after the client deletes a value.
etc.


Loki Coprocessor
The QuerierObserver interface provides callbacks for:

`**preQuery**`: Called before querier , Pass (logql, start, end) 3 parameters to the Coprocessor,
and the Coprocessor judges whether it is necessary for the querier to actually execute this query.

For example, for traceID search, query range = 7d + `split_queries_by_interval: 2h`.
This logql query will actually be divided into 84 query sub-requests, and here 83 are invalid,
and only one 2h sub-request can find the log of traceID.
We try to implement two types of Coprocessors in this scenario.

traceID Coprocessor 1 simple text analysis :
if traceID is traceID from XRay or openTelemtry (《Change default trace-id format to be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this works only if we're using XRay format ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, our team's exploration is only for this purpose now.

similar to AWS X-Ray (use timestamp )#1947》), this type of traceID information has a timestamp,
and Coprocessor can specify a trace to execute the longest duration to cooperate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. So the coprocessor parses the timestamp. Did you try doing it in LogQL?

with logql start and end 2 information quickly judges.

traceID Coprocessor 2 base tracing system:
If the trace information exists in a certain tracing system, the Coprocessor can query the return result of the traceID
in the tracing system once, and judge whether the logql query is
necessary based on the time distribution in the returned result and the start and end time of logql.

`preGetChunk`,: ...do someThing .
`preGetIndex`,: ...do someThing .
etc.

The IngesterObserver interface provides callbacks for:

`preFlush`, postFlush: ...do someThing .
etc.

## Other Notes

This feature will allow loki to provide a log consumption feature similar to kafka in the future, because we can provide distributor.PostSendIngestSuccess() and ingester.flushChunk().

There is an opportunity for more loki operators to provide more opportunities to participate in development and form some ecological components of loki. Prometehus-like relationship with thanos

The imagination is unlimited, but it may also make loki less focused on low-cost logging systems. So this feature is also very dangerous.
I noticed before that loki once wanted to be the log exporter of open telemetry, but it has been rejected later. loki expect to focus more on the development of the loki kernel.
For example, the realization of TSDB Index is a very exciting and good result.