-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spans to clickhouse exporter #10300
Comments
@hanjm do you have plans to implement this in the future? |
@bputt-e plan but i not have much time recently, would you like to contrib it ? 😄 grafana-clickhouse-datasource support query clickhouse logs in explore UI. it will also be useful to search spans. grafana/clickhouse-datasource#23 |
@hanjm possibly, though we are considering the following which may make more sense for our own code separate from this (we're interested in multitenancy)
I suspect 3 is doable using a kafka topic per client, tradeoffs I suppose :) |
@hanjm what are your thoughts on the following table schemas, ignore the fact there's no indexes or materialized columns for things like http_status. We're going to test option 4 as converting everything to strings will have a cost on storage, but for operations (things like >, <, >=), I'm expecting people can use functions on the column in the where clause. Option 1
Option 2
Option 3
Option 4
|
@bputt-e we can only use string and number(float) for map values, string type for search tags, number type for compare operation (>= <=), the complex type (map/array) could store as json string https://github.com/open-telemetry/opentelemetry-collector/blob/main/pdata/internal/common.go#L466, clickhouse also provide useful json functions.
|
It maybe helpful to explore storing attributes using the JSON object type with dynamic subcolumns: https://clickhouse.com/docs/en/guides/developer/working-with-json/json-semi-structured That way the data should be stored in the optimal format with the inferred type and queries can just use the dot notation like select event.tag.db.host from spans instead of using the string array notation. |
It is a excited experiment feature, currently it's advantage is extracting JSON fields into dedicated columns, store as seperate native columns, useful for create primary/order key for specific column. but it cannot create secondary index like map, the map type could use bloom filter secondary index. |
we're experimenting with map atm to determine its long-term usefulness. We're staying away from json atm because it isn't supported for distributed tables which is what something like this would benefit from A potential problem with json is that if you're dealing with multiple customers and one decides to name their tag xxx and accidentally puts in a numeric value and it gets generated as a new column, how would you handle cases where they fix the value to be string or if another customer comes in and their tag xxx is of another type? We're using a materialized column to set a "tenantId" based on the auth information passed to otel, so this affords us the ability to use the same table in the db. Maps afford us the ability to separate string/numeric fields upfront too |
According to the Clickhouse documentation, the type inference will do type coercion: "This sort of coercion is supported for most types that have variable representation e.g. Int, Float. If necessary, ClickHouse will unify to the higher bit type that allows all current values to be represented. If necessary, converting to a String represents the least precise definition." I am not sure how putting this into a map is more flexible. If the data type is inconsistent, you would have to coerce to the higher type and do type casting at the query level. You would need the same thing for the map. The main advantage of the JSON approach is the simplified query, and better storage and performance. For performance, since each key:value attribute is actually stored as a database column, it gets the performance improvement of the columnar database that Clickhouse is. If you are retrieving only a specific key(column), it only has to scan that column's data. Whereas map, they have to scan the entire map. It should have better performance compare to bloom filters. According to the recent video by Altinity, they posted some performance number differences: about 5x to 6x better performance using the JSON type data at query time versus non native type. https://www.youtube.com/watch?v=FWUKxLS5slc As for not supporting distributed table, that is a problem and hopefully will be fixed soon. Not sure the ETA though. |
BTW, for those that understand Mandarin Chinese, there is a video by BiliBili on how they are using Clickhouse for large scale for logs (trace is very similar). https://www.youtube.com/watch?v=nSfzu8Q0B4I They are using implicit maps, which basically flattens very key/value pair into separate columns. Which is basically what the JSON data type is doing. They have some performance number attached on their slide showing the performance improvements. |
do you have any concerns on null values since it's likely each column won't always have a value for each row |
It should be easy to just filter out null values in the query. I wrote a Python programs to load some of my trace spans from Elasticsearch into Clickhouse using the JSON schema, and the query with attributes are quite natural. The query looks something like this: SELECT |
The null concern was around performance because the number of columns would increase based on the distinct number of attributes....I suppose it might be a non-issue, though, clickhouse prefers an empty string over null |
I don't think null or empty value will cause much storage bloat since it is columnar db and column can use run length encoding or other compression to compress empty/null values. Query performance with separate columns (even with lots of null values) will be much better than a map. We are using auto instrumentation (Java), so I don't expect a lot of different attribute keys. I guess this would be more of a concern if applications are doing custom instrumentation and introduce arbitrary attributes. |
Wondering if this should omit certain attributes (KvlistValue, ArrayValue, BytesValue) similar to elastic's exporter opentelemetry-collector-contrib/exporter/elasticexporter/internal/translator/elastic/utils.go Line 31 in 4a72901
Attribute listing |
Resolved by #13442 |
Is your feature request related to a problem? Please describe.
We're interested in storing spans to clickhouse vs using something like ES or Loki/Tempo as it reduces architectural complexity and gives us more freedom with implementing wrappers around the database for visualizations within tools like grafana
Describe the solution you'd like
Looking to add spans as a supported option for clickhouse exporter
Describe alternatives you've considered
We're utilizing loki + tempo, ideally we could replace those with just clickhouse
The text was updated successfully, but these errors were encountered: