Add util package for timeseries data compression using streamvbyte #7810

bduffany · 2024-10-25T21:28:13Z

No description provided.

tylerwilliams

I feel like we could simplify this a bit, both the protos and the library. Since we're storing absolute timestamps, we don't need any chunking, you can just slam more data in the array.

Proto API could look like:

message TimeSeries {
  // A streamvbyte delta encoded list of int64 timestamps in usec since the epoch.
  // Use server/util/timeseries to transcode.
  bytes timestamps = 1;
  // A streamvbyte delta encoded list of int64 sample values.
  // Use server/util/timeseries to transcode.
  bytes samples = 2;
  // The length of this timeseries. Necessary for decoding.
  int64 length = 3;
}

and then library code could look like:

func Encode(timestamps, samples []int64) *TimeSeries {}
func Decode(ts *TimeSeries) ([]int64, []int64) {}

Then the calling code can be real simple and dumb:

func main() {
  timestamps := make([]int64, 0)
  samples := make([]int64, 0)

  for i := 0; i < 100_000; i++ {
    if len(timestamps) > 1e6 { // whatever your max memory you want to use for this. 
      break
    }
    timestamps = append(timestamps, time.Now().UnixMicros())
    samples = append(samples, observeSampleValue())
  }

  ts := timeseries.Encode(timestamps, samples)
  task.observedCPU = ts
}

bduffany · 2024-10-28T16:44:42Z

@tylerwilliams I did chunking so that the executor can compress the data incrementally rather than having to keep the full uncompressed dataset in memory while the task is running. Similarly when the apps decompress the data they can do so in a streaming fashion.

I guess we can deal with this by just limiting the number of samples to 100K or something though?

re the proto API I think it's redundant to duplicate the timestamps in each timeseries. The timestamps for memory, IO, etc. are all going to be the same because we scrape all of the cgroup files together. Maybe not a big deal but it does double the amount of storage required to add a new timeseries.

Also I don't know how we'd support []int64 when streamvbyte only supports int32. Not sure if that was an essential detail of your comment or not. We could maybe have two separate hi and lo streams for the high and low bytes that are compressed separately? I kind of prefer just using int32 though because it encourages using lower resolution data which will compress better anyway.

tylerwilliams · 2024-10-28T18:10:10Z

@tylerwilliams I did chunking so that the executor can compress the data incrementally rather than having to keep the full uncompressed dataset in memory while the task is running. Similarly when the apps decompress the data they can do so in a streaming fashion.

I guess we can deal with this by just limiting the number of samples to 100K or something though?

re the proto API I think it's redundant to duplicate the timestamps in each timeseries. The timestamps for memory, IO, etc. are all going to be the same because we scrape all of the cgroup files together. Maybe not a big deal but it does double the amount of storage required to add a new timeseries.

Also I don't know how we'd support []int64 when streamvbyte only supports int32. Not sure if that was an essential detail of your comment or not. We could maybe have two separate hi and lo streams for the high and low bytes that are compressed separately? I kind of prefer just using int32 though because it encourages using lower resolution data which will compress better anyway.

Ack, I don't think this is worth the added complexity in the protos or the code right now. Something we could come back and add if we really need it, but performance and memory wise, you can keep 3+ hours of 10ms resolution data in less than 20MB, and decompress it at ~GB/sec.

Similar for the timestamps, it's not a ton of data, if you gzipped the proto it would disappear anyway. I think it's best to keep this simple here.

bduffany · 2024-10-28T19:19:16Z

@tylerwilliams I reworked this based on your comments and what we chatted about offline - PTAL

bduffany requested a review from tylerwilliams October 25, 2024 21:28

bduffany force-pushed the timeseries-util branch from 3f54c80 to 984c552 Compare October 25, 2024 21:36

Add util package for timeseries data compression

9994ff4

bduffany force-pushed the timeseries-util branch from 984c552 to 9994ff4 Compare October 25, 2024 21:37

tylerwilliams requested changes Oct 27, 2024

View reviewed changes

bduffany requested a review from tylerwilliams October 28, 2024 19:19

Rework

6430173

bduffany force-pushed the timeseries-util branch from 7fc3707 to 6430173 Compare October 28, 2024 19:24

tylerwilliams approved these changes Oct 28, 2024

View reviewed changes

bduffany merged commit a021f08 into master Oct 28, 2024
15 checks passed

bduffany deleted the timeseries-util branch October 28, 2024 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add util package for timeseries data compression using streamvbyte #7810

Add util package for timeseries data compression using streamvbyte #7810

bduffany commented Oct 25, 2024

tylerwilliams left a comment

bduffany commented Oct 28, 2024 •

edited

Loading

tylerwilliams commented Oct 28, 2024

bduffany commented Oct 28, 2024

Add util package for timeseries data compression using streamvbyte #7810

Add util package for timeseries data compression using streamvbyte #7810

Conversation

bduffany commented Oct 25, 2024

tylerwilliams left a comment

Choose a reason for hiding this comment

bduffany commented Oct 28, 2024 • edited Loading

tylerwilliams commented Oct 28, 2024

bduffany commented Oct 28, 2024

bduffany commented Oct 28, 2024 •

edited

Loading