Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add util package for timeseries data compression using streamvbyte #7810

Merged
merged 2 commits into from
Oct 28, 2024

Conversation

bduffany
Copy link
Member

No description provided.

Copy link
Member

@tylerwilliams tylerwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we could simplify this a bit, both the protos and the library. Since we're storing absolute timestamps, we don't need any chunking, you can just slam more data in the array.

Proto API could look like:

message TimeSeries {
  // A streamvbyte delta encoded list of int64 timestamps in usec since the epoch.
  // Use server/util/timeseries to transcode.
  bytes timestamps = 1;
  // A streamvbyte delta encoded list of int64 sample values.
  // Use server/util/timeseries to transcode.
  bytes samples = 2;
  // The length of this timeseries. Necessary for decoding.
  int64 length = 3;
}

and then library code could look like:

func Encode(timestamps, samples []int64) *TimeSeries {}
func Decode(ts *TimeSeries) ([]int64, []int64) {}

Then the calling code can be real simple and dumb:

func main() {
  timestamps := make([]int64, 0)
  samples := make([]int64, 0)

  for i := 0; i < 100_000; i++ {
    if len(timestamps) > 1e6 { // whatever your max memory you want to use for this. 
      break
    }
    timestamps = append(timestamps, time.Now().UnixMicros())
    samples = append(samples, observeSampleValue())
  }

  ts := timeseries.Encode(timestamps, samples)
  task.observedCPU = ts
}

@bduffany
Copy link
Member Author

bduffany commented Oct 28, 2024

@tylerwilliams I did chunking so that the executor can compress the data incrementally rather than having to keep the full uncompressed dataset in memory while the task is running. Similarly when the apps decompress the data they can do so in a streaming fashion.

I guess we can deal with this by just limiting the number of samples to 100K or something though?

re the proto API I think it's redundant to duplicate the timestamps in each timeseries. The timestamps for memory, IO, etc. are all going to be the same because we scrape all of the cgroup files together. Maybe not a big deal but it does double the amount of storage required to add a new timeseries.

Also I don't know how we'd support []int64 when streamvbyte only supports int32. Not sure if that was an essential detail of your comment or not. We could maybe have two separate hi and lo streams for the high and low bytes that are compressed separately? I kind of prefer just using int32 though because it encourages using lower resolution data which will compress better anyway.

@tylerwilliams
Copy link
Member

@tylerwilliams I did chunking so that the executor can compress the data incrementally rather than having to keep the full uncompressed dataset in memory while the task is running. Similarly when the apps decompress the data they can do so in a streaming fashion.

I guess we can deal with this by just limiting the number of samples to 100K or something though?

re the proto API I think it's redundant to duplicate the timestamps in each timeseries. The timestamps for memory, IO, etc. are all going to be the same because we scrape all of the cgroup files together. Maybe not a big deal but it does double the amount of storage required to add a new timeseries.

Also I don't know how we'd support []int64 when streamvbyte only supports int32. Not sure if that was an essential detail of your comment or not. We could maybe have two separate hi and lo streams for the high and low bytes that are compressed separately? I kind of prefer just using int32 though because it encourages using lower resolution data which will compress better anyway.

Ack, I don't think this is worth the added complexity in the protos or the code right now. Something we could come back and add if we really need it, but performance and memory wise, you can keep 3+ hours of 10ms resolution data in less than 20MB, and decompress it at ~GB/sec.

Similar for the timestamps, it's not a ton of data, if you gzipped the proto it would disappear anyway. I think it's best to keep this simple here.

@bduffany
Copy link
Member Author

@tylerwilliams I reworked this based on your comments and what we chatted about offline - PTAL

@bduffany bduffany merged commit a021f08 into master Oct 28, 2024
15 checks passed
@bduffany bduffany deleted the timeseries-util branch October 28, 2024 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants