Skip to content

Add reservoir sampling #11554

@brancz

Description

@brancz

Is your feature request related to a problem or challenge?

We have a large sample of statistical data. All we need is a subset of the data that maintains statistical significance while being able to return a much smaller result to users since insignificantly small values aren't contained resulting in much lower latency.

Describe the solution you'd like

Add the ability to (statistically) sample rows. We've done this using reservoir sampling before. I imagine statistical sampling is a widely enough used function that it should be supported first-class.

Describe alternatives you've considered

I don't know enough about DataFusion to know whether this is possible via a UDF. In the past, we've had issues where records pushed into the query layer are sampled. However, the underlying record is still held onto as immediately materializing it would result in tiny and inefficient 1-row records, but eventually, they need to be materialized as otherwise memory explodes.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions