Support global limits in FreshnessSamplingStrategy and NewDataStrategy again #179

MaxiBoether · 2023-03-03T08:57:19Z

By introducing partitioning in the selector, the meaning of limit has shifted: The limit is currently applied per partition and not globally. This means that with a limit of 2, we can still have many data points if there are million partitions.

We should again shift the limit to be a global setting. This is not straightforward since we somehow need to sample across multiple partitions. One way might be generating indices that map into all partitions (i.e., count globally), and then before yielding a partition, only choose the samples whose indices are in our pre-generated list.

MaxiBoether · 2023-03-20T14:34:43Z

pre-generating a list of keys does not work. however, we could count the number of potential rows that we select, and then generate indices from 0-len(result) and somehow say: pls give me these lines to avoid materialization

francescodeaglio · 2023-03-20T16:08:08Z

The solution we want seems to be TABLESAMPLE which, however, is not implemented in all SQL dialects.

A fast and globally valid solution is like

SELECT sample_id
FROM( SELECT sample_id FROM table ORDER BY RANDOM() LIMIT 100) 
WHERE ...
ORDER BY timestamp

That can be implemented in the following way

subq = (
                select(SelectorStateMetadata.sample_key)
                .filter(SelectorStateMetadata.pipeline_id == self._pipeline_id)
                .order_by(func.random())
                .limit(target_size)
                .alias()
            )
stmt = (
                select(SelectorStateMetadata.sample_key)
                .execution_options(yield_per=self._maximum_keys_in_memory)
                .join(subq, SelectorStateMetadata.sample_key == subq.c.sample_key)
                .order_by(SelectorStateMetadata.timestamp)
            )

Remains to be implemented in these selection policies.

Pointers for TABLESAMPLE:

use clause

.with_hint(
        SelectorStateMetadata,
        text("TABLESAMPLE SYSTEM(:sample_size)"),
        'postgresql'
    ).params(bindparam('sample_size', sample_size))

add random if using bernoulli (https://dba.stackexchange.com/a/259192)

MaxiBoether added the Selector label Mar 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support global limits in FreshnessSamplingStrategy and NewDataStrategy again #179

Support global limits in FreshnessSamplingStrategy and NewDataStrategy again #179

MaxiBoether commented Mar 3, 2023

MaxiBoether commented Mar 20, 2023

francescodeaglio commented Mar 20, 2023

Support global limits in FreshnessSamplingStrategy and NewDataStrategy again #179

Support global limits in FreshnessSamplingStrategy and NewDataStrategy again #179

Comments

MaxiBoether commented Mar 3, 2023

MaxiBoether commented Mar 20, 2023

francescodeaglio commented Mar 20, 2023