Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support global limits in FreshnessSamplingStrategy and NewDataStrategy again #179

Open
MaxiBoether opened this issue Mar 3, 2023 · 2 comments
Labels

Comments

@MaxiBoether
Copy link
Contributor

By introducing partitioning in the selector, the meaning of limit has shifted: The limit is currently applied per partition and not globally. This means that with a limit of 2, we can still have many data points if there are million partitions.

We should again shift the limit to be a global setting. This is not straightforward since we somehow need to sample across multiple partitions. One way might be generating indices that map into all partitions (i.e., count globally), and then before yielding a partition, only choose the samples whose indices are in our pre-generated list.

@MaxiBoether
Copy link
Contributor Author

pre-generating a list of keys does not work. however, we could count the number of potential rows that we select, and then generate indices from 0-len(result) and somehow say: pls give me these lines to avoid materialization

@francescodeaglio
Copy link
Collaborator

The solution we want seems to be TABLESAMPLE which, however, is not implemented in all SQL dialects.

A fast and globally valid solution is like

SELECT sample_id
FROM( SELECT sample_id FROM table ORDER BY RANDOM() LIMIT 100) 
WHERE ...
ORDER BY timestamp

That can be implemented in the following way

subq = (
                select(SelectorStateMetadata.sample_key)
                .filter(SelectorStateMetadata.pipeline_id == self._pipeline_id)
                .order_by(func.random())
                .limit(target_size)
                .alias()
            )
stmt = (
                select(SelectorStateMetadata.sample_key)
                .execution_options(yield_per=self._maximum_keys_in_memory)
                .join(subq, SelectorStateMetadata.sample_key == subq.c.sample_key)
                .order_by(SelectorStateMetadata.timestamp)
            )

Remains to be implemented in these selection policies.

Pointers for TABLESAMPLE:

  • use clause
.with_hint(
        SelectorStateMetadata,
        text("TABLESAMPLE SYSTEM(:sample_size)"),
        'postgresql'
    ).params(bindparam('sample_size', sample_size))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants