Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: sampling selects #7030

Open
tbg opened this issue Jun 3, 2016 · 12 comments
Open

sql: sampling selects #7030

tbg opened this issue Jun 3, 2016 · 12 comments
Labels
A-sql-execution Relating to SQL execution. A-sql-optimizer SQL logical planning and optimizations. A-sql-semantics C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team

Comments

@tbg
Copy link
Member

tbg commented Jun 3, 2016

Postgres in 9.5 introduced the TABLESAMPLE clause:

A TABLESAMPLE clause after a table_name indicates that the specified sampling_method should be used to retrieve a subset of the rows in that table. This sampling precedes the application of any other filters such as WHERE clauses. The standard PostgreSQL distribution includes two sampling methods, BERNOULLI and SYSTEM, and other sampling methods can be installed in the database via extensions.

The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any real-valued expression. (Other sampling methods might accept more or different arguments.) These two methods each return a randomly-chosen sample of the table that will contain approximately the specified percentage of the table's rows. The BERNOULLI method scans the whole table and selects or ignores individual rows independently with the specified probability. The SYSTEM method does block-level sampling with each block having the specified chance of being selected; all rows in each selected block are returned. The SYSTEM method is significantly faster than the BERNOULLI method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.

The optional REPEATABLE clause specifies a seed number or expression to use for generating random numbers within the sampling method. The seed value can be any non-null floating-point value. Two queries that specify the same seed and argument values will select the same sample of the table, if the table has not been changed meanwhile. But different seed values will usually produce different samples. If REPEATABLE is not given then a new random sample is selected for each query. Note that some add-on sampling methods do not accept REPEATABLE, and will always produce new samples on each use.

Prior to 9.5, similar things could be done manually: https://www.periscopedata.com/blog/how-to-sample-rows-in-sql-273x-faster.html https://stackoverflow.com/questions/8674718/best-way-to-select-random-rows-postgresql

Implementing something like TABLESAMPLE is likely relatively difficult, but we could check that a manual query which performs something similar is available and gets a somewhat decent query plan.

Opened this issue because I was asked about it in a recent tech talk.

Jira issue: CRDB-6181

@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jun 3, 2016
@petermattis petermattis modified the milestone: Later Jul 11, 2016
@knz
Copy link
Contributor

knz commented May 9, 2018

@petermattis @RaduBerinde what do you suggest we do about this?

@knz knz added A-sql-semantics A-sql-optimizer SQL logical planning and optimizations. labels May 9, 2018
@knz knz added the A-sql-execution Relating to SQL execution. label May 9, 2018
@petermattis petermattis removed this from the Later milestone Oct 5, 2018
@AlexandreK38
Copy link

Will this be added to cockroach, and if so when? It's been 3 years since the issue was reported, so just wondering

@knz
Copy link
Contributor

knz commented Aug 13, 2019

deferring to @RaduBerinde @rytaft for comments

@rytaft
Copy link
Collaborator

rytaft commented Aug 13, 2019

I am not aware of this feature being on the roadmap (cc @awoods187), but it wouldn't be very hard to implement the BERNOULLI method described above given that we're already doing something very similar for table statistics collection. Adding a REPEATABLE option should be relatively easy, but I'm not sure it would be that useful since any changes in data distribution could change the result, even without changes to the data itself.

Implementing the BERNOULLI method would actually be simpler than what we're already doing for CREATE STATISTICS because we wouldn't need to maintain a sample reservoir (that is only necessary to collect a pre-defined number of samples). We'd need to write a new DistSQL processor, but it would be VERY simple, consisting of a single random number generator to decide whether to keep or discard each row.

Adding a SYSTEM sampling method would require a different approach that is aware of how data is stored in RocksDB.

@awoods187
Copy link
Contributor

@Kumamon38 could you tell me a little bit more about how and why you'd like to use this potential feature?

@AlexandreK38
Copy link

Hi and thanks everyone who answered and explained!
I am currently using Nakama solution for my game and I was looking for the best way to pick random rows from the database. I saw on different posts on stack overflow several ways to do that and for Postgres the best one so far after years seemed to be the tablesample system since postgres 9.5, so I had a look if you did something similar and found this post.
For now I managed to simplify how I pick my players without a random so it’s fine. I was just wondering if you will add it some day, not asking to add it :)
Thanks anyway!

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@bladefist
Copy link

bladefist commented Jul 20, 2021

Enterprise customer here, would like to throw our hat in this ring, we would be interested in this feature. Thank you.

@rytaft
Copy link
Collaborator

rytaft commented Jul 26, 2021

cc @vy-ton for visibility

@mgartner mgartner moved this to New Backlog in SQL Queries Jul 24, 2023
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 9, 2023
@github-project-automation github-project-automation bot moved this from New Backlog to Done in SQL Queries Oct 9, 2023
@rytaft
Copy link
Collaborator

rytaft commented Dec 4, 2023

still relevant

@rytaft rytaft reopened this Dec 4, 2023
@github-project-automation github-project-automation bot moved this from Done to Triage in SQL Queries Dec 4, 2023
@rytaft rytaft removed the X-stale label Dec 4, 2023
@rytaft rytaft moved this from Triage to New Backlog in SQL Queries Dec 4, 2023
@israellot
Copy link

Still relevant, I would benefit from this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. A-sql-optimizer SQL logical planning and optimizations. A-sql-semantics C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team
Projects
Status: Backlog
Development

No branches or pull requests