-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about query with random DISTINCT #23
Comments
Hello, Interesting, The The DISTINCT is here on purpose to give you a non-duplicate list of actors, since your For example, you would definitely have duplicated Tom Hardy actors through your table Sadly, I don't have a solution yet over this performance problem, except using I'm going to do some research on it and post below during the week-end. If you have an equivalent query which is working great and fast, I would be please if you can share it here. |
Also, please paste me here the EXPLAIN ANALYZE of both the query, so we can sort-out why PG doesn't like the |
Hahah I need to watch Legend! That makes sense with the distinct, though how does it know which needs to be unique if the ID, created_at, and updated_at are all different? The primary key is already unique, and our data is actually bad so we have 4 tom hardy records already 😛 Here's the explain on both of those. Kind of interesting to see that the distinct returns more records than the not distinct:
|
Ok, I've made my own little experiment on a prod-like db, where there's around 7M rows in states, 2M in tasks. This is the worst case scenario, as no data are duplicated here (in short: the distinction is useless in this case). SELECT COUNT(s.*)
FROM project_task_states s, project_tasks t
WHERE (t.id = s.project_task_id AND t.project_id = 1)
-- Rows count: 36071 Basically both inner join and double from clause create the same plan: Basic selects (duplicates possible)SELECT s.*
FROM project_task_states s, project_tasks t
WHERE (t.id = s.project_task_id AND t.project_id = 1) Returns:
232.905ms for the basic select query (non distinct). DISTINCT ONEXPLAIN ANALYSE(
SELECT DISTINCT ON (s.id) s.*
FROM project_task_states s
INNER JOIN project_tasks t ON (t.id = s.project_task_id)
WHERE t.project_id = 1
)
Results: 292.115ms, using SORT+UNIQUE Using DISTINCTEXPLAIN ANALYSE(
SELECT DISTINCT s.*
FROM project_task_states s
INNER JOIN project_tasks t ON (t.id = s.project_task_id)
WHERE t.project_id = 1
)
Results: 269.531 ms, using SORT+UNIQUE Interesting enough, make usage of GROUP BY and HASH AGGREGATE, which is faster (no sorting). GROUP BY primary keyEXPLAIN ANALYSE(
SELECT s.*
FROM project_task_states s
INNER JOIN project_tasks t ON (t.id = s.project_task_id)
WHERE t.project_id = 1
GROUP BY s.id)
Here we are even fast thanks to simpler hash computation, over only one field. Basically, the overhead is about 30 to 60ms; simpler HASH AGGREGATE works faster than SORT because sorting the data is complexity ( O(N log(N)) ) while hash access is complexity O(N) Maybe I should use |
Yeah, I think having the The first way is how it's running currently which loads all of the columns twice.
|
I have this schema that's sort of complex, but the main part is every time I run this one method, I get a query that looks like
Running a timing on this
I'm not actually calling distinct anywhere. Here's the broken down code https://gist.github.com/jwoertink/9efbe3afd19973bbe15812e7bb2ded54
Is there a way I can call this without having it call that distinct? Or is this maybe a bug in clear?
The text was updated successfully, but these errors were encountered: