Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about query with random DISTINCT #23

Closed
jwoertink opened this issue Aug 2, 2018 · 5 comments
Closed

Question about query with random DISTINCT #23

jwoertink opened this issue Aug 2, 2018 · 5 comments

Comments

@jwoertink
Copy link
Contributor

I have this schema that's sort of complex, but the main part is every time I run this one method, I get a query that looks like

SELECT DISTINCT actors.*, actors.* FROM actors ....

Running a timing on this

with that DISTINCT: Time: 76.873 ms
without DISTINCT: Time: 2.814 ms

I'm not actually calling distinct anywhere. Here's the broken down code https://gist.github.com/jwoertink/9efbe3afd19973bbe15812e7bb2ded54

Is there a way I can call this without having it call that distinct? Or is this maybe a bug in clear?

@anykeyh
Copy link
Owner

anykeyh commented Aug 3, 2018

Hello,

Interesting,

The DISTINCT actors.* followed by actors.* is a bug; I'll reproduce it on specs and correct it 👍.

The DISTINCT is here on purpose to give you a non-duplicate list of actors, since your video table has many performances which can lead to many duplicated actors.

For example, you would definitely have duplicated Tom Hardy actors through your table performances if you select the movie Legend as T.Hardy is playing two brothers in this incredible performance (unless you got the Liverpool accent, put subtitle on, but I digress 😉 ...).

Sadly, I don't have a solution yet over this performance problem, except using DISTINCT ON ($pkey) (using UNIQUE op) instead of DISTINCT (using Sort/compare op).
On other hand, using GROUP BY $pkey will force the planner to use HashAgg, but this would result only in little improvement.

I'm going to do some research on it and post below during the week-end. If you have an equivalent query which is working great and fast, I would be please if you can share it here.

@anykeyh
Copy link
Owner

anykeyh commented Aug 3, 2018

Also, please paste me here the EXPLAIN ANALYZE of both the query, so we can sort-out why PG doesn't like the DISTINCT here.

@jwoertink
Copy link
Contributor Author

Hahah I need to watch Legend!

That makes sense with the distinct, though how does it know which needs to be unique if the ID, created_at, and updated_at are all different? The primary key is already unique, and our data is actually bad so we have 4 tom hardy records already 😛

Here's the explain on both of those. Kind of interesting to see that the distinct returns more records than the not distinct:

primetime=# explain analyze SELECT DISTINCT actors.*, actors.* FROM actors INNER JOIN performances ON ((performances.actor_id = actors.id)) WHERE (performances.video_id = 1234);
                                                                                                                                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=385.80..385.84 rows=1 width=254) (actual time=4.530..4.530 rows=0 loops=1)
   ->  Sort  (cost=385.80..385.80 rows=1 width=254) (actual time=4.529..4.529 rows=0 loops=1)
         Sort Key: actors.id, actors.name, actors.cached_slug, actors.created_at, actors.updated_at, actors.attachment_file_name, actors.attachment_file_size, actors.attachment_content_type, actors.age, actors.name_starts_with
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop  (cost=0.28..385.79 rows=1 width=254) (actual time=4.405..4.405 rows=0 loops=1)
               ->  Seq Scan on performances  (cost=0.00..377.48 rows=1 width=4) (actual time=4.405..4.405 rows=0 loops=1)
                     Filter: (video_id = 1234)
                     Rows Removed by Filter: 21078
               ->  Index Scan using actors_pkey on actors  (cost=0.28..8.30 rows=1 width=127) (never executed)
                     Index Cond: (id = performances.actor_id)
 Planning time: 8.125 ms
 Execution time: 5.141 ms
(12 rows)

primetime=# explain analyze SELECT actors.* FROM actors INNER JOIN performances ON ((performances.actor_id = actors.id)) WHERE (performances.video_id = 1234);
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.28..385.79 rows=1 width=127) (actual time=2.732..2.732 rows=0 loops=1)
   ->  Seq Scan on performances  (cost=0.00..377.48 rows=1 width=4) (actual time=2.732..2.732 rows=0 loops=1)
         Filter: (video_id = 1234)
         Rows Removed by Filter: 21078
   ->  Index Scan using actors_pkey on actors  (cost=0.28..8.30 rows=1 width=127) (never executed)
         Index Cond: (id = performances.actor_id)
 Planning time: 0.414 ms
 Execution time: 2.812 ms
(8 rows)

@anykeyh
Copy link
Owner

anykeyh commented Aug 4, 2018

Ok, I've made my own little experiment on a prod-like db, where there's around 7M rows in states, 2M in tasks. This is the worst case scenario, as no data are duplicated here (in short: the distinction is useless in this case).

SELECT COUNT(s.*)
FROM project_task_states s, project_tasks t 
WHERE (t.id = s.project_task_id AND t.project_id = 1)
-- Rows count: 36071 

Basically both inner join and double from clause create the same plan:

Basic selects (duplicates possible)

SELECT s.*
FROM project_task_states s, project_tasks t 
WHERE (t.id = s.project_task_id AND t.project_id = 1)

Returns:

Nested Loop  (cost=0.86..66791.68 rows=9441 width=127) (actual time=0.051..230.768 rows=36071 loops=1)
  ->  Index Scan using index_project_tasks_on_project_id on project_tasks t  (cost=0.43..13378.66 rows=4506 width=4) (actual time=0.037..70.839 rows=14428 loops=1)
        Index Cond: (project_id = 1)
  ->  Index Scan using index_project_task_states_on_project_task_id on project_task_states s  (cost=0.43..11.82 rows=3 width=127) (actual time=0.006..0.010 rows=3 loops=14428)
        Index Cond: (project_task_id = t.id)
Planning time: 0.457 ms
Execution time: 232.905 ms

232.905ms for the basic select query (non distinct).

DISTINCT ON

EXPLAIN ANALYSE(
SELECT DISTINCT ON (s.id) s.*
FROM project_task_states s
INNER JOIN project_tasks t ON (t.id = s.project_task_id)
WHERE t.project_id = 1
)
Unique  (cost=67415.01..67462.22 rows=9441 width=127) (actual time=269.591..283.293 rows=36071 loops=1)
  ->  Sort  (cost=67415.01..67438.61 rows=9441 width=127) (actual time=269.590..276.223 rows=36071 loops=1)
        Sort Key: s.id
        Sort Method: external sort  Disk: 2064kB
        ->  Nested Loop  (cost=0.86..66791.68 rows=9441 width=127) (actual time=0.054..234.925 rows=36071 loops=1)
              ->  Index Scan using index_project_tasks_on_project_id on project_tasks t  (cost=0.43..13378.66 rows=4506 width=4) (actual time=0.046..71.097 rows=14428 loops=1)
                    Index Cond: (project_id = 1)
              ->  Index Scan using index_project_task_states_on_project_task_id on project_task_states s  (cost=0.43..11.82 rows=3 width=127) (actual time=0.006..0.010 rows=3 loops=14428)
                    Index Cond: (project_task_id = t.id)
Planning time: 0.589 ms
Execution time: 292.115 ms

Results: 292.115ms, using SORT+UNIQUE

Using DISTINCT

EXPLAIN ANALYSE(
SELECT DISTINCT s.*
FROM project_task_states s
INNER JOIN project_tasks t ON (t.id = s.project_task_id)
WHERE t.project_id = 1
)
HashAggregate  (cost=67004.10..67098.51 rows=9441 width=127) (actual time=256.158..267.949 rows=36071 loops=1)
  Group Key: s.id, s.state, s.user_id, s.project_task_id, s.created_at, s.starts_at, s.ends_at, s.data, s.total_time
  ->  Nested Loop  (cost=0.86..66791.68 rows=9441 width=127) (actual time=0.046..231.255 rows=36071 loops=1)
        ->  Index Scan using index_project_tasks_on_project_id on project_tasks t  (cost=0.43..13378.66 rows=4506 width=4) (actual time=0.038..70.515 rows=14428 loops=1)
              Index Cond: (project_id = 1)
        ->  Index Scan using index_project_task_states_on_project_task_id on project_task_states s  (cost=0.43..11.82 rows=3 width=127) (actual time=0.006..0.010 rows=3 loops=14428)
              Index Cond: (project_task_id = t.id)
Planning time: 0.468 ms
Execution time: 269.531 ms

Results: 269.531 ms, using SORT+UNIQUE

Interesting enough, make usage of GROUP BY and HASH AGGREGATE, which is faster (no sorting).

GROUP BY primary key

EXPLAIN ANALYSE(
SELECT s.*
FROM project_task_states s
INNER JOIN project_tasks t ON (t.id = s.project_task_id)
WHERE t.project_id = 1
GROUP BY s.id)
HashAggregate  (cost=66815.28..66909.69 rows=9441 width=127) (actual time=249.756..261.451 rows=36071 loops=1)
  Group Key: s.id
  ->  Nested Loop  (cost=0.86..66791.68 rows=9441 width=127) (actual time=0.051..229.116 rows=36071 loops=1)
        ->  Index Scan using index_project_tasks_on_project_id on project_tasks t  (cost=0.43..13378.66 rows=4506 width=4) (actual time=0.037..70.548 rows=14428 loops=1)
              Index Cond: (project_id = 1)
        ->  Index Scan using index_project_task_states_on_project_task_id on project_task_states s  (cost=0.43..11.82 rows=3 width=127) (actual time=0.006..0.010 rows=3 loops=14428)
              Index Cond: (project_task_id = t.id)
Planning time: 0.380 ms
Execution time: 262.943 ms

Here we are even fast thanks to simpler hash computation, over only one field.

Basically, the overhead is about 30 to 60ms; simpler HASH AGGREGATE works faster than SORT because sorting the data is complexity ( O(N log(N)) ) while hash access is complexity O(N)

Maybe I should use GROUP BY primary_key strategy, but anyway I want to keep the distinctions, as it makes sense e.g. Give me all the actors from theses movies should not give duplicate actors.

@jwoertink
Copy link
Contributor Author

Yeah, I think having the DISTINCT is ok, I think it's having DISTINCT table.*, table.* that kills it.

The first way is how it's running currently which loads all of the columns twice.

primetime=# SELECT DISTINCT actors.*, actors.* FROM actors INNER JOIN performances ON ((performances.actor_id = actors.id)) WHERE (performances.video_id = 1234);
(0 rows)

Time: 29.869 ms

primetime=# SELECT DISTINCT actors.* FROM actors INNER JOIN performances ON ((performances.actor_id = actors.id)) WHERE (performances.video_id = 1234);
(0 rows)

Time: 2.874 ms

primetime=# SELECT actors.* FROM actors INNER JOIN performances ON ((performances.actor_id = actors.id)) WHERE (performances.video_id = 1234);
(0 rows)

Time: 3.464 ms

primetime=# SELECT DISTINCT ON (actors.id) actors.* FROM actors INNER JOIN performances ON ((performances.actor_id = actors.id)) WHERE (performances.video_id = 1234);
(0 rows)

Time: 3.270 ms

@anykeyh anykeyh closed this as completed in 8d08219 Aug 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants