Skip to content

Low perf on NATURAL JOINs #560

Closed
Closed
@campoy

Description

@campoy

I have these two requests which should perform basically the same:

SELECT f.repository_id, COUNT(*) as n
FROM   files AS f
       JOIN commit_files cf ON
            f.repository_id=cf.repository_id AND
            f.file_path=cf.file_path AND
            f.blob_hash=cf.blob_hash AND
            f.tree_hash=cf.tree_hash
       JOIN refs ON
            cf.repository_id = refs.repository_id AND
            cf.commit_hash = refs.commit_hash
WHERE  ref_name = 'HEAD'
GROUP BY f.repository_id
ORDER BY n DESC

and its NATURAL JOIN equivalent

SELECT f.repository_id, COUNT(*) as n
FROM   files AS f
       NATURAL JOIN commit_files cf
       NATURAL JOIN refs
WHERE  ref_name = 'HEAD'
GROUP BY f.repository_id
ORDER BY n DESC

Unfortunately, while the first one finishes after a couple of seconds, the second one takes double that.
I analyzed their EXPLAIN output and saw there's a tiny difference and wonder whether this could be the culprit.

For the first JOIN ON version, the plan is:

Sort(n DESC)
 └─ Project(files.repository_id, COUNT(*) as n)
     └─ GroupBy
         ├─ Aggregate(files.repository_id, COUNT(*))
         ├─ Grouping(files.repository_id)
         └─ Exchange(parallelism=96)
             └─ SquashedTable(refs, commit_files, files)
                 ├─ Columns
                 │   ├─ Column(repository_id, TEXT, nullable=false)
                 │   ├─ Column(file_path, TEXT, nullable=false)
                 │   ├─ Column(blob_hash, TEXT, nullable=false)
                 │   ├─ Column(tree_hash, TEXT, nullable=false)
                 │   ├─ Column(tree_entry_mode, TEXT, nullable=false)
                 │   ├─ Column(blob_content, BLOB, nullable=false)
                 │   ├─ Column(blob_size, INT64, nullable=false)
                 │   ├─ Column(repository_id, TEXT, nullable=false)
                 │   ├─ Column(commit_hash, TEXT, nullable=false)
                 │   ├─ Column(file_path, TEXT, nullable=false)
                 │   ├─ Column(blob_hash, TEXT, nullable=false)
                 │   ├─ Column(tree_hash, TEXT, nullable=false)
                 │   ├─ Column(repository_id, TEXT, nullable=false)
                 │   ├─ Column(ref_name, TEXT, nullable=false)
                 │   └─ Column(commit_hash, TEXT, nullable=false)
                 └─ Filters
                     ├─ commit_files.repository_id = refs.repository_id
                     ├─ commit_files.commit_hash = refs.commit_hash
                     ├─ files.repository_id = commit_files.repository_id
                     ├─ files.file_path = commit_files.file_path
                     ├─ files.blob_hash = commit_files.blob_hash
                     ├─ files.tree_hash = commit_files.tree_hash
                     └─ refs.ref_name = "HEAD"

While for the one with NATURAL JOIN:

Sort(n DESC)
 └─ Project(files.repository_id, COUNT(*) as n)
     └─ GroupBy
         ├─ Aggregate(files.repository_id, COUNT(*))
         ├─ Grouping(files.repository_id)
         └─ Exchange(parallelism=96)
             └─ Project(files.repository_id, commit_files.commit_hash, files.file_path, files.blob_hash, files.tree_hash, files.tree_entry_mode, files.blob_content, files.blob_size, refs.ref_name)
                 └─ Filter(files.repository_id = refs.repository_id)
                     └─ SquashedTable(refs, commit_files, files)
                         ├─ Columns
                         │   ├─ Column(repository_id, TEXT, nullable=false)
                         │   ├─ Column(file_path, TEXT, nullable=false)
                         │   ├─ Column(blob_hash, TEXT, nullable=false)
                         │   ├─ Column(tree_hash, TEXT, nullable=false)
                         │   ├─ Column(tree_entry_mode, TEXT, nullable=false)
                         │   ├─ Column(blob_content, BLOB, nullable=false)
                         │   ├─ Column(blob_size, INT64, nullable=false)
                         │   ├─ Column(repository_id, TEXT, nullable=false)
                         │   ├─ Column(commit_hash, TEXT, nullable=false)
                         │   ├─ Column(file_path, TEXT, nullable=false)
                         │   ├─ Column(blob_hash, TEXT, nullable=false)
                         │   ├─ Column(tree_hash, TEXT, nullable=false)
                         │   ├─ Column(repository_id, TEXT, nullable=false)
                         │   ├─ Column(ref_name, TEXT, nullable=false)
                         │   └─ Column(commit_hash, TEXT, nullable=false)
                         └─ Filters
                             ├─ commit_files.commit_hash = refs.commit_hash
                             ├─ files.repository_id = commit_files.repository_id
                             ├─ files.file_path = commit_files.file_path
                             ├─ files.blob_hash = commit_files.blob_hash
                             ├─ files.tree_hash = commit_files.tree_hash
                             └─ refs.ref_name = "HEAD"

Is it possible that the extra Project and Filter right above the SquashedTable can cause such a change in performance?

Metadata

Metadata

Labels

performancePerformance improvementsquestionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions