-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can btree index be used to speed up sorting operations of columnar table? #187
Comments
In short, this are known limitation of columnar storage with index scans. |
I've been measuring BTree index scans in Hydra tables vs heap tables, and I've found that it can be improved by lowering the In my results, with 10k rows per stripe, for a bigint column, heap outperforms by 10x. Not so bad for Hydra but this point by @mkaruza could explain why it doesn't match heap table performance:
My table has 20 columns. This seems like an unfourtunate interaction between btree indexes and hydra, and I'd have some follow up questions on it. How can I tell if this is happening? Is there some DB config I can set to prevent the index scan from doing this? Could Hydra itself implement an optimization for this or are you blocked by limitations in PG core? |
you can enable column decompression caching, which might help with random querying like you are doing. you can read more about enabling it and tuning it at https://docs.hydra.so/concepts/optimizing-query-performance#column-cache |
Thanks, though from my experiments my impression is that having the disk page in the OS or PG buffers cache is of more impact than having it on the decompression cache, decompression didn't seem like a major bottleneck. My questions were more specific to that situation where there is column overfetching on index scan. |
the compression cache is meant to help when you are actively decompressing the same chunk repeatedly, since the pg/os caches only hold the compressed chunk. depends entirely on how often those same chunks get decompressed. if you enable it, those counts should show up in the explain analyze, giving you the ability to determine whether quickly if it is worth it to enable or now. |
I'm on Hydra Cloud so that must already be on. The cache statistics don't show in |
the cache is only for the |
Hi @leoyvens
There is no config to control this and it is always happening, index scan node request tuple from storage (heap, columnar, something else). For heap it cheap since it locate row tuple and returns , while for columnar we need to fetch all columns (with each default of 10k values), decompress them just to return single tuple requested by index scan. |
addressed in #205. please give it a try when you can and let us know if it helps. You will need to |
What's wrong?
By reading the documentation, we can know that column tables currently support btree and hash index.
I have a columnar table with little more than six million data volumes. The
id
field of table is primary key (which auto generate btree index). But when I perform an id sort query on this table, although the query plan shows that index scanning is performed, the query is still extremely slow. It seems that the root cause isBuffers: shared hit=65743485 read=4259
And When I remove the
order by id
clause, the query (which use ColumnarScan) is fast.Thanks for your reply.
The text was updated successfully, but these errors were encountered: