Skip to content

Add support for disabling hash aggregate with hll #65

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 11, 2018

Conversation

velioglu
Copy link
Contributor

@velioglu velioglu commented Dec 5, 2018

This PR adds support for disabling hash aggregate for hll aggregate functions. In order to disable cost of the path with hash aggregate is maximized, so the planner is forced to select group aggregate. Client can utilize this functionality by setting hll.disable_hashagg to ON at session level.

Remaining items:

  • Ensure that each version passes the test
  • Add more tests with (especially with hll_add_agg)
  • Check the support with Citus
  • Get function oids from cache

Copy link
Contributor

@byucesoy byucesoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a very quick look. I'll have a more detailed look after the caching part is completed.

src/hll.c Outdated
RegisterHLLConfigVariables(void)
{
DefineCustomBoolVariable(
"hll.disable_hashagg",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somehow enable_hashagg felt better here, postgres also uses it.

src/hll.c Outdated
foreach(cacheEntryCell, hllAggregateCache)
{
HllAggregateCacheEntry *cacheEntry = (HllAggregateCacheEntry *)lfirst(cacheEntryCell);
if (strcmp(aggregateName, cacheEntry->aggregateName) == 0 &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sequential scan on a list + text comparison. Could this get us into trouble ? Maybe compare argument count first

src/hll.c Outdated
oldContext = MemoryContextSwitchTo(CacheMemoryContext);
matchingCacheEntry = palloc0(sizeof(HllAggregateCacheEntry));
matchingCacheEntry->aggregateId = aggregateId;
matchingCacheEntry->aggregateName = aggregateName;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are storing a pointer to char whose lifetime is not know. We should copy instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears they are constants, then do we need to store text at all ? can we store an integer perhaps so that we can easily check the id ?

src/hll.c Outdated
heap_close(rel, AccessShareLock);

return result;
/* *INDENT-ON* */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INDENT-ON ?

Copy link
Member

@mtuncer mtuncer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor requests

Please re-evaluate function cache design

Copy link
Member

@mtuncer mtuncer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@mtuncer
Copy link
Member

mtuncer commented Dec 10, 2018

spoke too soon

Travis fails left and right, you will need to add hll to shared_preload_libraries
see https://github.com/citusdata/tools/blob/develop/travis/config_and_start_cluster#L17

@mtuncer mtuncer closed this Dec 10, 2018
@mtuncer
Copy link
Member

mtuncer commented Dec 10, 2018

accidentally closed the PR

@mtuncer mtuncer reopened this Dec 10, 2018
@velioglu velioglu force-pushed the disable_hashagg branch 2 times, most recently from a321592 to d22970b Compare December 10, 2018 19:31
src/hll.c Outdated
HLL_UNION_AGG_ID = 1
} HllAggregationId;

bool ForceGroupagg = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ForceGroupagg -> ForceGroupAgg

src/hll.c Outdated
foreach(cacheEntryCell, hllAggregateCache)
{
HllAggregateCacheEntry *cacheEntry = (HllAggregateCacheEntry *)lfirst(cacheEntryCell);
if (argumentCount == cacheEntry->argumentCount &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be slightly OCD but can we change the order to be like this;

if (cacheEntry->argumentCount == argumentCount && cacheEntry->hllAggregationId == hllAggregationId)

Personally, I would love to have hllAggregationId comparison first, but with the other comment below, having count comparison first makes more sense.

src/hll.c Outdated
Oid hllId = get_extension_oid(HLL_EXTENSION_NAME, false);
Oid hllSchemaOid = get_extension_schema(hllId);
const char *hllSchemaName = get_namespace_name(hllSchemaOid);
char *aggregateName = hllAggregationId == HLL_ADD_AGG_ID ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, we are still using aggregate names, so I think we don't need an extra level of abstraction (i.e. enum and aggregation ids). Without the enum, the code would be much simpler and only drawback I can see is we had to do string comparison in the loop above, which is mostly OK especially if we first compare argument counts.

src/hll.c Outdated
// Support disabling hash aggregation functionality for PG > 9.6
#if PG_VERSION_NUM >= 90600

#define HLL_EXTENSION_NAME "hll"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for HLL prefix, this is HLL extension anyway. We don't use HLL keyword anywhere in the codebase except functions exposed to user space.

src/hll.c Outdated
RelOptInfo *input_rel, RelOptInfo *output_rel)
#endif
{
Oid hllId = InvalidOid;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hllId -> extension_id

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

src/hll.c Outdated
{
Aggref *aggref = (Aggref *) var;
int argCount = list_length(aggref->args);
Oid addFunctionId = LookupAggregatorOidByName(HLL_ADD_AGG_ID, argCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to think, that it would be better to have two static variables (like add_aggregate_oid[] and union_aggregate_id) for function oids which are populated at pg_init. This would save us from all caching logic.

@velioglu velioglu force-pushed the disable_hashagg branch 2 times, most recently from 62615dd to 535c945 Compare December 11, 2018 13:55
HAVING hll_cardinality(hll_union_agg(sd)) > 1;

-- Test hll_add_agg, first set work_mem to 256MB in order to use hash aggregate
-- befire forcing group aggregate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

befire -> before

Copy link
Member

@mtuncer mtuncer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 after minor changes

-- ----------------------------------------------------------------
-- Regression tests for disabling hash aggregation
-- ----------------------------------------------------------------
SELECT hll_set_output_version(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all disable hash_agg expectations look the same. it would be better if we put some sort of PG version information that caused us to make this differentiation

src/hll.c Outdated
RelOptInfo *input_rel, RelOptInfo *output_rel)
#endif
{
Oid hllId = InvalidOid;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

src/hll.c Outdated

/* Initialize HLL_ADD_AGG with different signatures */
aggregateName = ADD_AGG_NAME;
for ( ; addAggArgumentCounter < HLL_AGGREGATE_COUNT ; addAggArgumentCounter++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

odd use of for loop

@velioglu velioglu merged commit b536369 into master Dec 11, 2018

if(HllAggregateOid(aggref->aggfnoid))
{
path->total_cost = INT_MAX;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INT_MAX is a pretty arbitrary value.

total_cost is not an int. If it was then setting it to the max would be a bad idea because it would likely overflow when other costs are added, now it's just arbitrary, which is probably fine?

@byucesoy byucesoy deleted the disable_hashagg branch September 2, 2019 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants