New function LISTAGG #8689

ChudaykinAlex · 2025-08-06T08:45:21Z

Purpose
The current implementation has an aggregate function LIST which concatenates multiple row fields into a blob. The SQL standard has a similar function called LISTAGG. The major difference is that it also supports the ordered concatenation.
Syntax and rules

<listagg set function> ::=
LISTAGG <left paren> [ <set quantifier> ] <character value expression> <comma> <listagg separator> [ <listagg overflow clause> ] <right paren> <within group specification>

<listagg separator> ::= 
<character string literal>

<listagg overflow clause> ::=
ON OVERFLOW <overflow behavior>

<overflow behavior> ::=
ERROR | TRUNCATE [ <listagg truncation filler> ] <listagg count indication>

<listagg truncation filler> ::=
<character string literal>

<listagg count indication> ::=
WITH COUNT | WITHOUT COUNT

<within group specification> ::=
WITHIN GROUP <left paren> ORDER BY <sort specification list> <right paren>

The legacy LIST syntax is preserved for backward compatibility, LISTAGG is added to cover the standard features.

There is a <listagg overflow clause> rule in the standard, which is intended to output an error when the output value overflows. Since the LIST function always returns a BLOB, it was decided that this rule would be meaningless. It was not implemented and silently ignored if specified.

If DISTINCT is specified for LISTAGG, then ORDER BY <sort specification list> must fully match <character value expression>

If DISTINCT is specified, the presence of WITHIN GROUP must obey the restriction and will not affect the subsequent code execution.

Examples:

CREATE TABLE TEST_T
	(COL1 INT, COL2 VARCHAR(2), COL3 VARCHAR(2), COL4 VARCHAR(2), COL5 BOOLEAN, COL6 VARCHAR(2)
	CHARACTER SET WIN1251);
COMMIT;
INSERT INTO TEST_T values(1, 'A', 'A', 'J', false, 'П');
INSERT INTO TEST_T values(2, 'B', 'B', 'I', false, 'Д');
INSERT INTO TEST_T values(3, 'C', 'A', 'L', true,  'Ж');
INSERT INTO TEST_T values(4, 'D', 'B', 'K', true,  'Й');
COMMIT;

SELECT LISTAGG (ALL COL4, ':') AS FROM TEST_T;
=======
J:I:L:K

SELECT LISTAGG (DISTINCT COL4, ':') FROM TEST_T;
========
I:J:K:L

SELECT LISTAGG (DISTINCT COL3, ':') FROM TEST_T;
====
A:B

SELECT LISTAGG (DISTINCT COL3, ':') WITHIN GROUP (ORDER BY COL2) FROM TEST_T;
====
A:B

SELECT LISTAGG (DISTINCT COL3, ':') WITHIN GROUP (ORDER BY COL2 DESCENDING) FROM TEST_T;
====
A:B

SELECT LISTAGG (COL2, ':') WITHIN GROUP (ORDER BY COL2 DESCENDING) FROM TEST_T;
=======
D:C:B:A

SELECT LISTAGG (COL4, ':') WITHIN GROUP (ORDER BY COL3 DESC) FROM TEST_T;
=======
I:K:J:L

SELECT LISTAGG (COL3, ':') WITHIN GROUP (ORDER BY COL5 ASCENDING) FROM TEST_T;
=======
A:B:A:B

SELECT LISTAGG (COL4, ':') WITHIN GROUP (ORDER BY COL3 ASC) FROM TEST_T;
=======
J:L:I:K

SELECT LISTAGG (ALL COL2) WITHIN GROUP (ORDER BY COL4) FROM TEST_T;
=======
B,A,D,C

SELECT LISTAGG (COL2, ':') WITHIN GROUP (ORDER BY COL3 DESC, COL4 ASC) FROM TEST_T;
=======
B:D:A:C

SELECT LISTAGG (COL2, ':') WITHIN GROUP (ORDER BY COL3 DESC, COL4 DESC) FROM TEST_T;
=======
D:B:C:A

SELECT LISTAGG (COL2, ':') WITHIN GROUP (ORDER BY COL3 ASC, COL4 DESC) FROM TEST_T;
=======
C:A:D:B

SELECT LISTAGG (ALL COL6, ':')FROM TEST_T;
=======
П:Д:Ж:Й

SELECT LISTAGG (ALL COL6, ':') WITHIN GROUP (ORDER BY COL2 DESC) FROM TEST_T;
=======
Й:Ж:Д:П

SELECT LISTAGG (ALL COL2, ':') WITHIN GROUP (ORDER BY COL6) FROM TEST_T;
=======
B:C:D:A

INSERT INTO TEST_T values(5, 'E', NULL, NULL, NULL, NULL);
INSERT INTO TEST_T values(6, 'F', 'C', 'N', true, 'К');

SELECT LISTAGG (ALL COL2, ':') WITHIN GROUP (ORDER BY COL3) FROM TEST_T;
===========
E:A:C:B:D:F

SELECT LISTAGG (ALL COL2, ':') WITHIN GROUP (ORDER BY COL3 NULLS LAST) FROM TEST_T;
===========
A:C:B:D:F:E

SELECT LISTAGG (ALL COL2, ':') WITHIN GROUP (ORDER BY COL6 NULLS FIRST) FROM TEST_T;
===========
E:B:C:D:F:A

SELECT LISTAGG (DISTINCT COL3, ':') WITHIN GROUP (ORDER BY COL2) FROM TEST_T;
========
Statement failed, SQLSTATE = 42000
SQL error code = -104
-Invalid command
-Sort-key of the ORDER BY specification must match the argument list

sim1984 · 2025-08-06T10:07:48Z

This PR implements one of the aggregate functions (LISTAGG), which depends on the order of the input stream records. The full list can be seen here: #7632

asfernandes · 2025-10-07T01:24:28Z

doc/sql.extensions/README.listagg

+INSERT INTO TEST_T values(4, 'D', 'B', 'K', true,  'Й');
+COMMIT;
+
+SELECT LISTAGG (ALL COL4, ':') AS FROM TEST_T;


In the syntax format section, <within group specification> is mandatory but does not exist in some examples.

The SQL specification declares <within group specification> as mandatory. However, IMHO it's quite restrictive and neither Oracle nor DB2 follows that rule, they have it optional. Given that LIST and LISTAGG share the same syntax in this PR, we've also made <within group specification> optional. So the easiest solution is to fix the README ;-)

Or we may go the standard way and separate the legacy LIST (leave it with the current grammar, without ordering) from LISTAGG (which is strictly standard-compliant) at the parser level. But IMHO it would be annoying for users to select either of them depending on whether you need ordering or not. So personally I'd keep everything "as is" and just fix the docs.

Other opinions?

I agree, LIST and LISTAGG should be complete synonyms.

I would also remove the mention of ON OVERFLOW from the documentation. It's standard, but we don't support it. It could be mentioned if it were simply ignored, but mentioning it leads to errors.

SELECT LISTAGG(TRIM(RDB$RELATION_NAME), ';' ON OVERFLOW ERROR) WITHIN GROUP(ORDER BY RDB$RELATION_NAME) AS REL_NAMES FROM RDB$RELATIONS;

Invalid token. Dynamic SQL Error. SQL error code = -104. Token unknown - line 2, column 52. ERROR. ---------------------------------- SQLCODE: -104 SQLSTATE: 42000 GDSCODE: 335544569

SELECT LISTAGG(TRIM(RDB$RELATION_NAME), ';' ON OVERFLOW TRUNCATE '...' WITHOUT COUNT) WITHIN GROUP(ORDER BY RDB$RELATION_NAME) AS REL_NAMES FROM RDB$RELATIONS;

Invalid token. Dynamic SQL Error. SQL error code = -104. Token unknown - line 2, column 52. TRUNCATE. ---------------------------------- SQLCODE: -104 SQLSTATE: 42000 GDSCODE: 335544569

Strange. I need to check it out

If we remove ON OVERFLOW from the docs, then I believe we should remove it from the parser too.

ON OVERFLOW can be kept if it will not cause errors.

asfernandes · 2025-10-07T01:25:15Z

doc/sql.extensions/README.listagg

+=======
+C:A:D:B
+
+SELECT LISTAGG (ALL COL6, ':')FROM TEST_T;


Suggested change

SELECT LISTAGG (ALL COL6, ':')FROM TEST_T;

SELECT LISTAGG (ALL COL6, ':') FROM TEST_T;

asfernandes · 2025-10-07T01:27:28Z

src/dsql/AggNodes.cpp

+			for (auto& nodeOrder : sort->expressions)
+			{
+				dsc toDesc = *(descOrder++);
+				toDesc.dsc_address = data + (IPTR)toDesc.dsc_address;


Suggested change

toDesc.dsc_address = data + (IPTR)toDesc.dsc_address;

toDesc.dsc_address = data + (IPTR) toDesc.dsc_address;

asfernandes · 2025-10-07T01:27:57Z

src/dsql/AggNodes.cpp

+					if (IS_INTL_DATA(fromDsc))
+						INTL_string_to_key(tdbb, INTL_TEXT_TO_INDEX(fromDsc->getTextType()),
+							fromDsc, &toDesc, INTL_KEY_UNIQUE);


Suggested change

if (IS_INTL_DATA(fromDsc))

INTL_string_to_key(tdbb, INTL_TEXT_TO_INDEX(fromDsc->getTextType()),

fromDsc, &toDesc, INTL_KEY_UNIQUE);

if (IS_INTL_DATA(fromDsc))

{

INTL_string_to_key(tdbb, INTL_TEXT_TO_INDEX(fromDsc->getTextType()),

fromDsc, &toDesc, INTL_KEY_UNIQUE);

}

asfernandes · 2025-10-07T01:28:36Z

src/dsql/AggNodes.cpp

+			}
+
+			dsc toDesc = asb->desc;
+			toDesc.dsc_address = data + (IPTR)toDesc.dsc_address;


Suggested change

toDesc.dsc_address = data + (IPTR)toDesc.dsc_address;

toDesc.dsc_address = data + (IPTR) toDesc.dsc_address;

asfernandes · 2025-10-07T01:28:56Z

src/dsql/AggNodes.cpp

+			if (distinct)
+				desc.dsc_address = data + (asb->intl ? asb->keyItems[1].getSkdOffset() : 0);
+			else
+				desc.dsc_address = data + (IPTR)asb->desc.dsc_address;


Suggested change

desc.dsc_address = data + (IPTR)asb->desc.dsc_address;

desc.dsc_address = data + (IPTR) asb->desc.dsc_address;

asfernandes · 2025-10-07T01:31:25Z

src/dsql/AggNodes.cpp

+	if (sort && distinct)
+	{
+		ValueExprNode* const sortNode = *sort->expressions.begin();
+		if (!arg->sameAs(sortNode, false) || sort->expressions.getCount() > 1)


Should they be identical? Why?
I think it should have like GROUP BY rules.

Good point. Currently, we sort only once and this is a good bonus. If we follow your suggestion and allow slightly different expressions, then we should either sort twice or ignore the user-specified ordering after DISTINCT. BTW, in this PR it also seems to be ignored -- LISTAGG(DISTINCT COL) WITHIN GROUP (ORDER BY COL DESC) would produce ASC-ordered output. But I suppose it should work the same way as for plain SELECT DISTINCT(COL) FROM T ORDER BY COL DESC, i.e. respect the ORDER BY ordering and optimize the sorts (merge two sorts into one) only if they fully (expressions / directions / NULLs placement) match each other.

Why direction and null placement are important for DISTINCT?

Nulls placement is not important, I agree, as NULLs are skipped by all aggregate functions.

By standard, DISTINCT eliminates duplicates in the ordered (if specified) result set, it should not change the user-defined ordering. Why do you think LISTAGG should behave differently?

It shouldn't but I don't quite understand why you said

i.e. respect the ORDER BY ordering and optimize the sorts (merge two sorts into one) only if they fully (expressions / directions / NULLs placement) match each other.

BTW, IMHO, sort->expressions.getCount() > 1 condition here is not needed as it is fine for distinct to use only first sorting segment.

Ah, sorry, my bad. Surely, direction for the combined sort should be taken from ORDER BY -- like we already do in Optimizer::checkSorts().

Nulls placement is not important, I agree, as NULLs are skipped by all aggregate functions.

With my suggestion to use same existing rules, the DISTINCT LISTAGG expression may be something like COALESCE(field, 'z') with a ORDER BY Z`.

Given that we support queries like:

select distinct emp_no from employee order by last_name

I'm not sure we need to introduce the GROUP BY rules inside LISTAGG (DISTINCT ...) ... WITHIN GROUP. I suppose DISTINCT + ORDER BY should work the same way as either a standalone construct or within the LISTAGG group, i.e. without any restrictions. That said, I see no practical point in the sort behind the ORDER BY <other expression> clause because its result is not going to be user-visible anyway. It basically means that if DISTINCT is present, we may simply ignore the ORDER BY clause -- except when ORDER BY mentions the aggregated expression in the very beginning of its <sort specification list> -- then its sort direction should be taken into account inside DISTINCT.

Do I miss anything?

asfernandes · 2025-10-07T01:34:57Z

src/jrd/optimizer/Optimizer.cpp

+	const auto keyCount = aggNode->sort->expressions.getCount() * 2;
+	sort_key_def* sortKey = asb->keyItems.getBuffer(keyCount);
+
+	auto const* direction = aggNode->sort->direction.begin();


const auto* for consistency, please.

asfernandes · 2025-10-07T01:36:12Z

src/dsql/AggNodes.cpp

 	// per function.
 	return aggInfo.blr == o->aggInfo.blr && aggInfo.name == o->aggInfo.name &&
-		distinct == o->distinct && dialect1 == o->dialect1;
+		distinct == o->distinct && dialect1 == o->dialect1 && sort == o->sort;;


Does not look correct the comparation of pointer address here for sort.

@ChudaykinAlex IMO, the sort node should be added to the list by ListAggNode::getChildren() -- this way you may remove the sort comparison in dsqlMatch() and also remove doPass2(sort) in pass2(), as the sort node will be processed by the inherited methods automagically.

ChudaykinAlex · 2025-10-07T07:04:36Z

Thanks for the recommendations, I'll fix it soon.

Adjustments have been made to the README. Fixed a bug with the <listagg overflow clause> behavior, now it is silently ignored. The dsqlMatch function has been redesigned. Redesigned behavior with DISTINCT. Multiple elements are now allowed in the ORDER BY. I also added influences on the sorting direction.

dyemanov · 2025-10-25T16:04:06Z

src/dsql/AggNodes.cpp

+	ListAggNode* node = FB_NEW_POOL(pool) ListAggNode(pool,	(blrOp == blr_agg_list_distinct));
 	node->arg = PAR_parse_value(tdbb, csb);
 	node->delimiter = PAR_parse_value(tdbb, csb);
+	node->sort = PAR_sort(tdbb, csb, blr_sort, true);


This will fail while parsing the BLR restored from the previous versions, because blr_sort will be missing.
One solution is to add blr_agg_list2 and use it only when the ordering is specified. Another could be something like this:

if (csb->csb_blr_reader.peekByte() == blr_sort) node->sort = PAR_sort(tdbb, csb, blr_sort, true);

In this case, I'd also change genBlr() below to call GEN_sort() only if the ordering is specified -- this way we keep generating a compatible BLR for the legacy LIST syntax, making it possible to downgrade the database if required.

ChudaykinAlex added 3 commits August 1, 2025 14:15

Adding an implementation of the new LISTAGG function

640b352

Merge branch 'master' into work/listagg

9f34905

Add README

72a977f

dyemanov self-requested a review August 21, 2025 06:56

dyemanov approved these changes Oct 6, 2025

View reviewed changes

asfernandes reviewed Oct 7, 2025

View reviewed changes

ChudaykinAlex added 2 commits October 16, 2025 10:13

Merge branch 'master' into work/listagg

dbf81d6

dyemanov reviewed Oct 25, 2025

View reviewed changes

ChudaykinAlex added 2 commits October 28, 2025 11:43

dyemanov omments have been corrected.

0e3ec8f

Merge branch 'master' into work/listagg

26e9d42

	SELECT LISTAGG (ALL COL6, ':')FROM TEST_T;
	SELECT LISTAGG (ALL COL6, ':') FROM TEST_T;

	toDesc.dsc_address = data + (IPTR)toDesc.dsc_address;
	toDesc.dsc_address = data + (IPTR) toDesc.dsc_address;

	desc.dsc_address = data + (IPTR)asb->desc.dsc_address;
	desc.dsc_address = data + (IPTR) asb->desc.dsc_address;

Uh oh!

New function LISTAGG #8689

Are you sure you want to change the base?

New function LISTAGG #8689

Uh oh!

Conversation

ChudaykinAlex commented Aug 6, 2025

Uh oh!

sim1984 commented Aug 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dyemanov Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dyemanov Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChudaykinAlex commented Oct 7, 2025

Uh oh!

dyemanov Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dyemanov Oct 17, 2025 •

edited

Loading

dyemanov Oct 23, 2025 •

edited

Loading

dyemanov Oct 25, 2025 •

edited

Loading