Improve generated `ast.ts` regarding multiple languages #1979

JohannesMeierSE · 2025-07-04T13:35:57Z

Summary

This PR reworks the content of the generated ast.ts in case of multiple languages inside the project (as defined in the langium-config.json).

Motivation

In a recent multi-language project, we wanted to write a formatter for a single language (not for all languages). In order to statically ensure by the TypeScript compiler, that we wrote definitions for all AST types in the language, we used the generated XXXAstType = { Type1: Type1, Type2: Type2, ... }. But this forced us to define formatting rules for all types of the project including types from other languages. Therefore this PR contributes (among others) a XXXAstType for each language.

Contributions in more detail

The ast.ts consists of:

terminals
keywords
type definitions (export interface YYY extends langium.AstNode { ... })
XXXAstType (as described in the motivation)
reflection

While the whole content is merged for all files and languages, this PR splits the generated code for terminals (1), keywords (2) and XXXAstType (4), while type definitions (3) and reflection (5) remain unchanged.
In particular for the type definitions (3) it is important to have only one version of them:

Otherwise imports in TS files during coding become non-unique
Type definitions might be "incomplete", since it is possible to infer the same type (identified by its name) with different properties (which are merged together into the final type definition) by different parser rules, which might be part of different languages!

One instance of reachable terminals & keywords and a XXXAstType for all available types is created for each language, which are nested into a namespace to improve structure and readability of the generated ast.ts.
Additionally, one instance of terminals, keywords and XXXAstType is created for the whole project.
Some comments are generated into the ast.ts as well.

Discussion of (non) breaking changes

The ast.ts for a single-language project remains unchanged (try npm run langium:generate on the whole Langium repo on your own)!
The terminals, keywords and XXXAstType created for the whole project have the same names as before this PR, therefore they can be used as before.

Additional contributions

Added test cases for the CLI generator for multi-file and multi-language projects (so far, only single-file projects are tested)

sailingKieler · 2025-07-07T20:50:32Z

First of all thank you very much for the detailed PR description, that's been a lot of effort! 🙏

sailingKieler

I have some conceptual doubts/questions/remarks, see below.

The generated code with the file-based separation of tokens, keywords, and types feels a bit puffy and noisy to me, especially in cases like for the langium grammar that now comes with an accompanying .langium file just contributing the outsourced AST types.
Of course, that's my personal taste.

I would also prefer to stick to the existing generation behavior regarding reachable keywords within a language.

sailingKieler · 2025-07-07T21:13:10Z

examples/requirements/src/language-server/generated/ast.ts

+
+export type RequirementsAndTestsLanguageRequirementsTerminalNames = keyof typeof RequirementsAndTestsLanguageRequirementsTerminals;
+
+export type RequirementsAndTestsLanguageRequirementsKeywordNames = RequirementsAndTestsFileCommonKeywordNames | RequirementsAndTestsFileRequirementsKeywordNames;


Has it been like that previously in terms of keywords being contributed by non-called parser rules?

I assume that File-based types include all keywords that occur in the entire file, but I do not agree that the union of those types is equal to the union of keywords that are actually part of a language.

Instead, I would assume that non reachable keywords are not included here. What was the established approach in that?

sailingKieler · 2025-07-07T21:26:13Z

examples/requirements/syntaxes/tests.tmLanguage.json

@@ -10,7 +10,7 @@
    },
    {
      "name": "keyword.control.tests-lang",
-      "match": "\\b(applicable|contact|for|testFile|tests|tst)\\b"
+      "match": "\\b(applicable|contact|environment|for|req|testFile|tests|tst)\\b"


I think this change reflects my consideration from above.
Here it won't not hurt, I guess.
In general, however, keywords will be mixed with IDs in the highlighting and other features, if keywords are 'inherited' from grammar files, although they are not reachable via the entry rule.

sailingKieler · 2025-07-07T21:36:37Z

packages/langium-cli/src/generate.ts

-    await writeWithFail(path.resolve(updateLangiumInternalAstPath(output, config), 'ast.ts'), genAst, options);
+    // ast.ts
+    const isSingleLanguage = config.languages.length === 1;
+    const isSingleFile = allGrammars.length === 1 && isSingleLanguage /* handles a special case: multiple languages use the same *.langium file */;


I don't get the sense of the comment 🤔

If there are two languages with the same entry grammar, then it is a multi-language project with a single *.langium file.

Yes, but the property is true for the (1, 1) case, so the context does not fit for me...

We have two languages in the config => isSingleLanguage is false (line 334)

Both language configs point to the same *.langium file => allGrammars.length === 1 is true => without the additional check && isSingleLanguage, isSingleFile is true as well (line 335)

Since we check the single-file case first (line 337), this (strange) multi-language project is treated as a single-file project, i.e. the language-specific parts in the ast.ts are skipped, but it should be treated as a multi-language project. Therefore we need && isSingleLanguage.

packages/langium-cli/src/generator/ast-generator.ts

sailingKieler · 2025-07-07T22:10:42Z

packages/langium-cli/test/generator/ast-generator.test.ts

    let startIndex = typesFileContent.indexOf(start);
    for (let i = 0; i < startCount; i++) {
        startIndex = typesFileContent.indexOf(start, startIndex + start.length);
    }
    const relevantPart = typesFileContent.substring(startIndex, typesFileContent.indexOf(end)).trim();
    expect(relevantPart).toEqual(expectedPart);
 }
+
+describe('Ast generator (with multiple *.langium files)', () => {


Considering that these additional new tests double the length of the file, I would prefer to create the test projects on disc rather in memory based on inlined code. That would also simplify the update of the expected parts.

Just reviewed the tests in VSCode and it is okay there, here on Github with side by side view it's basically unreadable.

That would also simplify the update of the expected parts.

I don't agree, since you need to change only one place with the current approach.

What do you think about splitting the ast-generator.test.ts file into two files, maybe ast-generator-single.test.ts and ast-generator-multi.test.ts?

It is still okay for me to keep it, but it's quite on the edge.

The files on disc-approach has the advantage that you can run the langium-cli to update the files if necessary (probably with some kind of setup), which is much easier for others that didn't write the tests initially. In general, there's no obligation to do it manually, in contrast to the in code-approach.

I got your point. Do we have such "on disc"-test cases somewhere else in the Langium project?

Otherwise the complexity of the set-up grows unnecessarily. Since we usually don't change the AST generator, it is not worth the effort.

BTW: Is there an easy way to solve the indentation problem, i.e. to get rid of the strange expandToStringIndented solution?

sailingKieler · 2025-07-07T22:17:38Z

packages/langium-cli/src/generate.ts

-    await writeWithFail(path.resolve(updateLangiumInternalAstPath(output, config), 'ast.ts'), genAst, options);
+    // ast.ts
+    const isSingleLanguage = config.languages.length === 1;
+    const isSingleFile = allGrammars.length === 1 && isSingleLanguage /* handles a special case: multiple languages use the same *.langium file */;


Did you think about excluding grammar files that define interfaces & types only, like langium-types.langium, at least while considering keywords, token types and such parts?

I advocate that as otherwise many language projects are considered multi file projects, although they only contain a single file defining parser & terminal rules, like the Langium grammar 'project'.

We need to ensure, that the complexity of the ast.ts-generator does not explode. Therefore I would not embed/inline grammar files containing only type definitions in general.

But we could interpret projects with one grammar file containing rules and additional grammar files containing only type definitions as "single-file projects" as well. That would include the "Langium grammar project" itself.

JohannesMeierSE · 2025-07-09T07:51:41Z

Thanks @sailingKieler for your review and discussing some critical points!

You are right regarding reachability, I focused on completeness in this PR.

I assume that File-based types include all keywords that occur in the entire file

yes

I do not agree that the union of those types is equal to the union of keywords that are actually part of a language.

You can argue in that way, yes.

This is a fundamental decision we need to decide.

I just checked, that the previous AST generator ignores unused rules (and therefore unused keywords), but it does consider unused types!

I see the following approaches (the project part would remain unchanged in all approaches):

Keep the current implementation of this PR, i.e. generate the complete set of elements, not the set of reachable elements.
- Open question: Do we want the complete or the reachable elements?
Skip the file-based parts completely, i.e. generate code only for languages and the whole project.
- Pro: reachable elements are easy
- Contra: If the project uses common.langium files which are included by multiple languages, you don't have the elements of common.langium in order to define stuff only once and reuse it for multiple languages.
Keep the file-based parts as they are and generate the complete set of reachable elements for each language (without referring to the files).
- Contra: unnecessary code clones
Generate the complete set of reachable elements for each language (without referring to the files), generate the complete set of elements for each file, and calculate the reachable elements for each file afterwards.
- Pro: We have reachable elements for both files and languages
- Contra: I am not sure, whether this approach is possible for terminals. The complexity grows.

I sketched the last approach for the keywords:

// the three *.langium files with all contained keywords
export type KeywordsF1All = // f1.langium
    | 'F1used'
    | 'F1unused';
export type KeywordsF2All = // f2.langium
    | 'F2used'
    | 'F2unused';
export type KeywordsFCommonAll = // fcommon.langium
    | 'FCommonBoth'
    | 'FCommon1'
    | 'FCommon2'
    | 'FCommonNone';

// two languages with their reachable keywords
export type KeywordsL1Reachable = // language 1 uses f1.langium and fcommon.langium
    | 'F1used'
    | 'FCommonBoth'
    | 'FCommon1';
export type KeywordsL2Reachable = // language 2 uses f2.langium and fcommon.langium
    | 'F2used'
    | 'FCommonBoth'
    | 'FCommon2';

// the reachable keywords of the three *.langium files
export type KeywordsF1Reachable = KeywordsF1All & KeywordsL1Reachable;
export type KeywordsF2Reachable = KeywordsF2All & KeywordsL2Reachable;
export type KeywordsFCommonReachable = KeywordsFCommonAll & (KeywordsL1Reachable | KeywordsL2Reachable);

spoenemann · 2025-07-09T10:13:27Z

I agree that the file-based generation of additional types and constants is very noisy. I worry that it will cause confusion – the generated names are long and you have to look very closely to choose the correct one.

Idea: WDYT of wrapping all definitions generated for a particular file in a namespace? That might make the generated code more understandable.

Additionally, we should start adding comments to the generated definitions to make them easier to grasp.

JohannesMeierSE · 2025-07-09T14:39:59Z

Decisions in the Langium dev meeting 2025-07-09 (@sailingKieler @spoenemann @msujew @JohannesMeierSE)

We generate elements for each language, but not for each file in order to reduce complexity in the AST generator => the ast.ts for all single-language projects remain unchanged
We generate only the used/reachable keywords, but all types.
We try to structure the ast.ts for multi-language projects with namespaces and comments for a better readability

@JohannesMeierSE will update this PR accordingly

cdietrich · 2025-07-11T07:13:51Z

looking forward what this does to our 12 language multi language project.

…dded test cases

JohannesMeierSE · 2025-07-11T12:37:50Z

@sailingKieler @spoenemann @msujew This PR is ready for another review

JohannesMeierSE · 2025-07-11T12:57:02Z

There are some follow-up ideas for improvements, WDYT?

Terminals which are called only by terminals are not listed in the terminals section: Why? I didn't got the reason for that ...
For the type definitions, first all unions and second all interfaces are generated: We could think about mixing union and interface types and sort them by name.
The terminals are not sorted by name so far.

cdietrich · 2025-07-11T13:02:17Z

i tried to mess with this in the requirements example

Reading config from /Users/dietrich/git/langium/examples/requirements/langium-config.json
Warning: Warning: No LINE_BREAKS Found.
        This Lexer has been defined to track line and column information,
        But none of the Token Types can be identified as matching a line terminator.
        See https://chevrotain.io/docs/guide/resolving_lexer_errors.html#LINE_BREAKS 
        for details.
Error: Undefined rule: Contact at 4:13

grammar Requirements2

import "./common"
import "./requirements"

entry Dully:
    "model" x=Environment2;

Environment2 infers Environment:
    'environment2' name=ID ':' description=STRING;

{
        "id": "requirements2-lang",
        "grammar": "src/language-server/requirements2.langium",
        "fileExtensions": [".req2"],
        "textMate": {
            "out": "syntaxes/requirements2.tmLanguage.json"
        }
    }

fails only on your branch

cdietrich

can you please have a look the problem i reported

JohannesMeierSE · 2025-07-15T08:33:40Z

I will look into it, but we are very busy this week

sailingKieler

Thank you for the update and the additional effort. 🙏
I have one major concern that probably is related to the finding of @cdietrich.

Apart from that there're a bunch of places where you could simplify the templates though nothing critical.

sailingKieler · 2025-07-18T09:58:22Z

packages/langium-cli/src/generate.ts

@@ -159,20 +156,21 @@ function mapGrammarElements(grammars: Grammar[], visited: Set<string> = new Set(
                grammar,
                (grammar.rules as GrammarElement[])
                    .concat(grammar.types)
-                    .concat(grammar.interfaces)
+                    .concat(grammar.interfaces) // a new array is created here!


What about this instead of the preceding 3 lines:

Suggested change

.concat(grammar.interfaces) // a new array is created here!

[ ...grammar.rules, ...grammar.types, ...grammar.interfaces ]

sailingKieler · 2025-07-18T10:15:52Z

packages/langium-cli/src/generate.ts

-            log('error', options, chalk.red(parserAnalysis.toString()));
-            return buildResult(false);
+            configMap.set(entryGrammar, languageConfig);
+            const embeddedGrammar = await embedGrammars([entryGrammar], configMap, config, options, grammarServices); // this changes the 'entryGrammar' and the 'configMap' in-place!


In the former version all configs were visited once and configMap.set(entryGrammar, languageConfig) was performed before the first configMap.get(...) has been executed. That's not the the case in your version, see impl of embedGrammars. Does that yield the issue @cdietrich experienced?

BTW: In your test you do it 'the old way', i.e. populating configMap completely before calling embedGrammars for the first time. Maybe this entire function (runGenerator(...)) could be tested in one of your tests (or an additional one)?

sailingKieler · 2025-07-18T10:33:47Z

packages/langium-cli/src/generator/ast-generator.ts

+        export type ${name}AstType = ${joinToNode(
+            identifiers,
+            identifier => `${identifier}.AstType`,
+            { separator: ' & ' }


I would prefer a one-liner version here, too, like in l.99, the the result will also be a one-liner.

sailingKieler · 2025-07-18T10:37:10Z

packages/langium-cli/src/generator/ast-generator.ts

            ${joinToNode(typeNames, name => name + ': ' + name, { appendNewLineIfNotEmpty: true })}
        }
+    `.appendNewLine().appendNewLine();


I suggest to drop the appendNewLine()... here and add empty lines in the calling template.
This will eliminate superfluous trailing newlines in ast.ts of the requirements project.

sailingKieler · 2025-07-18T10:37:37Z

packages/langium-cli/src/generator/ast-generator.ts

+            identifier => `${identifier}.AstType`,
+            { separator: ' & ' }
+        )}
+    `.appendNewLine().appendNewLine();


I suggest to drop the appendNewLine()... here and add empty lines in the calling template.

sailingKieler · 2025-07-18T11:06:28Z

packages/langium-cli/test/generator/ast-generator.test.ts

+                `,
+                expectedAstContentParts: [
+                    expandToStringIndented(2, expandToNode`
+                        A: {


Regarding your question above:
Yes, there is. You can tell the evaluator to consider less indentation as template white space by inserting a substitution contributing nothing. That does also work with the substitution and A on the same line but looks a bit uglier. The proposed version expects a leading line break before A that is totally fine and not an issue here.

Suggested change

A: {

${''}

A: {

sailingKieler · 2025-07-18T11:07:33Z

packages/langium-cli/test/generator/ast-generator.test.ts

+    path: string;
+    languageID: string | undefined; // If not undefined, this grammar file is used as entry point for a language with the given value as 'id'
+    grammarContent: string;
+    expectedAstContentParts: string[];


Some minor polishing suggestions:

Suggested change

expectedAstContentParts: string[];

expectedAstContentParts: Generated[];

sailingKieler · 2025-07-18T11:09:10Z

packages/langium-cli/test/generator/ast-generator.test.ts

+    grammarContent: string;
+    expectedAstContentParts: string[];
+}
+async function testMultiProject(grammarFiles: GrammarFile[], moreExpectedParts: string[]): Promise<void> {


A varArg param allows to skip the array brackets above:

Suggested change

async function testMultiProject(grammarFiles: GrammarFile[], moreExpectedParts: string[]): Promise<void> {

async function testMultiProject(grammarFiles: GrammarFile[], ...moreExpectedParts: Generated[]): Promise<void> {

sailingKieler · 2025-07-18T11:09:27Z

packages/langium-cli/test/generator/ast-generator.test.ts

+        languages: [], // will be filled later
+    };
+    const documents: LangiumDocument[] = [];
+    const expectedParts: string[] = [];


Suggested change

const expectedParts: string[] = [];

const expectedParts: Generated[] = [];

sailingKieler · 2025-07-18T11:09:47Z

packages/langium-cli/test/generator/ast-generator.test.ts

+    }
+
+    // check the generated content
+    expectedParts.forEach(part => expect(generated).toContain(part));


Suggested change

expectedParts.forEach(part => expect(generated).toContain(part));

expectedParts.forEach(part => expect(generated).toContain(toString(part)));

sailingKieler · 2025-07-18T11:56:24Z

There are some follow-up ideas for improvements, WDYT?

Terminals which are called only by terminals are not listed in the terminals section: Why? I didn't got the reason for that ...

Because referenced terminals are expanded/inlined before registering the relevant terminals in the lexer/generating the regexes to ast.ts. You might argue that generating sub terminals might still be useful. However, the value converter is always asked for converting terminals being accepted by a expanded terminal regex. Thus, using the expanded terminal definition and doing finer evaluations by checking for matched groups according to the actual terminal regex is the natural way to go. And I recently contributed an improvement of that 😉, see #1966.

BTW: terminal fragments are entirely ignored in the aforementioned activities.

For the type definitions, first all unions and second all interfaces are generated: We could think about mixing union and interface types and sort them by name.

That's probably an artifact of the early days. I'm undecided about that.

The terminals are not sorted by name so far.

Hence the order within the grammar is applied? That sounds fine for me, as the users probably group them somehow and that grouping is then reflected in the generated parts. Since the number of terminal rules usually rather small, being a bit laxer here is probably not an issue.

JohannesMeierSE added this to the v4.0.0 milestone Jul 4, 2025

JohannesMeierSE requested a review from spoenemann July 4, 2025 13:35

JohannesMeierSE added bug Something isn't working cli CLI related issue ast AST structure related issue labels Jul 4, 2025

sailingKieler reviewed Jul 7, 2025

View reviewed changes

JohannesMeierSE removed the bug Something isn't working label Jul 11, 2025

split the content inside the ast.ts for multiple-language-projects, a…

399fa23

…dded test cases

JohannesMeierSE force-pushed the jm/multi-language-ast-files branch from 31d3582 to 399fa23 Compare July 11, 2025 12:32

JohannesMeierSE changed the title ~~Improve generated ast.ts regarding multiple files and languages~~ Improve generated ast.ts regarding multiple languages Jul 11, 2025

cdietrich self-requested a review July 15, 2025 07:58

cdietrich requested changes Jul 15, 2025

View reviewed changes

sailingKieler reviewed Jul 18, 2025

View reviewed changes


		export type RequirementsAndTestsLanguageRequirementsTerminalNames = keyof typeof RequirementsAndTestsLanguageRequirementsTerminals;

		export type RequirementsAndTestsLanguageRequirementsKeywordNames = RequirementsAndTestsFileCommonKeywordNames \| RequirementsAndTestsFileRequirementsKeywordNames;

	.concat(grammar.interfaces) // a new array is created here!
	[ ...grammar.rules, ...grammar.types, ...grammar.interfaces ]

	expectedAstContentParts: string[];
	expectedAstContentParts: Generated[];

	async function testMultiProject(grammarFiles: GrammarFile[], moreExpectedParts: string[]): Promise<void> {
	async function testMultiProject(grammarFiles: GrammarFile[], ...moreExpectedParts: Generated[]): Promise<void> {

	const expectedParts: string[] = [];
	const expectedParts: Generated[] = [];

	expectedParts.forEach(part => expect(generated).toContain(part));
	expectedParts.forEach(part => expect(generated).toContain(toString(part)));

Improve generated ast.ts regarding multiple languages #1979

Are you sure you want to change the base?

Improve generated ast.ts regarding multiple languages #1979

Uh oh!

Conversation

JohannesMeierSE commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sailingKieler commented Jul 7, 2025

Uh oh!

sailingKieler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JohannesMeierSE commented Jul 9, 2025

Uh oh!

spoenemann commented Jul 9, 2025

Uh oh!

JohannesMeierSE commented Jul 9, 2025

Uh oh!

cdietrich commented Jul 11, 2025

Uh oh!

JohannesMeierSE commented Jul 11, 2025

Uh oh!

JohannesMeierSE commented Jul 11, 2025

Uh oh!

cdietrich commented Jul 11, 2025

Uh oh!

cdietrich left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesMeierSE commented Jul 15, 2025

Uh oh!

sailingKieler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sailingKieler Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Improve generated `ast.ts` regarding multiple languages #1979

Improve generated `ast.ts` regarding multiple languages #1979

JohannesMeierSE commented Jul 4, 2025 •

edited

Loading

sailingKieler Jul 18, 2025 •

edited

Loading