Skip to content

Improve generated ast.ts regarding multiple languages #1979

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

JohannesMeierSE
Copy link
Contributor

@JohannesMeierSE JohannesMeierSE commented Jul 4, 2025

Summary

This PR reworks the content of the generated ast.ts in case of multiple languages inside the project (as defined in the langium-config.json).

Motivation

In a recent multi-language project, we wanted to write a formatter for a single language (not for all languages). In order to statically ensure by the TypeScript compiler, that we wrote definitions for all AST types in the language, we used the generated XXXAstType = { Type1: Type1, Type2: Type2, ... }. But this forced us to define formatting rules for all types of the project including types from other languages. Therefore this PR contributes (among others) a XXXAstType for each language.

Contributions in more detail

The ast.ts consists of:

  1. terminals
  2. keywords
  3. type definitions (export interface YYY extends langium.AstNode { ... })
  4. XXXAstType (as described in the motivation)
  5. reflection

While the whole content is merged for all files and languages, this PR splits the generated code for terminals (1), keywords (2) and XXXAstType (4), while type definitions (3) and reflection (5) remain unchanged.
In particular for the type definitions (3) it is important to have only one version of them:

  • Otherwise imports in TS files during coding become non-unique
  • Type definitions might be "incomplete", since it is possible to infer the same type (identified by its name) with different properties (which are merged together into the final type definition) by different parser rules, which might be part of different languages!

One instance of reachable terminals & keywords and a XXXAstType for all available types is created for each language, which are nested into a namespace to improve structure and readability of the generated ast.ts.
Additionally, one instance of terminals, keywords and XXXAstType is created for the whole project.
Some comments are generated into the ast.ts as well.

Discussion of (non) breaking changes

  • The ast.ts for a single-language project remains unchanged (try npm run langium:generate on the whole Langium repo on your own)!
  • The terminals, keywords and XXXAstType created for the whole project have the same names as before this PR, therefore they can be used as before.

Additional contributions

  • Added test cases for the CLI generator for multi-file and multi-language projects (so far, only single-file projects are tested)

@JohannesMeierSE JohannesMeierSE added this to the v4.0.0 milestone Jul 4, 2025
@JohannesMeierSE JohannesMeierSE added bug Something isn't working cli CLI related issue ast AST structure related issue labels Jul 4, 2025
@sailingKieler
Copy link
Contributor

First of all thank you very much for the detailed PR description, that's been a lot of effort! 🙏

Copy link
Contributor

@sailingKieler sailingKieler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some conceptual doubts/questions/remarks, see below.

The generated code with the file-based separation of tokens, keywords, and types feels a bit puffy and noisy to me, especially in cases like for the langium grammar that now comes with an accompanying .langium file just contributing the outsourced AST types.
Of course, that's my personal taste.

I would also prefer to stick to the existing generation behavior regarding reachable keywords within a language.


export type RequirementsAndTestsLanguageRequirementsTerminalNames = keyof typeof RequirementsAndTestsLanguageRequirementsTerminals;

export type RequirementsAndTestsLanguageRequirementsKeywordNames = RequirementsAndTestsFileCommonKeywordNames | RequirementsAndTestsFileRequirementsKeywordNames;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has it been like that previously in terms of keywords being contributed by non-called parser rules?

I assume that File-based types include all keywords that occur in the entire file, but I do not agree that the union of those types is equal to the union of keywords that are actually part of a language.

Instead, I would assume that non reachable keywords are not included here. What was the established approach in that?

@@ -10,7 +10,7 @@
},
{
"name": "keyword.control.tests-lang",
"match": "\\b(applicable|contact|for|testFile|tests|tst)\\b"
"match": "\\b(applicable|contact|environment|for|req|testFile|tests|tst)\\b"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change reflects my consideration from above.
Here it won't not hurt, I guess.
In general, however, keywords will be mixed with IDs in the highlighting and other features, if keywords are 'inherited' from grammar files, although they are not reachable via the entry rule.

await writeWithFail(path.resolve(updateLangiumInternalAstPath(output, config), 'ast.ts'), genAst, options);
// ast.ts
const isSingleLanguage = config.languages.length === 1;
const isSingleFile = allGrammars.length === 1 && isSingleLanguage /* handles a special case: multiple languages use the same *.langium file */;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the sense of the comment 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are two languages with the same entry grammar, then it is a multi-language project with a single *.langium file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the property is true for the (1, 1) case, so the context does not fit for me...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two languages in the config => isSingleLanguage is false (line 334)

Both language configs point to the same *.langium file => allGrammars.length === 1 is true => without the additional check && isSingleLanguage, isSingleFile is true as well (line 335)

Since we check the single-file case first (line 337), this (strange) multi-language project is treated as a single-file project, i.e. the language-specific parts in the ast.ts are skipped, but it should be treated as a multi-language project. Therefore we need && isSingleLanguage.

let startIndex = typesFileContent.indexOf(start);
for (let i = 0; i < startCount; i++) {
startIndex = typesFileContent.indexOf(start, startIndex + start.length);
}
const relevantPart = typesFileContent.substring(startIndex, typesFileContent.indexOf(end)).trim();
expect(relevantPart).toEqual(expectedPart);
}

describe('Ast generator (with multiple *.langium files)', () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that these additional new tests double the length of the file, I would prefer to create the test projects on disc rather in memory based on inlined code. That would also simplify the update of the expected parts.

Just reviewed the tests in VSCode and it is okay there, here on Github with side by side view it's basically unreadable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would also simplify the update of the expected parts.

I don't agree, since you need to change only one place with the current approach.

What do you think about splitting the ast-generator.test.ts file into two files, maybe ast-generator-single.test.ts and ast-generator-multi.test.ts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still okay for me to keep it, but it's quite on the edge.

The files on disc-approach has the advantage that you can run the langium-cli to update the files if necessary (probably with some kind of setup), which is much easier for others that didn't write the tests initially. In general, there's no obligation to do it manually, in contrast to the in code-approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your point. Do we have such "on disc"-test cases somewhere else in the Langium project?

Otherwise the complexity of the set-up grows unnecessarily. Since we usually don't change the AST generator, it is not worth the effort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: Is there an easy way to solve the indentation problem, i.e. to get rid of the strange expandToStringIndented solution?

await writeWithFail(path.resolve(updateLangiumInternalAstPath(output, config), 'ast.ts'), genAst, options);
// ast.ts
const isSingleLanguage = config.languages.length === 1;
const isSingleFile = allGrammars.length === 1 && isSingleLanguage /* handles a special case: multiple languages use the same *.langium file */;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you think about excluding grammar files that define interfaces & types only, like langium-types.langium, at least while considering keywords, token types and such parts?

I advocate that as otherwise many language projects are considered multi file projects, although they only contain a single file defining parser & terminal rules, like the Langium grammar 'project'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to ensure, that the complexity of the ast.ts-generator does not explode. Therefore I would not embed/inline grammar files containing only type definitions in general.

But we could interpret projects with one grammar file containing rules and additional grammar files containing only type definitions as "single-file projects" as well. That would include the "Langium grammar project" itself.

@JohannesMeierSE
Copy link
Contributor Author

Thanks @sailingKieler for your review and discussing some critical points!

You are right regarding reachability, I focused on completeness in this PR.

I assume that File-based types include all keywords that occur in the entire file

yes

I do not agree that the union of those types is equal to the union of keywords that are actually part of a language.

You can argue in that way, yes.

This is a fundamental decision we need to decide.

I just checked, that the previous AST generator ignores unused rules (and therefore unused keywords), but it does consider unused types!

I see the following approaches (the project part would remain unchanged in all approaches):

  1. Keep the current implementation of this PR, i.e. generate the complete set of elements, not the set of reachable elements.
    • Open question: Do we want the complete or the reachable elements?
  2. Skip the file-based parts completely, i.e. generate code only for languages and the whole project.
    • Pro: reachable elements are easy
    • Contra: If the project uses common.langium files which are included by multiple languages, you don't have the elements of common.langium in order to define stuff only once and reuse it for multiple languages.
  3. Keep the file-based parts as they are and generate the complete set of reachable elements for each language (without referring to the files).
    • Contra: unnecessary code clones
  4. Generate the complete set of reachable elements for each language (without referring to the files), generate the complete set of elements for each file, and calculate the reachable elements for each file afterwards.
    • Pro: We have reachable elements for both files and languages
    • Contra: I am not sure, whether this approach is possible for terminals. The complexity grows.

I sketched the last approach for the keywords:

// the three *.langium files with all contained keywords
export type KeywordsF1All = // f1.langium
    | 'F1used'
    | 'F1unused';
export type KeywordsF2All = // f2.langium
    | 'F2used'
    | 'F2unused';
export type KeywordsFCommonAll = // fcommon.langium
    | 'FCommonBoth'
    | 'FCommon1'
    | 'FCommon2'
    | 'FCommonNone';

// two languages with their reachable keywords
export type KeywordsL1Reachable = // language 1 uses f1.langium and fcommon.langium
    | 'F1used'
    | 'FCommonBoth'
    | 'FCommon1';
export type KeywordsL2Reachable = // language 2 uses f2.langium and fcommon.langium
    | 'F2used'
    | 'FCommonBoth'
    | 'FCommon2';

// the reachable keywords of the three *.langium files
export type KeywordsF1Reachable = KeywordsF1All & KeywordsL1Reachable;
export type KeywordsF2Reachable = KeywordsF2All & KeywordsL2Reachable;
export type KeywordsFCommonReachable = KeywordsFCommonAll & (KeywordsL1Reachable | KeywordsL2Reachable);

@spoenemann
Copy link
Contributor

I agree that the file-based generation of additional types and constants is very noisy. I worry that it will cause confusion – the generated names are long and you have to look very closely to choose the correct one.

Idea: WDYT of wrapping all definitions generated for a particular file in a namespace? That might make the generated code more understandable.

Additionally, we should start adding comments to the generated definitions to make them easier to grasp.

@JohannesMeierSE
Copy link
Contributor Author

Decisions in the Langium dev meeting 2025-07-09 (@sailingKieler @spoenemann @msujew @JohannesMeierSE)

  • We generate elements for each language, but not for each file in order to reduce complexity in the AST generator => the ast.ts for all single-language projects remain unchanged
  • We generate only the used/reachable keywords, but all types.
  • We try to structure the ast.ts for multi-language projects with namespaces and comments for a better readability

@JohannesMeierSE will update this PR accordingly

@cdietrich
Copy link
Contributor

looking forward what this does to our 12 language multi language project.

@JohannesMeierSE JohannesMeierSE removed the bug Something isn't working label Jul 11, 2025
@JohannesMeierSE JohannesMeierSE force-pushed the jm/multi-language-ast-files branch from 31d3582 to 399fa23 Compare July 11, 2025 12:32
@JohannesMeierSE JohannesMeierSE changed the title Improve generated ast.ts regarding multiple files and languages Improve generated ast.ts regarding multiple languages Jul 11, 2025
@JohannesMeierSE
Copy link
Contributor Author

@sailingKieler @spoenemann @msujew This PR is ready for another review

@JohannesMeierSE
Copy link
Contributor Author

There are some follow-up ideas for improvements, WDYT?

  • Terminals which are called only by terminals are not listed in the terminals section: Why? I didn't got the reason for that ...
  • For the type definitions, first all unions and second all interfaces are generated: We could think about mixing union and interface types and sort them by name.
  • The terminals are not sorted by name so far.

@cdietrich
Copy link
Contributor

i tried to mess with this in the requirements example

Reading config from /Users/dietrich/git/langium/examples/requirements/langium-config.json
Warning: Warning: No LINE_BREAKS Found.
        This Lexer has been defined to track line and column information,
        But none of the Token Types can be identified as matching a line terminator.
        See https://chevrotain.io/docs/guide/resolving_lexer_errors.html#LINE_BREAKS 
        for details.
Error: Undefined rule: Contact at 4:13
grammar Requirements2

import "./common"
import "./requirements"

entry Dully:
    "model" x=Environment2;

Environment2 infers Environment:
    'environment2' name=ID ':' description=STRING;
{
        "id": "requirements2-lang",
        "grammar": "src/language-server/requirements2.langium",
        "fileExtensions": [".req2"],
        "textMate": {
            "out": "syntaxes/requirements2.tmLanguage.json"
        }
    }

fails only on your branch

@cdietrich cdietrich self-requested a review July 15, 2025 07:58
Copy link
Contributor

@cdietrich cdietrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please have a look the problem i reported

@JohannesMeierSE
Copy link
Contributor Author

I will look into it, but we are very busy this week

Copy link
Contributor

@sailingKieler sailingKieler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update and the additional effort. 🙏
I have one major concern that probably is related to the finding of @cdietrich.

Apart from that there're a bunch of places where you could simplify the templates though nothing critical.

@@ -159,20 +156,21 @@ function mapGrammarElements(grammars: Grammar[], visited: Set<string> = new Set(
grammar,
(grammar.rules as GrammarElement[])
.concat(grammar.types)
.concat(grammar.interfaces)
.concat(grammar.interfaces) // a new array is created here!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this instead of the preceding 3 lines:

Suggested change
.concat(grammar.interfaces) // a new array is created here!
[ ...grammar.rules, ...grammar.types, ...grammar.interfaces ]

log('error', options, chalk.red(parserAnalysis.toString()));
return buildResult(false);
configMap.set(entryGrammar, languageConfig);
const embeddedGrammar = await embedGrammars([entryGrammar], configMap, config, options, grammarServices); // this changes the 'entryGrammar' and the 'configMap' in-place!
Copy link
Contributor

@sailingKieler sailingKieler Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the former version all configs were visited once and configMap.set(entryGrammar, languageConfig) was performed before the first configMap.get(...) has been executed. That's not the the case in your version, see impl of embedGrammars. Does that yield the issue @cdietrich experienced?

BTW: In your test you do it 'the old way', i.e. populating configMap completely before calling embedGrammars for the first time. Maybe this entire function (runGenerator(...)) could be tested in one of your tests (or an additional one)?

export type ${name}AstType = ${joinToNode(
identifiers,
identifier => `${identifier}.AstType`,
{ separator: ' & ' }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a one-liner version here, too, like in l.99, the the result will also be a one-liner.

${joinToNode(typeNames, name => name + ': ' + name, { appendNewLineIfNotEmpty: true })}
}
`.appendNewLine().appendNewLine();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to drop the appendNewLine()... here and add empty lines in the calling template.
This will eliminate superfluous trailing newlines in ast.ts of the requirements project.

identifier => `${identifier}.AstType`,
{ separator: ' & ' }
)}
`.appendNewLine().appendNewLine();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to drop the appendNewLine()... here and add empty lines in the calling template.

`,
expectedAstContentParts: [
expandToStringIndented(2, expandToNode`
A: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding your question above:
Yes, there is. You can tell the evaluator to consider less indentation as template white space by inserting a substitution contributing nothing. That does also work with the substitution and A on the same line but looks a bit uglier. The proposed version expects a leading line break before A that is totally fine and not an issue here.

Suggested change
A: {
${''}
A: {

path: string;
languageID: string | undefined; // If not undefined, this grammar file is used as entry point for a language with the given value as 'id'
grammarContent: string;
expectedAstContentParts: string[];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor polishing suggestions:

Suggested change
expectedAstContentParts: string[];
expectedAstContentParts: Generated[];

grammarContent: string;
expectedAstContentParts: string[];
}
async function testMultiProject(grammarFiles: GrammarFile[], moreExpectedParts: string[]): Promise<void> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A varArg param allows to skip the array brackets above:

Suggested change
async function testMultiProject(grammarFiles: GrammarFile[], moreExpectedParts: string[]): Promise<void> {
async function testMultiProject(grammarFiles: GrammarFile[], ...moreExpectedParts: Generated[]): Promise<void> {

languages: [], // will be filled later
};
const documents: LangiumDocument[] = [];
const expectedParts: string[] = [];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const expectedParts: string[] = [];
const expectedParts: Generated[] = [];

}

// check the generated content
expectedParts.forEach(part => expect(generated).toContain(part));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expectedParts.forEach(part => expect(generated).toContain(part));
expectedParts.forEach(part => expect(generated).toContain(toString(part)));

@sailingKieler
Copy link
Contributor

There are some follow-up ideas for improvements, WDYT?

  • Terminals which are called only by terminals are not listed in the terminals section: Why? I didn't got the reason for that ...

Because referenced terminals are expanded/inlined before registering the relevant terminals in the lexer/generating the regexes to ast.ts. You might argue that generating sub terminals might still be useful. However, the value converter is always asked for converting terminals being accepted by a expanded terminal regex. Thus, using the expanded terminal definition and doing finer evaluations by checking for matched groups according to the actual terminal regex is the natural way to go. And I recently contributed an improvement of that 😉, see #1966.

BTW: terminal fragments are entirely ignored in the aforementioned activities.

  • For the type definitions, first all unions and second all interfaces are generated: We could think about mixing union and interface types and sort them by name.

That's probably an artifact of the early days. I'm undecided about that.

  • The terminals are not sorted by name so far.

Hence the order within the grammar is applied? That sounds fine for me, as the users probably group them somehow and that grouping is then reflected in the generated parts. Since the number of terminal rules usually rather small, being a bit laxer here is probably not an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ast AST structure related issue cli CLI related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants