-
Notifications
You must be signed in to change notification settings - Fork 81
Improve generated ast.ts
regarding multiple languages
#1979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
First of all thank you very much for the detailed PR description, that's been a lot of effort! 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some conceptual doubts/questions/remarks, see below.
The generated code with the file-based separation of tokens, keywords, and types feels a bit puffy and noisy to me, especially in cases like for the langium grammar that now comes with an accompanying .langium
file just contributing the outsourced AST types.
Of course, that's my personal taste.
I would also prefer to stick to the existing generation behavior regarding reachable keywords within a language.
|
||
export type RequirementsAndTestsLanguageRequirementsTerminalNames = keyof typeof RequirementsAndTestsLanguageRequirementsTerminals; | ||
|
||
export type RequirementsAndTestsLanguageRequirementsKeywordNames = RequirementsAndTestsFileCommonKeywordNames | RequirementsAndTestsFileRequirementsKeywordNames; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has it been like that previously in terms of keywords being contributed by non-called parser rules?
I assume that File
-based types include all keywords that occur in the entire file, but I do not agree that the union of those types is equal to the union of keywords that are actually part of a language.
Instead, I would assume that non reachable keywords are not included here. What was the established approach in that?
@@ -10,7 +10,7 @@ | |||
}, | |||
{ | |||
"name": "keyword.control.tests-lang", | |||
"match": "\\b(applicable|contact|for|testFile|tests|tst)\\b" | |||
"match": "\\b(applicable|contact|environment|for|req|testFile|tests|tst)\\b" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change reflects my consideration from above.
Here it won't not hurt, I guess.
In general, however, keywords will be mixed with IDs in the highlighting and other features, if keywords are 'inherited' from grammar files, although they are not reachable via the entry rule.
packages/langium-cli/src/generate.ts
Outdated
await writeWithFail(path.resolve(updateLangiumInternalAstPath(output, config), 'ast.ts'), genAst, options); | ||
// ast.ts | ||
const isSingleLanguage = config.languages.length === 1; | ||
const isSingleFile = allGrammars.length === 1 && isSingleLanguage /* handles a special case: multiple languages use the same *.langium file */; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get the sense of the comment 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are two languages with the same entry grammar, then it is a multi-language project with a single *.langium
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the property is true
for the (1, 1)
case, so the context does not fit for me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have two languages in the config => isSingleLanguage
is false
(line 334)
Both language configs point to the same *.langium
file => allGrammars.length === 1
is true
=> without the additional check && isSingleLanguage
, isSingleFile
is true
as well (line 335)
Since we check the single-file case first (line 337), this (strange) multi-language project is treated as a single-file project, i.e. the language-specific parts in the ast.ts
are skipped, but it should be treated as a multi-language project. Therefore we need && isSingleLanguage
.
let startIndex = typesFileContent.indexOf(start); | ||
for (let i = 0; i < startCount; i++) { | ||
startIndex = typesFileContent.indexOf(start, startIndex + start.length); | ||
} | ||
const relevantPart = typesFileContent.substring(startIndex, typesFileContent.indexOf(end)).trim(); | ||
expect(relevantPart).toEqual(expectedPart); | ||
} | ||
|
||
describe('Ast generator (with multiple *.langium files)', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that these additional new tests double the length of the file, I would prefer to create the test projects on disc rather in memory based on inlined code. That would also simplify the update of the expected parts.
Just reviewed the tests in VSCode and it is okay there, here on Github with side by side view it's basically unreadable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would also simplify the update of the expected parts.
I don't agree, since you need to change only one place with the current approach.
What do you think about splitting the ast-generator.test.ts
file into two files, maybe ast-generator-single.test.ts
and ast-generator-multi.test.ts
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still okay for me to keep it, but it's quite on the edge.
The files on disc-approach has the advantage that you can run the langium-cli
to update the files if necessary (probably with some kind of setup), which is much easier for others that didn't write the tests initially. In general, there's no obligation to do it manually, in contrast to the in code-approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got your point. Do we have such "on disc"-test cases somewhere else in the Langium project?
Otherwise the complexity of the set-up grows unnecessarily. Since we usually don't change the AST generator, it is not worth the effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW: Is there an easy way to solve the indentation problem, i.e. to get rid of the strange expandToStringIndented
solution?
packages/langium-cli/src/generate.ts
Outdated
await writeWithFail(path.resolve(updateLangiumInternalAstPath(output, config), 'ast.ts'), genAst, options); | ||
// ast.ts | ||
const isSingleLanguage = config.languages.length === 1; | ||
const isSingleFile = allGrammars.length === 1 && isSingleLanguage /* handles a special case: multiple languages use the same *.langium file */; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you think about excluding grammar files that define interfaces & types only, like langium-types.langium
, at least while considering keywords, token types and such parts?
I advocate that as otherwise many language projects are considered multi file projects, although they only contain a single file defining parser & terminal rules, like the Langium grammar 'project'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to ensure, that the complexity of the ast.ts
-generator does not explode. Therefore I would not embed/inline grammar files containing only type definitions in general.
But we could interpret projects with one grammar file containing rules and additional grammar files containing only type definitions as "single-file projects" as well. That would include the "Langium grammar project" itself.
Thanks @sailingKieler for your review and discussing some critical points! You are right regarding reachability, I focused on completeness in this PR.
yes
You can argue in that way, yes. This is a fundamental decision we need to decide. I just checked, that the previous AST generator ignores unused rules (and therefore unused keywords), but it does consider unused types! I see the following approaches (the project part would remain unchanged in all approaches):
I sketched the last approach for the keywords: // the three *.langium files with all contained keywords
export type KeywordsF1All = // f1.langium
| 'F1used'
| 'F1unused';
export type KeywordsF2All = // f2.langium
| 'F2used'
| 'F2unused';
export type KeywordsFCommonAll = // fcommon.langium
| 'FCommonBoth'
| 'FCommon1'
| 'FCommon2'
| 'FCommonNone';
// two languages with their reachable keywords
export type KeywordsL1Reachable = // language 1 uses f1.langium and fcommon.langium
| 'F1used'
| 'FCommonBoth'
| 'FCommon1';
export type KeywordsL2Reachable = // language 2 uses f2.langium and fcommon.langium
| 'F2used'
| 'FCommonBoth'
| 'FCommon2';
// the reachable keywords of the three *.langium files
export type KeywordsF1Reachable = KeywordsF1All & KeywordsL1Reachable;
export type KeywordsF2Reachable = KeywordsF2All & KeywordsL2Reachable;
export type KeywordsFCommonReachable = KeywordsFCommonAll & (KeywordsL1Reachable | KeywordsL2Reachable); |
I agree that the file-based generation of additional types and constants is very noisy. I worry that it will cause confusion – the generated names are long and you have to look very closely to choose the correct one. Idea: WDYT of wrapping all definitions generated for a particular file in a namespace? That might make the generated code more understandable. Additionally, we should start adding comments to the generated definitions to make them easier to grasp. |
Decisions in the Langium dev meeting 2025-07-09 (@sailingKieler @spoenemann @msujew @JohannesMeierSE)
@JohannesMeierSE will update this PR accordingly |
looking forward what this does to our 12 language multi language project. |
31d3582
to
399fa23
Compare
ast.ts
regarding multiple files and languagesast.ts
regarding multiple languages
@sailingKieler @spoenemann @msujew This PR is ready for another review |
There are some follow-up ideas for improvements, WDYT?
|
i tried to mess with this in the requirements example
fails only on your branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please have a look the problem i reported
I will look into it, but we are very busy this week |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the update and the additional effort. 🙏
I have one major concern that probably is related to the finding of @cdietrich.
Apart from that there're a bunch of places where you could simplify the templates though nothing critical.
@@ -159,20 +156,21 @@ function mapGrammarElements(grammars: Grammar[], visited: Set<string> = new Set( | |||
grammar, | |||
(grammar.rules as GrammarElement[]) | |||
.concat(grammar.types) | |||
.concat(grammar.interfaces) | |||
.concat(grammar.interfaces) // a new array is created here! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this instead of the preceding 3 lines:
.concat(grammar.interfaces) // a new array is created here! | |
[ ...grammar.rules, ...grammar.types, ...grammar.interfaces ] |
log('error', options, chalk.red(parserAnalysis.toString())); | ||
return buildResult(false); | ||
configMap.set(entryGrammar, languageConfig); | ||
const embeddedGrammar = await embedGrammars([entryGrammar], configMap, config, options, grammarServices); // this changes the 'entryGrammar' and the 'configMap' in-place! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the former version all configs were visited once and configMap.set(entryGrammar, languageConfig)
was performed before the first configMap.get(...)
has been executed. That's not the the case in your version, see impl of embedGrammars
. Does that yield the issue @cdietrich experienced?
BTW: In your test you do it 'the old way', i.e. populating configMap
completely before calling embedGrammars
for the first time. Maybe this entire function (runGenerator(...)
) could be tested in one of your tests (or an additional one)?
export type ${name}AstType = ${joinToNode( | ||
identifiers, | ||
identifier => `${identifier}.AstType`, | ||
{ separator: ' & ' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a one-liner version here, too, like in l.99, the the result will also be a one-liner.
${joinToNode(typeNames, name => name + ': ' + name, { appendNewLineIfNotEmpty: true })} | ||
} | ||
`.appendNewLine().appendNewLine(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to drop the appendNewLine()...
here and add empty lines in the calling template.
This will eliminate superfluous trailing newlines in ast.ts
of the requirements
project.
identifier => `${identifier}.AstType`, | ||
{ separator: ' & ' } | ||
)} | ||
`.appendNewLine().appendNewLine(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to drop the appendNewLine()... here and add empty lines in the calling template.
`, | ||
expectedAstContentParts: [ | ||
expandToStringIndented(2, expandToNode` | ||
A: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding your question above:
Yes, there is. You can tell the evaluator to consider less indentation as template white space by inserting a substitution contributing nothing. That does also work with the substitution and A
on the same line but looks a bit uglier. The proposed version expects a leading line break before A
that is totally fine and not an issue here.
A: { | |
${''} | |
A: { |
path: string; | ||
languageID: string | undefined; // If not undefined, this grammar file is used as entry point for a language with the given value as 'id' | ||
grammarContent: string; | ||
expectedAstContentParts: string[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor polishing suggestions:
expectedAstContentParts: string[]; | |
expectedAstContentParts: Generated[]; |
grammarContent: string; | ||
expectedAstContentParts: string[]; | ||
} | ||
async function testMultiProject(grammarFiles: GrammarFile[], moreExpectedParts: string[]): Promise<void> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A varArg param allows to skip the array brackets above:
async function testMultiProject(grammarFiles: GrammarFile[], moreExpectedParts: string[]): Promise<void> { | |
async function testMultiProject(grammarFiles: GrammarFile[], ...moreExpectedParts: Generated[]): Promise<void> { |
languages: [], // will be filled later | ||
}; | ||
const documents: LangiumDocument[] = []; | ||
const expectedParts: string[] = []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const expectedParts: string[] = []; | |
const expectedParts: Generated[] = []; |
} | ||
|
||
// check the generated content | ||
expectedParts.forEach(part => expect(generated).toContain(part)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expectedParts.forEach(part => expect(generated).toContain(part)); | |
expectedParts.forEach(part => expect(generated).toContain(toString(part))); |
Because referenced terminals are expanded/inlined before registering the relevant terminals in the lexer/generating the regexes to BTW:
That's probably an artifact of the early days. I'm undecided about that.
Hence the order within the grammar is applied? That sounds fine for me, as the users probably group them somehow and that grouping is then reflected in the generated parts. Since the number of terminal rules usually rather small, being a bit laxer here is probably not an issue. |
Summary
This PR reworks the content of the generated
ast.ts
in case of multiple languages inside the project (as defined in thelangium-config.json
).Motivation
In a recent multi-language project, we wanted to write a formatter for a single language (not for all languages). In order to statically ensure by the TypeScript compiler, that we wrote definitions for all AST types in the language, we used the generated
XXXAstType = { Type1: Type1, Type2: Type2, ... }
. But this forced us to define formatting rules for all types of the project including types from other languages. Therefore this PR contributes (among others) a XXXAstType for each language.Contributions in more detail
The
ast.ts
consists of:export interface YYY extends langium.AstNode { ... }
)XXXAstType
(as described in the motivation)While the whole content is merged for all files and languages, this PR splits the generated code for terminals (1), keywords (2) and XXXAstType (4), while type definitions (3) and reflection (5) remain unchanged.
In particular for the type definitions (3) it is important to have only one version of them:
One instance of reachable terminals & keywords and a XXXAstType for all available types is created for each language, which are nested into a namespace to improve structure and readability of the generated
ast.ts
.Additionally, one instance of terminals, keywords and XXXAstType is created for the whole project.
Some comments are generated into the
ast.ts
as well.Discussion of (non) breaking changes
ast.ts
for a single-language project remains unchanged (trynpm run langium:generate
on the whole Langium repo on your own)!Additional contributions