Skip to content

Parse fixes #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 9, 2024
Merged

Parse fixes #1

merged 7 commits into from
Sep 9, 2024

Conversation

thaliaarchi
Copy link
Contributor

This addresses all the parsing bugs I found when I surveyed your assembler. The largest change is to move to strongly typed argument handling by scanning arguments as consistent tokens, before invoking each opcode's function. I have written more details in the commit messages and each commit is atomic.

I found that any exceptions thrown within the promises cannot be caught by the enclosing block and have context for the line added, since they are not awaited there, thus most validation errors are thrown as UnhandledPromiseRejection. I was tempted to duplicate the try/catch block within both promises, but that would be quite lengthy. Why did you choose the promise approach? The problem is sequential and each promise waits for the result of the previous, so there is no parallelism. I was tempted to refactor that, but thought you might want it.

Opcode mnemonics are compared using String.prototype.toLocaleLowerCase,
which has locale-aware case folding. Thus, in Turkish (LANG=tr) and
Azeri (LANG=az) locales, İ (U+0130, LATIN CAPITAL LETTER I WITH DOT
ABOVE) compares equal to i (U+0069, LATIN SMALL LETTER I). In all
locales, K (U+212A, KELVIN SIGN) compares equal to k (U+006B, LATIN
SMALL LETTER K), but no mnemonics use k. The characters used are
[a-eg-jl-pr-z].

Use String.prototype.toLowerCase, since it has the correct ASCII-only
folding for the characters used.
Currently, the parsing of arguments is left up to each instruction,
making the behavior inconsistent. Required arguments can be omitted and
are substituted with 0. Opcodes that don't take arguments ignore any
arguments that are given. `storestr` mangles a char argument, because it
converts it to the decimal string representation of its code unit.

=> Scan arguments into tokens and check their types before invoking the
function for the opcode. Represent this within the TypeScript type
system using discriminated unions, so it is strongly typed. Throw on
malformed input.

It relies heavily on splitting lines by exactly one space, making it
brittle to other whitespace characters like tab or repeated spaces. At
the start and end of the line, Unicode whitespace may be used due to
`String.prototype.trim`.

=> Consistently use /\s/ while parsing and create a "-quoted space
token, so that strings with spaces can still be represented. Single-word
strings are still allowed.
With the current state of variable semantics, it doesn't make sense to
treat a leading underscore in label or import position as a variable.
@thaliaarchi
Copy link
Contributor Author

Some other things not addressed in this PR:

  • Your breakpoint extension instruction is written debugger for your assembler and printed as dbg by your debugger. Which do you consider canonical? This difference makes the output of the disassembler not able to be fed into the assembler.
  • wsa-tests/count.wsa uses test, presumably from Burghard, but it is not supported by your assembler.

Do not add internal labels to the label map, so a user-supplied label
with the same name as an internal label will receive a different ID. In
such cases, the disassembly in the debugger will still show the same
name for both, but that only affects display.
Copy link
Owner

@voliva voliva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I took this long to review it - I wanted to take a deeper look since it's a substantial refactor :D

I like it, it's massively better than what I had before, with a proper tokenization step. Thank you so much!

I've licensed the repo under MIT, I hope that's alright for your contribution.

@voliva
Copy link
Owner

voliva commented Aug 29, 2024

I found that any exceptions thrown within the promises cannot be caught by the enclosing block and have context for the line added, since they are not awaited there, thus most validation errors are thrown as UnhandledPromiseRejection. I was tempted to duplicate the try/catch block within both promises, but that would be quite lengthy. Why did you choose the promise approach? The problem is sequential and each promise waits for the result of the previous, so there is no parallelism. I was tempted to refactor that, but thought you might want it.

Mhh I'm not sure where the UnhandledPromiseRejection happens. In theory Promise.resolve().then(() => { throw "foo" }) should return a promise that rejects (so it can be handled from outside). All of the promises are added into the results array and then awaited through Promise.all, which also handles it correctly. The promise returned by the compile function should be capturing all of the errors.

I didn't add promises for concurrency, but to not constraint the compiling process into a synchronous operation. I know NodeJS has sync APIs to read files, but by abstracting it to the getIncludedStream: (filename: string) => LineStream makes it asynchronous, and in mind this could help later on when building with a web UI. The Promises then are just a way of wrapping and working around this.

Your breakpoint extension instruction is written debugger for your assembler and printed as dbg by your debugger. Which do you consider canonical? This difference makes the output of the disassembler not able to be fed into the assembler.

Good point. I think the canonical should be dbg - Feels a bit more consistent with the compact nature of the others.

I'll probably also change the "debugExtensions" into something which is not just "debug", as that also includes bitwise opcodes. Or maybe split into different flags.

wsa-tests/count.wsa uses test, presumably from Burghard, but it is not supported by your assembler.

Yes, I see it was from very early on while I was still implementing the initial version.

@voliva voliva merged commit 0039336 into voliva:main Sep 9, 2024
@thaliaarchi
Copy link
Contributor Author

Thanks for your review! I see I missed the email notification from two weeks ago; sorry about that.

@thaliaarchi thaliaarchi deleted the parse-fixes branch December 3, 2024 04:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants