Rewrote significant parts of lexer (read desc) #1

ddeletedaccount · 2022-10-15T05:12:39Z

switched to more detailed character comprehension loop (re implemented lexer)
added lexer mode for compiler calls (also made it straightforward to add future modes in needed)
lexer is now capable of looking ahead/behind of its current position more easily

TODO:

comment source
add specific token identifiers for operators
iron out newline/end of file characters bugs
fix lexical analysis errors relating to strings
added compiler mode for lexing compiler calls
make tracing for line numbers/columns function
~~implement basic error handling~~ now deferred to parser for performance reasons
set up tests

- switched to more detailed character comprehension loop - added lexer mode for compiler calls - made it so lexer can look anywhere in source, assisting with error handling later on TODO: - implement basic error handling - iron out newline/end of file characters - set up tests

- added specific token identifiers - fixed lexer errors relating to string - implemented lexer iteration in class calls - added compiler mode * fixme: - column numbers not showing correctly

- fixed column numbers - fixed issue where strings were misidentified in identifiers - fixed operator columns being computed incorrectly - removed repetitive code, tidied source - added comments

- fixed issue where unclosed strings corrupted lexical tokens coming after - fixed issue where loaded sources could inherit trace values from previous source - made operator types specific - added lexical analysis for numbers - added tests - switched files to .ts

HPaulson

Overall this is a great PR -- glad to see so much progress in rebuilding the lexer. I've added a bunch of comments: some are suggestions, some are appreciations, some are bugs, and most are thoughts.

I really like the new structure introduced here, particularly when it comes to file setup & testing. The biggest concern I have is the lack of testing with string parsing. In legacy, this was one of the hardest parts to get right, and bugs are super tough to get rid of, but it's important to get this right. A good portion of my comments are related to this -- these are the most important & should be given extensive thought. Others are just appreciation for methods, thoughts for the future, and minor changes.

Besides these -- really good work on this. Let's sort out the remaining bugs/issues mentioned so we move forward 🚀

deno.jsonc

src/plugins/lexer/types/mod.ts

src/plugins/lexer/mod.ts

HPaulson · 2022-10-19T20:43:04Z

src/plugins/lexer/mod.ts

+
+  // loads new .mtl source, avoiding reinitializing class
+  load(input: string) {
+    this.source = input.replace(/(\r\n|\n|\r)/gm, "&eol") + "&eof";


Big fan of this approach for distinguishing eol -- way superior to the legacy method. The only problem with this is that this doesn't consider strings. We've agreed that strings should be multi-line, and as such, this should do one of two things:

Not include the eol token if the end of a line is in the middle of a string; or

Appropriately handle the eol in string mode, adding a \n whenever this token is discovered in this mode. This is the preferred method and is actually a better way to handle line breaks than in the legacy lexer.

See comment on 101

Responding to 1: not sure what is meant by "end of a line is in the middle of a string" 😅. Internal eol tokens are used for keeping track of virtual line placement and possible use for future feature additions in MTL. Could you rephrase?

As a personal thought, I feel we should ditch multi-line strings in replacement for chain-able single lines. Escape sequences will remain such as \n and friends, which will not inhibit loading external files into unsafe blocks, but multi-line strings are very error prone and splitting them up could help with both string layout and error handling.

Examples
note: indentation used is four spaces
With multi-line:

const my_sentence: string { "The quick brown fox jumps over the lazy dog" }

with single line:

const my_sentence: string { "The quick brown fox" "jumps over the lazy dog" } # or const my_sentence: string { "The quick brown fox\njumps over the lazy dog" }

Output:

The quick brown fox jumps over the lazy dog

Continuing, we can also experience weird formatting issues when trying to tidy multi-line strings.
With multi-line:

const my_sentence: string { "The quick brown fox jumps over the lazy dog" }

Output:

The quick brown fox jumps over the lazy dog

On the topic of error handling, with single line the lexer is capable of lexing past unclosed strings and finding other concurring errors without searching for a closing quotation on said strings.

With multi-line:

# in this example, the lexer is stuck in string mode # and unable to detect token errors, with each character # being loaded into the string. const unclosed_string: string { "I'm a unclosed multine string... } # -- continued source in file --

With single line:

const unclosed_string: string { "I'm a unclosed" "multiline string... # lexer catches unclosed string here # lexer marks string as unclosed and continues lexing tokens to discover other syntax errors, allowing multiple errors to be reported } # -- continued source in file --

When you initially brought up this idea I really wasn't a fan -- I thought using multi-line was a much better approach that would allow for flexibility within the language. After reading your rationale though, I actually really like this idea. Error handling definitely becomes easier, but even further, I think this adds to our logical syntax design approach... whitespace & spacing becomes much more visible when reading code using this method (+ whitespace was annoying to implement anyway, so fine by me haha).

Before resolving:

Come up with a method of knowing when someone uses a multiline. Right now the second part of a multiline will be considered an ident, which shouldn't be the case. I'm pretty sure once this bug is fixed, it'll also resolve here, so fix that before worrying about this. Example:

const x: string { "hey there" };

{ type: "invalid_string", value: "hey", trace: { line: 1, column: [ 18, 21 ] } }, { type: "eol", value: "*", trace: { line: 1, column: [ -1, -1 ] } }, { type: "ident", value: 'there"', trace: { line: 2, column: [ 6, 11 ] } },

Do some testing with whitespace. I think this could have some wonky implications that we aren't thinking of, mainly because it's unconventional, and should be extensively tested to ensure it makes sense and is readable in an actual codebase.

Re: eol comment -- I think that's irrelevant with this new proposal. Ignore.

Besides those couple things -- really good thinking here. Big fan of this 👍

After thinking about this some more today, I have some additional thoughts.

How do we plan to actually parse this syntax? The lexer will produce two string tokens, even though a variable only has one string value, and then what -- do we expect the compiler to append them during codegen? That doesn't make much sense to me.

In your example we had the following src:

const my_sentence: string { "The quick brown fox" "jumps over the lazy dog" }

Which resulted in an output of:

The quick brown fox jumps over the lazy dog

and was equivalent to The quick brown fox\njumps over the lazy dog

So this leads to three important questions: 1. When are these strings appended to each other, 2. Does this break Lydo conventions, and 3. Does this add unnecessary complexity for the purpose of being cute?

For the POV of the programmer, our variable my_sentence should contain a single string: The quick brown fox\n jumps over the lazy dog, yet by reading the code it seems that there are two strings, almost like an array of two strings, which isn't an added confusion we want.

Even if that confusion doesn't exist though, Lydo's current structure is to return the very last element in a set of braces. With this in mind, aren't we breaking the convention to return multiple elements (even if they're appended to end up as one element)? What happens if I have some code like this:

const my_word: string { "hi" } const my_sentence: string { "my word is:" @my_word "this ^" }

With the proposed method, we expect my word is:\nhi\nthis^ -- this is super hard to determine by just reading the src. With our current return structure for vars, it's expected that you can read the very last line and see what the value of my_sentence will be -- here that isn't the case, and that adds a ton of ambiguity.

When you start combining syntax into one data type things get kinda iffy. And this goes back to the first question -- when are these actually appended? It's expected that my_sentence is only one string, which means our sets of quotes are combined at some point and joined with a \n... if this occurs at codegen time, I feel like it could introduce bugs or generally be very confusing for the end developer (and us while building this) to write code. Not to mention -- when has this syntax, or anything like it, been used by another language? If you're learning the language for the first time, It would be VERY confusing to imagine two separate strings being combined, and joined with a linebreak, simply because they're next to each other in the code. I imagine someone learning the language saying to themself: "So it's two strings with four quotes, but actually it's one string made up of two appended strings which for some reason are also joined with a linebreak, and it's returned as a single string data type" -- yikes.

I still like the concept, but doubtful about how this would be executed. This was the one comment on this PR that I couldn't get out of my head today. The more I thought about it, the more I remembered Ryan Dahl's quote from his Node Regrets talk: "What I've come to learn now that I'm ageing is that whenever you're designing a program there are things that you think might be cute to add in. You always regret those."

Responding to the thoughts in order:

I definitely should have been more detailed with explanation of how it would be handled, as I can see how confusion can come about.

Based on example:

const my_sentence: string { "The quick brown fox" "jumps over the lazy dog" }

In the above source, the parser will output two string tokens, which isn't much of an issue as it's very clear to the parser what is implied with strings side by side. Even with explicit operators, the source is quite understandable:

const my_sentence: string { "The quick brown fox" + "jumps over the lazy dog" }

As for variables producing two string values.. MTL doesn't have the concept of a variable by design, only collections with everything grouped together. At code generation the above code would be transformed into a variable, but that has nothing to do with MTL itself. Appending would be a parser step, transforming the strings into:

The quick brown fox[&newline]jumps over the lazy dog
(note: newline escape is not used, as it doesn't exist in all programming languages and is based off C dialect for parser/ast replacement. As for the parser insert, that's to be determined during design.)

As to answer the numbered questions:
When are these strings appended to each other?:
Strings are appended in parser/AST tree generation stage. Definitely not something left to later stages, as code generation should not need to focus on details about MTL.

Does this break Lydo conventions?:
MTL conventions are not formalized as of yet, and there's already contradicting implementations for the "return last line approach" we discussed, visibly in collections:

const sample: collection { @my_variable_1, @my_variable_2, @my_variable_3, }

3. Does this add unnecessary complexity for the purpose of being cute?:
Definitely an interesting question, and what I disagreed with the most. With your approach of multi-line strings and the issue of formatting below:

const my_sentence: string { "The quick brown fox jumps over the lazy dog" }

The developer ends up needlessly writing code to remove front line spacing or just ends up using single line regardless:

# no formatting issue const my_sentence: string { "The quick brown fox" + "jumps over the lazy dog" }

The single line approach without needing to infer + after each string to reach desired multi-line strings formatting is similar to just using + after each string

Going back to the discussion of Lydo conventions, would this not be invalid?:

const my_number: number { 1 + 1 - 3 * 6 -17 + 84 }

At the end of the day, it's removing a detrimental issue that could mess up the desired error reporting, avoiding unwanted string formatting issues, adding a simple sugar of not needing a + after each string, while optimizing the output through the parser in a simple step.

Answering the next concern:
What happens if I have some code like this?

const my_word: string { "hi" } const my_sentence: string { "my word is:" @my_word "this ^" }

That is invalid source, as the variable should be placed in a string:

const my_word: string { "hi" } const my_sentence: string { "my word is:" "@my_word" "this ^" }

Variables inside a string was a discussion topic which was agreed on, and deviation from that could generate confusing syntax.
The above proper example would compile to such in JS:

const my_word = "hi"; const my_string = `my word is:\n${my_word}\nthis ^`; // or alternatively const my_string = "my word is:" + "\n" + ${my_word} + "\nthis ^";

For the concern of appending happening during codegen, that is 100% not the case. Nothing pre-generation should be ever be pushed onto code generation to handle. Lexer interprets tokens, parser analyzes tokens and builds a code execution tree, IR code (if added for code generation optimizations later on) compiles parser tree (rinse and repeat lexer/parser steps for IR) and finally code generation takes a look at final syntax tree and simply translates to target language.

when has this syntax, or anything like it, been used by another language?
Why should it have to? There always has to be a first.

"So it's two strings with four quotes, but actually it's one string made up of two appended strings which for some reason are also joined with a linebreak, and it's returned as a single string data type" -- yikes.

Documentation exists to explain these types of things. It's the same as looking into Rusts abstractions and thinking it's terrible design of abstracting allocations without reading into docs. Like anything, there's a learning curve, but I feel what's intended is very visible. If not, there's a documentation section.

Ending off, there's definitely a divide between "cute" and functional. Error handling is definitely the most important point of the single line approach.

If still disagreed upon, I'm willing to go the multi-line approach that you have in mind, but really do think it will sting long term.

-- response is personal thoughts --

Some of these points are really good, but I still think this needs further discussion.

Just to make sure my intentions are clear -- I actually don't have a preference on method. These comments:

the multi-line approach that you have in mind [...] With your approach of multi-line strings

are a little misrepresentative of my intention. I'm more so concerned about the flaws with both of the implementations we've thought of thus far; I don't think either is a great option. Because of this, I want to make sure we discuss this thouroughly and get it right.

We've outlined a bunch of pros and cons, and I think this thread has a good collection of thoughts. Let's have a call and discuss this in further detail. Afterward, we can write a summary in this thread of our decision & its rationale.

src/plugins/lexer/mod.ts

HPaulson · 2022-10-19T21:02:18Z

src/plugins/lexer/mod.ts

+        } else {
+          type = "ident";


This isn't necessarily true. We can't assume any token value is an identifier just because it doesn't reach any other case. The way I handled this in the legacy lexer was by checking if the previous token was a const or mut keyword.

This method is effective because when defining an identifier, we require either of these keywords; and when calling an identifier we use @, so we can assume every identifier token will follow a keyw token with a const or mut value.

As for your suggestion to token checking, much of that is better left deferred to parser; lexer is naive in nature and only tasked with understanding what each token means, with more detailed analysis left up the parser. Having this check also ends up repetitive, as the end goal is a concurrent compilation system; there's no performance drawback from leaving until parser stage if each token is individually sent.

As for checking references, definitely something I missed in token analysis; will have lexer actively check for @ symbol and break apart chained references (such as @std.write() for example).

I think you missed the point I was making with this comment. I'm not saying we should care about the check for the purpose of analysis, but rather for the purpose of being correct. Right now the lexer will say anything is an identifier simply when it doesn't know what it is. Previously we had an unknown token type, and I recommend bringing this back. Here's an example: const x: string { "hey" bug };

This line has two syntax errors: bug & ;. We know bug isn't a call to a variable since it doesn't have an @, and the semicolon simply doesn't exist. These are both invalid segments within the code -- they don't have a valid token type because they shouldn't be there. Rather than realizing this fact, this PR will assume those are identifiers... which doesn't make logical sense. An ident is strictly the name of a variable, function, or other user-generated names -- not a random piece of text within the code.

So my original solution for this was to check if it came after a const or mut. This was effective because it made sure any unknown text that came after a defining keyword was known as an ident -- I really didn't care to analyze what type it was or any of that, I only cared about being correct when calling it an ident. If you have a simpler/faster method for accomplishing this, I'm all for it, but It's important that token types are actually correct. The assumption that any unknown code is an identifier is simply incorrect, so I think a token for unknown/invalid code would be an appropriate addition (& would help with errors in the parser).

Re: breaking up chained ref. & introducing @ -- I actually didn't notice these weren't already implemented, otherwise I would've left a separate comment about them too. Definitely should be implemented in this pr! 👍

For token types, I think the best approach is to change the name from "ident" to "unknown", as variable names will be understood by the parser fairly easily. An individual token specific to variable identifiers would prove useless to parser.

if you're good with that, I can do the variable change 👍

And yeah, references is definitely an important part to miss 😅

Yeah, this is a good approach. Any code segment that isn't known to the lexer, be it an identifier name or some random piece of text, will be an unknown token. The parser can then do the analysis between if it came after a const or mut, or if it's truly an unknown piece of text (i.e syntax error).

Good compromise -- Add this in your next commit & mark as resolved 👍

HPaulson · 2022-10-19T21:07:07Z

src/plugins/lexer/mod.ts

+        break;
+      }
+
+      case Mode.String: {


Currently, there's no section for parsing \n in a string -- this should be added. There are a LOT of cases when invalid_string will fire, even when unintentional. I think this is caused by the eol token -- See comment on 36.

Overall I think strings need lots of work, and we should be carefully simplifying them. This was by far the most complex system in the legacy lexer, and that was for a reason -- there's a lot of stuff to check. We'll want to include linebreaks, spaces, escapes, and various other items within strings... this should be thoroughly tested before implementing it.

Lexing of \n is definitely a fix that needs to be implemented, but general understanding of strings by the lexer is unneeded. The lexer should feed in the string data, and leave understanding of special escaped characters to later sequences, such as parser or even rudimentary code generation. Escape sequences are not something easy to perfect; I'll go ahead and add naive escape sequence understanding so they're retained in strings 👍.

Agreed. I wasn't saying we should actively understand escaped sections in a string, leave that to the parser, but we should definitely include them in the string token. As long as escapes are retained in string tokens, we should be good to go from the lexer's perspective. Side note: Make sure to remove the backslashes. For example, "this is \"escaped\"" should create a string token with a value of this is "escaped". In this version of the PR, it gets buggy and will break up the string token. I actually think this may be the same bug as this comment, or at the least they're related.

(Based on your comment here I think this may already be fixed. Feel free to resolve once tested)

src/plugins/lexer/mod.ts

src/plugins/lexer/tests/lexer.test.ts

ddeletedaccount requested a review from HPaulson October 15, 2022 05:12

ddeletedaccount added 3 commits October 17, 2022 12:07

Lexer touchups

fca78c8

- added specific token identifiers - fixed lexer errors relating to string - implemented lexer iteration in class calls - added compiler mode * fixme: - column numbers not showing correctly

Cleaned source, fixed column numbers, more

141435a

- fixed column numbers - fixed issue where strings were misidentified in identifiers - fixed operator columns being computed incorrectly - removed repetitive code, tidied source - added comments

Finalization for lexer

9732a97

- fixed issue where unclosed strings corrupted lexical tokens coming after - fixed issue where loaded sources could inherit trace values from previous source - made operator types specific - added lexical analysis for numbers - added tests - switched files to .ts

ddeletedaccount marked this pull request as ready for review October 19, 2022 05:48

Removed space

3a8bbdc

ddeletedaccount marked this pull request as draft October 19, 2022 06:10

ddeletedaccount marked this pull request as ready for review October 19, 2022 06:17

HPaulson suggested changes Oct 19, 2022

View reviewed changes

Rewrote significant parts of lexer (read desc) #1

Are you sure you want to change the base?

Rewrote significant parts of lexer (read desc) #1

Uh oh!

Conversation

ddeletedaccount commented Oct 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HPaulson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HPaulson Oct 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ddeletedaccount Oct 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HPaulson Oct 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ddeletedaccount Oct 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HPaulson Oct 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ddeletedaccount commented Oct 15, 2022 •

edited

Loading

HPaulson left a comment •

edited

Loading

HPaulson Oct 19, 2022 •

edited

Loading

ddeletedaccount Oct 21, 2022 •

edited

Loading

HPaulson Oct 21, 2022 •

edited

Loading

ddeletedaccount Oct 21, 2022 •

edited

Loading

HPaulson Oct 21, 2022 •

edited

Loading