-
Notifications
You must be signed in to change notification settings - Fork 740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a lexer for untokenised BBC BASIC files #1280
Conversation
Includes full support for ARM BBC BASIC. Some features of BBC BASIC for Windows and BBC BASIC for SDL2.0 are not included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And now I'm caught up with your whirlwind of submissions! :P
@bavison Looking at this code, I'm a little uncomfortable at the performance cost of the interpolation. There's a lot of it and there's potentially a lot of backtracking as the regex engine goes through trying to find matches. I thought again about that alternative strategy of matching an identifier and then checking whether the match is inside any of the arrays of keywords (be they That would be your conclusion, too, right? |
…iority than built-in function ERR
…ommand introducer This is done by introducing an extra state where CLI commands are not valid, and classifying the control flow keywords according to whether they transition in or out of this state. Not 100% safe yet, since detecting any expression in :root state should mean that * can only be an operator.
There's certainly a lot of Presumably it would be possible to form a decision tree to speed things up, so that for example |
@bavison Yeah, the number of I read through the documentation for Ruby's Regexp class and the best we have is the This is probably another example where we should see how it performs in practice and look at future optimisations as and when required. As you said, trying to do that by hand is possible but fraught with bug potential. |
By using a negative lookahead for when DIM and POINT are statements, rather that a positive lookahead when DIM() and POINT() are used as built-in functions, we can merge state :builtin_function into :expression. As a bonus, since built-in functions are now tested at lower priority than control flow statements, `ERROR` no longer needs its own rule.
This is achieved to inserting `goto :no_further_imperatives` every time we identify a component of an expression, or reach a state where we expect an expression next. Because these will not be balanced by the number of times we return to :root, we can no longer use a state stack. This means that instead of being able to do `mixin :expression` from state `:assembly2`, we need to write out all the same rules again, but without the `goto`s.
Some of those keywords previously listed are not valid as an imperative statement.
I don't think it realle helps to distinguish control flow statements from other statements, especially since they are tokenised the same. Some keywords (like `TINT`) have ended up in control3 even though they have nothing to do with conrol flow, and others (like `OF`) can appear within either sort of statement (`COLOUR OF x` and `CASE x OF`).
* `PROC` is permitted in an `ON` statement * An imperative keyword is permitted immediately after `DEFPROCthing` without a separating colon Added a few more examples to the visual spec
It isn't really a built-in function (since it needs a function name to qualify it, which is lexed as a variable), but it can appear within an expression.
Using a pair of states just wan't working - there are just too many exceptions to the rules to handle cleanly. So instead, here we take the approach of only permitting `*` to be lexed as a command introducer if it is at the start of a line, or follows `:` or one of a restrictive set of other keywords. Unfortunately, it appears that the lookbehind regexp syntax `(?<=)` doesn't work here, so we have to lex the string including the lead up to the `*` as a multi-token rule. This is fine in all cases except when the first line of the file is a `*` command, but hopefully this will be rare enough that nobody cares.
Sorry about the drip-drip of commits. Mostly it's putting the worms back in the can regarding the multiplication operator |
@bavison Thanks for all your work on this one—quite a journey, huh? Oh, and I'm sorry we don't have anything in the documentation explaining that negative lookbehinds don't work in rules. I hope you didn't spend too long before you worked that out! This is a limitation of (really a bug in) the StringScanner class from Ruby's standard library. A patch I submitted was just accepted but the reality is that even when that patch rolls out, it'll only be in the latest version of Ruby and so it'll be a while before it's supported in Rouge :( |
No, it wasn't that big of a proportion of the time. The biggest problem was dealing with a syntax that evolved in a not-entirely-consistent way in an era before common use of other language analysis tools, I think. That's the end of my backlog of lexers that I'd developed - there probably won't be many more to follow. Bugfixes maybe. Thanks very much for all your help! |
I realised I hadn't said this in the closing comment of any of the other PRs but Rouge is on a two-week release cadence at the moment as we work our way through the backlog of submissions. The lexers you've submitted will come out this coming Tuesday as part of v3.8.0. Thanks again! |
This commit adds a lexer for ARM BBC Basic. Some features of BBC BASIC for Windows and BBC BASIC for SDL2.0 are not included.
Includes full support for ARM BBC BASIC. Some features of BBC BASIC for Windows
and BBC BASIC for SDL2.0 are not included.