Skip to content

Conversation

@rocky
Copy link
Member

@rocky rocky commented Feb 2, 2026

Refactors, mathics.format.render code to something that uses better style and is more akin to current expression evaluation technology.

@rocky rocky marked this pull request as draft February 2, 2026 01:16
@rocky rocky requested a review from mmatera February 2, 2026 01:16
@rocky
Copy link
Member Author

rocky commented Feb 2, 2026

@mmatera: This is far from complete, but I wanted to get something out earlier for you to look at and think about.

The main thing right now to look at is mathics.format.render.mathml.fractionbox().

This starts to set box properties in the box object, even though until we get other boxes converted, we'll still have to pass things around in options, which, in my opinion, is bad.

First of all, the name is a poor choice. There are other places in the code where "options" refer to the user-settable parameters, like color, style, and width. Here, we are mingling in computed attributes of the box, e.g., indent-level. In the future, we will need many more, like the number of characters in text boxes (its width), number of lines (number of embedded "\n"'s), etc.

So a simple thing to do is rename box.options to box.properties or box.attributes. If we want to keep options for specified user options, that's okay.

(I don't think WMA allows Forms and box forms to take options, but one might imagine things like a way to specify flavors of TeX/LaTex, or the maximum width in a text box, and what to do if that width is exceeded).

Another "code smell" we've introduced (actually, I introduced this in splitting off the rendering code for asy, svg, etc.) is using **options.

The problem with this is that this kind of thing is an unspecified, unchecked, untyped bag of whatever. And here, we can do better. There is now box_options, but instead of a generic dict, that would be better expressed as a Python dataclass, which is typed.

There is way more to mention. But I have other stuff right now, so more later...

@rocky
Copy link
Member Author

rocky commented Feb 2, 2026

Here we have specific code, and presumably that helps to understand better what I'm talking about.

On to the more high-level ideas that we should be following to match conventional expression translation code like this.

Some information propagates from the bottom of the tree upwards. Indented string results work like that. And some information propagates from the parent down. The nesting level works like this.

A problem or feature of the way we were doing this is that a string was returned based on parameter information coming down. To the extent that all one ever needs is the string, okay. But there are situations where we may want to query and use those other pieces of information. Probably not nesting level, but the number of lines accumulated or the maximum width of text characters used are like this.

So what I ask is to think about information and transforming expression as though it propagates through the nodes of the tree rather than as pieces of information that are found in parameters or in return results, even though in fact how tree attributes get updated.

Right now, I'd like to see the MathML code improved and revised to make it clean and type-annotated. This includes removing the hard-coded characters. And I think that activity we will have a better sense of how tree-transformation code should work. This is applicable to the other render of boxing functions as well.

While this MathML code is significantly better than what we had before. There is still a little way to go. And I think it is very beneficial to do before moving onto other, more complex forms, like 2-Dimensional character-oriented output, etc.

Remove expression box_properties
@rocky
Copy link
Member Author

rocky commented Feb 2, 2026

In a draft of this PR, I renamed "box_option" to "box_property". In compiler expression manipulation terminology, "property" and "attribute" are somewhat synonymous. However, in Python, "properties" and "attributes" are different in that properties are read-only, while attributes are read-write. So, I will be renaming "box_properties" to "box_attributes to reflect the read-only versus read-write aspect.

Later edit. Change now made.

@mmatera
Copy link
Contributor

mmatera commented Feb 2, 2026

@mmatera: This is far from complete, but I wanted to get something out earlier for you to look at and think about.

Thanks for tackling this. This refactor is large, and it is good to have another point of view.

The main thing right now to look at is mathics.format.render.mathml.fractionbox().

OK, it is enough to start the discussion.

This starts to set box properties in the box object, even though until we get other boxes converted, we'll still have to pass things around in options, which, in my opinion, is bad.

First of all, the name is a poor choice. There are other places in the code where "options" refer to the user-settable parameters, like color, style, and width. Here, we are mingling in computed attributes of the box, e.g., indent-level. In the future, we will need many more, like the number of characters in text boxes (its width), number of lines (number of embedded "\n"'s), etc.

So a simple thing to do is rename box.options to box.properties or box.attributes. If we want to keep options for specified user options, that's okay.

The point is that in this case, I used the word box_options, because there are BoxExpressions that have options. The main examples are StyleBox, InterpretationBox, PaneBox and GraphicsBox, which have many options that affect the cointained object. For example, the result of F[x]//InputForm//MakeBoxes is

InterpretationBox[StyleBox[F[x], ShowStringCharacters -> True, NumberMarks -> True], F[x], Editable -> True, AutoDelete -> True]

InterpretationBox has at least two typical options, Editable and AutoDelete, which are used by the WMA Notebook interface (notice the Rule form used in WMA to specify OptionValues). StyleBox has these two options ShowStringCharacters and NumberMarks, that we use here.

(I don't think WMA allows Forms and box forms to take options, but one might imagine things like a way to specify flavors of TeX/LaTex, or the maximum width in a text box, and what to do if that width is exceeded).

A Form that takes options is, for example, NumberForm. Again, the options for form and box expressions are usually attributes independent of the final representation. Exceptions are attributes like width and height in GraphicsBox or PaneBox.

BoundingBoxes, and attributes like the indentation level are not options, because are determined by the information already available on the container or the the structure of the box expression: the indentation level of an element in a box expression is not something that you can especify with an option like

MyOuterExpr[MyInnerExpr[..., IntentatioLevel->3 ], IntentatioLevel->7]

IndentationLevel of MyInnerExpr and MyOuterExpr depends of their position, and not on something that we tell as an option.

Still, you could cache certain properties inside the objects. For example,in GraphicsBox, in order to produce the output some quantities like the size of the bounding boxes and the size occupied by a TextBox must be computed on the fly and used by the containers to make decisions on how to put pieces together, or what is the size of the image when it is not explicitly specified. In this case, it makes sense to store these quantities as attributes computed at render time. These attributes are not options, but could be a part of the object. Examples of this are found in the prepare_elements function, that we use in rendering GraphicsBox.

Another "code smell" we've introduced (actually, I introduced this in splitting off the rendering code for asy, svg, etc.) is using **options.

Then we reach the render part: boxes_to_ functions take optional arguments because, depending on the target format, and the box expression, you could need so especify different attributes. For example, certain render functions could require an Evaluation object, because they need to show messages during the rendering. Also, indentation is an attribute that could make sense for a markdown file format, but not for a bitmat format. We could decide to pass a dictionary instead of a kwargs parameter, but Python gives us this feature, so why not to use it?.

The problem with this is that this kind of thing is an unspecified, unchecked, untyped bag of whatever. And here, we can do better. There is now box_options, but instead of a generic dict, that would be better expressed as a Python dataclass, which is typed.

Again, is a bag because what we want to share is very general. A more explicit way to specify what is that parameter is to pass a dictionary instead of keyword arguments. But keyword arguments in Python are dictionaries. And also has the convenience that you do not need to explicitly copy the dictionaries to ensure that a change of one element at certain level affects the content of the dictionary inside another level (Python does it by default).

There is way more to mention. But I have other stuff right now, so more later...

@mmatera
Copy link
Contributor

mmatera commented Feb 2, 2026

Here we have specific code, and presumably that helps to understand better what I'm talking about.

good

On to the more high-level ideas that we should be following to match conventional expression translation code like this.

Just one thing: I know maybe I sound like the least indicated person to ask this, but in order to follow the discussion, let's try to change the different aspects of the implementation in different PRs.
Until here, I see four main aspects to correct

  • the name of certain identifiers, and their type annotations. Example: (self->boxes) or box_options->box_attributes. In the first we agree, in the second, I am not sure.
  • How special characters are handled at render time.
  • the mechanism to pass information on render time from the function that renders a container to the function that renders the contained objects and backwards.
  • the instanciation of BoxExpressions and their attributes. Would you bother to split these changes in different PRs?

@mmatera
Copy link
Contributor

mmatera commented Feb 2, 2026

A problem or feature of the way we were doing this is that a string was returned based on parameter information coming down. To the extent that all one ever needs is the string, okay. But there are situations where we may want to query and use those other pieces of information. Probably not nesting level, but the number of lines accumulated or the maximum width of text characters used are like this.

Notice that something like this was done in the implementation of GridBox. Again, the number of lines is something useful when the boxexpression is render as a text, but not as mathml.

So what I ask is to think about information and transforming expression as though it propagates through the nodes of the tree rather than as pieces of information that are found in parameters or in return results, even though in fact how tree attributes get updated.

OK, but are these parameters attributes of the Box Expression object, or of the specific representation used in the render?
For example, in html,

<div class='container' style='text-aling:center;color:blue;`>
<p style='color:red;'> a red string</p>
<p> a (default) blue string</p>
<p style='color:green;'> a green string</p>
</div>

the container div has style attributes that propagate to the contained <p> tags. Inside the web browser, there are objects that draw these instructions in the screen, and in these objects, I would expect that <p style='color:red;'> a red string

corresponds with an object having the attribute text-align:center. But at the HTML level,

` does not has this attribute set: we can move this element to another part of the HTML code and then its attribute could be any other thing. This is what happens here with properties computed at render time.

Right now, I'd like to see the MathML code improved and revised to make it clean and type-annotated. This includes removing the hard-coded characters. And I think that activity we will have a better sense of how tree-transformation code should work. This is applicable to the other render of boxing functions as well.

This is part of the process..

While this MathML code is significantly better than what we had before. There is still a little way to go. And I think it is very beneficial to do before moving onto other, more complex forms, like 2-Dimensional character-oriented output, etc.

The time I have to make very large changes is finishing (I am going back to work this week). #1643, #1661 and #1663 complete what I can finish before the release. With them in, it is easy to make progress in specific parts (e.g., how SVG or prettyprint are rendered) without making structural changes. Reformulate BuiltinElement for example, is out of scope for me now. Also, there are more things in formatting infix, prefix and postfix operators would need more work which I am not sure to be available to do before my next holidays, but at least they should be more or less localized changes.

@rocky
Copy link
Member Author

rocky commented Feb 2, 2026

Here we have specific code, and presumably that helps to understand better what I'm talking about.

good

On to the more high-level ideas that we should be following to match conventional expression translation code like this.

Just one thing: I know maybe I sound like the least indicated person to ask this, but in order to follow the discussion, let's try to change the different aspects of the implementation in different PRs. Until here, I see four main aspects to correct

  • the name of certain identifiers, and their type annotations. Example: (self->boxes) or box_options->box_attributes. In the first we agree, in the second, I am not sure.

I don't understand which part you agree on and which part you don't.

  • How special characters are handled at render time.

That I haven't started to address, but it needs to be addressed.

  • the mechanism to pass information on render time from the function that renders a container to the function that renders the contained objects and backwards.
  • the instanciation of BoxExpressions and their attributes. Would you bother to split these changes in different PRs?

I don't mind splitting this into different PRs. But we should not move forward with new rendering and boxing work, until the code we currently have both in the PR and in the master, has been cleaned up.

Otherwise, we are proliferating bad patterns. I am sorry I didn't notice and catch this sooner. ,

So which aspect would you like a PR for?

@mmatera
Copy link
Contributor

mmatera commented Feb 2, 2026

So which aspect would you like a PR for?

@rocky, from what is in now, I like the change from self by box, and add annotations in the mathics.format.render modules.
Regarding renaming box_options to box_attributes I am not sure, but if you think is a better option, I am OK with that.
Regarding the change _indent_level to indent_level, I think it is OK since we do not expect to have an option with that name. In any case, what I like more of this stuff is that we can start to discuss design details in a nearly common language.

I don't mind splitting this into different PRs. But we should not move forward with new rendering and boxing work, until the code we currently have both in the PR and in the master, has been cleaned up.

Thanks! and I agree, at this point we need to be in the same page to go forward.

Otherwise, we are proliferating bad patterns. I am sorry I didn't notice and catch this sooner. ,

So which aspect would you like a PR for?

@rocky
Copy link
Member Author

rocky commented Feb 2, 2026

@mmatera: This is far from complete, but I wanted to get something out earlier for you to look at and think about.

Thanks for tackling this. This refactor is large, and it is good to have another point of view.

So a simple thing to do is rename box.options to box.properties or box.attributes. If we want to keep options for specified user options, that's okay.

The point is that in this case, I used the word box_options, because there are BoxExpressions that have options. The main examples are StyleBox, InterpretationBox, PaneBox and GraphicsBox, which have many options that affect the cointained object. For example, the result of F[x]//InputForm//MakeBoxes is

When something is an option, it is fine and proper to call it an option. Computed properties like indent level and bounding box information are not options; they are box attributes. Storing them into a dictionary called options is a code smell.

The problem with this is that this kind of thing is an unspecified, unchecked, untyped bag of whatever. And here, we can do better. There is now box_options, but instead of a generic dict, that would be better expressed as a Python dataclass, which is typed.

Again, is a bag because what we want to share is very general. A more explicit way to specify what is that parameter is to pass a dictionary instead of keyword arguments. But keyword arguments in Python are dictionaries. And also has the convenience that you do not need to explicitly copy the dictionaries to ensure that a change of one element at certain level affects the content of the dictionary inside another level (Python does it by default).

Each Box type has very specific attributes that it needs, such as for certain LaTeX and MathML boxes, whether there are multiple lines, and possibly in the future, the maximum width in characters of the lines. This is not bag-like; we know in advance which box attributes are used for which kinds of boxes, and can specify a hierarchy for these. So this stuff should follow conventional OO and Python practice and should be attributes of the box object.

For things like options on built-in commands, which are more varied and can be added at will, sure, the general dictionary mechanism is sometimes very convenient. But we should limit this to when it is needed. It is not needed in specifying box attributes. As has been said many times, a problem with dictionaries when used for things like box attributes is that you loose the ability to check attribute names (called a "key" in dictionary parlance), and you also loose the type information in the value, since the type has to cover all possible values covered by keys.

@mmatera
Copy link
Contributor

mmatera commented Feb 2, 2026

When something is an option, it is fine and proper to call it an option. Computed properties like indent level and bounding box information are not options; they are box attributes. Storing them into a dictionary called options is a code smell.

In this part I agree: in the signature of render functions, **options should be called **kwargs, like is typically done in Python. Again, I use options there (not box_options, the attribute of BoxExpression) because it appeared in that way in existing code. I like to make this distinction more explicit.

Each Box type has very specific attributes that it needs, such as for certain LaTeX and MathML boxes, whether there are multiple lines, and possibly in the future, the maximum width in characters of the lines. This is not bag-like; we know in advance which box attributes are used for which kinds of boxes, and can specify a hierarchy for these. So this stuff should follow conventional OO and Python practice and should be attributes of the box object.

First, there are not LaTeX/* MathML boxes: there are Boxes that eventually can be rendered as an SVG picture, MathML code or LaTeX code (and eventually, as a PNG picture). Some BoxExpressions have specific options, while others not: there are box expressions like Graphics that have a lot of options that modify some of their contained objects.

For things like options on built-in commands, which are more varied and can be added at will, sure, the general dictionary mechanism is sometimes very convenient. But we should limit this to when it is needed. It is not needed in specifying box attributes. As has been said many times, a problem with dictionaries when used for things like box attributes is that you loose the ability to check attribute names (called a "key" in dictionary parlance), and you also loose the type information in the value, since the type has to cover all possible values covered by keys.

The consistency check of that the options received are the right ones or not happens at evaluation time ( $OptionSyntax in Builtin.options defines that behavior). Also, because are stored in a dictionary, repeated options are not allowed. Also, we can make some extra checks in the init method.

@rocky
Copy link
Member Author

rocky commented Feb 2, 2026

The time I have to make very large changes is finishing (I am going back to work this week). #1643, #1661 and #1663 complete what I can finish before the release. With them in, it is easy to make progress in specific parts (e.g., how SVG or prettyprint are rendered) without making structural changes. Reformulate BuiltinElement for example, is out of scope for me now. Also, there are more things in formatting infix, prefix and postfix operators would need more work which I am not sure to be available to do before my next holidays, but at least they should be more or less localized changes.

Thanks for the information. I just looked at these PRs. Right now, I think we can get the features covered in the PRs merged in soon.

Specifics:

@rocky
Copy link
Member Author

rocky commented Feb 2, 2026

When something is an option, it is fine and proper to call it an option. Computed properties like indent level and bounding box information are not options; they are box attributes. Storing them into a dictionary called options is a code smell.

In this part I agree: in the signature of render functions, **options should be called **kwargs, like is typically done in Python. Again, I use options there (not box_options, the attribute of BoxExpression) because it appeared in that way in existing code. I like to make this distinction more explicit.

Whether this is called **options or **kwargs I don't really care about. The bigger concern for me is not to put stuff in here that doesn't below. Specifically box attributes.

First, there are not LaTeX/* MathML boxes:

These routines work off of various kinds of Boxes.

there are Boxes that eventually can be rendered as an SVG picture,
MathML code or LaTeX code (and eventually, as a PNG picture). Some BoxExpressions have specific options, while others not: there are box expressions like Graphics that have a lot of options that modify some of their contained objects.

Again, if something is an option, it should stay an option. Just don't put bounding box and box attributes into the options dictionary. Instead it is an attribute of the box object.

The consistency check of that the options received are the right ones or not happens at evaluation time ( $OptionSyntax in Builtin.options defines that behavior). Also, because are stored in a dictionary, repeated options are not allowed. Also, we can make some extra checks in the init method.

For things that are options, that's great. For things that are box attributes, we should follow Python annotation for type checking. I feel like a broken record repeating stuff: options are options and (box) attributes are attributes. Things that are "computed on the fly" are attributes.

@rocky
Copy link
Member Author

rocky commented Feb 2, 2026

@mmatera Looking over the discussion so far, the one thing that saddens me the most is that I don't see an acknowledgement or understanding that the compiler design pattern that should be used here is one of thinking of attributes associated with the expression nodes (here, "boxes"), and there is a pattern of information propagating up and down the (expression) tree. Code should be written informed by this principle.

Information passed down the expression tree right now is done via **options or **kwargs, but I believe this single parameter whatever should be split in two: options of the kind that you've been going on ad nauseam over, versus, box (or more generally expession node) attributes.

Strictly speaking, though we don't need to pass two parameters, we can do that in the parent by assigning (often via copy) to each child box from the parent before the call involving child nodes. I did that somewhere in the draft to show how that's done.

@mmatera
Copy link
Contributor

mmatera commented Feb 2, 2026

@mmatera Looking over the discussion so far, the one thing that saddens me the most is that I don't see an acknowledgement or understanding that the compiler design pattern that should be used here is one of thinking of attributes associated with the expression nodes (here, "boxes"), and there is a pattern of information propagating up and down the (expression) tree.

I think I understand and acknowledge the pattern. What I am not agree is that the information that propagates should be attached to the BoxExpression, as you do not make a C compiler to store the information used in compilation inside the source code. As I see, Box Expressions are the equivalent to the source code, and the MathML output is the object code. The place where I think the information used in compilation should be stored is in the kwargs dictionary. Could be also done in many other ways, I think.

Code should be written informed by this principle.

Information passed down the expression tree right now is done via **options or **kwargs, but I believe this single parameter whatever should be split in two: options of the kind that you've been going on ad nauseam over, versus, box (or more generally expession node) attributes.

OK, I agree with this. The detail is how to do that plays well with the dispatch table.

Strictly speaking, though we don't need to pass two parameters, we can do that in the parent by assigning (often via copy) to each child box from the parent before the call involving child nodes. I did that somewhere in the draft to show how that's done.

@rocky
Copy link
Member Author

rocky commented Feb 2, 2026

I think I understand and acknowledge the pattern.

Good to hear.

What I am not agree is that the information that propagates should be attached to the BoxExpression, as you do not make a C compiler to store the information used in compilation inside the source code.

A compiler is way more complicated than this. The aspect I am talking about here is part of what is called the "front-end" of a compiler and you do find it in interpreters as well. Think of Python's AST structure.

As I see, Box Expressions are the equivalent to the source code, and the MathML output is the object code.

Ok. So take an interpreter like Perl, which doesn't have an AST structure, but it has an interpreter tree called Optree, which runs off of. Again, when doing transformations, structurally, it helps the organization by thinking of the information as passing through the tree instead of via parameters.

The place where I think the information used in compilation should be stored is in the kwargs dictionary. Could be also done in many other ways, I think.

Although there are always many ways to do things, my personal experience with this kind of transformation with large code bases is that things are more comprehensible when you think about and work with stuff in this node-centric way. I talked about this in https://rocky.github.io/YAPC2018-deparse/#/9 https://rocky.github.io/YAPC2018-deparse/#/9/1 https://rocky.github.io/YAPC2018-deparse/#/10. A really poorly-presented talk I gave on this is https://youtu.be/gREriCbwW8E?si=otz2X-cRBqv3UdPP&t=1001

@mmatera
Copy link
Contributor

mmatera commented Feb 2, 2026

I think I understand and acknowledge the pattern.

Good to hear.

What I am not agree is that the information that propagates should be attached to the BoxExpression, as you do not make a C compiler to store the information used in compilation inside the source code.

A compiler is way more complicated than this. The aspect I am talking about here is part of what is called the "front-end" of a compiler and you do find it in interpreters as well. Think of Python's AST structure.

Is Python AST structure modified during its conversion to bytecode?

As I see, Box Expressions are the equivalent to the source code, and the MathML output is the object code.

Ok. So take an interpreter like Perl, which doesn't have an AST structure, but it has an interpreter tree called Optree, which runs off of. Again, when doing transformations, structurally, it helps the organization by thinking of the information as passing through the tree instead of via parameters.

OK, but I would look not at every compiler/render but at the ones with a similar interface. How does HTML/SVG renders do work? How Python's xml library does the conversion from text to a tree structure and back to text?

The place where I think the information used in compilation should be stored is in the kwargs dictionary. Could be also done in many other ways, I think.

Although there are always many ways to do things, my personal experience with this kind of transformation with large code bases is that things are more comprehensible when you think about and work with stuff in this node-centric way. I talked about this in https://rocky.github.io/YAPC2018-deparse/#/9 https://rocky.github.io/YAPC2018-deparse/#/9/1 https://rocky.github.io/YAPC2018-deparse/#/10. A really poorly-presented talk I gave on this is https://youtu.be/gREriCbwW8E?si=otz2X-cRBqv3UdPP&t=1001

OK, then my question is: can be implement the mathml render in this way, without changing the design of the boxes_to_format methods by dispatch tables? Shall we change the implementation of it? Or the changes can be restricted to the mathics.format.render.mathml module?

@rocky
Copy link
Member Author

rocky commented Feb 2, 2026 via email

@mmatera
Copy link
Contributor

mmatera commented Feb 2, 2026

I am telling you that this is a common compiler pattern

OK, and I trust you it is. I just mention that how Perl's interpreter works does not show to me as the most relevant example.

So, regarding my questions

OK, then my question is: can be implement the mathml render in this way,
without changing the design of the boxes_to_format methods by dispatch
tables? Shall we change the implementation of it? Or the changes can be
restricted to the mathics.format.render.mathml module?

If the changes just require to add some attributes to the BoxExpression subclasses and modifying mathics.format.render.mathml I take your implementation without more questions. If it requires more changes, let me please understand these changes before proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants