Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode math properties #2425

Merged
merged 10 commits into from
Oct 10, 2024
Merged

Unicode math properties #2425

merged 10 commits into from
Oct 10, 2024

Conversation

brucemiller
Copy link
Owner

Here's an interesting, perhaps controversial, PR: It enhances decodeMathChar to also lookup math properties (role, meaning,...) for the decoded glyph from the Unicode module. This provides some parity between symbols defined "semantically" by bindings and low-level \mathchar, in the simple cases. This covers most of what TeX's plain defines, as it defines them!

@brucemiller brucemiller requested a review from dginev October 8, 2024 20:12
Copy link
Collaborator

@dginev dginev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually quite like the direction of the PR. Maybe I don't yet understand enough to register the controversial piece?

I left a few minor comments, and I can't say I understand the full details of the TeX interplay yet, but the grammar-near pieces seem encouraging.

I especially like the changes in the tests, where the math XML markup has become significantly more enhanced.

lib/LaTeXML/Core/Definition/CharDef.pm Show resolved Hide resolved
($local ? Tokens(T_CS('\char'), $value->revert, T_CS('\relax')) : $$self{cs})); }
else { # Else math mode, mathDecode!
my ($glyph, $f, $rev, %props) = LaTeXML::Package::decodeMathChar($nvalue);
if (!defined $props{name}) { # TEMPORARY HACK ?????
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything to add to that comment? E.g. what is missing for the hack to go away?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified; thanks!

$whatsit->setProperty(font => LookupValue('font')->specialize($glyph)) if $glyph;
my ($glyph, $f, $rev, %props) = decodeMathChar($n);
$whatsit->setProperty(glyph => $glyph) if $glyph;
$whatsit->setProperties(%props);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor, but you could also guard that setter call:

$whatsit->setProperties(%props) if %props;

DefConstructor('\mathclose Digested', "?#1(<ltx:XMWrap role='CLOSE'>#1</ltx:XMWrap>)()", bounded => 1);
DefConstructor('\mathpunct Digested', "?#1(<ltx:XMWrap role='PUNCT'>#1</ltx:XMWrap>)()", bounded => 1);
DefConstructor('\mathinner Digested', "?#1(<ltx:XMWrap role='ATOM'>#1</ltx:XMWrap>)()", bounded => 1);
our %mathclass_subclass = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nested hash has the feeling of a mini-grammar... Interesting.

How about three other ideas:

  • PUNCT => { PERIOD => 1 }
  • ID => { NUMBER => 1 }
  • BIGOP => { SUMOP => 1, INTOP => 1, LIMITOP => 1, DIFFOP => 1 }

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!, yep, that's the idea!

my $font = $curfont->merge(%$fontinfo);
$fontinfo = $defn && $defn->isFontDef;
if ($$basefontinfo{size} != $curfont->getSize) { # If we've gotten an explicit font SIZE change; Adjust!
$fontinfo = {%$fontinfo}; $$fontinfo{size} = $curfont->getSize; } }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small theoretical chance $fontinfo is undef here, so maybe default it to {} ?

 $fontinfo = ($defn && $defn->isFontDef) || {};

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

$fontinfo = {%$fontinfo}; $$fontinfo{size} = $curfont->getSize; } }
my $font = $curfont->merge(%$fontinfo);
if ($downsize > 0) { $font = $curfont->merge(scripted => 1); }
if ($downsize > 1) { $font = $curfont->merge(scripted => 1); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These merge calls have the same value? Do you intend it to do the same merge twice when >1 ?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, twice for scriptscript

#======================================================================
UTF(0x5C) => { role => 'ADDOP', meaning => 'set-minus' }, # \backslash
UTF(0xAC) => { role => 'BIGOP', meaning => 'not' }, # \lnot
UTF(0xAC) => { role => 'BIGOP', meaning => 'not' }, # \neg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0xAC is given 3 separate times here? Unsure how that happened. There may be other cases, I see 0x5C is also there twice.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

editing mistake. fixed.

"ker" => { role => 'OPFUNCTION', meaning => 'kernel' }, # \ker #
"lg" => { role => 'OPFUNCTION' }, # \lg #
"lim" => { role => 'LIMITOP', meaning => 'limit', need_scriptpos => 1 }, # \lim #
"lim inf" => { role => 'LIMITOP', meaning => 'limit-infimum', need_scriptpos => 1 }, # \liminf #
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, first I assume single characters, then single words (no spaces), but I see some entries even have a space.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, these don't actually get fired, yet; it would need to be hooked into the math ligature that combines "s,i,n" into "sin"; and then looked up! I suspect "lim inf" will likely fail that test. I kinda left them all in as a reminder...

@brucemiller brucemiller merged commit 9f07a49 into master Oct 10, 2024
26 checks passed
@brucemiller brucemiller deleted the unicode-math-properties branch October 10, 2024 20:53
teepeemm pushed a commit to teepeemm/LaTeXML that referenced this pull request Oct 29, 2024
* New unicode_math_properties(char) gets the math properties (role, meaning, ...) associated with a unicode char

* New unicode_math_properties(char) gets the math properties (role, meaning, ...) associated with a unicode char

* Revise API for decodeMathChar to return the math properties associated with the glyph, after decoding and font adjustment

* Revise CharDef->new API to take only the code, deferring decoding until digestsion; use new decodeMathChar API

* Updates for improved decodeMathChar and CharDef->new API

* Update fontmap for some bigops

* Have \mathop and friends intelligently merge the requested TeX math class with any already assigned grammatical role

* Update tests for more 'semantic' when low-level glyphs are accessed

* Refinements & clarifications suggested by D.Ginev

* Add guard
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants