name-parser: Allow dashes between modifier and weight

[why] Some fonts might have a non-standard (i.e. broken) weight naming scheme: They put a blank or a dash between the modifier and the weight, for example "Extra Bold" or "Demi-Condensed", when they mean "ExtraBold" resp "DemiCondensed". The former happens with CartographCF, the later with IBM3270. [how] Automatically allow a dash between modifier and weight, which comes up as CamelCase boundary. Insert an optional dash (r'-?') into such boundaries. For the further lookup we need to remove the dash in the found keyword, if there is any, to get back to standard naming. This might break if the font name ends in a modifier. So we can not really distinguish Font Name Extra Bold Italic => Font Name - ExtraBold Italic => Font Name Extra - Bold Italic The known modifiers are 'Demi', 'Ultra', 'Semi', 'Extra'. It is possible but unlikely that a font name ends in one of these. For example "Modern Ultra - Bold". [note] The question arises if we should not parse the PSname instead of the Fullname; and stick to the dash there as boundary. The problem might be prepatched fonts with broken naming, that would be parsed completely wrong then. So maybe the current approach is still the best, with the caveat given above (fontnames ending in a modifier). [note 2] Funny enough the variable allow_regex_token was not used at all :-> Some leftover? Anyhow we use it now. [note 3] We can still not remove the special handling for IBM3270, because the font initially looks like a PSname and this is parsed as such, which breaks the name in the incorrect place: PSname template = "Name-StylesWeights" Fullname of 3270 = "IBM 3270 Semi-Condensed" Signed-off-by: Fini Jastrow <ulf.fini.jastrow@desy.de>
ryanoasis · May 26, 2023 · bb21d28 · bb21d28
1 parent 30d7317
commit bb21d28
Showing 1 changed file with 9 additions and 1 deletion.
diff --git a/bin/scripts/name_parser/FontnameTools.py b/bin/scripts/name_parser/FontnameTools.py
@@ -150,7 +150,12 @@ def get_name_token(name, tokens, allow_regex_token = False):
         not_matched = ""
         all_tokens = []
         j = 1
-        regex = re.compile('(.*?)(' + '|'.join(tokens) + ')(.*)', re.IGNORECASE)
+        token_regex = '|'.join(tokens)
+        if not allow_regex_token:
+            # Allow a dash between CamelCase token word parts, i.e. Camel-Case
+            # This allows for styles like Extra-Bold
+            token_regex = re.sub(r'(?<=[a-z])(?=[A-Z])', '-?', token_regex)
+        regex = re.compile('(.*?)(' + token_regex + ')(.*)', re.IGNORECASE)
         while j:
             j = regex.match(name)
             if not j:
@@ -159,6 +164,9 @@ def get_name_token(name, tokens, allow_regex_token = False):
                 sys.exit('Malformed regex in FontnameTools.get_name_token()')
             not_matched += ' ' + j.groups()[0] # Blanc prevents unwanted concatenation of unmatched substrings
             tok = j.groups()[1].lower()
+            if not allow_regex_token:
+                # Remove dashes between CamelCase token words
+                tok = tok.replace('-', '')
             if tok in lower_tokens:
                 tok = tokens[lower_tokens.index(tok)]
             tok = FontnameTools.unify_style_names(tok)