Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions scielo_classic_website/spsxml/sps_xml_pipes.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ def _process(document):
XMLStylePipe(),
XMLArticleMetaCountsPipe(),
XMLNormalizeSpacePipe(),
XMLSupToXrefPipe(),
XMLDeleteRepeatedElementWithId(),
XMLDeleteRepeatedTranslations(),
XMLFontFaceSymbolPipe(),
Expand Down Expand Up @@ -645,3 +646,100 @@ def transform(self, data):

ET.strip_tags(xml, "REMOVEFONTFACESYMBOL")
return data


class XMLSupToXrefPipe(plumber.Pipe):
"""
Transforma elementos <sup> com conteúdo numérico em <xref> quando eles
representam referências bibliográficas (bibr) ou notas de rodapé (fn).

Critérios de análise:
- sup deve ter conteúdo numérico
- Verifica se não existem xref[@ref-type='bibr'] (evita duplicação)
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation issue: The docstring states "Verifica se não existem xref[@ref-type='bibr'] (evita duplicação)" but there's no code that actually checks for existing xref elements. This check is mentioned in the PR description but not implemented. Consider adding this check before transforming sup elements, or remove this line from the documentation.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation inconsistency: The docstring states "Verifica se não existem xref[@ref-type='bibr'] (evita duplicação)" (Checks if bibr xrefs don't exist to avoid duplication), but the code doesn't actually perform this check. There's no XPath query to verify existing xref elements before transformation.

If duplication prevention is intended, the code should include a check like:

existing_xrefs = xml.xpath(".//xref[@ref-type='bibr']")
if existing_xrefs:
    return data  # or handle differently

Otherwise, remove this claim from the documentation.

Copilot uses AI. Check for mistakes.
- Compara valores numéricos com ref/label e fn/label existentes
- Prioriza bibr sobre fn quando há ambiguidade
- Na dúvida, mantém como sup
"""

def transform(self, data):
raw, xml = data

sups = xml.xpath(".//sup")
if not sups:
return data

sup_values = set()
for sup in list(sups):
if sup.find("xref") is not None:
continue
parent = sup.getparent()
if parent.tag == "xref":
continue
text = "".join(sup.itertext()).strip()
if text and text.isdigit():
sup_values.add(text)
if not sup_values:
return data

ids, numeric_labels = self.get_ids_and_labels(xml, "ref", "mixed-citation")
done = self._convert_sup_to_xref(sups, sup_values, ids, numeric_labels, "bibr")
if done:
return data

ids, numeric_labels = self.get_ids_and_labels(xml, "fn")
done = self._convert_sup_to_xref(sups, sup_values, ids, numeric_labels, "fn")
if done:
return data

Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace detected on this line. Please remove the extra space(s) at the end of line 693.

Suggested change

Copilot uses AI. Check for mistakes.
return data
Comment on lines +651 to +694
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage: The new XMLSupToXrefPipe class lacks test coverage. Given that the test file tests/test_sps_xml_pipes.py contains tests for other pipe classes like XMLFontFaceSymbolPipe, tests should be added for this new pipe to ensure:

  • Correct conversion of numeric <sup> elements to <xref ref-type="bibr">
  • Correct conversion of numeric <sup> elements to <xref ref-type="fn">
  • Elements that should not be converted remain unchanged (non-numeric, already wrapped, ambiguous cases)
  • Edge cases like missing labels, missing IDs, etc.

Copilot uses AI. Check for mistakes.

def get_ids_and_labels(self, xml, from_tag, subtag=None):
elem_ids = {}
numeric_labels = set()
for elem in xml.xpath(f".//{from_tag}"):
node = elem
if subtag:
node = elem.find(subtag)
label = elem.findtext("label") or self._extract_numeric_label_from_node(node)
if label:
numeric_labels.add(label)
elem_ids[label] = elem.get("id")
Comment on lines +704 to +706
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue with missing IDs: If a ref or fn element lacks an id attribute, elem.get("id") returns None (line 706), which gets stored in the elem_ids dictionary. Later in _convert_sup_to_xref (line 714), this could result in setting rid to None on the xref element (line 720), which would be invalid XML. Consider adding a check:

elem_id = elem.get("id")
if label and elem_id:
    numeric_labels.add(label)
    elem_ids[label] = elem_id
Suggested change
if label:
numeric_labels.add(label)
elem_ids[label] = elem.get("id")
elem_id = elem.get("id")
if label and elem_id:
numeric_labels.add(label)
elem_ids[label] = elem_id

Copilot uses AI. Check for mistakes.
return elem_ids, numeric_labels

def _convert_sup_to_xref(self, sups, sup_values, ids, numeric_labels, ref_type):
total = 0
if sup_values.issubset(numeric_labels):
for sup in list(sups):
sup_text = "".join(sup.itertext()).strip()
rid = ids.get(sup_text)
if not rid:
continue
parent = sup.getparent()
xref = ET.Element("xref")
xref.set("ref-type", ref_type)
xref.set("rid", rid)
xref.append(deepcopy(sup))
parent.replace(sup, xref)
total += 1
Comment on lines +712 to +723
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic issue: The _convert_sup_to_xref method doesn't check if the sup element already contains a child xref or if its parent is an xref (as done in lines 673-677). This could lead to incorrect transformations of sup elements that should be skipped. Add the same validation checks here:

for sup in list(sups):
    if sup.find("xref") is not None:
        continue
    parent = sup.getparent()
    if parent.tag == "xref":
        continue
    sup_text = "".join(sup.itertext()).strip()
    # ... rest of the logic

Copilot uses AI. Check for mistakes.
return total

def _extract_numeric_label_from_node(self, node):
"""
Extrai o label numérico inicial de um elemento mixed-citation.

Returns:
str: O label numérico extraído ou string vazia se não encontrado
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring mentions returning "string vazia" (empty string) when not found, but the method actually returns None (line 709, 712). The documentation should be corrected to accurately reflect the return value: str or None: O label numérico extraído ou None se não encontrado.

Suggested change
str: O label numérico extraído ou string vazia se não encontrado
str or None: O label numérico extraído ou None se não encontrado

Copilot uses AI. Check for mistakes.
Comment on lines +728 to +731
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear docstring: The docstring says this method extracts numeric labels from "mixed-citation" elements, but the method actually works with any XML node, not just mixed-citation elements. It's called for both mixed-citation nodes (line 681) and fn nodes (line 696).

Consider updating the docstring to be more generic:

"""
Extrai o label numérico inicial de um nó XML.

Returns:
    str: O label numérico extraído ou None se não encontrado
"""
Suggested change
Extrai o label numérico inicial de um elemento mixed-citation.
Returns:
str: O label numérico extraído ou string vazia se não encontrado
Extrai o label numérico inicial de um XML.
Returns:
str: O label numérico extraído ou None se não encontrado

Copilot uses AI. Check for mistakes.
"""
if node is None:
return None
text = "".join(node.itertext()).strip()
if not text or not text[0].isdigit():
return None
Comment on lines +734 to +737
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation inconsistency: The docstring states the method returns "string vazia se não encontrado" (empty string if not found), but the code actually returns None on lines 718 and 721 when no label is found.

Either update the docstring to say "None se não encontrado" or change the code to return empty string "" instead of None for consistency.

Suggested change
return None
text = "".join(node.itertext()).strip()
if not text or not text[0].isdigit():
return None
return ""
text = "".join(node.itertext()).strip()
if not text or not text[0].isdigit():
return ""

Copilot uses AI. Check for mistakes.

label = ""
for char in text:
if char.isdigit():
label += char
else:
break
return label