Skip to content

Conversation

@robertatakenaka
Copy link
Member

@robertatakenaka robertatakenaka commented Dec 3, 2025

Adiciona XMLSupToXrefPipe para converter elementos sup numéricos em xref

Descrição

Esta PR implementa um novo pipe de transformação XML (XMLSupToXrefPipe) que converte automaticamente elementos <sup> com conteúdo numérico em elementos <xref> apropriados quando representam referências bibliográficas ou notas de rodapé.

Motivação

Documentos XML frequentemente contêm elementos <sup> que na verdade são referências bibliográficas ou notas de rodapé, mas não estão marcados corretamente como <xref>. Isso pode afetar o processamento e a renderização adequada desses elementos no sistema.

Alterações implementadas

  • Novo pipe XMLSupToXrefPipe: Transforma elementos <sup> numéricos em <xref ref-type="bibr"> ou <xref ref-type="fn"> baseado na análise do conteúdo
  • Análise inteligente:
    • Identifica elementos <sup> com conteúdo numérico
    • Extrai labels numéricos de elementos <ref> e <fn> existentes
    • Compara valores para determinar o tipo correto de referência
    • Prioriza referências bibliográficas (bibr) sobre notas de rodapé (fn) em casos ambíguos
  • Método auxiliar _extract_numeric_label: Extrai labels numéricos do início de elementos mixed-citation
  • Prevenção de duplicação: Verifica xref existentes antes de aplicar transformações
  • Integração ao pipeline: Pipe adicionado à sequência de processamento XML

Impacto

  • Melhora a qualidade semântica do XML processado
  • Garante marcação correta de referências e notas
  • Mantém compatibilidade com o processamento existente
  • Não altera elementos quando há incerteza sobre o tipo correto

Como testar

  1. Processar documentos XML contendo elementos <sup> numéricos
  2. Verificar se elementos <sup> que correspondem a refs ou fns são convertidos corretamente
  3. Confirmar que elementos <sup> ambíguos ou não-referências permanecem inalterados
  4. Validar que não há duplicação de elementos <xref>

Copilot AI review requested due to automatic review settings December 3, 2025 13:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR attempts to add a new XMLSupToXrefPipe transformation pipe that converts <sup> elements with numeric content into <xref> elements when they represent bibliographic references (bibr) or footnotes (fn). However, the implementation is incomplete and contains critical bugs that prevent it from functioning.

Key Issues

  • The transform method contains an incomplete if statement (line 698) causing a syntax error
  • Method naming mismatch: calls to _extract_numeric_label_from_mixed_citation reference a non-existent method
  • Multiple collected variables (sup_values, ref_numeric_labels, fn_numeric_labels, ref_labels, fn_labels, refs, fns) are never used in any transformation logic
Comments suppressed due to low confidence (1)

scielo_classic_website/spsxml/sps_xml_pipes.py:698

  • Syntax Error (in Python 3).
        if 

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Extrai o label numérico inicial de um elemento mixed-citation.

Returns:
str: O label numérico extraído ou string vazia se não encontrado
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring mentions returning "string vazia" (empty string) when not found, but the method actually returns None (line 709, 712). The documentation should be corrected to accurately reflect the return value: str or None: O label numérico extraído ou None se não encontrado.

Suggested change
str: O label numérico extraído ou string vazia se não encontrado
str or None: O label numérico extraído ou None se não encontrado

Copilot uses AI. Check for mistakes.
Comment on lines 651 to 745
class XMLSupToXrefPipe(plumber.Pipe):
"""
Transforma elementos <sup> com conteúdo numérico em <xref> quando eles
representam referências bibliográficas (bibr) ou notas de rodapé (fn).

Critérios de análise:
- sup deve ter conteúdo numérico
- Verifica se não existem xref[@ref-type='bibr'] (evita duplicação)
- Compara valores numéricos com ref/label e fn/label existentes
- Prioriza bibr sobre fn quando há ambiguidade
- Na dúvida, mantém como sup
"""

def transform(self, data):
raw, xml = data

sups = xml.xpath(".//sup")
if not sups:
return data

sup_values = set()
for sup in sups:
if sup.text and sup.text.strip().isdigit():
sup_values.add(sup.text.strip())
if not sup_values:
return data

ref_numeric_labels = []
for item in xml.xpath(".//ref"):
label = item.findtext("label") or self._extract_numeric_label_from_mixed_citation(item)
if label:
ref_numeric_labels.append(label)

fn_numeric_labels = []
for item in xml.xpath(".//fn"):
label = item.findtext("label") or self._extract_numeric_label_from_mixed_citation(item)
if label:
fn_numeric_labels.append(label)

ref_labels = [
(ref.findtext("label"), ref) for ref in xml.xpath(".//ref") if ref.find("label") is not None
]
fn_labels = [
(fn.findtext("label"), fn) for fn in xml.xpath(".//fn") if fn.find("label") is not None
]
refs = xml.xpath(".//ref")
fns = xml.xpath(".//fn")
if
return data

def _extract_numeric_label(self, node):
"""
Extrai o label numérico inicial de um elemento mixed-citation.

Returns:
str: O label numérico extraído ou string vazia se não encontrado
"""
if node is None:
return None
text = "".join(node.itertext()).strip()
if not text or not text[0].isdigit():
return None

label = ""
for char in text:
if char.isdigit():
label += char
else:
break
return label
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new XMLSupToXrefPipe class lacks test coverage. Based on the test patterns in the repository (see test_sps_xml_pipes.py), comprehensive tests should be added to verify the transformation logic, including cases for: numeric sup elements matching bibr/fn labels, non-numeric sup elements, missing labels, edge cases, and the prioritization of bibr over fn.

Copilot uses AI. Check for mistakes.
Comment on lines 671 to 674
sup_values = set()
for sup in sups:
if sup.text and sup.text.strip().isdigit():
sup_values.add(sup.text.strip())
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sup_values set is collected but never used in the subsequent logic. This variable should likely be used to filter or compare against the ref_numeric_labels and fn_numeric_labels, or it should be removed if not needed.

Copilot uses AI. Check for mistakes.
]
refs = xml.xpath(".//ref")
fns = xml.xpath(".//fn")
if
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete if statement with no condition. This is a syntax error that will prevent the code from running.

Suggested change
if

Copilot uses AI. Check for mistakes.
if
return data

def _extract_numeric_label(self, node):
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method _extract_numeric_label_from_mixed_citation is called on lines 680 and 686, but the actual method defined is named _extract_numeric_label. Either the method name should be changed to _extract_numeric_label_from_mixed_citation or the calls should use _extract_numeric_label.

Copilot uses AI. Check for mistakes.
Comment on lines 690 to 698
ref_labels = [
(ref.findtext("label"), ref) for ref in xml.xpath(".//ref") if ref.find("label") is not None
]
fn_labels = [
(fn.findtext("label"), fn) for fn in xml.xpath(".//fn") if fn.find("label") is not None
]
refs = xml.xpath(".//ref")
fns = xml.xpath(".//fn")
if
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variables ref_labels, fn_labels, refs, and fns are defined but never used. The code appears incomplete as these variables are collected but not utilized in any logic before returning. Either complete the implementation or remove these unused variables.

Suggested change
ref_labels = [
(ref.findtext("label"), ref) for ref in xml.xpath(".//ref") if ref.find("label") is not None
]
fn_labels = [
(fn.findtext("label"), fn) for fn in xml.xpath(".//fn") if fn.find("label") is not None
]
refs = xml.xpath(".//ref")
fns = xml.xpath(".//fn")
if
# Removed unused variables and incomplete logic

Copilot uses AI. Check for mistakes.
Comment on lines 678 to 689
ref_numeric_labels = []
for item in xml.xpath(".//ref"):
label = item.findtext("label") or self._extract_numeric_label_from_mixed_citation(item)
if label:
ref_numeric_labels.append(label)

fn_numeric_labels = []
for item in xml.xpath(".//fn"):
label = item.findtext("label") or self._extract_numeric_label_from_mixed_citation(item)
if label:
fn_numeric_labels.append(label)

Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variables ref_numeric_labels and fn_numeric_labels are collected but never used. Since the code is incomplete, these should likely be incorporated into the missing logic or removed if not needed.

Suggested change
ref_numeric_labels = []
for item in xml.xpath(".//ref"):
label = item.findtext("label") or self._extract_numeric_label_from_mixed_citation(item)
if label:
ref_numeric_labels.append(label)
fn_numeric_labels = []
for item in xml.xpath(".//fn"):
label = item.findtext("label") or self._extract_numeric_label_from_mixed_citation(item)
if label:
fn_numeric_labels.append(label)

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 686 to 706
for sup in sups:
sup_text = sup.text.strip()
sup.tag = "xref"
sup.set("ref-type", "bibr")
sup.set("rid", ref_ids[sup_text])
return data

fn_ids = {}
fn_numeric_labels = set()
for item in xml.xpath(".//fn"):
label = item.findtext("label") or self._extract_numeric_label_from_node(item)
if label:
fn_numeric_labels.add(label)
fn_ids[label] = item.get("id")
if sup_values.issubset(ref_numeric_labels):
for sup in sups:
sup_text = sup.text.strip()
sup.tag = "xref"
sup.set("ref-type", "fn")
sup.set("rid", fn_ids[sup_text])
return data
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintainability: The logic for converting sup elements to xref for bibr (lines 686-691) and fn (lines 701-706) is duplicated. This duplication makes the code harder to maintain and increases the risk of bugs. Consider extracting this into a helper method like _convert_sups_to_xrefs(sups, ref_type, ids_dict) to reduce code duplication.

Copilot uses AI. Check for mistakes.
Comment on lines 686 to 691
for sup in sups:
sup_text = sup.text.strip()
sup.tag = "xref"
sup.set("ref-type", "bibr")
sup.set("rid", ref_ids[sup_text])
return data
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Best practices violation: The code modifies elements in the sups list while iterating over it. While this specific case may work because only attributes and tag names are being changed (not adding/removing elements), it's generally a risky pattern. Consider iterating over a copy of the list for better safety: for sup in list(sups): or for sup in sups[:]:.

Copilot uses AI. Check for mistakes.
Comment on lines 686 to 691
for sup in sups:
sup_text = sup.text.strip()
sup.tag = "xref"
sup.set("ref-type", "bibr")
sup.set("rid", ref_ids[sup_text])
return data
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: This loop processes ALL sup elements found on line 667, not just those with numeric text. Non-numeric sup elements (e.g., <sup>*</sup>) will cause an AttributeError on line 687 when calling .strip() on None text, or will be incorrectly converted if they have text. The loop should only process sup elements that are in sup_values. Consider: for sup in sups: if sup.text and sup.text.strip() in sup_values:.

Copilot uses AI. Check for mistakes.
Comment on lines 701 to 706
for sup in sups:
sup_text = sup.text.strip()
sup.tag = "xref"
sup.set("ref-type", "fn")
sup.set("rid", fn_ids[sup_text])
return data
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Same issue as lines 686-691 - this loop processes ALL sup elements, not just those with numeric text. Non-numeric sup elements will cause errors or be incorrectly converted. The loop should only process sup elements that are in sup_values. Consider: for sup in sups: if sup.text and sup.text.strip() in sup_values:.

Copilot uses AI. Check for mistakes.
Comment on lines 703 to 705
sup.tag = "xref"
sup.set("ref-type", "fn")
sup.set("rid", fn_ids[sup_text])
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Same KeyError risk as line 690 - if sup_text is not in fn_ids dictionary, this will crash. Add validation: if sup_text in fn_ids: before accessing the dictionary.

Suggested change
sup.tag = "xref"
sup.set("ref-type", "fn")
sup.set("rid", fn_ids[sup_text])
if sup_text in fn_ids:
sup.tag = "xref"
sup.set("ref-type", "fn")
sup.set("rid", fn_ids[sup_text])

Copilot uses AI. Check for mistakes.
Comment on lines 681 to 684
label = item.findtext("label") or self._extract_numeric_label_from_node(item.find("mixed-citation"))
if label:
ref_numeric_labels.add(label)
ref_ids[label] = item.get("id")
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The code adds label to ref_numeric_labels and ref_ids[label] only when label exists, but doesn't check if item.get("id") returns None. If a ref element doesn't have an 'id' attribute, ref_ids[label] will be None, which will cause issues later when setting the 'rid' attribute on line 690. Only add to the dictionaries if both label and id exist: if label and item.get("id"):.

Copilot uses AI. Check for mistakes.
Comment on lines 697 to 699
if label:
fn_numeric_labels.add(label)
fn_ids[label] = item.get("id")
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Same issue as lines 681-684 - doesn't check if item.get("id") returns None. If a fn element doesn't have an 'id' attribute, fn_ids[label] will be None. Only add to the dictionaries if both label and id exist: if label and item.get("id"):.

Suggested change
if label:
fn_numeric_labels.add(label)
fn_ids[label] = item.get("id")
id_ = item.get("id")
if label and id_:
fn_numeric_labels.add(label)
fn_ids[label] = id_

Copilot uses AI. Check for mistakes.

Critérios de análise:
- sup deve ter conteúdo numérico
- Verifica se não existem xref[@ref-type='bibr'] (evita duplicação)
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation issue: The docstring states "Verifica se não existem xref[@ref-type='bibr'] (evita duplicação)" but there's no code that actually checks for existing xref elements. This check is mentioned in the PR description but not implemented. Consider adding this check before transforming sup elements, or remove this line from the documentation.

Copilot uses AI. Check for mistakes.
Comment on lines 688 to 690
sup.tag = "xref"
sup.set("ref-type", "bibr")
sup.set("rid", ref_ids[sup_text])
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The code will raise a KeyError if sup_text is not in ref_ids dictionary. This can happen if a sup element was added to sup_values but doesn't have a corresponding entry in ref_ids (e.g., if the ref lacks an 'id' attribute). Add validation: if sup_text in ref_ids: before accessing the dictionary, or ensure all refs have IDs.

Suggested change
sup.tag = "xref"
sup.set("ref-type", "bibr")
sup.set("rid", ref_ids[sup_text])
if sup_text in ref_ids and ref_ids[sup_text] is not None:
sup.tag = "xref"
sup.set("ref-type", "bibr")
sup.set("rid", ref_ids[sup_text])
else:
logging.warning(f"Sup value '{sup_text}' not found in ref_ids or has no id; skipping.")

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +712 to +731
Extrai o label numérico inicial de um elemento mixed-citation.

Returns:
str: O label numérico extraído ou string vazia se não encontrado
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear docstring: The docstring says this method extracts numeric labels from "mixed-citation" elements, but the method actually works with any XML node, not just mixed-citation elements. It's called for both mixed-citation nodes (line 681) and fn nodes (line 696).

Consider updating the docstring to be more generic:

"""
Extrai o label numérico inicial de um nó XML.

Returns:
    str: O label numérico extraído ou None se não encontrado
"""
Suggested change
Extrai o label numérico inicial de um elemento mixed-citation.
Returns:
str: O label numérico extraído ou string vazia se não encontrado
Extrai o label numérico inicial de um XML.
Returns:
str: O label numérico extraído ou None se não encontrado

Copilot uses AI. Check for mistakes.
Comment on lines 651 to 745
class XMLSupToXrefPipe(plumber.Pipe):
"""
Transforma elementos <sup> com conteúdo numérico em <xref> quando eles
representam referências bibliográficas (bibr) ou notas de rodapé (fn).

Critérios de análise:
- sup deve ter conteúdo numérico
- Verifica se não existem xref[@ref-type='bibr'] (evita duplicação)
- Compara valores numéricos com ref/label e fn/label existentes
- Prioriza bibr sobre fn quando há ambiguidade
- Na dúvida, mantém como sup
"""

def transform(self, data):
raw, xml = data

sups = xml.xpath(".//sup")
if not sups:
return data

sup_values = set()
for sup in sups:
if sup.text and sup.text.strip().isdigit():
sup_values.add(sup.text.strip())
if not sup_values:
return data

ref_ids = {}
ref_numeric_labels = set()
for item in xml.xpath(".//ref"):
label = item.findtext("label") or self._extract_numeric_label_from_node(item.find("mixed-citation"))
if label:
ref_numeric_labels.add(label)
ref_ids[label] = item.get("id")
if sup_values.issubset(ref_numeric_labels):
for sup in sups:
sup_text = sup.text.strip()
sup.tag = "xref"
sup.set("ref-type", "bibr")
sup.set("rid", ref_ids[sup_text])
return data

fn_ids = {}
fn_numeric_labels = set()
for item in xml.xpath(".//fn"):
label = item.findtext("label") or self._extract_numeric_label_from_node(item)
if label:
fn_numeric_labels.add(label)
fn_ids[label] = item.get("id")
if sup_values.issubset(ref_numeric_labels):
for sup in sups:
sup_text = sup.text.strip()
sup.tag = "xref"
sup.set("ref-type", "fn")
sup.set("rid", fn_ids[sup_text])
return data

return data

def _extract_numeric_label_from_node(self, node):
"""
Extrai o label numérico inicial de um elemento mixed-citation.

Returns:
str: O label numérico extraído ou string vazia se não encontrado
"""
if node is None:
return None
text = "".join(node.itertext()).strip()
if not text or not text[0].isdigit():
return None

label = ""
for char in text:
if char.isdigit():
label += char
else:
break
return label
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage: The new XMLSupToXrefPipe class lacks automated tests. Similar pipes in this module (like XMLFontFaceSymbolPipe) have comprehensive test coverage in tests/test_sps_xml_pipes.py.

Consider adding tests to cover:

  • Converting numeric <sup> to <xref ref-type="bibr"> when matching ref labels
  • Converting numeric <sup> to <xref ref-type="fn"> when matching fn labels
  • Not converting non-numeric <sup> elements
  • Not converting when there's no match with ref or fn labels
  • Edge cases like empty <sup> elements or <sup> without text

Copilot uses AI. Check for mistakes.
Comment on lines +718 to +737
return None
text = "".join(node.itertext()).strip()
if not text or not text[0].isdigit():
return None
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation inconsistency: The docstring states the method returns "string vazia se não encontrado" (empty string if not found), but the code actually returns None on lines 718 and 721 when no label is found.

Either update the docstring to say "None se não encontrado" or change the code to return empty string "" instead of None for consistency.

Suggested change
return None
text = "".join(node.itertext()).strip()
if not text or not text[0].isdigit():
return None
return ""
text = "".join(node.itertext()).strip()
if not text or not text[0].isdigit():
return ""

Copilot uses AI. Check for mistakes.
Comment on lines 702 to 705
sup_text = sup.text.strip()
sup.tag = "xref"
sup.set("ref-type", "fn")
sup.set("rid", fn_ids[sup_text])
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as lines 686-690: This loop iterates over all sup elements without checking if they have numeric text content. If a sup element doesn't have numeric text, calling sup.text.strip() on line 702 could fail if sup.text is None, and fn_ids[sup_text] on line 705 will raise a KeyError.

Apply the same fix as suggested for the bibr case:

for sup in sups:
    if sup.text and sup.text.strip().isdigit():
        sup_text = sup.text.strip()
        sup.tag = "xref"
        sup.set("ref-type", "fn")
        sup.set("rid", fn_ids[sup_text])
Suggested change
sup_text = sup.text.strip()
sup.tag = "xref"
sup.set("ref-type", "fn")
sup.set("rid", fn_ids[sup_text])
if sup.text and sup.text.strip().isdigit():
sup_text = sup.text.strip()
sup.tag = "xref"
sup.set("ref-type", "fn")
sup.set("rid", fn_ids[sup_text])

Copilot uses AI. Check for mistakes.

Critérios de análise:
- sup deve ter conteúdo numérico
- Verifica se não existem xref[@ref-type='bibr'] (evita duplicação)
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation inconsistency: The docstring states "Verifica se não existem xref[@ref-type='bibr'] (evita duplicação)" (Checks if bibr xrefs don't exist to avoid duplication), but the code doesn't actually perform this check. There's no XPath query to verify existing xref elements before transformation.

If duplication prevention is intended, the code should include a check like:

existing_xrefs = xml.xpath(".//xref[@ref-type='bibr']")
if existing_xrefs:
    return data  # or handle differently

Otherwise, remove this claim from the documentation.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +712 to +723
for sup in list(sups):
sup_text = "".join(sup.itertext()).strip()
rid = ids.get(sup_text)
if not rid:
continue
parent = sup.getparent()
xref = ET.Element("xref")
xref.set("ref-type", ref_type)
xref.set("rid", rid)
xref.append(deepcopy(sup))
parent.replace(sup, xref)
total += 1
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic issue: The _convert_sup_to_xref method doesn't check if the sup element already contains a child xref or if its parent is an xref (as done in lines 673-677). This could lead to incorrect transformations of sup elements that should be skipped. Add the same validation checks here:

for sup in list(sups):
    if sup.find("xref") is not None:
        continue
    parent = sup.getparent()
    if parent.tag == "xref":
        continue
    sup_text = "".join(sup.itertext()).strip()
    # ... rest of the logic

Copilot uses AI. Check for mistakes.
Comment on lines +704 to +706
if label:
numeric_labels.add(label)
elem_ids[label] = elem.get("id")
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue with missing IDs: If a ref or fn element lacks an id attribute, elem.get("id") returns None (line 706), which gets stored in the elem_ids dictionary. Later in _convert_sup_to_xref (line 714), this could result in setting rid to None on the xref element (line 720), which would be invalid XML. Consider adding a check:

elem_id = elem.get("id")
if label and elem_id:
    numeric_labels.add(label)
    elem_ids[label] = elem_id
Suggested change
if label:
numeric_labels.add(label)
elem_ids[label] = elem.get("id")
elem_id = elem.get("id")
if label and elem_id:
numeric_labels.add(label)
elem_ids[label] = elem_id

Copilot uses AI. Check for mistakes.
return elem_ids, numeric_labels

def _convert_sup_to_xref(self, sups, sup_values, ids, numeric_labels, ref_type):
total = 0
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation: This line uses only 7 spaces for indentation while the rest of the method uses 8 spaces (or a proper indentation level). This should be aligned with the rest of the method.

Suggested change
total = 0
total = 0

Copilot uses AI. Check for mistakes.
done = self._convert_sup_to_xref(sups, sup_values, ids, numeric_labels, "fn")
if done:
return data

Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace detected on this line. Please remove the extra space(s) at the end of line 693.

Suggested change

Copilot uses AI. Check for mistakes.
Comment on lines +651 to +694
class XMLSupToXrefPipe(plumber.Pipe):
"""
Transforma elementos <sup> com conteúdo numérico em <xref> quando eles
representam referências bibliográficas (bibr) ou notas de rodapé (fn).

Critérios de análise:
- sup deve ter conteúdo numérico
- Verifica se não existem xref[@ref-type='bibr'] (evita duplicação)
- Compara valores numéricos com ref/label e fn/label existentes
- Prioriza bibr sobre fn quando há ambiguidade
- Na dúvida, mantém como sup
"""

def transform(self, data):
raw, xml = data

sups = xml.xpath(".//sup")
if not sups:
return data

sup_values = set()
for sup in list(sups):
if sup.find("xref") is not None:
continue
parent = sup.getparent()
if parent.tag == "xref":
continue
text = "".join(sup.itertext()).strip()
if text and text.isdigit():
sup_values.add(text)
if not sup_values:
return data

ids, numeric_labels = self.get_ids_and_labels(xml, "ref", "mixed-citation")
done = self._convert_sup_to_xref(sups, sup_values, ids, numeric_labels, "bibr")
if done:
return data

ids, numeric_labels = self.get_ids_and_labels(xml, "fn")
done = self._convert_sup_to_xref(sups, sup_values, ids, numeric_labels, "fn")
if done:
return data

return data
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage: The new XMLSupToXrefPipe class lacks test coverage. Given that the test file tests/test_sps_xml_pipes.py contains tests for other pipe classes like XMLFontFaceSymbolPipe, tests should be added for this new pipe to ensure:

  • Correct conversion of numeric <sup> elements to <xref ref-type="bibr">
  • Correct conversion of numeric <sup> elements to <xref ref-type="fn">
  • Elements that should not be converted remain unchanged (non-numeric, already wrapped, ambiguous cases)
  • Edge cases like missing labels, missing IDs, etc.

Copilot uses AI. Check for mistakes.
- Implements transformation of <sup> elements with numeric content to <xref>
  when they represent bibliographic references (bibr) or footnotes (fn)
- Analyzes sup elements to identify numeric references
- Extracts numeric labels from ref and fn elements for comparison
- Includes helper method to extract numeric labels from mixed-citation
- Prevents duplication by checking existing xref elements
- Prioritizes bibr over fn when ambiguous, maintains sup when uncertain
- Added pipe to XML processing pipeline
@robertatakenaka robertatakenaka force-pushed the converte_sup_para_xref_bibr_ou_xref_fn branch from c3d6f55 to 6342888 Compare December 3, 2025 17:41
@robertatakenaka robertatakenaka merged commit e491871 into scieloorg:main Dec 3, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant