-
Notifications
You must be signed in to change notification settings - Fork 3
Add XMLSupToXrefPipe to convert numeric sup elements to xref #129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add XMLSupToXrefPipe to convert numeric sup elements to xref #129
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR attempts to add a new XMLSupToXrefPipe transformation pipe that converts <sup> elements with numeric content into <xref> elements when they represent bibliographic references (bibr) or footnotes (fn). However, the implementation is incomplete and contains critical bugs that prevent it from functioning.
Key Issues
- The
transformmethod contains an incompleteifstatement (line 698) causing a syntax error - Method naming mismatch: calls to
_extract_numeric_label_from_mixed_citationreference a non-existent method - Multiple collected variables (
sup_values,ref_numeric_labels,fn_numeric_labels,ref_labels,fn_labels,refs,fns) are never used in any transformation logic
Comments suppressed due to low confidence (1)
scielo_classic_website/spsxml/sps_xml_pipes.py:698
- Syntax Error (in Python 3).
if
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Extrai o label numérico inicial de um elemento mixed-citation. | ||
|
|
||
| Returns: | ||
| str: O label numérico extraído ou string vazia se não encontrado |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring mentions returning "string vazia" (empty string) when not found, but the method actually returns None (line 709, 712). The documentation should be corrected to accurately reflect the return value: str or None: O label numérico extraído ou None se não encontrado.
| str: O label numérico extraído ou string vazia se não encontrado | |
| str or None: O label numérico extraído ou None se não encontrado |
| class XMLSupToXrefPipe(plumber.Pipe): | ||
| """ | ||
| Transforma elementos <sup> com conteúdo numérico em <xref> quando eles | ||
| representam referências bibliográficas (bibr) ou notas de rodapé (fn). | ||
|
|
||
| Critérios de análise: | ||
| - sup deve ter conteúdo numérico | ||
| - Verifica se não existem xref[@ref-type='bibr'] (evita duplicação) | ||
| - Compara valores numéricos com ref/label e fn/label existentes | ||
| - Prioriza bibr sobre fn quando há ambiguidade | ||
| - Na dúvida, mantém como sup | ||
| """ | ||
|
|
||
| def transform(self, data): | ||
| raw, xml = data | ||
|
|
||
| sups = xml.xpath(".//sup") | ||
| if not sups: | ||
| return data | ||
|
|
||
| sup_values = set() | ||
| for sup in sups: | ||
| if sup.text and sup.text.strip().isdigit(): | ||
| sup_values.add(sup.text.strip()) | ||
| if not sup_values: | ||
| return data | ||
|
|
||
| ref_numeric_labels = [] | ||
| for item in xml.xpath(".//ref"): | ||
| label = item.findtext("label") or self._extract_numeric_label_from_mixed_citation(item) | ||
| if label: | ||
| ref_numeric_labels.append(label) | ||
|
|
||
| fn_numeric_labels = [] | ||
| for item in xml.xpath(".//fn"): | ||
| label = item.findtext("label") or self._extract_numeric_label_from_mixed_citation(item) | ||
| if label: | ||
| fn_numeric_labels.append(label) | ||
|
|
||
| ref_labels = [ | ||
| (ref.findtext("label"), ref) for ref in xml.xpath(".//ref") if ref.find("label") is not None | ||
| ] | ||
| fn_labels = [ | ||
| (fn.findtext("label"), fn) for fn in xml.xpath(".//fn") if fn.find("label") is not None | ||
| ] | ||
| refs = xml.xpath(".//ref") | ||
| fns = xml.xpath(".//fn") | ||
| if | ||
| return data | ||
|
|
||
| def _extract_numeric_label(self, node): | ||
| """ | ||
| Extrai o label numérico inicial de um elemento mixed-citation. | ||
|
|
||
| Returns: | ||
| str: O label numérico extraído ou string vazia se não encontrado | ||
| """ | ||
| if node is None: | ||
| return None | ||
| text = "".join(node.itertext()).strip() | ||
| if not text or not text[0].isdigit(): | ||
| return None | ||
|
|
||
| label = "" | ||
| for char in text: | ||
| if char.isdigit(): | ||
| label += char | ||
| else: | ||
| break | ||
| return label |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new XMLSupToXrefPipe class lacks test coverage. Based on the test patterns in the repository (see test_sps_xml_pipes.py), comprehensive tests should be added to verify the transformation logic, including cases for: numeric sup elements matching bibr/fn labels, non-numeric sup elements, missing labels, edge cases, and the prioritization of bibr over fn.
| sup_values = set() | ||
| for sup in sups: | ||
| if sup.text and sup.text.strip().isdigit(): | ||
| sup_values.add(sup.text.strip()) |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sup_values set is collected but never used in the subsequent logic. This variable should likely be used to filter or compare against the ref_numeric_labels and fn_numeric_labels, or it should be removed if not needed.
| ] | ||
| refs = xml.xpath(".//ref") | ||
| fns = xml.xpath(".//fn") | ||
| if |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete if statement with no condition. This is a syntax error that will prevent the code from running.
| if |
| if | ||
| return data | ||
|
|
||
| def _extract_numeric_label(self, node): |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method _extract_numeric_label_from_mixed_citation is called on lines 680 and 686, but the actual method defined is named _extract_numeric_label. Either the method name should be changed to _extract_numeric_label_from_mixed_citation or the calls should use _extract_numeric_label.
| ref_labels = [ | ||
| (ref.findtext("label"), ref) for ref in xml.xpath(".//ref") if ref.find("label") is not None | ||
| ] | ||
| fn_labels = [ | ||
| (fn.findtext("label"), fn) for fn in xml.xpath(".//fn") if fn.find("label") is not None | ||
| ] | ||
| refs = xml.xpath(".//ref") | ||
| fns = xml.xpath(".//fn") | ||
| if |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variables ref_labels, fn_labels, refs, and fns are defined but never used. The code appears incomplete as these variables are collected but not utilized in any logic before returning. Either complete the implementation or remove these unused variables.
| ref_labels = [ | |
| (ref.findtext("label"), ref) for ref in xml.xpath(".//ref") if ref.find("label") is not None | |
| ] | |
| fn_labels = [ | |
| (fn.findtext("label"), fn) for fn in xml.xpath(".//fn") if fn.find("label") is not None | |
| ] | |
| refs = xml.xpath(".//ref") | |
| fns = xml.xpath(".//fn") | |
| if | |
| # Removed unused variables and incomplete logic |
| ref_numeric_labels = [] | ||
| for item in xml.xpath(".//ref"): | ||
| label = item.findtext("label") or self._extract_numeric_label_from_mixed_citation(item) | ||
| if label: | ||
| ref_numeric_labels.append(label) | ||
|
|
||
| fn_numeric_labels = [] | ||
| for item in xml.xpath(".//fn"): | ||
| label = item.findtext("label") or self._extract_numeric_label_from_mixed_citation(item) | ||
| if label: | ||
| fn_numeric_labels.append(label) | ||
|
|
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variables ref_numeric_labels and fn_numeric_labels are collected but never used. Since the code is incomplete, these should likely be incorporated into the missing logic or removed if not needed.
| ref_numeric_labels = [] | |
| for item in xml.xpath(".//ref"): | |
| label = item.findtext("label") or self._extract_numeric_label_from_mixed_citation(item) | |
| if label: | |
| ref_numeric_labels.append(label) | |
| fn_numeric_labels = [] | |
| for item in xml.xpath(".//fn"): | |
| label = item.findtext("label") or self._extract_numeric_label_from_mixed_citation(item) | |
| if label: | |
| fn_numeric_labels.append(label) |
5c6ae59 to
f70da64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for sup in sups: | ||
| sup_text = sup.text.strip() | ||
| sup.tag = "xref" | ||
| sup.set("ref-type", "bibr") | ||
| sup.set("rid", ref_ids[sup_text]) | ||
| return data | ||
|
|
||
| fn_ids = {} | ||
| fn_numeric_labels = set() | ||
| for item in xml.xpath(".//fn"): | ||
| label = item.findtext("label") or self._extract_numeric_label_from_node(item) | ||
| if label: | ||
| fn_numeric_labels.add(label) | ||
| fn_ids[label] = item.get("id") | ||
| if sup_values.issubset(ref_numeric_labels): | ||
| for sup in sups: | ||
| sup_text = sup.text.strip() | ||
| sup.tag = "xref" | ||
| sup.set("ref-type", "fn") | ||
| sup.set("rid", fn_ids[sup_text]) | ||
| return data |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maintainability: The logic for converting sup elements to xref for bibr (lines 686-691) and fn (lines 701-706) is duplicated. This duplication makes the code harder to maintain and increases the risk of bugs. Consider extracting this into a helper method like _convert_sups_to_xrefs(sups, ref_type, ids_dict) to reduce code duplication.
| for sup in sups: | ||
| sup_text = sup.text.strip() | ||
| sup.tag = "xref" | ||
| sup.set("ref-type", "bibr") | ||
| sup.set("rid", ref_ids[sup_text]) | ||
| return data |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Best practices violation: The code modifies elements in the sups list while iterating over it. While this specific case may work because only attributes and tag names are being changed (not adding/removing elements), it's generally a risky pattern. Consider iterating over a copy of the list for better safety: for sup in list(sups): or for sup in sups[:]:.
| for sup in sups: | ||
| sup_text = sup.text.strip() | ||
| sup.tag = "xref" | ||
| sup.set("ref-type", "bibr") | ||
| sup.set("rid", ref_ids[sup_text]) | ||
| return data |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: This loop processes ALL sup elements found on line 667, not just those with numeric text. Non-numeric sup elements (e.g., <sup>*</sup>) will cause an AttributeError on line 687 when calling .strip() on None text, or will be incorrectly converted if they have text. The loop should only process sup elements that are in sup_values. Consider: for sup in sups: if sup.text and sup.text.strip() in sup_values:.
| for sup in sups: | ||
| sup_text = sup.text.strip() | ||
| sup.tag = "xref" | ||
| sup.set("ref-type", "fn") | ||
| sup.set("rid", fn_ids[sup_text]) | ||
| return data |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Same issue as lines 686-691 - this loop processes ALL sup elements, not just those with numeric text. Non-numeric sup elements will cause errors or be incorrectly converted. The loop should only process sup elements that are in sup_values. Consider: for sup in sups: if sup.text and sup.text.strip() in sup_values:.
| sup.tag = "xref" | ||
| sup.set("ref-type", "fn") | ||
| sup.set("rid", fn_ids[sup_text]) |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Same KeyError risk as line 690 - if sup_text is not in fn_ids dictionary, this will crash. Add validation: if sup_text in fn_ids: before accessing the dictionary.
| sup.tag = "xref" | |
| sup.set("ref-type", "fn") | |
| sup.set("rid", fn_ids[sup_text]) | |
| if sup_text in fn_ids: | |
| sup.tag = "xref" | |
| sup.set("ref-type", "fn") | |
| sup.set("rid", fn_ids[sup_text]) |
| label = item.findtext("label") or self._extract_numeric_label_from_node(item.find("mixed-citation")) | ||
| if label: | ||
| ref_numeric_labels.add(label) | ||
| ref_ids[label] = item.get("id") |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: The code adds label to ref_numeric_labels and ref_ids[label] only when label exists, but doesn't check if item.get("id") returns None. If a ref element doesn't have an 'id' attribute, ref_ids[label] will be None, which will cause issues later when setting the 'rid' attribute on line 690. Only add to the dictionaries if both label and id exist: if label and item.get("id"):.
| if label: | ||
| fn_numeric_labels.add(label) | ||
| fn_ids[label] = item.get("id") |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Same issue as lines 681-684 - doesn't check if item.get("id") returns None. If a fn element doesn't have an 'id' attribute, fn_ids[label] will be None. Only add to the dictionaries if both label and id exist: if label and item.get("id"):.
| if label: | |
| fn_numeric_labels.add(label) | |
| fn_ids[label] = item.get("id") | |
| id_ = item.get("id") | |
| if label and id_: | |
| fn_numeric_labels.add(label) | |
| fn_ids[label] = id_ |
|
|
||
| Critérios de análise: | ||
| - sup deve ter conteúdo numérico | ||
| - Verifica se não existem xref[@ref-type='bibr'] (evita duplicação) |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation issue: The docstring states "Verifica se não existem xref[@ref-type='bibr'] (evita duplicação)" but there's no code that actually checks for existing xref elements. This check is mentioned in the PR description but not implemented. Consider adding this check before transforming sup elements, or remove this line from the documentation.
| sup.tag = "xref" | ||
| sup.set("ref-type", "bibr") | ||
| sup.set("rid", ref_ids[sup_text]) |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: The code will raise a KeyError if sup_text is not in ref_ids dictionary. This can happen if a sup element was added to sup_values but doesn't have a corresponding entry in ref_ids (e.g., if the ref lacks an 'id' attribute). Add validation: if sup_text in ref_ids: before accessing the dictionary, or ensure all refs have IDs.
| sup.tag = "xref" | |
| sup.set("ref-type", "bibr") | |
| sup.set("rid", ref_ids[sup_text]) | |
| if sup_text in ref_ids and ref_ids[sup_text] is not None: | |
| sup.tag = "xref" | |
| sup.set("ref-type", "bibr") | |
| sup.set("rid", ref_ids[sup_text]) | |
| else: | |
| logging.warning(f"Sup value '{sup_text}' not found in ref_ids or has no id; skipping.") |
f70da64 to
a04a81f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Extrai o label numérico inicial de um elemento mixed-citation. | ||
|
|
||
| Returns: | ||
| str: O label numérico extraído ou string vazia se não encontrado |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear docstring: The docstring says this method extracts numeric labels from "mixed-citation" elements, but the method actually works with any XML node, not just mixed-citation elements. It's called for both mixed-citation nodes (line 681) and fn nodes (line 696).
Consider updating the docstring to be more generic:
"""
Extrai o label numérico inicial de um nó XML.
Returns:
str: O label numérico extraído ou None se não encontrado
"""| Extrai o label numérico inicial de um elemento mixed-citation. | |
| Returns: | |
| str: O label numérico extraído ou string vazia se não encontrado | |
| Extrai o label numérico inicial de um nó XML. | |
| Returns: | |
| str: O label numérico extraído ou None se não encontrado |
| class XMLSupToXrefPipe(plumber.Pipe): | ||
| """ | ||
| Transforma elementos <sup> com conteúdo numérico em <xref> quando eles | ||
| representam referências bibliográficas (bibr) ou notas de rodapé (fn). | ||
|
|
||
| Critérios de análise: | ||
| - sup deve ter conteúdo numérico | ||
| - Verifica se não existem xref[@ref-type='bibr'] (evita duplicação) | ||
| - Compara valores numéricos com ref/label e fn/label existentes | ||
| - Prioriza bibr sobre fn quando há ambiguidade | ||
| - Na dúvida, mantém como sup | ||
| """ | ||
|
|
||
| def transform(self, data): | ||
| raw, xml = data | ||
|
|
||
| sups = xml.xpath(".//sup") | ||
| if not sups: | ||
| return data | ||
|
|
||
| sup_values = set() | ||
| for sup in sups: | ||
| if sup.text and sup.text.strip().isdigit(): | ||
| sup_values.add(sup.text.strip()) | ||
| if not sup_values: | ||
| return data | ||
|
|
||
| ref_ids = {} | ||
| ref_numeric_labels = set() | ||
| for item in xml.xpath(".//ref"): | ||
| label = item.findtext("label") or self._extract_numeric_label_from_node(item.find("mixed-citation")) | ||
| if label: | ||
| ref_numeric_labels.add(label) | ||
| ref_ids[label] = item.get("id") | ||
| if sup_values.issubset(ref_numeric_labels): | ||
| for sup in sups: | ||
| sup_text = sup.text.strip() | ||
| sup.tag = "xref" | ||
| sup.set("ref-type", "bibr") | ||
| sup.set("rid", ref_ids[sup_text]) | ||
| return data | ||
|
|
||
| fn_ids = {} | ||
| fn_numeric_labels = set() | ||
| for item in xml.xpath(".//fn"): | ||
| label = item.findtext("label") or self._extract_numeric_label_from_node(item) | ||
| if label: | ||
| fn_numeric_labels.add(label) | ||
| fn_ids[label] = item.get("id") | ||
| if sup_values.issubset(ref_numeric_labels): | ||
| for sup in sups: | ||
| sup_text = sup.text.strip() | ||
| sup.tag = "xref" | ||
| sup.set("ref-type", "fn") | ||
| sup.set("rid", fn_ids[sup_text]) | ||
| return data | ||
|
|
||
| return data | ||
|
|
||
| def _extract_numeric_label_from_node(self, node): | ||
| """ | ||
| Extrai o label numérico inicial de um elemento mixed-citation. | ||
|
|
||
| Returns: | ||
| str: O label numérico extraído ou string vazia se não encontrado | ||
| """ | ||
| if node is None: | ||
| return None | ||
| text = "".join(node.itertext()).strip() | ||
| if not text or not text[0].isdigit(): | ||
| return None | ||
|
|
||
| label = "" | ||
| for char in text: | ||
| if char.isdigit(): | ||
| label += char | ||
| else: | ||
| break | ||
| return label |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage: The new XMLSupToXrefPipe class lacks automated tests. Similar pipes in this module (like XMLFontFaceSymbolPipe) have comprehensive test coverage in tests/test_sps_xml_pipes.py.
Consider adding tests to cover:
- Converting numeric
<sup>to<xref ref-type="bibr">when matching ref labels - Converting numeric
<sup>to<xref ref-type="fn">when matching fn labels - Not converting non-numeric
<sup>elements - Not converting when there's no match with ref or fn labels
- Edge cases like empty
<sup>elements or<sup>without text
| return None | ||
| text = "".join(node.itertext()).strip() | ||
| if not text or not text[0].isdigit(): | ||
| return None |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation inconsistency: The docstring states the method returns "string vazia se não encontrado" (empty string if not found), but the code actually returns None on lines 718 and 721 when no label is found.
Either update the docstring to say "None se não encontrado" or change the code to return empty string "" instead of None for consistency.
| return None | |
| text = "".join(node.itertext()).strip() | |
| if not text or not text[0].isdigit(): | |
| return None | |
| return "" | |
| text = "".join(node.itertext()).strip() | |
| if not text or not text[0].isdigit(): | |
| return "" |
| sup_text = sup.text.strip() | ||
| sup.tag = "xref" | ||
| sup.set("ref-type", "fn") | ||
| sup.set("rid", fn_ids[sup_text]) |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as lines 686-690: This loop iterates over all sup elements without checking if they have numeric text content. If a sup element doesn't have numeric text, calling sup.text.strip() on line 702 could fail if sup.text is None, and fn_ids[sup_text] on line 705 will raise a KeyError.
Apply the same fix as suggested for the bibr case:
for sup in sups:
if sup.text and sup.text.strip().isdigit():
sup_text = sup.text.strip()
sup.tag = "xref"
sup.set("ref-type", "fn")
sup.set("rid", fn_ids[sup_text])| sup_text = sup.text.strip() | |
| sup.tag = "xref" | |
| sup.set("ref-type", "fn") | |
| sup.set("rid", fn_ids[sup_text]) | |
| if sup.text and sup.text.strip().isdigit(): | |
| sup_text = sup.text.strip() | |
| sup.tag = "xref" | |
| sup.set("ref-type", "fn") | |
| sup.set("rid", fn_ids[sup_text]) |
|
|
||
| Critérios de análise: | ||
| - sup deve ter conteúdo numérico | ||
| - Verifica se não existem xref[@ref-type='bibr'] (evita duplicação) |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation inconsistency: The docstring states "Verifica se não existem xref[@ref-type='bibr'] (evita duplicação)" (Checks if bibr xrefs don't exist to avoid duplication), but the code doesn't actually perform this check. There's no XPath query to verify existing xref elements before transformation.
If duplication prevention is intended, the code should include a check like:
existing_xrefs = xml.xpath(".//xref[@ref-type='bibr']")
if existing_xrefs:
return data # or handle differentlyOtherwise, remove this claim from the documentation.
a04a81f to
c3d6f55
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for sup in list(sups): | ||
| sup_text = "".join(sup.itertext()).strip() | ||
| rid = ids.get(sup_text) | ||
| if not rid: | ||
| continue | ||
| parent = sup.getparent() | ||
| xref = ET.Element("xref") | ||
| xref.set("ref-type", ref_type) | ||
| xref.set("rid", rid) | ||
| xref.append(deepcopy(sup)) | ||
| parent.replace(sup, xref) | ||
| total += 1 |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic issue: The _convert_sup_to_xref method doesn't check if the sup element already contains a child xref or if its parent is an xref (as done in lines 673-677). This could lead to incorrect transformations of sup elements that should be skipped. Add the same validation checks here:
for sup in list(sups):
if sup.find("xref") is not None:
continue
parent = sup.getparent()
if parent.tag == "xref":
continue
sup_text = "".join(sup.itertext()).strip()
# ... rest of the logic| if label: | ||
| numeric_labels.add(label) | ||
| elem_ids[label] = elem.get("id") |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential issue with missing IDs: If a ref or fn element lacks an id attribute, elem.get("id") returns None (line 706), which gets stored in the elem_ids dictionary. Later in _convert_sup_to_xref (line 714), this could result in setting rid to None on the xref element (line 720), which would be invalid XML. Consider adding a check:
elem_id = elem.get("id")
if label and elem_id:
numeric_labels.add(label)
elem_ids[label] = elem_id| if label: | |
| numeric_labels.add(label) | |
| elem_ids[label] = elem.get("id") | |
| elem_id = elem.get("id") | |
| if label and elem_id: | |
| numeric_labels.add(label) | |
| elem_ids[label] = elem_id |
| return elem_ids, numeric_labels | ||
|
|
||
| def _convert_sup_to_xref(self, sups, sup_values, ids, numeric_labels, ref_type): | ||
| total = 0 |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent indentation: This line uses only 7 spaces for indentation while the rest of the method uses 8 spaces (or a proper indentation level). This should be aligned with the rest of the method.
| total = 0 | |
| total = 0 |
| done = self._convert_sup_to_xref(sups, sup_values, ids, numeric_labels, "fn") | ||
| if done: | ||
| return data | ||
|
|
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing whitespace detected on this line. Please remove the extra space(s) at the end of line 693.
| class XMLSupToXrefPipe(plumber.Pipe): | ||
| """ | ||
| Transforma elementos <sup> com conteúdo numérico em <xref> quando eles | ||
| representam referências bibliográficas (bibr) ou notas de rodapé (fn). | ||
|
|
||
| Critérios de análise: | ||
| - sup deve ter conteúdo numérico | ||
| - Verifica se não existem xref[@ref-type='bibr'] (evita duplicação) | ||
| - Compara valores numéricos com ref/label e fn/label existentes | ||
| - Prioriza bibr sobre fn quando há ambiguidade | ||
| - Na dúvida, mantém como sup | ||
| """ | ||
|
|
||
| def transform(self, data): | ||
| raw, xml = data | ||
|
|
||
| sups = xml.xpath(".//sup") | ||
| if not sups: | ||
| return data | ||
|
|
||
| sup_values = set() | ||
| for sup in list(sups): | ||
| if sup.find("xref") is not None: | ||
| continue | ||
| parent = sup.getparent() | ||
| if parent.tag == "xref": | ||
| continue | ||
| text = "".join(sup.itertext()).strip() | ||
| if text and text.isdigit(): | ||
| sup_values.add(text) | ||
| if not sup_values: | ||
| return data | ||
|
|
||
| ids, numeric_labels = self.get_ids_and_labels(xml, "ref", "mixed-citation") | ||
| done = self._convert_sup_to_xref(sups, sup_values, ids, numeric_labels, "bibr") | ||
| if done: | ||
| return data | ||
|
|
||
| ids, numeric_labels = self.get_ids_and_labels(xml, "fn") | ||
| done = self._convert_sup_to_xref(sups, sup_values, ids, numeric_labels, "fn") | ||
| if done: | ||
| return data | ||
|
|
||
| return data |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage: The new XMLSupToXrefPipe class lacks test coverage. Given that the test file tests/test_sps_xml_pipes.py contains tests for other pipe classes like XMLFontFaceSymbolPipe, tests should be added for this new pipe to ensure:
- Correct conversion of numeric
<sup>elements to<xref ref-type="bibr"> - Correct conversion of numeric
<sup>elements to<xref ref-type="fn"> - Elements that should not be converted remain unchanged (non-numeric, already wrapped, ambiguous cases)
- Edge cases like missing labels, missing IDs, etc.
- Implements transformation of <sup> elements with numeric content to <xref> when they represent bibliographic references (bibr) or footnotes (fn) - Analyzes sup elements to identify numeric references - Extracts numeric labels from ref and fn elements for comparison - Includes helper method to extract numeric labels from mixed-citation - Prevents duplication by checking existing xref elements - Prioritizes bibr over fn when ambiguous, maintains sup when uncertain - Added pipe to XML processing pipeline
c3d6f55 to
6342888
Compare
Adiciona XMLSupToXrefPipe para converter elementos sup numéricos em xref
Descrição
Esta PR implementa um novo pipe de transformação XML (
XMLSupToXrefPipe) que converte automaticamente elementos<sup>com conteúdo numérico em elementos<xref>apropriados quando representam referências bibliográficas ou notas de rodapé.Motivação
Documentos XML frequentemente contêm elementos
<sup>que na verdade são referências bibliográficas ou notas de rodapé, mas não estão marcados corretamente como<xref>. Isso pode afetar o processamento e a renderização adequada desses elementos no sistema.Alterações implementadas
XMLSupToXrefPipe: Transforma elementos<sup>numéricos em<xref ref-type="bibr">ou<xref ref-type="fn">baseado na análise do conteúdo<sup>com conteúdo numérico<ref>e<fn>existentes_extract_numeric_label: Extrai labels numéricos do início de elementos mixed-citationImpacto
Como testar
<sup>numéricos<sup>que correspondem a refs ou fns são convertidos corretamente<sup>ambíguos ou não-referências permanecem inalterados<xref>