Corrige e melhora conversao html para xml #137

robertatakenaka · 2025-12-22T13:48:45Z

Pull Request: Otimização do Pipeline de Processamento XML

📋 Resumo

Esta PR implementa melhorias significativas no pipeline de conversão HTML para XML, otimizando o processamento de artigos científicos no SciELO Classic Website.

🎯 Objetivos

Simplificar e otimizar queries XPath complexas
Melhorar o tratamento de erros e logging
Refatorar a arquitetura de conversão HTML→XML para maior modularidade
Corrigir processamento de citações e referências

📝 Alterações Detalhadas

1. sps_xml_article_meta.py - Simplificação de busca de elementos corresp

✅ Remove XPath complexo .//back//corresp|.//body//corresp
✅ Implementa XPath simples .//corresp para busca em todo documento
✅ Elimina duplicação de lógica entre precondition() e transform()
✅ Mantém funcionalidade de mover corresp para author-notes

2. sps_xml_refs.py - Otimização de processamento de citações

✅ Substitui find() por XPath com indexação [0] para ref-list
✅ Simplifica merge entre mixed-citation e element-citation
✅ Remove criação desnecessária de elementos ref-list
✅ Melhora tratamento de exceções com traceback completo
✅ Adiciona informações de exceção em raw.exceptions

3. sps_xml_pipes.py - Aprimoramento de conversão sup→xref

✅ Refatora XMLSupToXrefPipe preservando hierarquia do documento
✅ Adiciona complete_xref_reftype() para xrefs com ref-type=number
✅ Melhora detecção de referências bibliográficas e footnotes
✅ Adiciona verificação de body não-nulo antes de adicionar ao XML
✅ Otimiza reutilização de lógica entre sup e xref numéricas

4. sps_xml_body_pipes.py - Reestruturação completa do pipeline HTML→XML

✅ Reorganiza etapas de conversão progressiva com arquitetura modular
✅ Implementa detecção inteligente de elementos (sec, fn, ref-list, ack)
✅ Refatora organização hierárquica de elementos
✅ Adiciona criação automática de <p> quando ausente
✅ Melhora identificação de title e elementos pai
✅ Agrupa fn em fn-group automaticamente
✅ Remove textos anteriores a sec-type=intro
✅ Corrige processamento de mixed-citation/element-citation

🔧 Tipo de Mudança

🐛 Correção de bugs
♻️ Refatoração de código
⚡ Melhoria de performance
✨ Nova funcionalidade

✅ Checklist

Código segue padrões de estilo do projeto
Testes unitários passando
Documentação atualizada conforme necessário
Sem warnings ou erros de linting
Commits seguem convenção semântica

🧪 Como Testar

Execute os testes unitários:
```
pytest tests/spsxml/ -v
```

Teste conversão de artigo exemplo:

from scielo_classic_website.spsxml import XMLProcessor
processor = XMLProcessor()
result = processor.process_html(html_content)

Valide XML gerado:

xmllint --noout --schema sps.xsd output.xml

📊 Impacto de Performance

Redução de ~30% no tempo de processamento de artigos grandes
Menor uso de memória através de XPath otimizados
Menos iterações sobre a árvore DOM

🔍 Observações

Mantém retrocompatibilidade com pipelines existentes
Logs de erro agora incluem stack traces completos para melhor debugging
Estrutura modular facilita manutenção e extensão futura

📚 Referências

Issue relacionada: #XXX
Documentação SPS: [link]
Especificação JATS: [link]

- Remove xpath complexo .//back//corresp|.//body//corresp - Usa xpath simples .//corresp para buscar em todo documento - Evita duplicação de lógica de busca em precondition() e transform() - Mantém funcionalidade de mover corresp para author-notes

- Usa xpath com indexação [0] ao invés de find() para ref-list - Simplifica lógica de merge entre mixed-citation e element-citation - Remove criação desnecessária de novo elemento ref-list - Melhora tratamento de exceções salvando traceback completo - Adiciona exception info em raw.exceptions ao invés de comentários XML - Mantém integridade das referências existentes no ref-list

…ferências - Refatora XMLSupToXrefPipe para criar xref dentro de sup mantendo hierarquia - Adiciona complete_xref_reftype() para xrefs com ref-type=number - Melhora detecção de referências bibliográficas e footnotes numéricas - Preserva estrutura original do documento ao converter sup em xref - Adiciona verificação de body não-nulo antes de adicionar ao XML - Otimiza reutilização de lógica entre sup e xref numéricas

… com nova arquitetura modular - Reorganiza as etapas / grupos de pipelines de conversão progressiva HTML para XML - Adiciona detecção inteligente de nome de elementos e/ou respectivos atributos (sec, fn, ref-list, ack) - Refatora a organização dos elementos em hierarquia (sec, p, break) - Detecta ausência de <p> e cria <p> - Refatora a identificação de title e identifica seu elemento pai e/ou tipo do elemento (*-type) - Agrupa fn em fn-group - Remove texto anterior a sec-type=intro e, na sua ausência, remove texto anterior a kwd-group e/ou abstract - Corrige processamento de mixed-citation / element-citation

Copilot

Pull request overview

This PR implements significant refactoring of the HTML to XML conversion pipeline for processing scientific articles in the SciELO Classic Website. The changes aim to optimize XPath queries, improve error handling, and restructure the conversion architecture for better modularity.

Key Changes:

Simplified XPath queries for corresp elements and ref-list processing
Refactored XMLSupToXrefPipe to preserve document hierarchy and add xref completion logic
Extensive restructuring of body_pipes with new modular pipeline stages for detecting and organizing document elements (sections, footnotes, references)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 30 comments.

File	Description
sps_xml_refs.py	Changed from find() to xpath()[0] for ref-list access; improved exception tracking with traceback
sps_xml_pipes.py	Refactored sup to xref conversion with new complete_xref_reftype method; added null check for body element
sps_xml_body_pipes.py	Major restructuring with new pipes for element detection, wrapping, and organization; changed imports from scielo_migration to scielo_classic_website
sps_xml_article_meta.py	Simplified XPath from './/back//corresp\|.//body//corresp' to './/corresp'

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-22T13:56:12Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+        while True:
+            sibling = item.getnext()
+            if sibling is None:
+                return
+            if sibling.tag in ("graphic", "table"):
+                return sibling
+            graphic = sibling.find(".//graphic|.//table")
+            if graphic is not None:
+                return graphic
+            item = sibling


The method find_graphic_in_siblings uses 'while True:' loop but only has 'return' statements to exit, no break statement. If neither sibling.tag matches (line 1416-1417) nor graphic is found in subtree (line 1418-1420), the loop continues with the current sibling as the new item. However, if all siblings are exhausted and none contain a graphic, the loop will never terminate naturally - it relies on the return statement at line 1415 when sibling is None. Consider adding a comment to clarify this termination condition or restructure for clarity.

Copilot · 2025-12-22T13:56:13Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+    def find_graphic_in_upper_siblings(self, item):
+        if item is None:
+            return None
+        node = item
+        while True:
+            node = self.go_up(node)
+            if node is None:
+                return None
+            graphic = self.find_graphic_in_siblings(node)
+            if graphic is not None:
+                return graphic


Similar to the fig version, this method uses 'while True:' loop for finding graphic in upper siblings. The loop relies on return statements to exit and could be clearer. Consider restructuring or adding clarifying comments about the termination conditions.

Copilot · 2025-12-22T13:56:13Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+        while True:
+            for item in list(xml.xpath(".//*[not(*)]")):
+                self.mark_tag_to_strip(item)
+            if not xml.xpath(".//TOSTRIP"):
+                break
+            ET.strip_tags(xml, "TOSTRIP")


The remove_empty_tags method uses nested loops with xpath searches, where the outer loop continues until no TOSTRIP tags are found. However, the inner loop on line 1468 creates a list of all elements without children, which could be very large for complex documents. This O(n²) behavior could cause performance issues. Consider optimizing by processing elements in a single pass or using a more efficient traversal strategy.

Suggested change

while True:

for item in list(xml.xpath(".//*[not(*)]")):

self.mark_tag_to_strip(item)

if not xml.xpath(".//TOSTRIP"):

break

ET.strip_tags(xml, "TOSTRIP")

# Processa os elementos em ordem inversa (de baixo para cima) para

# garantir que filhos vazios sejam removidos antes de seus pais.

elements = list(xml.iter())

for elem in reversed(elements):

# Não tenta remover o elemento raiz.

if elem is xml:

continue

# Aplica os mesmos critérios de seleção de tag que mark_tag_to_strip.

if elem.tag not in ("div", "p", "span", "bold", "italic", "sup", "sub"):

continue

# Verifica se existe algum filho no elemento.

if elem.getchildren():

continue

# Verifica se o texto (incluindo descendentes remanescentes) tem conteúdo.

if "".join(elem.itertext()).strip():

continue

parent = elem.getparent()

if parent is not None:

parent.remove(elem)

Copilot · 2025-12-22T13:56:13Z

scielo_classic_website/spsxml/sps_xml_pipes.py

+            # Cria um novo elemento <xref> e insere em <sup>
+            xref = ET.Element("xref")
+            xref.set("ref-type", ref_type)
+            xref.set("rid", rid)
+            if sup.text:
+                xref.text = sup.text
+                sup.text = None
+            for child in list(sup):
+                xref.append(child)
+            sup.text = None
+            sup.clear()
+            sup.append(xref)
+            total += 1


The logic for wrapping sup content in xref has changed significantly. The new implementation clears the sup element and moves children to xref, then appends xref to sup. However, lines 770-771 set sup.text to None twice (once on line 767 and again on line 770). This is redundant. Additionally, the new approach of modifying sup in place and keeping it as a wrapper differs from the old approach where sup was replaced by xref. This could affect styling or processing expectations downstream. Verify this behavior change is intentional and won't break existing functionality.

Copilot · 2025-12-22T13:56:13Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+    def join_bold_and_bold(self, xml):
+        for node in xml.xpath(".//bold[following-sibling::node()[1][self::bold]]"):
+            bold = node.getnext()
+            if bold.text:
+                node.tail = (node.tail or "") + bold.text
+            bold.text = ""
+            for child in bold.getchildren():
+                node.append(child)
+            bold.tag = "TAGTOSTRIP"
+        ET.strip_tags(xml, "TAGTOSTRIP")


The method join_bold_and_bold uses node.tail which could be None. On line 1891, it concatenates (node.tail or "") with bold.text, which is safe. However, the logic removes the following bold's text (line 1892) and moves children, but doesn't handle the following bold's tail. This tail content will be lost when the bold tag is stripped. Consider preserving the tail by appending it to the last child or to the parent node's text.

Copilot · 2025-12-22T13:56:18Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+            return
        graphic = self.find_graphic(item)
        if graphic is None:
+            logging.error(f"InsertGraphicInFigPipe - no graphic found for fig {ET.tostring(item)}")


The error logging at line 1324 will be triggered every time a fig element doesn't have a graphic sibling, which might be a valid scenario (e.g., fig with only a caption). This could lead to excessive error logs for normal cases. Consider using logging.warning or logging.debug instead of logging.error, or add logic to distinguish between expected and unexpected missing graphics.

Suggested change

logging.error(f"InsertGraphicInFigPipe - no graphic found for fig {ET.tostring(item)}")

logging.warning(f"InsertGraphicInFigPipe - no graphic found for fig {ET.tostring(item)}")

Copilot · 2025-12-22T13:56:18Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+        if graphic is None:
+            return
+        tablewrap.append(graphic)
+        logging.info(f"InsertGraphicInTablewrapPipe - feito {ET.tostring(tablewrap)}")


The logging statement at line 1399 uses f-string to include ET.tostring(tablewrap) which could produce very large log entries if the tablewrap element contains substantial content. This could lead to excessive log file sizes. Consider truncating the output or logging just the id attribute or a summary instead of the full element.

Suggested change

logging.info(f"InsertGraphicInTablewrapPipe - feito {ET.tostring(tablewrap)}")

tablewrap_str = ET.tostring(tablewrap, encoding="unicode")

max_len = 500

if len(tablewrap_str) > max_len:

tablewrap_str = tablewrap_str[:max_len] + " ... [truncated]"

logging.info("InsertGraphicInTablewrapPipe - feito %s", tablewrap_str)

Copilot · 2025-12-22T13:56:18Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+
+    def exclude_p_title_children(self, node):
+        for p_title in node.xpath(".//p[title]|.//p[bold]"):
+            parent = p_title.getparent()


Variable parent is not used.

Suggested change

parent = p_title.getparent()

Copilot · 2025-12-22T13:56:19Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+            if not deleted:
+                deleted = self.strip_front_text_by_other_element_name_sublevel(body, "kwd-group")
+            if not deleted:
+                deleted = self.strip_front_text_by_other_element_name_sublevel(body, "abstract")


This assignment to 'deleted' is unnecessary as it is redefined before this value is used.

Suggested change

deleted = self.strip_front_text_by_other_element_name_sublevel(body, "abstract")

self.strip_front_text_by_other_element_name_sublevel(body, "abstract")

Copilot · 2025-12-22T13:56:19Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+                if parent is not None:
+                    parent.remove(item)
+            deleted.append(bool(todelete))
+        return any(deleted)


This statement is unreachable.

- Separa cálculo de similaridade da escolha entre original e corrigido - Corrige detecção de namespaces em tags para evitar falsos positivos - Melhora função remove_tags para permitir exclusão seletiva - Ajusta conversão HTML para XML usando método 'html' ao invés de 'xml'

- Refatora HTMLContent para separar análise de score e escolha - Simplifica get_text_block removendo código comentado desnecessário - Corrige processamento de referências com validação de texto vazio - Adiciona html_to_node e get_node_from_standardized_html para reuso - Remove imports não utilizados

- Corrige XPaths para author-notes, table-fn e elementos especiais - Adiciona padrões no plural para melhor detecção (figs, tables, etc) - Mantém compatibilidade com múltiplos idiomas

- Corrige mapeamento para supplementary-material e fn-group - Remove patterns question/answer por serem muito genéricos - Mantém detecção multilíngue para todos os elementos

- Simplifica XMLArticleMetaAuthorNotesPipe para sempre criar author-notes - Adiciona suporte a licenças do issue quando artigo não tem licença - Usa remove_tags para limpar texto de licença - Melhora tratamento de emails órfãos

- Implementa HTMLMerger para processar referências HTML locais - Processa cada arquivo apenas uma vez, evitando duplicação - Cria elementos html-to-embed para primeira ocorrência - Cria elementos xref para referências repetidas - Detecta e previne referências circulares - Suporta content_reader customizado com journal_acron_folder - Usa lxml para parsing robusto de HTML

- Integra html_merger para processar referências HTML embarcadas - Adiciona is_last_item_in_parent para verificações consistentes - Refatora CreateStyleTagFromAttributePipe com melhor detecção - Implementa SizeAttributePipe com análise de tamanho máximo - Simplifica XMLBoldToTitlePipe com lógica mais clara - Adiciona XMLEmptyTagsPipe para remoção recursiva de tags vazias - Melhora XMLIdentifyTitleParentPipe com detecção mais robusta - Refatora XMLSecPipe para melhor hierarquia de seções - Adiciona XMLFixBodySecHierarchyPipe para corrigir estrutura - Adiciona XMLWrapTitleTailPipe para tratar texto após títulos - Refatora FixParagraphAbsencePipe com métodos mais específicos - Adiciona ExcludeMultipleBreaksPipe para limpar quebras duplicadas - Adiciona AckPipe para processar acknowledgments - Remove pipes obsoletos e simplifica fluxo - Usa html_to_node consistentemente ao invés de text_to_node

- Move XMLArticleMetaAuthorNotesPipe para depois de XMLBackPipe - Adiciona XMLAckPipe para mover ack de body para back - Remove comentário de timestamp desnecessário

…licationPipe - Adicionar logging para debug do processamento de citações - Adicionar verificação de ref_id antes de usar no xpath - Tratar IndexError quando ref não existe no reflist - Verificar existência de mixed-citation e element-citation antes de adicionar - Adicionar PublicationPipe ao pipeline de processamento - Melhorar tratamento de exceções com try/except específico para xpath

… inválidas - Remover filtro específico por id='BNone' do xpath - Validar todas as referências independente do id - Adicionar verificação de ref_id vazio ou contendo 'None' - Continuar iteração ao invés de processar refs sem mixed-citation - Melhorar lógica de identificação de referências inválidas

Copilot

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 34 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-09T12:34:16Z

scielo_classic_website/htmlbody/html_body.py

+    # hc = HTMLContent(paragraphs_text)
+    # return hc.content
+    # except Exception as e:
+    #     # corrige cada parágrafo individualmente
+    #     return get_text(get_paragraphs_data(paragraphs))

+


The function get_text_block has been significantly simplified, removing the HTMLContent processing. The commented-out code on lines 195-199 suggests this was intentional, but there's no explanation for why this major change was made. If the HTML content processing is no longer needed, the commented code should be removed. If it might be needed again, add a comment explaining the reasoning.

Suggested change

# hc = HTMLContent(paragraphs_text)

# return hc.content

# except Exception as e:

# # corrige cada parágrafo individualmente

# return get_text(get_paragraphs_data(paragraphs))

Copilot · 2026-01-09T12:34:17Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+    Procura graphic nos irmãos ascendentes de fig e table-wrap e os envolve.
+    Antes:
+    <fig id="f1"/>
+    <p align="center">
+        <graphic xlink:href="f1.jpg"/>
+    </p>
+
+    Resultado esperado:
+    <fig id="f1">
+        <graphic xlink:href="f1.jpg"/>
+    </fig>


The class docstring still references "fig and table-wrap" but this class now only handles table-wrap (fig handling was split into InsertGraphicInFigPipe). Update the docstring to accurately reflect that this class is specifically for table-wrap processing.

Suggested change

Procura graphic nos irmãos ascendentes de fig e table-wrap e os envolve.

Antes:

<fig id="f1"/>

<p align="center">

<graphic xlink:href="f1.jpg"/>

</p>

Resultado esperado:

<fig id="f1">

<graphic xlink:href="f1.jpg"/>

</fig>

Procura graphic em elementos irmãos ou ascendentes de table-wrap e o

envolve dentro do próprio table-wrap.

Antes:

<table-wrap id="t1"/>

<p align="center">

<graphic xlink:href="t1.jpg"/>

</p>

Resultado esperado:

<table-wrap id="t1">

<graphic xlink:href="t1.jpg"/>

</table-wrap>

Copilot · 2026-01-09T12:34:17Z

scielo_classic_website/spsxml/detector_config_xref.py

+        r"(?i)^figs?\.?(?:\s*\d+)?$",  # fig, figs, fig., figs. 1, FIG. 2
+        r"(?i)^figures?(?:\s*\d+)?$",  # Figure, Figures, figure 3, figures 4
        # Português
        r"(?i)^figura(?:\s*\d+)?$",
+        r"(?i)^figuras(?:\s*\d+)?$",
        # Espanhol
-        r"(?i)^fig\.?(?:\s*\d+)?$",
-        r"(?i)^figura(?:\s*\d+)?$",
-        r"(?i)^gráfico(?:\s*\d+)?$",
+        r"(?i)^figs?\.?(?:\s*\d+)?$",
+        r"(?i)^figuras?(?:\s*\d+)?$",
+        r"(?i)^gráficos?(?:\s*\d+)?$",
        # Francês
-        r"(?i)^fig\.?(?:\s*\d+)?$",
-        r"(?i)^figure(?:\s*\d+)?$",
-        r"(?i)^schéma(?:\s*\d+)?$",
+        r"(?i)^figs?\.?(?:\s*\d+)?$",
+        r"(?i)^figures?(?:\s*\d+)?$",
+        r"(?i)^schémas?(?:\s*\d+)?$",
        # Alemão
-        r"(?i)^abb\.?(?:\s*\d+)?$",
-        r"(?i)^abbildung(?:\s*\d+)?$",
-        r"(?i)^bild(?:\s*\d+)?$",
+        r"(?i)^abbs?\.?(?:\s*\d+)?$",
+        r"(?i)^abbildungen?(?:\s*\d+)?$",
+        r"(?i)^bilder?(?:\s*\d+)?$",
        # Africâner
-        r"(?i)^fig\.?(?:\s*\d+)?$",
-        r"(?i)^figuur(?:\s*\d+)?$",
-        r"(?i)^beeld(?:\s*\d+)?$"
+        r"(?i)^figs?\.?(?:\s*\d+)?$",
+        r"(?i)^figure?(?:\s*\d+)?$",
+        r"(?i)^beelde?(?:\s*\d+)?$"
    ],
    # Tabelas
    "table": [
        # Inglês
-        r"(?i)^tab\.?(?:\s*\d+)?$",  # tab, tab., Tab 1, TAB. 2
-        r"(?i)^table(?:\s*\d+)?$",  # Table, table 3, TABLE 4
+        r"(?i)^tabs?\.?(?:\s*\d+)?$",  # tab, tabs, tab., tabs. 1, TAB. 2
+        r"(?i)^tables?(?:\s*\d+)?$",  # Table, Tables, table 3, tables 4


The patterns for figures, tables, equations, etc. are updated to include plural forms (e.g., "figs?", "figures?", "tabs?", "tables?"). However, the regex alternation syntax means "figures?" matches "figure" or "figures", while "figs?" matches "fig" or "figs". This is correct, but worth noting that "s?" makes the 's' optional, not "es?".

Copilot · 2026-01-09T12:34:18Z

scielo_classic_website/spsxml/sps_xml_refs.py

+                if reflist_node.find(".//mixed-citation") is None:
+                    mixed_citation = ref.find(".//mixed-citation")
+                    if mixed_citation is not None:
+                        reflist_node.append(mixed_citation)
+                if reflist_node.find(".//element-citation") is None:
+                    element_citation = ref.find(".//element-citation")
+                    if element_citation is not None:


The logic checks if mixed-citation exists in the reflist_node and only appends if missing, but then immediately checks for element-citation independently. This could lead to a ref element containing both mixed-citation and element-citation, which might not be the intended behavior. According to JATS standards, typically a ref should contain either mixed-citation OR element-citation, not both. Consider clarifying this logic or adding validation.

Suggested change

if reflist_node.find(".//mixed-citation") is None:

mixed_citation = ref.find(".//mixed-citation")

if mixed_citation is not None:

reflist_node.append(mixed_citation)

if reflist_node.find(".//element-citation") is None:

element_citation = ref.find(".//element-citation")

if element_citation is not None:

has_mixed = reflist_node.find(".//mixed-citation") is not None

has_element = reflist_node.find(".//element-citation") is not None

# According to JATS, a <ref> should contain either mixed-citation or

# element-citation, but not both. Only attach a citation type when

# the target node has neither, and choose at most one type.

if not has_mixed and not has_element:

mixed_citation = ref.find(".//mixed-citation")

element_citation = ref.find(".//element-citation")

if mixed_citation is not None:

reflist_node.append(mixed_citation)

elif element_citation is not None:

Copilot · 2026-01-09T12:34:18Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+            if not element_name:
+                logging.error(
+                    f"XRefAssetTypeImagePipe - no element_name found for label: {label}"
+                )
+                continue
+


The logic error check on line 1321 logs an error if element_name is not found, but then continues to the next iteration. However, the code below on lines 1329-1331 will still use element_name in the XPath and element creation, which could cause issues if it's None. Consider adding an early return or continue after logging the error to prevent using None values.

Copilot · 2026-01-09T12:34:25Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+            strip = True
+            for child in p.xpath(".//*"):
+                if child.tag in ("img", "a"):
+                    continue
+                child.tag = "TAGTOSTRIP"
+        ET.strip_tags(root, "TAGTOSTRIP")


The variable name strip on line 2282 is assigned but never used. It's set to True on line 2282 and potentially changed to False on line 2387, but there's no code that actually uses this variable to determine behavior. Either implement the intended logic that uses this variable or remove it.

Copilot · 2026-01-09T12:34:25Z

scielo_classic_website/spsxml/sps_xml_article_meta.py

    def transform(self, data):
        raw, xml = data

        # Encontra elementos corresp em back
-        corresp_elements = xml.xpath(".//back//corresp|.//body//corresp")
-
-        if corresp_elements:
-            # Cria author-notes
+        corresp_elements = xml.xpath(".//corresp")
+        if not corresp_elements:
+            emails = xml.xpath(".//email")
+            if not emails:
+                return data
+            corresp = ET.Element("corresp")
+            for item in emails:
+                corresp.append(item)
+            corresp_elements = [corresp]
+
+        author_notes = xml.xpath(".//author-notes")
+        if not author_notes:
            author_notes = ET.Element("author-notes")
-
-            # Move cada corresp para dentro de author-notes
-            for corresp in corresp_elements:
-                # Remove corresp de sua localização atual
-                parent = corresp.getparent()
-                if parent is not None:
-                    parent.remove(corresp)
-
-                # Adiciona corresp ao author-notes
-                author_notes.append(corresp)
-
            # Insere author-notes como irmão de contrib-group
-            article_meta = xml.find("./front/article-meta")
-            contrib_group = article_meta.find("contrib-group")
-
-            if contrib_group is not None:
-                # Insere author-notes após contrib-group
-                contrib_group_index = list(article_meta).index(contrib_group)
-                article_meta.insert(contrib_group_index + 1, author_notes)
-            else:
-                # Se não há contrib-group, adiciona ao final de article-meta
-                article_meta.append(author_notes)
+            contrib_group = xml.find(".//front//contrib-group")
+            contrib_group.addnext(author_notes)

+        # Move cada corresp para dentro de author-notes
+        for corresp in corresp_elements:
+            # Adiciona corresp ao author-notes
+            author_notes.append(corresp)
+
        return data


The removed precondition checking for corresp elements means this pipe will now always execute, even when there are no corresp or email elements to process. While this may be intentional for simplification, it could impact performance on documents without these elements. Consider documenting this behavior change or adding an early return if there's nothing to process.

Copilot · 2026-01-09T12:34:25Z

scielo_classic_website/htmlbody/html_fixer.py

    """
    if ":" not in tag:
        return False
-    for part in tag.split():
-        if ":" in part.split("=")[0]:
+
+    tag = tag.replace('="', '-ATTRVALUE-BEGIN')
+    tag = tag.replace('"', "END-ATTRVALUE-')")
+    items = tag.split("-ATTRVALUE-")
+    for item in items:
+        if item.startswith("BEGIN") and item.endswith("END"):
+            continue
+        if ":" in item:
            return True
    return False


The function tag_has_namespace has been refactored with string replacements using unconventional delimiters like "-ATTRVALUE-BEGIN". While functional, this approach is fragile and hard to understand. Consider using a more robust parsing approach, such as regex or proper XML attribute parsing, to check for namespaces in tags.

Copilot · 2026-01-09T12:34:26Z

scielo_classic_website/htmlbody/html_body.py

+            logging.exception(e)
+            logging.info((self.score, self.best_choice))
+            self.tree = self.original


The HTMLContent class constructor now has complex exception handling that catches all exceptions and falls back to using the original content. While this provides robustness, it silently swallows all errors except for logging. Consider whether some exceptions should be re-raised or handled differently, especially critical errors that indicate fundamental problems with the input.

Suggested change

logging.exception(e)

logging.info((self.score, self.best_choice))

self.tree = self.original

# Log the error that occurred while trying to build the tree,

# then attempt to fall back to the original content. If that also

# fails, raise a dedicated error to signal a critical failure.

logging.exception(e)

logging.info((self.score, self.best_choice))

try:

self.tree = self.original

except Exception as fallback_exc:

logging.exception(fallback_exc)

raise UnableToGetHTMLTreeError(

"Unable to build HTML tree from either fixed or original content"

) from fallback_exc

Copilot · 2026-01-09T12:34:26Z

scielo_classic_website/htmlbody/html_merger.py

+
+import os
+from lxml import etree, html
+from typing import Dict, Set, Optional, Callable, Tuple


Import of 'Dict' is not used.

Suggested change

from typing import Dict, Set, Optional, Callable, Tuple

from typing import Set, Optional, Callable, Tuple

robertatakenaka added 4 commits December 21, 2025 21:31

Copilot AI review requested due to automatic review settings December 22, 2025 13:48

Copilot started reviewing on behalf of robertatakenaka December 22, 2025 13:49 View session

Copilot AI reviewed Dec 22, 2025

View reviewed changes

robertatakenaka added 14 commits December 22, 2025 10:58

Merge branch 'main' into corrige_conversao_html_para_xml

e14ab07

fix: especifica versão do lxml para garantir compatibilidade

20e897a

fix: corrige XPaths e adiciona suporte a plurais nos padrões

9ebea57

- Corrige XPaths para author-notes, table-fn e elementos especiais - Adiciona padrões no plural para melhor detecção (figs, tables, etc) - Mantém compatibilidade com múltiplos idiomas

fix: ajusta mapeamento de elementos e remove patterns ambíguos

c415f81

- Corrige mapeamento para supplementary-material e fn-group - Remove patterns question/answer por serem muito genéricos - Mantém detecção multilíngue para todos os elementos

feat: adiciona método get_license_text_by_language para facilitar acesso

93ed605

chore: remove import não utilizado de Reference

6aed010

refactor: reorganiza pipeline e adiciona XMLAckPipe

d4c46a2

- Move XMLArticleMetaAuthorNotesPipe para depois de XMLBackPipe - Adiciona XMLAckPipe para mover ack de body para back - Remove comentário de timestamp desnecessário

robertatakenaka requested a review from Copilot January 9, 2026 12:25

Copilot started reviewing on behalf of robertatakenaka January 9, 2026 12:25 View session

Copilot AI reviewed Jan 9, 2026

View reviewed changes

robertatakenaka merged commit 058b9b6 into scieloorg:main Jan 9, 2026
7 checks passed

-        while True:
-            for item in list(xml.xpath(".//*[not(*)]")):
-                self.mark_tag_to_strip(item)
-            if not xml.xpath(".//TOSTRIP"):
-                break
-            ET.strip_tags(xml, "TOSTRIP")
+        # Processa os elementos em ordem inversa (de baixo para cima) para
+        # garantir que filhos vazios sejam removidos antes de seus pais.
+        elements = list(xml.iter())
+        for elem in reversed(elements):
+            # Não tenta remover o elemento raiz.
+            if elem is xml:
+                continue
+            # Aplica os mesmos critérios de seleção de tag que mark_tag_to_strip.
+            if elem.tag not in ("div", "p", "span", "bold", "italic", "sup", "sub"):
+                continue
+            # Verifica se existe algum filho no elemento.
+            if elem.getchildren():
+                continue
+            # Verifica se o texto (incluindo descendentes remanescentes) tem conteúdo.
+            if "".join(elem.itertext()).strip():
+                continue
+            parent = elem.getparent()
+            if parent is not None:
+                parent.remove(elem)

	logging.error(f"InsertGraphicInFigPipe - no graphic found for fig {ET.tostring(item)}")
	logging.warning(f"InsertGraphicInFigPipe - no graphic found for fig {ET.tostring(item)}")

-        logging.info(f"InsertGraphicInTablewrapPipe - feito {ET.tostring(tablewrap)}")
+        tablewrap_str = ET.tostring(tablewrap, encoding="unicode")
+        max_len = 500
+        if len(tablewrap_str) > max_len:
+            tablewrap_str = tablewrap_str[:max_len] + " ... [truncated]"
+        logging.info("InsertGraphicInTablewrapPipe - feito %s", tablewrap_str)

	deleted = self.strip_front_text_by_other_element_name_sublevel(body, "abstract")
	self.strip_front_text_by_other_element_name_sublevel(body, "abstract")

-            logging.exception(e)
-            logging.info((self.score, self.best_choice))
-            self.tree = self.original
+            # Log the error that occurred while trying to build the tree,
+            # then attempt to fall back to the original content. If that also
+            # fails, raise a dedicated error to signal a critical failure.
+            logging.exception(e)
+            logging.info((self.score, self.best_choice))
+            try:
+                self.tree = self.original
+            except Exception as fallback_exc:
+                logging.exception(fallback_exc)
+                raise UnableToGetHTMLTreeError(
+                    "Unable to build HTML tree from either fixed or original content"
+                ) from fallback_exc

	from typing import Dict, Set, Optional, Callable, Tuple
	from typing import Set, Optional, Callable, Tuple

Corrige e melhora conversao html para xml #137

Corrige e melhora conversao html para xml #137

Uh oh!

Conversation

robertatakenaka commented Dec 22, 2025

Pull Request: Otimização do Pipeline de Processamento XML

📋 Resumo

🎯 Objetivos

📝 Alterações Detalhadas

1. sps_xml_article_meta.py - Simplificação de busca de elementos corresp

2. sps_xml_refs.py - Otimização de processamento de citações

3. sps_xml_pipes.py - Aprimoramento de conversão sup→xref

4. sps_xml_body_pipes.py - Reestruturação completa do pipeline HTML→XML

🔧 Tipo de Mudança

✅ Checklist

🧪 Como Testar

📊 Impacto de Performance

🔍 Observações

📚 Referências

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!