Corrige conversao html para xml quanto a elementos do front #139

robertatakenaka · 2026-01-27T14:24:30Z

Correções no processamento de XML e HTML para artigos científicos

Descrição

Este PR corrige problemas no processamento de HTML e na geração de XML SPS, incluindo melhorias na conversão de conteúdo HTML, tratamento de títulos traduzidos e posicionamento correto de elementos no XML final.

Alterações

html_fixer.py

Adiciona pretty_print=False na função tostring() para evitar que a conversão HTML introduza quebras de linha ou espaços extras indesejados na estrutura do documento

sps_xml_article_meta.py

Simplifica a função create_node_with_fixed_html_text removendo tratamento específico de tags bold e adicionando strip de tags span residuais
Adiciona validação em XMLArticleMetaTranslatedTitleGroupPipe para ignorar títulos traduzidos sem idioma ou texto definido, evitando erros em dados incompletos
Corrige o posicionamento do elemento author-notes em XMLArticleMetaAuthorNotesPipe, utilizando busca reversa de elementos irmãos para inserir na posição correta conforme a especificação SPS

Motivação

As correções garantem que o XML gerado esteja em conformidade com o schema SPS e que dados incompletos sejam tratados adequadamente sem causar falhas no pipeline de processamento.

…indesejada - Especifica pretty_print=False em tostring() para manter a estrutura original do HTML sem adicionar quebras de linha ou espaços extras

create_node_with_fixed_html_text: - Remove strip de tags bold específicas (b, B, bold) mantendo apenas strip genérico - Adiciona strip_tags para span, removendo spans residuais do HTML XMLArticleMetaTranslatedTitleGroupPipe: - Adiciona validação para ignorar itens sem language ou text - Extrai valores de language e text em variáveis antes do uso XMLArticleMetaAuthorNotesPipe: - Corrige posicionamento de author-notes usando busca reversa de siblings - Insere após o primeiro elemento encontrado na ordem: aff, contrib-group, title-group, article-categories, article-id

Copilot

Pull request overview

Corrige o pipeline de conversão HTML→XML e o posicionamento de elementos no article-meta para gerar XML SPS mais aderente ao schema e mais tolerante a dados incompletos.

Changes:

Ajusta a serialização do <body> na conversão HTML→XML para evitar inserção de whitespace/pretty-print indesejado.
Endurece o tratamento de títulos traduzidos, ignorando entradas sem language ou text.
Altera a lógica de inserção/movimentação de author-notes para tentar posicioná-lo corretamente em relação a elementos irmãos do article-meta.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`scielo_classic_website/spsxml/sps_xml_article_meta.py`	Ajustes em sanitização de HTML em nós de metadados, validação de translated titles e reposicionamento de `author-notes`.
`scielo_classic_website/htmlbody/html_fixer.py`	Evita pretty-print na serialização do `<body>` para não introduzir quebras/indentação no HTML convertido.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-27T14:29:56Z

scielo_classic_website/spsxml/sps_xml_article_meta.py

    ET.strip_tags(node, "b")
    ET.strip_tags(node, "B")
    ET.strip_tags(node, "bold")
+    ET.strip_tags(node, "span")



create_node_with_fixed_html_text is now stripping all <span> tags. This will also remove span[@name='style_italic'|'style_sup'|...] markers that StylePipe later relies on to convert styling into JATS tags, causing loss of inline formatting in titles/collab (regression vs previous behavior where non-bold spans were preserved). Prefer preserving span[@name^='style_'] (or converting them to their target tags) and only stripping truly residual spans.

Copilot · 2026-01-27T14:29:57Z

scielo_classic_website/spsxml/sps_xml_article_meta.py

        author_notes = xml.xpath(".//author-notes")
        if not author_notes:
            author_notes = ET.Element("author-notes")
-            # Insere author-notes como irmão de contrib-group
-            contrib_group = xml.find(".//front//contrib-group")
-            contrib_group.addnext(author_notes)

        # Move cada corresp para dentro de author-notes
        for corresp in corresp_elements:
            # Adiciona corresp ao author-notes
            author_notes.append(corresp)


author_notes = xml.xpath('.//author-notes') returns a list; when an <author-notes> already exists this code will append corresp to the Python list (not the XML element) and then pass a list into addnext, which will raise at runtime. Select the first matching element (and ideally scope the search to ./front/article-meta/author-notes) or use xml.find() so author_notes is always an Element.

Copilot · 2026-01-27T14:29:57Z

scielo_classic_website/spsxml/sps_xml_article_meta.py

+        siblings = reversed((
+            "article-id",
+            "article-categories",
+            "title-group",
+            "contrib-group",
+            "aff",
+        ))
+        for sibling_tag in siblings:
+            sibling = xml.find(f".//front/article-meta/{sibling_tag}")
+            if sibling is not None:
+                sibling.addnext(author_notes)
+                break


If none of the listed sibling tags are found under front/article-meta, author_notes is never attached to the document (it remains a detached element), so any collected corresp will be dropped from the output. Consider falling back to appending author_notes directly under ./front/article-meta when no insertion point is found, and/or guarding against inserting a second time when it already has a parent.

robertatakenaka added 2 commits January 27, 2026 11:22

Adiciona pretty_print=False na conversão HTML para evitar formatação …

185be44

…indesejada - Especifica pretty_print=False em tostring() para manter a estrutura original do HTML sem adicionar quebras de linha ou espaços extras

Copilot AI review requested due to automatic review settings January 27, 2026 14:24

Copilot started reviewing on behalf of robertatakenaka January 27, 2026 14:24 View session

Copilot AI reviewed Jan 27, 2026

View reviewed changes

robertatakenaka merged commit 174575d into scieloorg:main Jan 28, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrige conversao html para xml quanto a elementos do front #139

Corrige conversao html para xml quanto a elementos do front #139

Uh oh!

robertatakenaka commented Jan 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Corrige conversao html para xml quanto a elementos do front #139

Corrige conversao html para xml quanto a elementos do front #139

Uh oh!

Conversation

robertatakenaka commented Jan 27, 2026

Correções no processamento de XML e HTML para artigos científicos

Descrição

Alterações

Motivação

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant