Skip to content

Conversation

@robertatakenaka
Copy link
Member

PR: Correção do processamento de breaks e parágrafos na conversão HTML→XML

📋 Descrição

Esta PR corrige o processamento inadequado de tags <br> durante a conversão HTML→XML, implementando lógica mais inteligente para identificar quando breaks representam novos parágrafos versus quebras de linha simples.
Esta correção é necessária para inserir corretamenta as figuras / tabelas representadas no html original como link para arquivos. A conversão html para xml, troca o links por xref + o elemento correspondente com graphic do arquivo.

🎯 Problema

O HTML legado frequentemente usa <br> onde deveria usar <p>, especialmente:

  • Double breaks (<br><br>) para separar parágrafos
  • Elementos <span> ou <font> seguidos de breaks que deveriam ser parágrafos
  • Conteúdo sem estrutura de parágrafos adequada

Porém, nem todo <br> representa um novo parágrafo - alguns são quebras de linha legítimas dentro do conteúdo.

✨ Solução implementada

🔄 Refatoração do pipeline step_2

Dividido em três etapas especializadas para melhor controle do processamento:

  • step_2_a: Conversões básicas HTML→XML (preserva estrutura original)
  • step_2_b: [FOCO PRINCIPAL] Correção inteligente de parágrafos:
    • Implementa FixParagraphsAndBreaksPipe com lógica específica
    • Adiciona SizeAttributePipe para identificar títulos
  • step_2_c: Processamento final de links

🆕 Novo pipe: FixParagraphsAndBreaksPipe

Implementa detecção inteligente de parágrafos:

  1. Identifica double breaks (<br><br>) como separadores de parágrafo
  2. Converte spans seguidos de double breaks em <p>
  3. Converte fonts seguidos de breaks em <p> ou <sec> conforme contexto
  4. Preserva breaks simples dentro do conteúdo quando apropriado
def fix_paragraphs_and_breaks(self, parent):
    if parent.xpath(".//p"):
        return  # Já tem estrutura de parágrafos
    
    self.mark_double_breaks(parent)  # <br><br> → indicador de parágrafo
    self.replace_span_followed_by_break_by_p(parent)
    self.replace_font_followed_by_break_by_p_or_sec(parent)

🔧 Melhorias no ReplaceBrByPPipe

Refatorado para processar apenas casos com múltiplos breaks:

  • Ignora elementos com apenas um <br> (quebra simples)
  • Divide conteúdo apenas quando há padrão claro de separação

📊 Casos de uso cobertos

Entrada HTML Saída XML Justificativa
texto<br><br>outro texto <p>texto</p><p>outro texto</p> Double break = novo parágrafo
<span>título</span><br><br>conteúdo <p>título</p><p>conteúdo</p> Span + double break = parágrafo
linha 1<br>linha 2 <p>linha 1<break/>linha 2</p> Break simples = quebra de linha
<font size="+2">Seção</font><br><br> <sec><title>Seção</title></sec> Font grande + break = seção

🐛 Correções adicionais

  • Uso correto de strip_tags() ao invés de parent.remove()
  • Otimizações de performance em verificações desnecessárias
  • Documentação detalhada de todo o pipeline

🧪 Validação

  • Testado com documentos que usam <br> incorretamente no lugar de <p>
  • Verificado que breaks simples dentro de parágrafos são preservados
  • Confirmado que estruturas já corretas (com <p>) não são alteradas
  • Double breaks são corretamente identificados como separadores

💡 Impacto

Esta correção resolve problemas históricos de documentos HTML mal estruturados, garantindo que o XML resultante tenha estrutura semântica correta de parágrafos, mantendo a flexibilidade para quebras de linha legítimas.

📝 Arquivos modificados

  • scielo_classic_website/spsxml/sps_xml_body_pipes.py

…organização

- Divide convert_html_to_xml_step_2 em três etapas distintas (2_a, 2_b, 2_c)
  - step_2_a: conversões básicas HTML para XML (normalização, símbolos, listas, links)
  - step_2_b: correção de estrutura de parágrafos e formatação
  - step_2_c: processamento específico de links href

- Adiciona documentação detalhada com docstrings numpy format para todas as funções

- Implementa novos pipes:
  - FixParagraphsAndBreaksPipe: corrige ausência de parágrafos e processa double breaks
  - SizeAttributePipe: converte atributos size em tags title
  - RemoveEmptyTagPipe: generaliza remoção de tags vazias (p, sec)

- Melhora pipes existentes:
  - ReplaceBrByPPipe: refatora com métodos auxiliares _split_content_by_breaks
  - AHrefPipe: otimiza verificação de journal_acron
  - XRefSpecialInternalLinkPipe: adiciona early return para otimização

- Corrige uso de parent.remove() por strip_tags em:
  - RemoveEmptyRefTagPipe
  - RemoveExcedingBreakTagPipe

- Renomeia funções para nomes mais descritivos:
  - step_0 -> step_0_insert_html_in_cdata
  - step_1 -> step_1_remove_cdata
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the HTML-to-XML conversion pipeline to better handle <br> tags and paragraph structure in legacy HTML documents. The main change splits the monolithic convert_html_to_xml_step_2 into three specialized steps (2_a, 2_b, 2_c) for better control and introduces intelligent logic to distinguish between double breaks representing paragraph separators versus single breaks representing line breaks.

Key Changes:

  • Refactored step_2 into three focused substeps: step_2_a (basic HTML→XML conversion), step_2_b (paragraph structure correction), and step_2_c (link processing)
  • Introduced FixParagraphsAndBreaksPipe to intelligently detect and convert double breaks (<br><br>) into paragraph separators
  • Added SizeAttributePipe to identify titles based on font size attributes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 994 to 998
if size[0] == "-":
continue
s = size[0]
if s[0] == "+":
s = s[1:]
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line 994, accessing size[0] could raise an IndexError if size is an empty string. After that, on line 997, you access s[0] again which would fail if s is also empty. Consider adding a length check before accessing by index.

Suggested change
if size[0] == "-":
continue
s = size[0]
if s[0] == "+":
s = s[1:]
if not size or len(size) == 0:
continue
if size[0] == "-":
continue
s = size[0]
if s == "+" and len(size) > 1:
s = size[1]

Copilot uses AI. Check for mistakes.
@@ -721,16 +979,31 @@ class CreateStyleTagFromAttributePipe(plumber.Pipe):
def transform(self, data):
raw, xml = data
for node in xml.xpath(".//*[@style]"):
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of the check if not node.get("style"): continue could cause issues. If node.get("style") returns None (when the attribute doesn't exist or is empty), then the loop for style in ("italic", "sup", "sub", "bold", "underline") will check if style in None which will raise a TypeError. The original guard was necessary to prevent this error.

Suggested change
for node in xml.xpath(".//*[@style]"):
for node in xml.xpath(".//*[@style]"):
if not node.get("style"):
continue

Copilot uses AI. Check for mistakes.
Comment on lines +838 to +841
if (break_node.tail or "").strip():
continue
next_sibling = break_node.getnext()
if next_sibling.tail and next_sibling.tail.strip():
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the mark_double_breaks method, you're checking if (break_node.tail or "").strip() on line 838 and if next_sibling.tail and next_sibling.tail.strip() on line 841. However, the logic seems incomplete - if the first break has no text in its tail (empty or whitespace only), you proceed, but if the next sibling has text in its tail, you skip it with continue. This means the first break gets marked as DOUBLEBREAK but its sibling doesn't get marked as TOREMOVE, creating an inconsistent state. Consider whether both conditions should result in a continue or if the logic needs adjustment.

Suggested change
if (break_node.tail or "").strip():
continue
next_sibling = break_node.getnext()
if next_sibling.tail and next_sibling.tail.strip():
next_sibling = break_node.getnext()
if ((break_node.tail or "").strip()) or (next_sibling.tail and next_sibling.tail.strip()):

Copilot uses AI. Check for mistakes.
Comment on lines +1604 to +1605
item.tag = "REMOVE_EMPTY_REF_TAG"
ET.strip_tags(xml, "REMOVE_EMPTY_REF_TAG")
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The removal pattern using ET.strip_tags() is more consistent and safer than using parent.remove(). However, since strip_tags() removes the tag but keeps its content (text and tail), you should verify that removing the <ref> tag but keeping its tail doesn't cause issues. If the ref is truly empty (no text), this should be fine, but consider if there are edge cases where an empty ref might have a meaningful tail.

Suggested change
item.tag = "REMOVE_EMPTY_REF_TAG"
ET.strip_tags(xml, "REMOVE_EMPTY_REF_TAG")
parent = item.getparent()
if parent is not None:
# Remove the tail if present
item.tail = None
parent.remove(item)

Copilot uses AI. Check for mistakes.
self.fix_paragraph_absence(parent)

def fix_paragraph_absence(self, parent):
# troca break duplo por DOUBLEBREAK
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment "troca break duplo por DOUBLEBREAK" (replaces double break with DOUBLEBREAK) is somewhat misleading. The code actually marks the first break as DOUBLEBREAK and the second as TOREMOVE, then strips TOREMOVE tags. Consider updating the comment to reflect the actual two-step process.

Suggested change
# troca break duplo por DOUBLEBREAK
# Marca o primeiro break duplo como DOUBLEBREAK e o segundo como TOREMOVE, depois remove TOREMOVE

Copilot uses AI. Check for mistakes.

journal_acron = raw.journal and raw.journal.acronym
if journal_acron:
journal_acron = f"/{journal_acron}/"
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from f"/{journal_acron}/" in the href check to just journal_acron in the href check, combined with now adding the slashes when setting journal_acron (line 1156), changes the behavior. Previously, the pattern f"/{journal_acron}/" in href would match /acronym/ anywhere in the href. Now, line 1139 checks if journal_acron (which includes slashes) is in href.lower(), but journal_acron itself is not lowercased. This could cause case-sensitivity issues where journal_acron = "/ABC/" won't match href = "example.com/abc/page.html". Consider applying .lower() to journal_acron as well.

Suggested change
journal_acron = f"/{journal_acron}/"
journal_acron = f"/{journal_acron.lower()}/"

Copilot uses AI. Check for mistakes.

def mark_empty_tag(self, node):
# Verifica se existe algum filho no node.
if node.getchildren():
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method getchildren() is deprecated in lxml and will be removed in future versions. Use list(node) or len(node) instead. This applies to line 1542 where you check if node.getchildren(): - replace with if len(node) > 0: or if list(node):.

Suggested change
if node.getchildren():
if len(node) > 0:

Copilot uses AI. Check for mistakes.
Comment on lines +858 to +860
for node in parent.xpath(".//span[DOUBLEBREAK]"):
children = node.xpath(".//*")
if len(children) != 1:
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "substitui span cujo único filho é doublebreak por p, remove doublebreak" (replaces span whose only child is doublebreak with p, removes doublebreak) is clearer than the function name. However, the implementation checks if len(children) != 1: which verifies there's exactly one child, but doesn't verify it's specifically a DOUBLEBREAK. The XPath .//span[DOUBLEBREAK] finds spans containing DOUBLEBREAK but doesn't guarantee it's the only child. Consider either updating the XPath to .//span[count(*)=1 and DOUBLEBREAK] or adjusting the comment.

Suggested change
for node in parent.xpath(".//span[DOUBLEBREAK]"):
children = node.xpath(".//*")
if len(children) != 1:
for node in parent.xpath(".//span[count(*)=1 and *[1][self::DOUBLEBREAK]]"):
children = node.getchildren()
if len(children) != 1 or children[0].tag != "DOUBLEBREAK":

Copilot uses AI. Check for mistakes.
try:
if int(s) > 1:
node.tag = "title"
except (ValueError, TypeError):
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except (ValueError, TypeError):
except (ValueError, TypeError):
# Ignore non-integer or malformed size values; not all 'size' attributes are expected to be valid integers.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@robertatakenaka robertatakenaka merged commit 73b433a into scieloorg:main Nov 24, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant