-
Notifications
You must be signed in to change notification settings - Fork 3
Correção do processamento de breaks e parágrafos na conversão HTML→XML #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correção do processamento de breaks e parágrafos na conversão HTML→XML #124
Conversation
…organização - Divide convert_html_to_xml_step_2 em três etapas distintas (2_a, 2_b, 2_c) - step_2_a: conversões básicas HTML para XML (normalização, símbolos, listas, links) - step_2_b: correção de estrutura de parágrafos e formatação - step_2_c: processamento específico de links href - Adiciona documentação detalhada com docstrings numpy format para todas as funções - Implementa novos pipes: - FixParagraphsAndBreaksPipe: corrige ausência de parágrafos e processa double breaks - SizeAttributePipe: converte atributos size em tags title - RemoveEmptyTagPipe: generaliza remoção de tags vazias (p, sec) - Melhora pipes existentes: - ReplaceBrByPPipe: refatora com métodos auxiliares _split_content_by_breaks - AHrefPipe: otimiza verificação de journal_acron - XRefSpecialInternalLinkPipe: adiciona early return para otimização - Corrige uso de parent.remove() por strip_tags em: - RemoveEmptyRefTagPipe - RemoveExcedingBreakTagPipe - Renomeia funções para nomes mais descritivos: - step_0 -> step_0_insert_html_in_cdata - step_1 -> step_1_remove_cdata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the HTML-to-XML conversion pipeline to better handle <br> tags and paragraph structure in legacy HTML documents. The main change splits the monolithic convert_html_to_xml_step_2 into three specialized steps (2_a, 2_b, 2_c) for better control and introduces intelligent logic to distinguish between double breaks representing paragraph separators versus single breaks representing line breaks.
Key Changes:
- Refactored
step_2into three focused substeps:step_2_a(basic HTML→XML conversion),step_2_b(paragraph structure correction), andstep_2_c(link processing) - Introduced
FixParagraphsAndBreaksPipeto intelligently detect and convert double breaks (<br><br>) into paragraph separators - Added
SizeAttributePipeto identify titles based on font size attributes
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if size[0] == "-": | ||
| continue | ||
| s = size[0] | ||
| if s[0] == "+": | ||
| s = s[1:] |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line 994, accessing size[0] could raise an IndexError if size is an empty string. After that, on line 997, you access s[0] again which would fail if s is also empty. Consider adding a length check before accessing by index.
| if size[0] == "-": | |
| continue | |
| s = size[0] | |
| if s[0] == "+": | |
| s = s[1:] | |
| if not size or len(size) == 0: | |
| continue | |
| if size[0] == "-": | |
| continue | |
| s = size[0] | |
| if s == "+" and len(size) > 1: | |
| s = size[1] |
| @@ -721,16 +979,31 @@ class CreateStyleTagFromAttributePipe(plumber.Pipe): | |||
| def transform(self, data): | |||
| raw, xml = data | |||
| for node in xml.xpath(".//*[@style]"): | |||
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of the check if not node.get("style"): continue could cause issues. If node.get("style") returns None (when the attribute doesn't exist or is empty), then the loop for style in ("italic", "sup", "sub", "bold", "underline") will check if style in None which will raise a TypeError. The original guard was necessary to prevent this error.
| for node in xml.xpath(".//*[@style]"): | |
| for node in xml.xpath(".//*[@style]"): | |
| if not node.get("style"): | |
| continue |
| if (break_node.tail or "").strip(): | ||
| continue | ||
| next_sibling = break_node.getnext() | ||
| if next_sibling.tail and next_sibling.tail.strip(): |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the mark_double_breaks method, you're checking if (break_node.tail or "").strip() on line 838 and if next_sibling.tail and next_sibling.tail.strip() on line 841. However, the logic seems incomplete - if the first break has no text in its tail (empty or whitespace only), you proceed, but if the next sibling has text in its tail, you skip it with continue. This means the first break gets marked as DOUBLEBREAK but its sibling doesn't get marked as TOREMOVE, creating an inconsistent state. Consider whether both conditions should result in a continue or if the logic needs adjustment.
| if (break_node.tail or "").strip(): | |
| continue | |
| next_sibling = break_node.getnext() | |
| if next_sibling.tail and next_sibling.tail.strip(): | |
| next_sibling = break_node.getnext() | |
| if ((break_node.tail or "").strip()) or (next_sibling.tail and next_sibling.tail.strip()): |
| item.tag = "REMOVE_EMPTY_REF_TAG" | ||
| ET.strip_tags(xml, "REMOVE_EMPTY_REF_TAG") |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The removal pattern using ET.strip_tags() is more consistent and safer than using parent.remove(). However, since strip_tags() removes the tag but keeps its content (text and tail), you should verify that removing the <ref> tag but keeping its tail doesn't cause issues. If the ref is truly empty (no text), this should be fine, but consider if there are edge cases where an empty ref might have a meaningful tail.
| item.tag = "REMOVE_EMPTY_REF_TAG" | |
| ET.strip_tags(xml, "REMOVE_EMPTY_REF_TAG") | |
| parent = item.getparent() | |
| if parent is not None: | |
| # Remove the tail if present | |
| item.tail = None | |
| parent.remove(item) |
| self.fix_paragraph_absence(parent) | ||
|
|
||
| def fix_paragraph_absence(self, parent): | ||
| # troca break duplo por DOUBLEBREAK |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment "troca break duplo por DOUBLEBREAK" (replaces double break with DOUBLEBREAK) is somewhat misleading. The code actually marks the first break as DOUBLEBREAK and the second as TOREMOVE, then strips TOREMOVE tags. Consider updating the comment to reflect the actual two-step process.
| # troca break duplo por DOUBLEBREAK | |
| # Marca o primeiro break duplo como DOUBLEBREAK e o segundo como TOREMOVE, depois remove TOREMOVE |
|
|
||
| journal_acron = raw.journal and raw.journal.acronym | ||
| if journal_acron: | ||
| journal_acron = f"/{journal_acron}/" |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change from f"/{journal_acron}/" in the href check to just journal_acron in the href check, combined with now adding the slashes when setting journal_acron (line 1156), changes the behavior. Previously, the pattern f"/{journal_acron}/" in href would match /acronym/ anywhere in the href. Now, line 1139 checks if journal_acron (which includes slashes) is in href.lower(), but journal_acron itself is not lowercased. This could cause case-sensitivity issues where journal_acron = "/ABC/" won't match href = "example.com/abc/page.html". Consider applying .lower() to journal_acron as well.
| journal_acron = f"/{journal_acron}/" | |
| journal_acron = f"/{journal_acron.lower()}/" |
|
|
||
| def mark_empty_tag(self, node): | ||
| # Verifica se existe algum filho no node. | ||
| if node.getchildren(): |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method getchildren() is deprecated in lxml and will be removed in future versions. Use list(node) or len(node) instead. This applies to line 1542 where you check if node.getchildren(): - replace with if len(node) > 0: or if list(node):.
| if node.getchildren(): | |
| if len(node) > 0: |
| for node in parent.xpath(".//span[DOUBLEBREAK]"): | ||
| children = node.xpath(".//*") | ||
| if len(children) != 1: |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment "substitui span cujo único filho é doublebreak por p, remove doublebreak" (replaces span whose only child is doublebreak with p, removes doublebreak) is clearer than the function name. However, the implementation checks if len(children) != 1: which verifies there's exactly one child, but doesn't verify it's specifically a DOUBLEBREAK. The XPath .//span[DOUBLEBREAK] finds spans containing DOUBLEBREAK but doesn't guarantee it's the only child. Consider either updating the XPath to .//span[count(*)=1 and DOUBLEBREAK] or adjusting the comment.
| for node in parent.xpath(".//span[DOUBLEBREAK]"): | |
| children = node.xpath(".//*") | |
| if len(children) != 1: | |
| for node in parent.xpath(".//span[count(*)=1 and *[1][self::DOUBLEBREAK]]"): | |
| children = node.getchildren() | |
| if len(children) != 1 or children[0].tag != "DOUBLEBREAK": |
| try: | ||
| if int(s) > 1: | ||
| node.tag = "title" | ||
| except (ValueError, TypeError): |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except (ValueError, TypeError): | |
| except (ValueError, TypeError): | |
| # Ignore non-integer or malformed size values; not all 'size' attributes are expected to be valid integers. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
PR: Correção do processamento de breaks e parágrafos na conversão HTML→XML
📋 Descrição
Esta PR corrige o processamento inadequado de tags
<br>durante a conversão HTML→XML, implementando lógica mais inteligente para identificar quando breaks representam novos parágrafos versus quebras de linha simples.Esta correção é necessária para inserir corretamenta as figuras / tabelas representadas no html original como link para arquivos. A conversão html para xml, troca o links por xref + o elemento correspondente com graphic do arquivo.
🎯 Problema
O HTML legado frequentemente usa
<br>onde deveria usar<p>, especialmente:<br><br>) para separar parágrafos<span>ou<font>seguidos de breaks que deveriam ser parágrafosPorém, nem todo
<br>representa um novo parágrafo - alguns são quebras de linha legítimas dentro do conteúdo.✨ Solução implementada
🔄 Refatoração do pipeline step_2
Dividido em três etapas especializadas para melhor controle do processamento:
step_2_a: Conversões básicas HTML→XML (preserva estrutura original)step_2_b: [FOCO PRINCIPAL] Correção inteligente de parágrafos:FixParagraphsAndBreaksPipecom lógica específicaSizeAttributePipepara identificar títulosstep_2_c: Processamento final de links🆕 Novo pipe:
FixParagraphsAndBreaksPipeImplementa detecção inteligente de parágrafos:
<br><br>) como separadores de parágrafo<p><p>ou<sec>conforme contexto🔧 Melhorias no
ReplaceBrByPPipeRefatorado para processar apenas casos com múltiplos breaks:
<br>(quebra simples)📊 Casos de uso cobertos
texto<br><br>outro texto<p>texto</p><p>outro texto</p><span>título</span><br><br>conteúdo<p>título</p><p>conteúdo</p>linha 1<br>linha 2<p>linha 1<break/>linha 2</p><font size="+2">Seção</font><br><br><sec><title>Seção</title></sec>🐛 Correções adicionais
strip_tags()ao invés deparent.remove()🧪 Validação
<br>incorretamente no lugar de<p><p>) não são alteradas💡 Impacto
Esta correção resolve problemas históricos de documentos HTML mal estruturados, garantindo que o XML resultante tenha estrutura semântica correta de parágrafos, mantendo a flexibilidade para quebras de linha legítimas.
📝 Arquivos modificados
scielo_classic_website/spsxml/sps_xml_body_pipes.py