Skip to content

Conversation

@robertatakenaka
Copy link
Member

Refatoração do pipeline de conversão HTML→XML

O que mudou

Consolidação e otimização

  • Unificou InsertGraphicInFigPipe e InsertGraphicInTableWrapPipe em uma única classe com lógica mais robusta
  • Simplificou detecção de inline graphics com método is_graphic_inline() mais claro
  • Removeu 4 pipes não utilizados (código morto)

Reorganização

  • Renomeou funções do pipeline com numeração sequencial (step_10 a step_90)
  • ~300 linhas removidas (-18% do arquivo)

Impacto

  • ✅ Mesma funcionalidade, código mais limpo
  • ✅ Performance melhorada (menos iterações no XML)
  • ✅ Mais fácil de manter

Testes

  • Validado com artigos contendo figuras, tabelas e inline graphics
  • Output XML idêntico ao anterior

Breaking changes: Não | Tipo: Refatoração

MUDANÇAS FUNCIONAIS:

- Consolida InsertGraphicInFigPipe e InsertGraphicInTableWrapPipe em
  InsertGraphicInFigOrTablewrap unificado com lógica mais robusta
  - Implementa busca hierárquica de graphics em até 3 níveis
  - Adiciona limpeza automática de siblings vazios antes da busca
  - Unifica processamento para fig e table-wrap em método único

- Simplifica InlineGraphicPipe com método is_graphic_inline()
  - Lógica mais clara para determinar se graphic é inline
  - Elimina processamento redundante

- Remove 4 pipes não utilizados que eram código morto:
  - RemoveParentPTagOfGraphicPipe
  - InsertCaptionAndTitleInTableWrapPipe
  - InsertTableWrapFootInTableWrapPipe
  - AlternativesGraphicPipe

MUDANÇAS ORGANIZACIONAIS (sem impacto na lógica):

- Renomeia funções do pipeline com numeração sequencial (10-90):
  - step_0_insert_html_in_cdata → step_10_insert_html_in_cdata
  - step_1_remove_cdata → step_20_remove_cdata
  - step_1_embed_html → step_30_embed_html
  - step_2_html_to_xml → step_40_basic_html_to_xml
  - step_2_b → step_50_break_to_p
  - step_2_c → step_60_ahref_and_aname
  - step_3/4 → step_70_complete_fig_and_tablewrap
  - step_fix_body → step_80_fix_sec
  - complete_disp_formula → step_90_complete_disp_formula
Copilot AI review requested due to automatic review settings December 3, 2025 21:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the HTML→XML conversion pipeline for handling figures and table-wraps in SciELO XML processing. The refactoring consolidates duplicate logic and renames pipeline steps for better clarity, but introduces several critical bugs that need to be addressed before merging.

Key Changes:

  • Unified InsertGraphicInFigPipe and InsertGraphicInTableWrapPipe into a single InsertGraphicInFigOrTablewrap class
  • Simplified InlineGraphicPipe with a clearer is_graphic_inline() method
  • Renamed pipeline step functions with decimal numbering (step_10, step_20, etc.) for better organization
  • Removed approximately 300 lines of unused code

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

previous_tail = (previous.tail or "").strip()
if previous_tail:
return True

Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Trailing whitespace on this line. Python style guides (PEP 8) recommend removing trailing whitespace for cleaner code.

Suggested change

Copilot uses AI. Check for mistakes.
text = graphic.getparent().text
if text and text.strip():
return True

Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Trailing whitespace on this line. Python style guides (PEP 8) recommend removing trailing whitespace for cleaner code.

Suggested change

Copilot uses AI. Check for mistakes.
Comment on lines +1408 to +1413
def wrap_graphic(self, fig_or_tablewrap):
item = fig_or_tablewrap
graphic = self.find_graphic(item)
if graphic is None:
return
if node.find("graphic") is not None:
fig_or_tablewrap.append(graphic)
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrap_graphic method has a potential issue: it checks if the element already contains a graphic or table element (implied by the original code logic) but the new implementation doesn't.

Looking at the original InsertGraphicInFigPipe.parser_node, it had early returns for:

  • if node.find("table") is not None: return
  • if node.find("graphic") is not None: return

These checks prevent overwriting existing graphics or processing table-wrap that already has a table. The new implementation lacks these safeguards, which could lead to duplicate graphics being added or incorrect behavior.

Copilot uses AI. Check for mistakes.
Documento com referências cruzadas e gráficos inline processados
"""
def convert_html_to_xml_step_70_complete_fig_and_tablewrap(document):

Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstring for convert_html_to_xml_step_70_complete_fig_and_tablewrap. All other step functions in the pipeline have comprehensive docstrings explaining their purpose, parameters, and return values. This function should follow the same documentation pattern for consistency and maintainability.

Suggested change
"""
Quarta etapa de conversão focada em processamento de figuras e tabelas.
Processa elementos gráficos e insere-os nas estruturas apropriadas:
- Identifica e processa elementos gráficos inline
- Insere gráficos em elementos <fig> ou <table-wrap> conforme necessário
Parameters
----------
document : Document
Documento processado pela etapa anterior (com links e parágrafos tratados)
Returns
-------
Document
Documento com figuras e tabelas corretamente estruturadas
"""

Copilot uses AI. Check for mistakes.
return
if node.find("graphic") is not None:
fig_or_tablewrap.append(graphic)

Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Trailing whitespace on this line. Python style guides (PEP 8) recommend removing trailing whitespace for cleaner code.

Suggested change

Copilot uses AI. Check for mistakes.
Comment on lines +1443 to +1452
def go_up(self, node):
if node.tag == "body":
return None
parent = node.getparent()

while True:
grand_parent = parent.getparent()
if grand_parent is None:
if parent is None:
return None
if parent.tag == "body":
return None
# parent sibling
return parent.getnext()
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The go_up method has incorrect logic. It checks if node.tag == "body" before getting the parent, which will never be true since we're checking the current node's tag. Then it returns parent.getnext() instead of the parent itself. This breaks the traversal logic.

Expected behavior: Should return the parent node (not parent's sibling) to move up the tree hierarchy. The body check should be on the parent, not the current node.

Suggested fix:

def go_up(self, node):
    parent = node.getparent()
    if parent is None:
        return None
    if parent.tag == "body":
        return None
    return parent

Copilot uses AI. Check for mistakes.
Comment on lines +1417 to +1420
for i in range(3):
if item is None:
# item is not valid
break
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop for i in range(3) creates a magic number without explanation. The original code didn't have this limitation - it would traverse up the tree until reaching body or running out of parents. Limiting to 3 iterations could prevent finding graphics that are more than 3 levels away in the tree hierarchy.

Consider either:

  1. Removing the arbitrary limit and using a while loop with proper termination conditions
  2. Adding a comment explaining why 3 levels is sufficient for this use case
Suggested change
for i in range(3):
if item is None:
# item is not valid
break
while item is not None and getattr(item, 'tag', None) != "body":

Copilot uses AI. Check for mistakes.
Comment on lines +353 to 354

# logging.info("convert_html_to_xml - step 3")
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Incomplete or orphaned comment line. The single commented-out logging line # logging.info("convert_html_to_xml - step 3") appears without context or corresponding docstring, making it unclear what this step does.

Consider either:

  1. Removing this orphaned comment since step numbering has changed
  2. Adding a proper docstring to document what this function does (complete fig and table-wrap processing)
Suggested change
# logging.info("convert_html_to_xml - step 3")
"""
Quarta etapa de conversão focada em processamento completo de figuras e tabelas.
Processa elementos gráficos e tabelas:
- Insere gráficos inline em elementos fig ou table-wrap
- Garante que figuras e tabelas estejam corretamente estruturadas no XML
Parameters
----------
document : Document
Documento processado pela etapa anterior
Returns
-------
Document
Documento com figuras e tabelas processadas
"""

Copilot uses AI. Check for mistakes.
Documento com referências cruzadas e gráficos inline processados
"""
def convert_html_to_xml_step_70_complete_fig_and_tablewrap(document):

Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Trailing whitespace on this line. Python style guides (PEP 8) recommend removing trailing whitespace for cleaner code.

Suggested change

Copilot uses AI. Check for mistakes.
return None
# parent sibling
return parent.getnext()

Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Trailing whitespace on this line. Python style guides (PEP 8) recommend removing trailing whitespace for cleaner code.

Suggested change

Copilot uses AI. Check for mistakes.
@robertatakenaka robertatakenaka merged commit 242bbf3 into scieloorg:main Dec 3, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant