-
Notifications
You must be signed in to change notification settings - Fork 3
refactor: Revisa pipeline de conversão HTML para XML de fig e table-wrap #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Revisa pipeline de conversão HTML para XML de fig e table-wrap #131
Conversation
MUDANÇAS FUNCIONAIS: - Consolida InsertGraphicInFigPipe e InsertGraphicInTableWrapPipe em InsertGraphicInFigOrTablewrap unificado com lógica mais robusta - Implementa busca hierárquica de graphics em até 3 níveis - Adiciona limpeza automática de siblings vazios antes da busca - Unifica processamento para fig e table-wrap em método único - Simplifica InlineGraphicPipe com método is_graphic_inline() - Lógica mais clara para determinar se graphic é inline - Elimina processamento redundante - Remove 4 pipes não utilizados que eram código morto: - RemoveParentPTagOfGraphicPipe - InsertCaptionAndTitleInTableWrapPipe - InsertTableWrapFootInTableWrapPipe - AlternativesGraphicPipe MUDANÇAS ORGANIZACIONAIS (sem impacto na lógica): - Renomeia funções do pipeline com numeração sequencial (10-90): - step_0_insert_html_in_cdata → step_10_insert_html_in_cdata - step_1_remove_cdata → step_20_remove_cdata - step_1_embed_html → step_30_embed_html - step_2_html_to_xml → step_40_basic_html_to_xml - step_2_b → step_50_break_to_p - step_2_c → step_60_ahref_and_aname - step_3/4 → step_70_complete_fig_and_tablewrap - step_fix_body → step_80_fix_sec - complete_disp_formula → step_90_complete_disp_formula
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the HTML→XML conversion pipeline for handling figures and table-wraps in SciELO XML processing. The refactoring consolidates duplicate logic and renames pipeline steps for better clarity, but introduces several critical bugs that need to be addressed before merging.
Key Changes:
- Unified
InsertGraphicInFigPipeandInsertGraphicInTableWrapPipeinto a singleInsertGraphicInFigOrTablewrapclass - Simplified
InlineGraphicPipewith a cleareris_graphic_inline()method - Renamed pipeline step functions with decimal numbering (step_10, step_20, etc.) for better organization
- Removed approximately 300 lines of unused code
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| previous_tail = (previous.tail or "").strip() | ||
| if previous_tail: | ||
| return True | ||
|
|
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Trailing whitespace on this line. Python style guides (PEP 8) recommend removing trailing whitespace for cleaner code.
| text = graphic.getparent().text | ||
| if text and text.strip(): | ||
| return True | ||
|
|
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Trailing whitespace on this line. Python style guides (PEP 8) recommend removing trailing whitespace for cleaner code.
| def wrap_graphic(self, fig_or_tablewrap): | ||
| item = fig_or_tablewrap | ||
| graphic = self.find_graphic(item) | ||
| if graphic is None: | ||
| return | ||
| if node.find("graphic") is not None: | ||
| fig_or_tablewrap.append(graphic) |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wrap_graphic method has a potential issue: it checks if the element already contains a graphic or table element (implied by the original code logic) but the new implementation doesn't.
Looking at the original InsertGraphicInFigPipe.parser_node, it had early returns for:
if node.find("table") is not None: returnif node.find("graphic") is not None: return
These checks prevent overwriting existing graphics or processing table-wrap that already has a table. The new implementation lacks these safeguards, which could lead to duplicate graphics being added or incorrect behavior.
| Documento com referências cruzadas e gráficos inline processados | ||
| """ | ||
| def convert_html_to_xml_step_70_complete_fig_and_tablewrap(document): | ||
|
|
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing docstring for convert_html_to_xml_step_70_complete_fig_and_tablewrap. All other step functions in the pipeline have comprehensive docstrings explaining their purpose, parameters, and return values. This function should follow the same documentation pattern for consistency and maintainability.
| """ | |
| Quarta etapa de conversão focada em processamento de figuras e tabelas. | |
| Processa elementos gráficos e insere-os nas estruturas apropriadas: | |
| - Identifica e processa elementos gráficos inline | |
| - Insere gráficos em elementos <fig> ou <table-wrap> conforme necessário | |
| Parameters | |
| ---------- | |
| document : Document | |
| Documento processado pela etapa anterior (com links e parágrafos já tratados) | |
| Returns | |
| ------- | |
| Document | |
| Documento com figuras e tabelas corretamente estruturadas | |
| """ |
| return | ||
| if node.find("graphic") is not None: | ||
| fig_or_tablewrap.append(graphic) | ||
|
|
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Trailing whitespace on this line. Python style guides (PEP 8) recommend removing trailing whitespace for cleaner code.
| def go_up(self, node): | ||
| if node.tag == "body": | ||
| return None | ||
| parent = node.getparent() | ||
|
|
||
| while True: | ||
| grand_parent = parent.getparent() | ||
| if grand_parent is None: | ||
| if parent is None: | ||
| return None | ||
| if parent.tag == "body": | ||
| return None | ||
| # parent sibling | ||
| return parent.getnext() |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The go_up method has incorrect logic. It checks if node.tag == "body" before getting the parent, which will never be true since we're checking the current node's tag. Then it returns parent.getnext() instead of the parent itself. This breaks the traversal logic.
Expected behavior: Should return the parent node (not parent's sibling) to move up the tree hierarchy. The body check should be on the parent, not the current node.
Suggested fix:
def go_up(self, node):
parent = node.getparent()
if parent is None:
return None
if parent.tag == "body":
return None
return parent| for i in range(3): | ||
| if item is None: | ||
| # item is not valid | ||
| break |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop for i in range(3) creates a magic number without explanation. The original code didn't have this limitation - it would traverse up the tree until reaching body or running out of parents. Limiting to 3 iterations could prevent finding graphics that are more than 3 levels away in the tree hierarchy.
Consider either:
- Removing the arbitrary limit and using a while loop with proper termination conditions
- Adding a comment explaining why 3 levels is sufficient for this use case
| for i in range(3): | |
| if item is None: | |
| # item is not valid | |
| break | |
| while item is not None and getattr(item, 'tag', None) != "body": |
|
|
||
| # logging.info("convert_html_to_xml - step 3") |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Incomplete or orphaned comment line. The single commented-out logging line # logging.info("convert_html_to_xml - step 3") appears without context or corresponding docstring, making it unclear what this step does.
Consider either:
- Removing this orphaned comment since step numbering has changed
- Adding a proper docstring to document what this function does (complete fig and table-wrap processing)
| # logging.info("convert_html_to_xml - step 3") | |
| """ | |
| Quarta etapa de conversão focada em processamento completo de figuras e tabelas. | |
| Processa elementos gráficos e tabelas: | |
| - Insere gráficos inline em elementos fig ou table-wrap | |
| - Garante que figuras e tabelas estejam corretamente estruturadas no XML | |
| Parameters | |
| ---------- | |
| document : Document | |
| Documento processado pela etapa anterior | |
| Returns | |
| ------- | |
| Document | |
| Documento com figuras e tabelas processadas | |
| """ |
| Documento com referências cruzadas e gráficos inline processados | ||
| """ | ||
| def convert_html_to_xml_step_70_complete_fig_and_tablewrap(document): | ||
|
|
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Trailing whitespace on this line. Python style guides (PEP 8) recommend removing trailing whitespace for cleaner code.
| return None | ||
| # parent sibling | ||
| return parent.getnext() | ||
|
|
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Trailing whitespace on this line. Python style guides (PEP 8) recommend removing trailing whitespace for cleaner code.
Refatoração do pipeline de conversão HTML→XML
O que mudou
Consolidação e otimização
InsertGraphicInFigPipeeInsertGraphicInTableWrapPipeem uma única classe com lógica mais robustais_graphic_inline()mais claroReorganização
Impacto
Testes
Breaking changes: Não | Tipo: Refatoração