-
Notifications
You must be signed in to change notification settings - Fork 3
Fix graphic position in tablewrap #138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix graphic position in tablewrap #138
Conversation
- Adiciona padrão 'Fig. No' para figuras em TEXT_PATTERNS - Adiciona padrão 'cuadro No' para tabelas em TEXT_PATTERNS - Adiciona mapeamento 'cuadro' em espanhol para table em ID_PATTERNS
- Adiciona tentativa de detecção usando apenas a primeira palavra do texto - Permite identificar referências quando o texto completo não casa com padrões
- Importa detect_from_id para identificar tipo de elemento pelo ID - ANamePipe agora detecta o tipo correto do elemento via ID - Comenta temporariamente convert_html_to_xml_step_90_complete_disp_formula - Adiciona strip_tags para STRIPTAG em XMLBodyCenterPipe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances cross-reference (xref) detection and element type identification from IDs, improving support for Spanish patterns and adding fallback logic for composite text references.
Changes:
- Added Spanish pattern support for figures (
Fig. No) and tables (cuadro No) - Implemented fallback logic in
analyze_xrefto detect type using only the first word when full text doesn't match patterns - Refactored
ANamePipeto usedetect_from_idfor automatic element type identification
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| scielo_classic_website/spsxml/detector_config_xref.py | Added Spanish patterns for figure/table detection and ID mapping for cuadro prefix |
| scielo_classic_website/spsxml/detector.py | Added fallback logic in analyze_xref to try first word when full text doesn't match |
| scielo_classic_website/spsxml/sps_xml_body_pipes.py | Refactored ANamePipe to use detect_from_id, commented out disp_formula step, added strip_tags in XMLBodyCenterPipe |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ref_type_text, element_name_text, prefix, number = detect_from_text(text.split()[0]) | ||
| if ref_type_text: | ||
| result["ref_type"] = ref_type_text | ||
| result["element_name"] = element_name_text | ||
| result["prefix"] = prefix | ||
| result["number"] = number | ||
| result["source"] = "text" |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If text is an empty string or contains only whitespace, text.split() will return an empty list, causing an IndexError when accessing index [0]. Add a check to ensure the split result is not empty before accessing the first element.
| ref_type_text, element_name_text, prefix, number = detect_from_text(text.split()[0]) | |
| if ref_type_text: | |
| result["ref_type"] = ref_type_text | |
| result["element_name"] = element_name_text | |
| result["prefix"] = prefix | |
| result["number"] = number | |
| result["source"] = "text" | |
| parts = text.split() | |
| if parts: | |
| ref_type_text, element_name_text, prefix, number = detect_from_text(parts[0]) | |
| if ref_type_text: | |
| result["ref_type"] = ref_type_text | |
| result["element_name"] = element_name_text | |
| result["prefix"] = prefix | |
| result["number"] = number | |
| result["source"] = "text" |
| result["number"] = number | ||
| result["source"] = "text" | ||
|
|
||
|
|
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an extra blank line between the end of the else block and the comment for the next section. This should be removed to maintain consistent code formatting.
| @@ -128,7 +129,7 @@ def convert_html_to_xml(document): | |||
| convert_html_to_xml_step_60_ahref_and_aname, | |||
| convert_html_to_xml_step_70_complete_fig_and_tablewrap, | |||
| convert_html_to_xml_step_80_fix_sec, | |||
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convert_html_to_xml_step_90_complete_disp_formula step is commented out without explanation. Add a comment explaining why this step is temporarily disabled and under what conditions it should be re-enabled.
| convert_html_to_xml_step_80_fix_sec, | |
| convert_html_to_xml_step_80_fix_sec, | |
| # NOTE: Step 90 (`convert_html_to_xml_step_90_complete_disp_formula`) is | |
| # temporarily disabled because it may incorrectly transform some <disp-formula> | |
| # elements and generate invalid SPS XML. Re-enable this step only after the | |
| # underlying issues are fixed and regression tests for complex formula markup | |
| # are in place and passing. |
| @@ -1910,6 +1913,7 @@ def rename_center(self, root): | |||
| center.tag = "p" | |||
| continue | |||
| center.tag = "title" | |||
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition of ET.strip_tags(root, 'STRIPTAG') in the rename_center method is not explained. Add a comment explaining why STRIPTAG elements need to be stripped at this point in the processing pipeline.
| center.tag = "title" | |
| center.tag = "title" | |
| # Remove os elementos marcados com STRIPTAG, usados apenas como marcadores | |
| # temporários para <center> vazios, para que essas tags artificiais não | |
| # apareçam no XML final. |
| for node in xml.xpath(".//a[@name]"): | ||
| node.tag = "element" | ||
| node.set("id", node.attrib.pop("name")) | ||
| name = node.attrib.pop("name") |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactored ANamePipe logic now uses detect_from_id to determine element types, but the existing test only covers the case where the name doesn't match any pattern (expecting <div> as fallback). Add tests for cases where the name matches known patterns like 'f1' (should become <fig>), 't1' (should become <table-wrap>), and 'cuadro1' (should become <table-wrap>).
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Melhora detecção de xref e identificação de elementos por ID
Descrição
Este PR aprimora o sistema de detecção de referências cruzadas (xref) e a identificação de tipos de elementos a partir de IDs, melhorando o suporte a padrões em espanhol e adicionando fallback para textos compostos.
Alterações
Novos padrões de detecção
Fig. No(ex: "Fig. No 1", "Fig. no. 2")cuadro No(ex: "Cuadro No 1", "cuadro no. 2")cuadro→tablepara IDs em espanhol (ex:cuadro1,cuadro2)Melhoria na análise de xref
Refatoração do ANamePipe
detect_from_idpara identificar automaticamente o tipo correto do elemento baseado no ID<a name="f1">agora são convertidos para<fig id="f1">em vez de<element id="f1">Ajustes no pipeline
convert_html_to_xml_step_90_complete_disp_formulastrip_tagspara remover tagsSTRIPTAGemXMLBodyCenterPipeTestes
<a name="cuadro1">para<table-wrap id="cuadro1">