Skip to content

Conversation

@robertatakenaka
Copy link
Member

Melhora detecção de xref e identificação de elementos por ID

Descrição

Este PR aprimora o sistema de detecção de referências cruzadas (xref) e a identificação de tipos de elementos a partir de IDs, melhorando o suporte a padrões em espanhol e adicionando fallback para textos compostos.

Alterações

Novos padrões de detecção

  • Figuras: adiciona padrão Fig. No (ex: "Fig. No 1", "Fig. no. 2")
  • Tabelas: adiciona padrão cuadro No (ex: "Cuadro No 1", "cuadro no. 2")
  • ID de tabelas: adiciona mapeamento cuadrotable para IDs em espanhol (ex: cuadro1, cuadro2)

Melhoria na análise de xref

  • Implementa fallback que tenta detectar o tipo usando apenas a primeira palavra do texto quando o texto completo não casa com nenhum padrão
  • Permite identificar corretamente referências como "Fig. 1 e 2" ou "Tabla 1, 2 y 3"

Refatoração do ANamePipe

  • Utiliza detect_from_id para identificar automaticamente o tipo correto do elemento baseado no ID
  • Elementos <a name="f1"> agora são convertidos para <fig id="f1"> em vez de <element id="f1">

Ajustes no pipeline

  • Comenta temporariamente convert_html_to_xml_step_90_complete_disp_formula
  • Adiciona strip_tags para remover tags STRIPTAG em XMLBodyCenterPipe

Testes

  • Verificar detecção de "Fig. No 1" como figura
  • Verificar detecção de "Cuadro No 1" como tabela
  • Verificar conversão de <a name="cuadro1"> para <table-wrap id="cuadro1">
  • Verificar fallback com textos compostos como "Fig. 1 e 2"

- Adiciona padrão 'Fig. No' para figuras em TEXT_PATTERNS
- Adiciona padrão 'cuadro No' para tabelas em TEXT_PATTERNS
- Adiciona mapeamento 'cuadro' em espanhol para table em ID_PATTERNS
- Adiciona tentativa de detecção usando apenas a primeira palavra do texto
- Permite identificar referências quando o texto completo não casa com padrões
- Importa detect_from_id para identificar tipo de elemento pelo ID
- ANamePipe agora detecta o tipo correto do elemento via ID
- Comenta temporariamente convert_html_to_xml_step_90_complete_disp_formula
- Adiciona strip_tags para STRIPTAG em XMLBodyCenterPipe
Copilot AI review requested due to automatic review settings January 16, 2026 14:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances cross-reference (xref) detection and element type identification from IDs, improving support for Spanish patterns and adding fallback logic for composite text references.

Changes:

  • Added Spanish pattern support for figures (Fig. No) and tables (cuadro No)
  • Implemented fallback logic in analyze_xref to detect type using only the first word when full text doesn't match patterns
  • Refactored ANamePipe to use detect_from_id for automatic element type identification

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
scielo_classic_website/spsxml/detector_config_xref.py Added Spanish patterns for figure/table detection and ID mapping for cuadro prefix
scielo_classic_website/spsxml/detector.py Added fallback logic in analyze_xref to try first word when full text doesn't match
scielo_classic_website/spsxml/sps_xml_body_pipes.py Refactored ANamePipe to use detect_from_id, commented out disp_formula step, added strip_tags in XMLBodyCenterPipe

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +196 to +202
ref_type_text, element_name_text, prefix, number = detect_from_text(text.split()[0])
if ref_type_text:
result["ref_type"] = ref_type_text
result["element_name"] = element_name_text
result["prefix"] = prefix
result["number"] = number
result["source"] = "text"
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If text is an empty string or contains only whitespace, text.split() will return an empty list, causing an IndexError when accessing index [0]. Add a check to ensure the split result is not empty before accessing the first element.

Suggested change
ref_type_text, element_name_text, prefix, number = detect_from_text(text.split()[0])
if ref_type_text:
result["ref_type"] = ref_type_text
result["element_name"] = element_name_text
result["prefix"] = prefix
result["number"] = number
result["source"] = "text"
parts = text.split()
if parts:
ref_type_text, element_name_text, prefix, number = detect_from_text(parts[0])
if ref_type_text:
result["ref_type"] = ref_type_text
result["element_name"] = element_name_text
result["prefix"] = prefix
result["number"] = number
result["source"] = "text"

Copilot uses AI. Check for mistakes.
result["number"] = number
result["source"] = "text"


Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an extra blank line between the end of the else block and the comment for the next section. This should be removed to maintain consistent code formatting.

Suggested change

Copilot uses AI. Check for mistakes.
@@ -128,7 +129,7 @@ def convert_html_to_xml(document):
convert_html_to_xml_step_60_ahref_and_aname,
convert_html_to_xml_step_70_complete_fig_and_tablewrap,
convert_html_to_xml_step_80_fix_sec,
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convert_html_to_xml_step_90_complete_disp_formula step is commented out without explanation. Add a comment explaining why this step is temporarily disabled and under what conditions it should be re-enabled.

Suggested change
convert_html_to_xml_step_80_fix_sec,
convert_html_to_xml_step_80_fix_sec,
# NOTE: Step 90 (`convert_html_to_xml_step_90_complete_disp_formula`) is
# temporarily disabled because it may incorrectly transform some <disp-formula>
# elements and generate invalid SPS XML. Re-enable this step only after the
# underlying issues are fixed and regression tests for complex formula markup
# are in place and passing.

Copilot uses AI. Check for mistakes.
@@ -1910,6 +1913,7 @@ def rename_center(self, root):
center.tag = "p"
continue
center.tag = "title"
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of ET.strip_tags(root, 'STRIPTAG') in the rename_center method is not explained. Add a comment explaining why STRIPTAG elements need to be stripped at this point in the processing pipeline.

Suggested change
center.tag = "title"
center.tag = "title"
# Remove os elementos marcados com STRIPTAG, usados apenas como marcadores
# temporários para <center> vazios, para que essas tags artificiais não
# apareçam no XML final.

Copilot uses AI. Check for mistakes.
for node in xml.xpath(".//a[@name]"):
node.tag = "element"
node.set("id", node.attrib.pop("name"))
name = node.attrib.pop("name")
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactored ANamePipe logic now uses detect_from_id to determine element types, but the existing test only covers the case where the name doesn't match any pattern (expecting <div> as fallback). Add tests for cases where the name matches known patterns like 'f1' (should become <fig>), 't1' (should become <table-wrap>), and 'cuadro1' (should become <table-wrap>).

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@robertatakenaka robertatakenaka merged commit 20b43ce into scieloorg:main Jan 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant