Corrige inserção duplicada de figuras representadas como links no html original #123

robertatakenaka · 2025-11-22T15:11:47Z

O que esse PR faz?

Refatora o processamento de XRef em sps_xml_body_pipes.py e corrige bug crítico onde múltiplas referências à mesma imagem geravam elementos <fig> duplicados no XML.

Porque estamos fazendo isso?

Bug crítico: Quando múltiplas referências apontavam para a mesma imagem (ex: "Fig. 6", "Fig. 6a", "Figs. 6b y c" → /img/revistas/dreh/v11n2/a02i6.jpg), o código criava elementos duplicados:

<!-- ANTES: 3 elementos fig duplicados ❌ -->
<fig id="f6"><graphic href="a02i6.jpg"/></fig>
<fig id="a02i6"><graphic href="a02i6.jpg"/></fig>  

<!-- DEPOIS: 1 elemento fig ✅ -->
<fig id="a02i6" filebasename="a02i6">
  <label>Fig. 6</label>
  <graphic href="a02i6.jpg"/>
</fig>

Como testar?

# Teste 1: Bug de duplicação corrigido
html = """
<a href="/img/revistas/dreh/v11n2/a02i6.jpg">Fig. 6</a>
<a href="/img/revistas/dreh/v11n2/a02i6.jpg">Fig. 6a</a>  
<a href="/img/revistas/dreh/v11n2/a02i6.jpg">Figs. 6b y c</a>
"""
# Esperado: 1 elemento <fig id="a02i6">, 3 <xref rid="a02i6">

# Teste 2: Detecção de acrônimos corrigida
# ✅ "/{rbef}/" → detecta como interno
# ❌ "/myrbefsite/" → não detecta (falso positivo)

Tipos de mudança

Bug fix (corrige issue Duplica as imagens #122)
Refatoração (melhora código existente)

Checklist

Código segue o estilo do projeto
Testes adicionados e passando
Não quebra funcionalidade existente

Issues relacionados

Closes #122

Detalhes da implementação

Principais correções:

Deduplicação: Track por filebasename previne elementos duplicados
ID baseado no arquivo: Usa basename como identificador (ex: a02i6 de a02i6.jpg)
Match de journal corrigido: Requer /{acronym}/ em vez de apenas conter string
Suporte a materiais suplementares: Detecta .pdf, .xlsx, etc.

Novos métodos modulares:

get_letter_and_number(): Valida padrões [letra][números]
get_rid_from_href_and_pkg_name(): Extrai rid do basename
get_element_name(): Determina tipo por extensão/contexto

Impacto:

Elimina duplicação de elementos
XML válido e consistente
Todos XRefs apontam para elemento único correto

…ation - Add regex-based validation for letter-number patterns in get_letter_and_number() - Expand element mapping dictionaries (LABEL_INITIAL_TO_ELEMENT, FILENAME_TO_ELEMENT) - Refactor XRefSpecialInternalLinkPipe with modular methods: - get_rid_from_xref_label_and_number(): Generate rid from label text and number - get_rid_from_href_and_pkg_name(): Extract rid from href and package name - get_label_text_and_number_from_xref_text(): Parse XRef text for label components - get_element_name(): Determine element type based on label, rid, or file extension - Improve element matching using filebasename attribute alongside id - Add support for supplementary materials based on file extensions - Sort XRef children by filebasename before insertion - Fix journal acronym matching to be more precise (/{acronym}/) - Add safety check in InsertGraphicInTableWrapPipe for parent removal - Store XRef text in label element for better preservation

Copilot

Pull request overview

This PR addresses a critical bug (#122) where multiple xref references to the same image file generated duplicate <fig> elements in the XML output. The fix introduces filebasename-based deduplication tracking and refactors the XRef processing logic into modular helper methods for improved maintainability.

Key Changes:

Implements filebasename-based deduplication to prevent multiple <fig> elements for the same image
Refactors XRef processing into testable helper methods (get_letter_and_number, get_rid_from_xref_label_and_number, etc.)
Fixes journal acronym detection to require /{acronym}/ pattern instead of substring match

Comments suppressed due to low confidence (1)

scielo_classic_website/spsxml/sps_xml_body_pipes.py:1116

Always creating a <graphic> element (lines 1113-1116) may be incorrect for non-image files. According to line 1057, supplementary materials (PDF, DOC, XLS, etc.) should use <supplementary-material> as the element name, but they would still get a <graphic> child element. For supplementary materials, you should create a different child element or use an appropriate link element instead of <graphic>, which is typically used for images.

                g = ET.Element("graphic")
                g.set("{http://www.w3.org/1999/xlink}href", href)

                new_elem.append(g)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-22T15:17:13Z