Skip to content

Conversation

@robertatakenaka
Copy link
Member

O que esse PR faz?

Refatora o processamento de XRef em sps_xml_body_pipes.py e corrige bug crítico onde múltiplas referências à mesma imagem geravam elementos <fig> duplicados no XML.

Porque estamos fazendo isso?

Bug crítico: Quando múltiplas referências apontavam para a mesma imagem (ex: "Fig. 6", "Fig. 6a", "Figs. 6b y c" → /img/revistas/dreh/v11n2/a02i6.jpg), o código criava elementos duplicados:

<!-- ANTES: 3 elementos fig duplicados ❌ -->
<fig id="f6"><graphic href="a02i6.jpg"/></fig>
<fig id="a02i6"><graphic href="a02i6.jpg"/></fig>  

<!-- DEPOIS: 1 elemento fig ✅ -->
<fig id="a02i6" filebasename="a02i6">
  <label>Fig. 6</label>
  <graphic href="a02i6.jpg"/>
</fig>

Como testar?

# Teste 1: Bug de duplicação corrigido
html = """
<a href="/img/revistas/dreh/v11n2/a02i6.jpg">Fig. 6</a>
<a href="/img/revistas/dreh/v11n2/a02i6.jpg">Fig. 6a</a>  
<a href="/img/revistas/dreh/v11n2/a02i6.jpg">Figs. 6b y c</a>
"""
# Esperado: 1 elemento <fig id="a02i6">, 3 <xref rid="a02i6">

# Teste 2: Detecção de acrônimos corrigida
# ✅ "/{rbef}/" → detecta como interno
# ❌ "/myrbefsite/" → não detecta (falso positivo)

Tipos de mudança

Checklist

  • Código segue o estilo do projeto
  • Testes adicionados e passando
  • Não quebra funcionalidade existente

Issues relacionados

Closes #122

Detalhes da implementação

Principais correções:

  1. Deduplicação: Track por filebasename previne elementos duplicados
  2. ID baseado no arquivo: Usa basename como identificador (ex: a02i6 de a02i6.jpg)
  3. Match de journal corrigido: Requer /{acronym}/ em vez de apenas conter string
  4. Suporte a materiais suplementares: Detecta .pdf, .xlsx, etc.

Novos métodos modulares:

  • get_letter_and_number(): Valida padrões [letra][números]
  • get_rid_from_href_and_pkg_name(): Extrai rid do basename
  • get_element_name(): Determina tipo por extensão/contexto

Impacto:

  • Elimina duplicação de elementos
  • XML válido e consistente
  • Todos XRefs apontam para elemento único correto

…ation

- Add regex-based validation for letter-number patterns in get_letter_and_number()
- Expand element mapping dictionaries (LABEL_INITIAL_TO_ELEMENT, FILENAME_TO_ELEMENT)
- Refactor XRefSpecialInternalLinkPipe with modular methods:
  - get_rid_from_xref_label_and_number(): Generate rid from label text and number
  - get_rid_from_href_and_pkg_name(): Extract rid from href and package name
  - get_label_text_and_number_from_xref_text(): Parse XRef text for label components
  - get_element_name(): Determine element type based on label, rid, or file extension
- Improve element matching using filebasename attribute alongside id
- Add support for supplementary materials based on file extensions
- Sort XRef children by filebasename before insertion
- Fix journal acronym matching to be more precise (/{acronym}/)
- Add safety check in InsertGraphicInTableWrapPipe for parent removal
- Store XRef text in label element for better preservation
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a critical bug (#122) where multiple xref references to the same image file generated duplicate <fig> elements in the XML output. The fix introduces filebasename-based deduplication tracking and refactors the XRef processing logic into modular helper methods for improved maintainability.

Key Changes:

  • Implements filebasename-based deduplication to prevent multiple <fig> elements for the same image
  • Refactors XRef processing into testable helper methods (get_letter_and_number, get_rid_from_xref_label_and_number, etc.)
  • Fixes journal acronym detection to require /{acronym}/ pattern instead of substring match
Comments suppressed due to low confidence (1)

scielo_classic_website/spsxml/sps_xml_body_pipes.py:1116

  • Always creating a <graphic> element (lines 1113-1116) may be incorrect for non-image files. According to line 1057, supplementary materials (PDF, DOC, XLS, etc.) should use <supplementary-material> as the element name, but they would still get a <graphic> child element. For supplementary materials, you should create a different child element or use an appropriate link element instead of <graphic>, which is typically used for images.
                g = ET.Element("graphic")
                g.set("{http://www.w3.org/1999/xlink}href", href)

                new_elem.append(g)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if label_number.isdigit():
return f"{element_prefix}{label_number}"

if label_number[:-1].isdigit() and label_number[-1].isalpha():
Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential IndexError if label_number is an empty string. Before accessing label_number[:-1] and label_number[-1], you should check that label_number has at least one character. An empty string would cause an IndexError when accessing these indices.

Suggested change
if label_number[:-1].isdigit() and label_number[-1].isalpha():
if len(label_number) > 1 and label_number[:-1].isdigit() and label_number[-1].isalpha():

Copilot uses AI. Check for mistakes.

parts = xref_text.split()

# first character of last part
Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential IndexError when expected_number is an empty string. If for some reason parts[-1] is an empty string, accessing expected_number[0] on line 1042 will raise an IndexError. Consider adding a check: if not expected_number: return parts[0] if parts else None, None before line 1042.

Suggested change
# first character of last part
# first character of last part
if not parts or not parts[-1]:
return parts[0] if parts else None, None

Copilot uses AI. Check for mistakes.
return None, None

parts = xref_text.split()

Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential IndexError when xref_text.split() returns an empty list. If xref_text contains only whitespace, split() will return an empty list, and accessing parts[-1] on line 1041 will raise an IndexError. Consider adding a check: if not parts: return None, None after line 1038.

Suggested change
if not parts:
return None, None

Copilot uses AI. Check for mistakes.
def parse_xref_text(self, xref_text, label_text):
# Tables 1-3
parts = xref_text.split(" ")
filename = filename.replace(pkg_name, "")
Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using str.replace() on line 1023 could lead to incorrect results if pkg_name appears multiple times in the filename. For example, if pkg_name="test" and filename="testtest", this would result in an empty string. Consider using filename.removeprefix(pkg_name) (Python 3.9+) or filename[len(pkg_name):] if you specifically want to remove a prefix.

Suggested change
filename = filename.replace(pkg_name, "")
filename = filename[len(pkg_name):]

Copilot uses AI. Check for mistakes.
Comment on lines +1094 to +1097
xpath = f"//*[@filebasename='{basename}']"
if rid:
xpath = f"//*[@id='{rid}' | @filebasename='{basename}']"
found = root.xpath(xpath)[0]
Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XPath injection vulnerability: The basename variable is inserted directly into the XPath expression without sanitization. If basename contains special characters like single quotes, it could break the XPath query or potentially be exploited. Consider sanitizing basename or using parameterized queries if available. For example, if basename contains a single quote, the XPath query will fail.

Suggested change
xpath = f"//*[@filebasename='{basename}']"
if rid:
xpath = f"//*[@id='{rid}' | @filebasename='{basename}']"
found = root.xpath(xpath)[0]
if rid:
xpath = "//*[@id=$rid or @filebasename=$basename]"
found_nodes = root.xpath(xpath, rid=rid, basename=basename)
else:
xpath = "//*[@filebasename=$basename]"
found_nodes = root.xpath(xpath, basename=basename)
found = found_nodes[0]

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +63
# Expressão Regular: r"^[a-zA-Z]\d+$"
# ^: Início da string
# [a-zA-Z]: Exatamente uma letra
# \d+: Um ou mais dígitos
# $: Fim da string
regex = r"^[a-zA-Z]\d+$"

# re.fullmatch() verifica se a string inteira corresponde ao padrão
match = re.fullmatch(regex, codigo)

if match:
# Se o padrão casar com a string inteira, retorna o valor original
return codigo
else:
# Caso contrário, retorna None
Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Inline comments are in Portuguese (lines 49-53, 56, 60, 63) while the docstring is also in Portuguese. Consider whether this is consistent with the project's documentation standards. If the project uses English for code comments, these should be translated for consistency.

Suggested change
# Expressão Regular: r"^[a-zA-Z]\d+$"
# ^: Início da string
# [a-zA-Z]: Exatamente uma letra
# \d+: Um ou mais dígitos
# $: Fim da string
regex = r"^[a-zA-Z]\d+$"
# re.fullmatch() verifica se a string inteira corresponde ao padrão
match = re.fullmatch(regex, codigo)
if match:
# Se o padrão casar com a string inteira, retorna o valor original
return codigo
else:
# Caso contrário, retorna None
# Regular Expression: r"^[a-zA-Z]\d+$"
# ^: Start of string
# [a-zA-Z]: Exactly one letter
# \d+: One or more digits
# $: End of string
regex = r"^[a-zA-Z]\d+$"
# re.fullmatch() checks if the entire string matches the pattern
match = re.fullmatch(regex, codigo)
if match:
# If the pattern matches the entire string, return the original value
return codigo
else:
# Otherwise, return None

Copilot uses AI. Check for mistakes.
Returns:
String com o rid ou None
"""
if not label_text:
Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential IndexError if label_text is an empty string. Before accessing label_text[0] on line 998, you should verify that label_text has at least one character. The condition on line 995 only checks if it's not None/empty, but an empty string would pass that check and cause an IndexError.

Suggested change
if not label_text:
if not label_text or label_text == "":

Copilot uses AI. Check for mistakes.
Comment on lines +1051 to +1055
if label_text:
label_initial = label_text[0].lower()
element_name = LABEL_INITIAL_TO_ELEMENT.get(label_initial)
elif rid:
element_name = FILENAME_TO_ELEMENT.get(rid[0])
Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential IndexError if label_text or rid is an empty string. Before accessing label_text[0] on line 1052 or rid[0] on line 1055, you should verify these strings have at least one character. Empty strings would cause an IndexError.

Copilot uses AI. Check for mistakes.
child.attrib.pop("is_internal_link_to_asset_html_page")

for child in reversed(children):
# Sort children by rid before inserting
Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "Sort children by rid before inserting" but the code actually sorts by filebasename. Either update the comment to reflect the actual sorting key or verify that sorting by filebasename is the intended behavior.

Suggested change
# Sort children by rid before inserting
# Sort children by filebasename before inserting

Copilot uses AI. Check for mistakes.
if label_number.isdigit():
return f"{element_prefix}{label_number}"

if label_number[:-1].isdigit() and label_number[-1].isalpha():
Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double space found between and and label_number. This should be a single space for consistency.

Suggested change
if label_number[:-1].isdigit() and label_number[-1].isalpha():
if label_number[:-1].isdigit() and label_number[-1].isalpha():

Copilot uses AI. Check for mistakes.
…le id/rid attribute conversion

- Add ReplaceIdhrefAndRidhrefByIdPipe to pipeline step 4
- Change XRefSpecialInternalLinkPipe to use rid-href instead of rid
- Add ref-type attribute based on element name mapping
- Use id-href for elements without existing id attributes
- Implement ReplaceIdhrefAndRidhrefByIdPipe with three main methods:
  - replace_rid_href_by_id: Updates xref elements with matching filebasename to use node id
  - create_rid_from_filebasename: Converts filebasename to rid attribute
  - create_id_from_filebasename: Converts filebasename to id attribute
- Process nodes in order: existing ids first, then rid-href, then id-href
- Remove temporary attributes (filebasename, id-href, rid-href) after processing
@robertatakenaka robertatakenaka merged commit 13cf379 into scieloorg:main Nov 22, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Duplica as imagens

1 participant