-
Notifications
You must be signed in to change notification settings - Fork 3
Corrige inserção duplicada de figuras representadas como links no html original #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrige inserção duplicada de figuras representadas como links no html original #123
Conversation
…ation
- Add regex-based validation for letter-number patterns in get_letter_and_number()
- Expand element mapping dictionaries (LABEL_INITIAL_TO_ELEMENT, FILENAME_TO_ELEMENT)
- Refactor XRefSpecialInternalLinkPipe with modular methods:
- get_rid_from_xref_label_and_number(): Generate rid from label text and number
- get_rid_from_href_and_pkg_name(): Extract rid from href and package name
- get_label_text_and_number_from_xref_text(): Parse XRef text for label components
- get_element_name(): Determine element type based on label, rid, or file extension
- Improve element matching using filebasename attribute alongside id
- Add support for supplementary materials based on file extensions
- Sort XRef children by filebasename before insertion
- Fix journal acronym matching to be more precise (/{acronym}/)
- Add safety check in InsertGraphicInTableWrapPipe for parent removal
- Store XRef text in label element for better preservation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses a critical bug (#122) where multiple xref references to the same image file generated duplicate <fig> elements in the XML output. The fix introduces filebasename-based deduplication tracking and refactors the XRef processing logic into modular helper methods for improved maintainability.
Key Changes:
- Implements filebasename-based deduplication to prevent multiple
<fig>elements for the same image - Refactors XRef processing into testable helper methods (
get_letter_and_number,get_rid_from_xref_label_and_number, etc.) - Fixes journal acronym detection to require
/{acronym}/pattern instead of substring match
Comments suppressed due to low confidence (1)
scielo_classic_website/spsxml/sps_xml_body_pipes.py:1116
- Always creating a
<graphic>element (lines 1113-1116) may be incorrect for non-image files. According to line 1057, supplementary materials (PDF, DOC, XLS, etc.) should use<supplementary-material>as the element name, but they would still get a<graphic>child element. For supplementary materials, you should create a different child element or use an appropriate link element instead of<graphic>, which is typically used for images.
g = ET.Element("graphic")
g.set("{http://www.w3.org/1999/xlink}href", href)
new_elem.append(g)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if label_number.isdigit(): | ||
| return f"{element_prefix}{label_number}" | ||
|
|
||
| if label_number[:-1].isdigit() and label_number[-1].isalpha(): |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential IndexError if label_number is an empty string. Before accessing label_number[:-1] and label_number[-1], you should check that label_number has at least one character. An empty string would cause an IndexError when accessing these indices.
| if label_number[:-1].isdigit() and label_number[-1].isalpha(): | |
| if len(label_number) > 1 and label_number[:-1].isdigit() and label_number[-1].isalpha(): |
|
|
||
| parts = xref_text.split() | ||
|
|
||
| # first character of last part |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential IndexError when expected_number is an empty string. If for some reason parts[-1] is an empty string, accessing expected_number[0] on line 1042 will raise an IndexError. Consider adding a check: if not expected_number: return parts[0] if parts else None, None before line 1042.
| # first character of last part | |
| # first character of last part | |
| if not parts or not parts[-1]: | |
| return parts[0] if parts else None, None |
| return None, None | ||
|
|
||
| parts = xref_text.split() | ||
|
|
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential IndexError when xref_text.split() returns an empty list. If xref_text contains only whitespace, split() will return an empty list, and accessing parts[-1] on line 1041 will raise an IndexError. Consider adding a check: if not parts: return None, None after line 1038.
| if not parts: | |
| return None, None |
| def parse_xref_text(self, xref_text, label_text): | ||
| # Tables 1-3 | ||
| parts = xref_text.split(" ") | ||
| filename = filename.replace(pkg_name, "") |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using str.replace() on line 1023 could lead to incorrect results if pkg_name appears multiple times in the filename. For example, if pkg_name="test" and filename="testtest", this would result in an empty string. Consider using filename.removeprefix(pkg_name) (Python 3.9+) or filename[len(pkg_name):] if you specifically want to remove a prefix.
| filename = filename.replace(pkg_name, "") | |
| filename = filename[len(pkg_name):] |
| xpath = f"//*[@filebasename='{basename}']" | ||
| if rid: | ||
| xpath = f"//*[@id='{rid}' | @filebasename='{basename}']" | ||
| found = root.xpath(xpath)[0] |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XPath injection vulnerability: The basename variable is inserted directly into the XPath expression without sanitization. If basename contains special characters like single quotes, it could break the XPath query or potentially be exploited. Consider sanitizing basename or using parameterized queries if available. For example, if basename contains a single quote, the XPath query will fail.
| xpath = f"//*[@filebasename='{basename}']" | |
| if rid: | |
| xpath = f"//*[@id='{rid}' | @filebasename='{basename}']" | |
| found = root.xpath(xpath)[0] | |
| if rid: | |
| xpath = "//*[@id=$rid or @filebasename=$basename]" | |
| found_nodes = root.xpath(xpath, rid=rid, basename=basename) | |
| else: | |
| xpath = "//*[@filebasename=$basename]" | |
| found_nodes = root.xpath(xpath, basename=basename) | |
| found = found_nodes[0] |
| # Expressão Regular: r"^[a-zA-Z]\d+$" | ||
| # ^: Início da string | ||
| # [a-zA-Z]: Exatamente uma letra | ||
| # \d+: Um ou mais dígitos | ||
| # $: Fim da string | ||
| regex = r"^[a-zA-Z]\d+$" | ||
|
|
||
| # re.fullmatch() verifica se a string inteira corresponde ao padrão | ||
| match = re.fullmatch(regex, codigo) | ||
|
|
||
| if match: | ||
| # Se o padrão casar com a string inteira, retorna o valor original | ||
| return codigo | ||
| else: | ||
| # Caso contrário, retorna None |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Inline comments are in Portuguese (lines 49-53, 56, 60, 63) while the docstring is also in Portuguese. Consider whether this is consistent with the project's documentation standards. If the project uses English for code comments, these should be translated for consistency.
| # Expressão Regular: r"^[a-zA-Z]\d+$" | |
| # ^: Início da string | |
| # [a-zA-Z]: Exatamente uma letra | |
| # \d+: Um ou mais dígitos | |
| # $: Fim da string | |
| regex = r"^[a-zA-Z]\d+$" | |
| # re.fullmatch() verifica se a string inteira corresponde ao padrão | |
| match = re.fullmatch(regex, codigo) | |
| if match: | |
| # Se o padrão casar com a string inteira, retorna o valor original | |
| return codigo | |
| else: | |
| # Caso contrário, retorna None | |
| # Regular Expression: r"^[a-zA-Z]\d+$" | |
| # ^: Start of string | |
| # [a-zA-Z]: Exactly one letter | |
| # \d+: One or more digits | |
| # $: End of string | |
| regex = r"^[a-zA-Z]\d+$" | |
| # re.fullmatch() checks if the entire string matches the pattern | |
| match = re.fullmatch(regex, codigo) | |
| if match: | |
| # If the pattern matches the entire string, return the original value | |
| return codigo | |
| else: | |
| # Otherwise, return None |
| Returns: | ||
| String com o rid ou None | ||
| """ | ||
| if not label_text: |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential IndexError if label_text is an empty string. Before accessing label_text[0] on line 998, you should verify that label_text has at least one character. The condition on line 995 only checks if it's not None/empty, but an empty string would pass that check and cause an IndexError.
| if not label_text: | |
| if not label_text or label_text == "": |
| if label_text: | ||
| label_initial = label_text[0].lower() | ||
| element_name = LABEL_INITIAL_TO_ELEMENT.get(label_initial) | ||
| elif rid: | ||
| element_name = FILENAME_TO_ELEMENT.get(rid[0]) |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential IndexError if label_text or rid is an empty string. Before accessing label_text[0] on line 1052 or rid[0] on line 1055, you should verify these strings have at least one character. Empty strings would cause an IndexError.
| child.attrib.pop("is_internal_link_to_asset_html_page") | ||
|
|
||
| for child in reversed(children): | ||
| # Sort children by rid before inserting |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "Sort children by rid before inserting" but the code actually sorts by filebasename. Either update the comment to reflect the actual sorting key or verify that sorting by filebasename is the intended behavior.
| # Sort children by rid before inserting | |
| # Sort children by filebasename before inserting |
| if label_number.isdigit(): | ||
| return f"{element_prefix}{label_number}" | ||
|
|
||
| if label_number[:-1].isdigit() and label_number[-1].isalpha(): |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double space found between and and label_number. This should be a single space for consistency.
| if label_number[:-1].isdigit() and label_number[-1].isalpha(): | |
| if label_number[:-1].isdigit() and label_number[-1].isalpha(): |
…le id/rid attribute conversion - Add ReplaceIdhrefAndRidhrefByIdPipe to pipeline step 4 - Change XRefSpecialInternalLinkPipe to use rid-href instead of rid - Add ref-type attribute based on element name mapping - Use id-href for elements without existing id attributes - Implement ReplaceIdhrefAndRidhrefByIdPipe with three main methods: - replace_rid_href_by_id: Updates xref elements with matching filebasename to use node id - create_rid_from_filebasename: Converts filebasename to rid attribute - create_id_from_filebasename: Converts filebasename to id attribute - Process nodes in order: existing ids first, then rid-href, then id-href - Remove temporary attributes (filebasename, id-href, rid-href) after processing
O que esse PR faz?
Refatora o processamento de XRef em
sps_xml_body_pipes.pye corrige bug crítico onde múltiplas referências à mesma imagem geravam elementos<fig>duplicados no XML.Porque estamos fazendo isso?
Bug crítico: Quando múltiplas referências apontavam para a mesma imagem (ex: "Fig. 6", "Fig. 6a", "Figs. 6b y c" →
/img/revistas/dreh/v11n2/a02i6.jpg), o código criava elementos duplicados:Como testar?
Tipos de mudança
Checklist
Issues relacionados
Closes #122
Detalhes da implementação
Principais correções:
filebasenameprevine elementos duplicadosa02i6dea02i6.jpg)/{acronym}/em vez de apenas conter stringNovos métodos modulares:
get_letter_and_number(): Valida padrões [letra][números]get_rid_from_href_and_pkg_name(): Extrai rid do basenameget_element_name(): Determina tipo por extensão/contextoImpacto: