-
Notifications
You must be signed in to change notification settings - Fork 3
Corrige inserção duplicada de figuras representadas como links no html original #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -4,6 +4,7 @@ | |||||||||||||||||||||||
| import os | ||||||||||||||||||||||||
| from copy import deepcopy | ||||||||||||||||||||||||
| from io import StringIO | ||||||||||||||||||||||||
| import re | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| import plumber | ||||||||||||||||||||||||
| from lxml import etree as ET | ||||||||||||||||||||||||
|
|
@@ -19,20 +20,50 @@ | |||||||||||||||||||||||
| "e": "disp-formula", | ||||||||||||||||||||||||
| } | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ELEM_NAME = { | ||||||||||||||||||||||||
| LABEL_INITIAL_TO_ELEMENT = { | ||||||||||||||||||||||||
| "t": "table-wrap", | ||||||||||||||||||||||||
| "f": "fig", | ||||||||||||||||||||||||
| "e": "disp-formula", | ||||||||||||||||||||||||
| "c": "table-wrap", | ||||||||||||||||||||||||
| "c": "table-wrap", # cuadro | ||||||||||||||||||||||||
| "a": "app", # appendix, anexo | ||||||||||||||||||||||||
| } | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| FILENAME_TO_ELEMENT = {} | ||||||||||||||||||||||||
| FILENAME_TO_ELEMENT.update(LABEL_INITIAL_TO_ELEMENT) | ||||||||||||||||||||||||
| FILENAME_TO_ELEMENT["i"] = "fig" | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ELEM_AND_REF_TYPE = { | ||||||||||||||||||||||||
| "table-wrap": "table", | ||||||||||||||||||||||||
| } | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| def get_letter_and_number(codigo): | ||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||
| Verifica se a string inteira corresponde exatamente ao padrão: | ||||||||||||||||||||||||
| [Letra (maiúscula/minúscula)][Um ou mais dígitos]. | ||||||||||||||||||||||||
| Se corresponder (ex: 'f1', 'A99'), retorna a string original. | ||||||||||||||||||||||||
| Se não corresponder (ex: '1f', 'f1a'), retorna None. | ||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| # Expressão Regular: r"^[a-zA-Z]\d+$" | ||||||||||||||||||||||||
| # ^: Início da string | ||||||||||||||||||||||||
| # [a-zA-Z]: Exatamente uma letra | ||||||||||||||||||||||||
| # \d+: Um ou mais dígitos | ||||||||||||||||||||||||
| # $: Fim da string | ||||||||||||||||||||||||
| regex = r"^[a-zA-Z]\d+$" | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| # re.fullmatch() verifica se a string inteira corresponde ao padrão | ||||||||||||||||||||||||
| match = re.fullmatch(regex, codigo) | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| if match: | ||||||||||||||||||||||||
| # Se o padrão casar com a string inteira, retorna o valor original | ||||||||||||||||||||||||
| return codigo | ||||||||||||||||||||||||
| else: | ||||||||||||||||||||||||
| # Caso contrário, retorna None | ||||||||||||||||||||||||
| return None | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| class XMLBodyAnBackConvertException(Exception): ... | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
@@ -257,6 +288,7 @@ def convert_html_to_xml_step_4(document): | |||||||||||||||||||||||
| # logging.info("convert_html_to_xml - step 4") | ||||||||||||||||||||||||
| ppl = plumber.Pipeline( | ||||||||||||||||||||||||
| StartPipe(), | ||||||||||||||||||||||||
| ReplaceIdhrefAndRidhrefByIdPipe(), | ||||||||||||||||||||||||
| DivIdToAssetPipe(), | ||||||||||||||||||||||||
| XRefTypePipe(), | ||||||||||||||||||||||||
| InsertGraphicInFigPipe(), | ||||||||||||||||||||||||
|
|
@@ -831,7 +863,7 @@ def parser_node(self, node, journal_acron): | |||||||||||||||||||||||
| if "img/revistas/" in href or ".." in href: | ||||||||||||||||||||||||
| return self._create_internal_link_to_asset_html_page(node) | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| if journal_acron and journal_acron in href: | ||||||||||||||||||||||||
| if journal_acron and f"/{journal_acron}/" in href.lower(): | ||||||||||||||||||||||||
| return self._create_internal_link_to_asset_html_page(node) | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| if ":" in href: | ||||||||||||||||||||||||
|
|
@@ -950,7 +982,32 @@ def transform(self, data): | |||||||||||||||||||||||
| def _extract_xref_text(self, xref_element): | ||||||||||||||||||||||||
| return " ".join(xref_element.xpath(".//text()")).strip() | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| def _extract_rid(self, href, pkg_name, label_text, label_number): | ||||||||||||||||||||||||
| def get_rid_from_xref_label_and_number(self, label_text, label_number): | ||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||
| Gera o rid a partir do label_text e label_number. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| Args: | ||||||||||||||||||||||||
| label_text: Texto do label (e.g., 'Table', 'Figure') | ||||||||||||||||||||||||
| label_number: Número do label (e.g., '1', '2') | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| Returns: | ||||||||||||||||||||||||
| String com o rid ou None | ||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||
| if not label_text: | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
| if not label_text: | |
| if not label_text or label_text == "": |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential IndexError if label_number is an empty string. Before accessing label_number[:-1] and label_number[-1], you should check that label_number has at least one character. An empty string would cause an IndexError when accessing these indices.
| if label_number[:-1].isdigit() and label_number[-1].isalpha(): | |
| if len(label_number) > 1 and label_number[:-1].isdigit() and label_number[-1].isalpha(): |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double space found between and and label_number. This should be a single space for consistency.
| if label_number[:-1].isdigit() and label_number[-1].isalpha(): | |
| if label_number[:-1].isdigit() and label_number[-1].isalpha(): |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using str.replace() on line 1023 could lead to incorrect results if pkg_name appears multiple times in the filename. For example, if pkg_name="test" and filename="testtest", this would result in an empty string. Consider using filename.removeprefix(pkg_name) (Python 3.9+) or filename[len(pkg_name):] if you specifically want to remove a prefix.
| filename = filename.replace(pkg_name, "") | |
| filename = filename[len(pkg_name):] |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential IndexError when xref_text.split() returns an empty list. If xref_text contains only whitespace, split() will return an empty list, and accessing parts[-1] on line 1041 will raise an IndexError. Consider adding a check: if not parts: return None, None after line 1038.
| if not parts: | |
| return None, None |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential IndexError when expected_number is an empty string. If for some reason parts[-1] is an empty string, accessing expected_number[0] on line 1042 will raise an IndexError. Consider adding a check: if not expected_number: return parts[0] if parts else None, None before line 1042.
| # first character of last part | |
| # first character of last part | |
| if not parts or not parts[-1]: | |
| return parts[0] if parts else None, None |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential IndexError if label_text or rid is an empty string. Before accessing label_text[0] on line 1052 or rid[0] on line 1055, you should verify these strings have at least one character. Empty strings would cause an IndexError.
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The XPath union operator | on line 1096 will match elements with either @id='{rid}' OR @filebasename='{basename}'. However, if an element has a different id but the same filebasename, it will still be matched. This could lead to matching the wrong element. Consider using and logic instead: //*[@id='{rid}' and @filebasename='{basename}'] or prioritizing the id match with a fallback query.
| xpath = f"//*[@id='{rid}' | @filebasename='{basename}']" | |
| xpath = f"//*[@id='{rid}' and @filebasename='{basename}']" |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XPath injection vulnerability: The basename variable is inserted directly into the XPath expression without sanitization. If basename contains special characters like single quotes, it could break the XPath query or potentially be exploited. Consider sanitizing basename or using parameterized queries if available. For example, if basename contains a single quote, the XPath query will fail.
| xpath = f"//*[@filebasename='{basename}']" | |
| if rid: | |
| xpath = f"//*[@id='{rid}' | @filebasename='{basename}']" | |
| found = root.xpath(xpath)[0] | |
| if rid: | |
| xpath = "//*[@id=$rid or @filebasename=$basename]" | |
| found_nodes = root.xpath(xpath, rid=rid, basename=basename) | |
| else: | |
| xpath = "//*[@filebasename=$basename]" | |
| found_nodes = root.xpath(xpath, basename=basename) | |
| found = found_nodes[0] |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "Sort children by rid before inserting" but the code actually sorts by filebasename. Either update the comment to reflect the actual sorting key or verify that sorting by filebasename is the intended behavior.
| # Sort children by rid before inserting | |
| # Sort children by filebasename before inserting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Inline comments are in Portuguese (lines 49-53, 56, 60, 63) while the docstring is also in Portuguese. Consider whether this is consistent with the project's documentation standards. If the project uses English for code comments, these should be translated for consistency.