-
Notifications
You must be signed in to change notification settings - Fork 3
refactor: melhora validação de dígitos e valores nulos na criação de … #140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: melhora validação de dígitos e valores nulos na criação de … #140
Conversation
…elementos de data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the date validation logic in the XML article metadata processing pipeline to handle malformed dates more robustly. The changes aim to prevent invalid XML generation when encountering dates with zeros or non-numeric characters in legacy databases.
Changes:
- Modified
_create_date_elementfunction to validate date components usingisdigit()and stop processing when encountering zeros or non-numeric values - Added try-except block in
XMLArticleMetaHistoryPipeto catch and log exceptions during date processing without halting the entire pipeline
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if not value.isdigit(): | ||
| break | ||
| if int(value) == 0: | ||
| break | ||
| e = ET.Element(label) | ||
| e.text = value | ||
| date_element.append(e) |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR introduces significant changes to date validation logic but no new tests are added to verify the behavior with malformed dates (e.g., "00000000", "20200200", "2023 12", or dates with non-numeric characters). The PR description specifically mentions these edge cases as the motivation for the changes. Add test cases to cover these scenarios and verify that the validation behaves as expected, especially given the critical logic changes identified in the _create_date_element function.
| try: | ||
| elem = _create_date_element("date", attributes, date_) | ||
| except Exception as e: | ||
| raw.exceptions.append( | ||
| { | ||
| "pipe": "XMLArticleMetaHistoryPipe", | ||
| "error_type": str(type(e)), | ||
| "error_message": str(e), | ||
| "exc_traceback": traceback.format_exc(), | ||
| } | ||
| ) | ||
| continue |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The traceback module is used here but is not imported at the top of the file. This will cause a NameError when an exception occurs. Add "import traceback" to the imports section at the top of the file (similar to how it's done in sps_xml_refs.py:2).
| break | ||
| if int(value) == 0: | ||
| break |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validation logic now uses "break" instead of the implicit "continue" from the previous code, which fundamentally changes the behavior. When a zero or non-digit is encountered, the break statement stops all further date components from being processed.
For example, with date "20200200" (valid year and month, but day is "00"):
- Old behavior: would skip day but still add month and year elements
- New behavior: breaks at day "00" and adds no elements at all
This means valid year and month values will be lost if the day is zero or invalid. Consider using "continue" instead of "break" to preserve the old behavior of skipping invalid components while still processing valid ones. Alternatively, if the intent is to process in order of significance, the iteration order should be changed to (year, month, day) so that more significant components are processed first.
| break | |
| if int(value) == 0: | |
| break | |
| continue | |
| if int(value) == 0: | |
| continue |
| year, month, day = date_text[:4], date_text[4:6], date_text[6:] | ||
| labels = ("day", "month", "year") | ||
| for label, value in zip(labels, (day, month, year)): | ||
| if int(value) != 0: | ||
| e = ET.Element(label) | ||
| e.text = value | ||
| date_element.append(e) | ||
| if not value.isdigit(): |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code doesn't validate that date_text has the expected length (8 characters) before slicing. If date_text is shorter than 8 characters, the slicing operations will produce empty strings or truncated values. For example, if date_text is "2020", then year="2020", month="", day="". The empty string will fail the isdigit() check and break, but this could lead to silent failures. Consider adding a length check after the line 58 check, e.g., "if len(date_text) != 8: return" to make the validation more explicit and catch malformed input early.
| } | ||
| ) | ||
| continue | ||
| history.append(elem) |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the try-except block, the code attempts to append elem to history without checking if elem is None. The _create_date_element function can return None in two cases: when date_text is falsy (line 58-59), or when the date_element has no child elements added due to validation failures. If _create_date_element returns None and no exception is raised, this will attempt to append None to the history element, which could cause issues. Add a check "if elem is not None:" before the append operation on line 541.
| history.append(elem) | |
| if elem is not None: | |
| history.append(elem) |
…elementos de data
O que esse PR faz?
Este PR implementa melhorias na conversão de strings de data para elementos XML e aumenta a resiliência do pipeline de metadados de histórico. As principais alterações são:
_create_date_elementagora valida se os valores são numéricos e interrompe o processamento caso encontre zeros ou caracteres não numéricos, evitando a criação de elementos XML inválidos.try-exceptnoXMLArticleMetaHistoryPipepara garantir que falhas no processamento de uma data específica não interrompam todo o pipeline, registrando o erro detalhadamente no objetoraw.exceptions.Onde a revisão poderia começar?
A revisão deve começar pelo arquivo:
scielo_classic_website/spsxml/sps_xml_article_meta.pyInicie pela função
_create_date_element(linha 64) para entender a nova lógica de validação e depois siga para a classeXMLArticleMetaHistoryPipe(linha 523).Como este poderia ser testado manualmente?
raw.exceptions.Algum cenário de contexto que queira dar?
Em bases de dados legadas, é comum encontrar campos de data com preenchimento inconsistente (ex: "00000000" ou "2023 12"). A lógica anterior tentava converter esses valores, o que podia causar erros de conversão de tipo ou gerar XMLs semanticamente incorretos. Esta alteração torna o conversor mais defensivo contra dados ruidosos.
fixes
scieloorg/scms-upload#811
Screenshots
N/A (Alteração de lógica de backend)
Quais são tickets relevantes?
scieloorg/scms-upload#811
Referências