Skip to content

Conversation

@robertatakenaka
Copy link
Member

Pull Request: Otimização do processamento de arquivos em issue_folder

Descrição

Este PR refatora o módulo issue_folder.py para melhorar performance e separação de responsabilidades no processamento de arquivos de issues do SciELO.

Mudanças Principais

🚀 Otimização de Performance

  • Separação de responsabilidades: fixed_glob() agora apenas localiza arquivos, get_files() faz a leitura
    • Evita leitura desnecessária de arquivos grandes
    • Permite melhor controle de memória

📁 Melhorias no Processamento de Arquivos

  • Busca recursiva aprimorada: Pattern '**/*.*' para arquivos de tradução
    • Encontra traduções em subdiretórios corretamente
    • Maior flexibilidade na organização de arquivos

🔧 Tratamento de Erros Mais Robusto

  • HTMLs vazios: Tratados graciosamente sem lançar exceções
    • Arquivo continua sendo processado mesmo sem conteúdo
    • Melhora tolerância a falhas

🧹 Limpeza de Código

  • Logs simplificados:
    • Decodificação HTML reporta apenas encoding usado (utf-8 vs iso-8859-1)
    • Removidos logs verbosos que não agregavam valor
  • Código mais limpo:
    • Lógica de detecção de traduções em PDFs simplificada
    • Remoção de continue redundante após IndexError

Benefícios

  • Performance: Leitura sob demanda reduz uso de memória
  • Manutenibilidade: Separação clara de responsabilidades
  • Confiabilidade: Melhor tratamento de casos extremos
  • Logs mais úteis: Apenas informações relevantes são logadas

Impacto

Nenhuma mudança na API pública. Código cliente continua funcionando sem alterações.

Exemplo de Melhoria

Antes:

# Lê todos os arquivos imediatamente
for path in glob.glob(pattern):
    with open(path, "rb") as f:
        content = f.read()  # Memória alocada mesmo se não usar

Depois:

# Localiza arquivos primeiro
paths = fixed_glob(patterns)
# Lê apenas quando necessário
for path in paths:
    content = read_when_needed(path)

…nto de erros

- Refatora fixed_glob() para retornar conjunto de paths sem ler conteúdo
- Move leitura de arquivos para get_files() melhorando separação de responsabilidades
- Simplifica logs de decodificação HTML (utf-8 vs iso-8859-1)
- Remove logs verbosos desnecessários em files property
- Melhora busca recursiva de arquivos de tradução com pattern '**/*.*'
- Trata HTMLs vazios sem lançar exceção, apenas não define content
- Simplifica lógica de detecção de traduções em bases_pdf_files
- Remove continue desnecessário após IndexError em bases_pdf_files
Copilot AI review requested due to automatic review settings December 22, 2025 00:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the issue_folder.py module to optimize file processing for SciELO issues. The changes separate file discovery from file reading, update translation file search patterns, and modify error handling and logging approaches.

Key Changes:

  • Refactored fixed_glob() to only return file paths instead of reading file contents immediately
  • Changed translation file pattern from *.ht* to **/*.* for recursive subdirectory search
  • Simplified error handling and reduced logging verbosity

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +76 to +79
logging.info("HTML content decoded as utf-8")
except Exception as e:
content = content.decode("iso-8859-1")
logging.warning(
f"HTML content decoded using iso-8859-1 due to utf-8 decoding error. {e}"
)
logging.info("HTML content decoded as iso-8859-1")
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging at INFO level for every file's encoding detection will create excessive log noise in production. Consider using DEBUG level for these routine operations, or removing the logs entirely since encoding detection is expected behavior.

Copilot uses AI. Check for mistakes.
self._subdir_acron_issue,
"*.ht*",
"**",
"*.*",
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern change from ".ht" to "" + "." enables recursive searching, but will now match all file types in subdirectories, not just HTML files (.htm, .html). This could lead to processing non-HTML files as HTML content, which may cause errors. Consider using a more specific pattern like "/.ht" to maintain the original file type filtering while enabling recursive search.

Suggested change
"*.*",
"*.ht*",

Copilot uses AI. Check for mistakes.
f"Conteúdo vazio no arquivo HTML {item['path']}"
)
item["content"] = content
if content:
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When fix_html_content() returns None (for empty content), the original item["content"] remains as raw bytes. This creates inconsistent content types in the output - some items will have processed string content while others have unprocessed bytes. This inconsistency could cause issues for consumers of this data. Consider always setting item["content"] to an appropriate value (e.g., empty string or None) to maintain type consistency.

Suggested change
if content:
if content is None:
# Ensure item["content"] is always a text value, not raw bytes
item["content"] = ""
else:

Copilot uses AI. Check for mistakes.
lang = None
item["lang"] = None
key = item["key"]
if "_" in key and key[2] == "_":
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of the IndexError handler without replacing it with proper bounds checking creates a potential bug. The code assumes key[2] exists when checking key[2] == "_" on line 184, but if key has fewer than 3 characters, this will raise an IndexError that is no longer caught. This will cause the entire generator to fail instead of gracefully handling the invalid filename.

Suggested change
if "_" in key and key[2] == "_":
if len(key) >= 3 and key[2] == "_":

Copilot uses AI. Check for mistakes.
Comment on lines 64 to +67
except Exception as e:
yield {"type": file_type, "error": str(e), "error_type": type(e).__name__}
logging.exception(e)
item["error"] = str(e)
item["error_type"] = type(e).__name__
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable item may not be defined if an exception occurs before line 54. If an exception happens during the modified_date(), os.path.basename(), or _get_classic_website_rel_path() calls (lines 57-59), the variable item will not exist when trying to set error attributes on lines 66-67, causing a NameError.

Copilot uses AI. Check for mistakes.
Comment on lines 42 to +48
def fixed_glob(patterns, file_type, recursive):
paths = set()
for pattern in patterns:
for path in glob.glob(pattern, recursive=recursive):
try:
item = {
"type": file_type,
}
item["path"] = path
with open(path, "rb") as f:
item["content"] = f.read()
item["modified_date"] = modified_date(path)
except Exception as e:
logging.exception(e)
item["error"] = str(e)
item["error_type"] = type(e).__name__
yield item
paths.update(
glob.glob(pattern, recursive=recursive)
)
return paths
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter file_type is passed to fixed_glob() but never used within that function. This parameter should either be removed from the function signature or utilized if it serves a purpose.

Copilot uses AI. Check for mistakes.
@robertatakenaka robertatakenaka merged commit 5dca311 into scieloorg:main Dec 22, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant