Cria modulos de deteccao de elementos e atributos para conversao html para xml #135

robertatakenaka · 2025-12-22T00:18:24Z

Pull Request: Sistema avançado de detecção de elementos XML baseado em títulos

Descrição

Este PR introduz um sistema inteligente de detecção e classificação de elementos XML baseado em análise de títulos, com suporte multilíngue completo para documentos científicos.

Mudanças Principais

🎯 Novo Sistema de Detecção de Elementos

detector_title_parent.py (NOVO)

Identifica tipo de elemento XML baseado no conteúdo do título
Detecta: ack, ref-list, abstract, kwd-group, app, glossary, etc.
Análise contextual inteligente para desambiguação

detector_config_fn.py (NOVO)

Padrões especializados para footnotes (fn)
Mapeia símbolos especiais (*, †, ‡, §) para tipos de footnote
Detecta: corresp, financial-disclosure, equal, con, etc.

🌍 Suporte Multilíngue Expandido

detector_config_sec.py

REFERENCIAS_PATTERNS: Detecta referências em 6 idiomas
- Português, Inglês, Espanhol, Francês, Italiano, Alemão
- Variações: "References", "Bibliografia", "Literature Cited", etc.
AGRADECIMENTOS_PATTERNS: Detecta agradecimentos multilíngues
- "Acknowledgments", "Agradecimentos", "Remerciements", "Danksagung"

🔧 Correções Técnicas

detector_config_xref.py

Simplifica XPaths complexos que causavam falhas:
- author-notes//fn → fix-author-notes-fn
- table-wrap-foot//fn → fix-table-wrap-foot-fn
- Previne erros de processamento XML

🚀 Funcionalidades Avançadas

detector.py

Detecção inteligente de footnotes:
- detect_fn_type(): Classifica tipo de nota de rodapé
- extract_fn_number(): Extrai numeração/símbolos
- suggest_fn_id(): Gera IDs padronizados
Desambiguação contextual:
- detect_element_type(): Diferencia sec de fn
- Usa contexto (back, body) para decisões
- Heurísticas para casos ambíguos
Processamento em lote:
- batch_detect_element_types(): Análise eficiente de múltiplos elementos

Casos de Uso

# Detecta tipo de elemento pelo título
detect_element_type("*Corresponding author")  
# → {element_type: 'fn', type_attribute: 'corresp'}

detect_element_type("References")  
# → {element_type: 'ref-list'}

detect_element_type("Financial disclosure: Grant XYZ")  
# → {element_type: 'fn', type_attribute: 'financial-disclosure'}

Benefícios

✅ Precisão: Detecção acurada de elementos estruturais
✅ Multilíngue: Suporte nativo para 6+ idiomas
✅ Robusto: Tratamento de casos ambíguos e símbolos especiais
✅ Extensível: Arquitetura modular para novos padrões

Impacto

Melhora significativa na conversão de documentos HTML legados para XML estruturado, especialmente para conteúdo científico internacional.

…o valor de title (ack, ref-list, abstract, etc)

…do valor de title

…mentos - Adiciona REFERENCIAS_PATTERNS com padrões em múltiplos idiomas (PT, EN, ES, FR, IT, DE) - Adiciona AGRADECIMENTOS_PATTERNS com padrões multilíngues para seções de agradecimentos - Suporta detecção de referências bibliográficas, bibliografia e literatura citada - Permite identificação de seções de acknowledgments/remerciements/danksagung

- Substitui xpath complexo 'author-notes//fn' por 'fix-author-notes-fn' - Substitui xpath 'table-wrap-foot//fn' por 'fix-table-wrap-foot-fn' - Substitui 'sec[@sec-type='transcript']' por 'fix-sec-transcript' - Substitui 'abstract[@abstract-type='graphical']' por 'fix-visual-abstract' - Simplifica mapeamentos que causavam problemas no processamento XML

…s estruturais - Implementa detect_fn_type() para identificar tipos de notas de rodapé - Adiciona detect_element_type() para distinguir entre seções e footnotes - Cria extract_fn_number() para extrair numeração de footnotes - Implementa suggest_fn_id() para gerar IDs padronizados para footnotes - Adiciona batch_detect_element_types() para processamento em lote - Suporta detecção de símbolos especiais (*, †, ‡, §) em footnotes - Melhora desambiguação entre sec e fn usando contexto e heurísticas - Adiciona suporte multilíngue para tipos de footnotes (corresp, equal, financial-disclosure) - Implementa detecção de referências e agradecimentos como elementos especiais

Copilot

Pull request overview

This PR introduces an intelligent XML element detection system for converting HTML to JATS/SPS XML format. The system uses title-based analysis with comprehensive multilingual support (Portuguese, English, Spanish, French, Italian, German) to identify and classify document elements such as sections, footnotes, abstracts, references, and acknowledgments.

Key Changes:

New semantic detection modules for classifying elements based on title content patterns
Multilingual footnote type detection with special symbol handling (*, †, ‡, §)
XPath simplification to prevent XML processing failures
Context-aware disambiguation logic for ambiguous cases

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 21 comments.

Show a summary per file

File	Description
detector_title_parent.py	New module that maps title content to parent XML elements using regex patterns across 6+ languages
detector_config_fn.py	New configuration module defining footnote type patterns, symbol mappings, and numbering patterns for multilingual footnote detection
detector_config_xref.py	Simplifies complex XPath expressions to fix-prefixed placeholders to prevent XML processing errors
detector_config_sec.py	Adds multilingual pattern definitions for acknowledgments and references sections
detector.py	Implements core detection functions including element type detection, footnote type classification, number extraction, and ID generation with disambiguation logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-22T00:33:01Z

scielo_classic_website/spsxml/detector_config_sec.py

+    r'(?i)agradecimentos?(\s+especiais?)?',
+
+    # Inglês
+    r'(?i)acknowledgm?ents?',


The pattern "r'(?i)acknowledgm?ents?'" on line 380 will match "acknowledgents" (which is incorrect) but not "acknowledgements" (which is correct). The optional 'm' makes the pattern match both "acknowledgents" and "acknowledgments", when it should match "acknowledgments" and "acknowledgements". Consider changing to "r'(?i)acknowledg(e)?ments?'" or list both variants explicitly.

Suggested change

r'(?i)acknowledgm?ents?',

r'(?i)acknowledg(e)?ments?',

Copilot · 2025-12-22T00:33:01Z

scielo_classic_website/spsxml/detector.py

+    for pattern in REFERENCIAS_PATTERNS:
+        if re.search(pattern, title):
+            result["element_type"] = "ref-list"
+            return result
+
+    for pattern in AGRADECIMENTOS_PATTERNS:
+        if re.search(pattern, title):
+            result["element_type"] = "ack"
+            return result


The function checks REFERENCIAS_PATTERNS and AGRADECIMENTOS_PATTERNS first, returning immediately if matched (lines 570-578). However, it then checks NOT_FN_INDICATORS which includes similar patterns. This creates a precedence issue where certain titles will always be classified as ref-list or ack, even if they might contextually be sections. Consider adding a context parameter check before these early returns, or document why these should always take precedence.

Copilot · 2025-12-22T00:33:01Z

scielo_classic_website/spsxml/detector.py

+            clean_number = re.sub(r"[^\w]", "", number)
+            return f"fn{clean_number}"


Potential security issue with regex injection. The clean_number variable uses re.sub(r"[^\w]", "", number) to sanitize the number (line 724), but if number contains backslashes or other special regex characters from user input, this could potentially cause issues. While this is unlikely in this context, consider using a whitelist approach with str.isalnum() or similar for more robust sanitization.

Copilot · 2025-12-22T00:33:02Z

scielo_classic_website/spsxml/detector.py

+def detect_element_type(title: str, context: Optional[str] = None) -> Dict[str, Optional[str]]:
+    """
+    Detecta se um título corresponde a uma seção (sec) ou footnote (fn).
+
+    Args:
+        title: Título ou texto a ser analisado
+        context: Contexto adicional opcional (por ex: "back", "body", "front")
+
+    Returns:
+        Dicionário com:
+        - element_type: 'sec', 'fn' ou None
+        - type_attribute: sec-type ou fn-type detectado
+        - confidence: 'high', 'medium', 'low'
+        - detected_as: descrição do que foi detectado
+        - suggested_id: ID sugerido para o elemento
+
+    Exemplos:
+        >>> detect_element_type("Introduction")
+        {'element_type': 'sec', 'type_attribute': 'intro', ...}
+
+        >>> detect_element_type("*Corresponding author")
+        {'element_type': 'fn', 'type_attribute': 'corresp', ...}
+
+        >>> detect_element_type("Financial disclosure: This work was supported by...")
+        {'element_type': 'fn', 'type_attribute': 'financial-disclosure', ...}
+    """
+    result = {
+        "element_type": None,
+        "type_attribute": None,
+        "confidence": None,
+        "detected_as": None,
+        "suggested_id": None,
+        "number": None,
+    }
+
+    if not title:
+        return result
+
+    title = title.strip()
+    if not title:
+        return result
+
+    for pattern in REFERENCIAS_PATTERNS:
+        if re.search(pattern, title):
+            result["element_type"] = "ref-list"
+            return result
+
+    for pattern in AGRADECIMENTOS_PATTERNS:
+        if re.search(pattern, title):
+            result["element_type"] = "ack"
+            return result
+
+    # Verifica se definitivamente NÃO é uma footnote
+    for pattern in NOT_FN_INDICATORS:
+        if re.match(pattern, title):
+            # É uma seção
+            sec_type, number = detect_sec_type_and_number(title)
+            if sec_type:
+                result["element_type"] = "sec"
+                result["type_attribute"] = sec_type
+                result["confidence"] = "high"
+                result["detected_as"] = f"section: {sec_type}"
+                result["suggested_id"] = suggest_sec_id(title)
+                result["number"] = number
+                return result
+
+    # Tenta detectar como footnote
+    fn_type = detect_fn_type(title)
+
+    # Tenta detectar como seção
+    sec_type, section_number = detect_sec_type_and_number(title)
+
+    # Decide baseado no que foi detectado
+    if fn_type and not sec_type:
+        # Só detectou como footnote
+        result["element_type"] = "fn"
+        result["type_attribute"] = fn_type
+        result["confidence"] = "high"
+        result["detected_as"] = f"footnote: {fn_type}"
+        result["number"] = extract_fn_number(title)
+        result["suggested_id"] = suggest_fn_id(fn_type, result["number"])
+
+    elif sec_type and not fn_type:
+        # Só detectou como seção
+        result["element_type"] = "sec"
+        result["type_attribute"] = sec_type
+        result["confidence"] = "high"
+        result["detected_as"] = f"section: {sec_type}"
+        result["number"] = section_number
+        result["suggested_id"] = suggest_sec_id(title)
+
+    elif fn_type and sec_type:
+        # Detectou como ambos - precisa desambiguar
+        # Usa contexto e heurísticas
+
+        # Se está no back-matter, provavelmente é footnote
+        if context == "back":
+            result["element_type"] = "fn"
+            result["type_attribute"] = fn_type
+            result["confidence"] = "medium"
+            result["detected_as"] = f"footnote (in back): {fn_type}"
+            result["number"] = extract_fn_number(title)
+            result["suggested_id"] = suggest_fn_id(fn_type, result["number"])
+
+        # Se tem símbolo especial no início, provavelmente é footnote
+        elif any(title.startswith(symbol) for symbol in SYMBOL_TO_FN_TYPE.keys()):
+            result["element_type"] = "fn"
+            result["type_attribute"] = fn_type
+            result["confidence"] = "high"
+            result["detected_as"] = f"footnote (symbol): {fn_type}"
+            result["number"] = extract_fn_number(title)
+            result["suggested_id"] = suggest_fn_id(fn_type, result["number"])
+
+        # Se tem numeração de seção (2.1, etc), provavelmente é seção
+        elif section_number and "." in section_number:
+            result["element_type"] = "sec"
+            result["type_attribute"] = sec_type
+            result["confidence"] = "high"
+            result["detected_as"] = f"section (numbered): {sec_type}"
+            result["number"] = section_number
+            result["suggested_id"] = suggest_sec_id(title)
+
+        else:
+            # Caso ambíguo - usa seção como padrão
+            result["element_type"] = "sec"
+            result["type_attribute"] = sec_type
+            result["confidence"] = "low"
+            result["detected_as"] = f"ambiguous (defaulting to section): {sec_type}"
+            result["number"] = section_number
+            result["suggested_id"] = suggest_sec_id(title)
+
+    else:
+        # Não detectou como nenhum dos dois
+        result["confidence"] = "none"
+        result["detected_as"] = "unidentified"
+
+    return result


Poor performance due to repeated pattern matching. The detect_element_type function calls multiple detection functions (detect_fn_type, detect_sec_type_and_number) which each iterate through large pattern lists. For batch processing, this means patterns are matched multiple times for the same text. Consider caching results or restructuring the logic to perform pattern matching once and classify based on results.

Copilot · 2025-12-22T00:33:02Z

scielo_classic_website/spsxml/detector_title_parent.py

+    r'^(summary|sumário|sommario|sommaire|síntesis)': 'abstract',
+
+    # Trans-abstract (abstracts traduzidos com indicação de idioma)
+    r'^(abstract|resumo|resumen|résumé).*(english|inglês|español|português|français|castellano|espanhol)': 'trans-abstract',


Copilot · 2025-12-22T00:33:04Z

scielo_classic_website/spsxml/detector.py

+def detect_fn_type(title: str) -> Optional[str]:
+    """
+    Detecta o tipo de footnote (@fn-type) a partir do título ou texto.

-    # Teste 1: Análise a partir de texto
-    print("\n1. ANÁLISE A PARTIR DE TEXTO:")
-    print("-" * 40)
+    Args:
+        title: Título ou texto da nota de rodapé

-    test_texts = [
-        # Múltiplos idiomas
-        "Figure 1",
-        "Figura 2",
-        "Abbildung 3",
-        "Figuur 4",
-        "Table 1",
-        "Tableau 2",
-        "Tabelle 3",
-        "Tabel 4",
-        "Equation 1",
-        "Équation 2",
-        "Gleichung 3",
-        "Section 2.1",
-        "Sección 3",
-        "Chapitre 4",
-        # Referências bibliográficas
-        "Silva et al., 2024",
-        "García y col., 2023",
-        "Schmidt u.a., 2022",
-        "Van Der Merwe en ander, 2021",
-    ]
+    Returns:
+        fn-type identificado ou None se não identificado

-    for text in test_texts:
-        ref_type, element_name, prefix, number = detect_from_text(text)
-        rid = f"{prefix}{number}" if prefix and number else None
-        print(
-            f"{text:<30} → ref_type: {ref_type:<15} element: {element_name:<20} rid: {rid}"
-        )
+    Exemplos:
+        >>> detect_fn_type("Corresponding author")
+        'corresp'
+        >>> detect_fn_type("Financial disclosure")
+        'financial-disclosure'
+        >>> detect_fn_type("†These authors contributed equally")
+        'equal'
+    """
+    if not title:
+        return None

-    # Teste 2: Análise a partir de IDs
-    print("\n2. ANÁLISE A PARTIR DE IDs:")
-    print("-" * 40)
+    title = title.strip()
+
+    # Primeiro verifica se há símbolos especiais no início
+    for symbol, fn_type in SYMBOL_TO_FN_TYPE.items():
+        if title.startswith(symbol):
+            # Se o símbolo sugere um tipo, ainda verifica o texto
+            # para confirmar ou refinar a detecção
+            title_without_symbol = title[len(symbol):].strip()
+            detected_type = _detect_fn_type_from_patterns(title_without_symbol)
+            if detected_type:
+                return detected_type
+            # Se não encontrou padrão específico, usa o tipo do símbolo
+            return fn_type
+
+    # Tenta detecção pelos padrões de texto
+    return _detect_fn_type_from_patterns(title)
+
+
+def _detect_fn_type_from_patterns(text: str) -> Optional[str]:
+    """
+    Função auxiliar para detectar fn-type a partir dos padrões de texto.
+    """
+    if not text:
+        return None

-    test_ids = [
-        "f1",
-        "f2a",
-        "t3",
-        "B42",
-        "sec2_1",
-        "app1",
-        "e5",
-        "TFN1",
-        "suppl3",
-        "S2",
-        "fnast",
-        "fndag",
-    ]
+    # Remove possível numeração do início
+    clean_text = text
+    for pattern in FN_NUMBER_PATTERNS:
+        clean_text = re.sub(pattern, "", text, count=1)
+        if clean_text != text:
+            break

-    for rid in test_ids:
-        ref_type, element_name = detect_from_id(rid)
-        print(f"{rid:<15} → ref_type: {ref_type:<20} element: {element_name}")
+    clean_text = clean_text.strip()
+    if not clean_text:
+        return None

-    # Teste 3: Análise bidirecional completa
-    print("\n3. ANÁLISE BIDIRECIONAL COMPLETA:")
-    print("-" * 40)
+    # Verifica padrões de fn-type
+    for fn_type, patterns in FN_TYPE_PATTERNS.items():
+        for pattern in patterns:
+            if re.search(pattern, clean_text):
+                return fn_type

-    test_cases = [
-        {"text": "Figure 1", "rid": "f1"},
-        {"text": "Tabla 2", "rid": "t2"},
-        {"text": "Section 3.1", "rid": "sec3_1"},
-        {"text": "Fig. 4", "rid": "wrong5"},  # Inconsistente
-        {"text": "Equation 2"},  # Só texto
-        {"rid": "B15"},  # Só ID
-    ]
+    return None

-    for test in test_cases:
-        result = analyze_xref(**test)
-        print(f"\nInput: {test}")
-        print(
-            f"Result: ref_type={result['ref_type']}, element={result['element_name']}, "
-            f"rid={result['rid']}, source={result['source']}, consistent={result['consistent']}"
-        )

-    # Teste 4: Análise em lote
-    print("\n4. ANÁLISE EM LOTE:")
-    print("-" * 40)
+def detect_element_type(title: str, context: Optional[str] = None) -> Dict[str, Optional[str]]:
+    """
+    Detecta se um título corresponde a uma seção (sec) ou footnote (fn).
+
+    Args:
+        title: Título ou texto a ser analisado
+        context: Contexto adicional opcional (por ex: "back", "body", "front")
+
+    Returns:
+        Dicionário com:
+        - element_type: 'sec', 'fn' ou None
+        - type_attribute: sec-type ou fn-type detectado
+        - confidence: 'high', 'medium', 'low'
+        - detected_as: descrição do que foi detectado
+        - suggested_id: ID sugerido para o elemento
+
+    Exemplos:
+        >>> detect_element_type("Introduction")
+        {'element_type': 'sec', 'type_attribute': 'intro', ...}
+
+        >>> detect_element_type("*Corresponding author")
+        {'element_type': 'fn', 'type_attribute': 'corresp', ...}
+
+        >>> detect_element_type("Financial disclosure: This work was supported by...")
+        {'element_type': 'fn', 'type_attribute': 'financial-disclosure', ...}
+    """
+    result = {
+        "element_type": None,
+        "type_attribute": None,
+        "confidence": None,
+        "detected_as": None,
+        "suggested_id": None,
+        "number": None,
+    }
+
+    if not title:
+        return result
+
+    title = title.strip()
+    if not title:
+        return result
+
+    for pattern in REFERENCIAS_PATTERNS:
+        if re.search(pattern, title):
+            result["element_type"] = "ref-list"
+            return result
+
+    for pattern in AGRADECIMENTOS_PATTERNS:
+        if re.search(pattern, title):
+            result["element_type"] = "ack"
+            return result
+
+    # Verifica se definitivamente NÃO é uma footnote
+    for pattern in NOT_FN_INDICATORS:
+        if re.match(pattern, title):
+            # É uma seção
+            sec_type, number = detect_sec_type_and_number(title)
+            if sec_type:
+                result["element_type"] = "sec"
+                result["type_attribute"] = sec_type
+                result["confidence"] = "high"
+                result["detected_as"] = f"section: {sec_type}"
+                result["suggested_id"] = suggest_sec_id(title)
+                result["number"] = number
+                return result
+
+    # Tenta detectar como footnote
+    fn_type = detect_fn_type(title)
+
+    # Tenta detectar como seção
+    sec_type, section_number = detect_sec_type_and_number(title)
+
+    # Decide baseado no que foi detectado
+    if fn_type and not sec_type:
+        # Só detectou como footnote
+        result["element_type"] = "fn"
+        result["type_attribute"] = fn_type
+        result["confidence"] = "high"
+        result["detected_as"] = f"footnote: {fn_type}"
+        result["number"] = extract_fn_number(title)
+        result["suggested_id"] = suggest_fn_id(fn_type, result["number"])
+
+    elif sec_type and not fn_type:
+        # Só detectou como seção
+        result["element_type"] = "sec"
+        result["type_attribute"] = sec_type
+        result["confidence"] = "high"
+        result["detected_as"] = f"section: {sec_type}"
+        result["number"] = section_number
+        result["suggested_id"] = suggest_sec_id(title)
+
+    elif fn_type and sec_type:
+        # Detectou como ambos - precisa desambiguar
+        # Usa contexto e heurísticas
+
+        # Se está no back-matter, provavelmente é footnote
+        if context == "back":
+            result["element_type"] = "fn"
+            result["type_attribute"] = fn_type
+            result["confidence"] = "medium"
+            result["detected_as"] = f"footnote (in back): {fn_type}"
+            result["number"] = extract_fn_number(title)
+            result["suggested_id"] = suggest_fn_id(fn_type, result["number"])
+
+        # Se tem símbolo especial no início, provavelmente é footnote
+        elif any(title.startswith(symbol) for symbol in SYMBOL_TO_FN_TYPE.keys()):
+            result["element_type"] = "fn"
+            result["type_attribute"] = fn_type
+            result["confidence"] = "high"
+            result["detected_as"] = f"footnote (symbol): {fn_type}"
+            result["number"] = extract_fn_number(title)
+            result["suggested_id"] = suggest_fn_id(fn_type, result["number"])
+
+        # Se tem numeração de seção (2.1, etc), provavelmente é seção
+        elif section_number and "." in section_number:
+            result["element_type"] = "sec"
+            result["type_attribute"] = sec_type
+            result["confidence"] = "high"
+            result["detected_as"] = f"section (numbered): {sec_type}"
+            result["number"] = section_number
+            result["suggested_id"] = suggest_sec_id(title)
+
+        else:
+            # Caso ambíguo - usa seção como padrão
+            result["element_type"] = "sec"
+            result["type_attribute"] = sec_type
+            result["confidence"] = "low"
+            result["detected_as"] = f"ambiguous (defaulting to section): {sec_type}"
+            result["number"] = section_number
+            result["suggested_id"] = suggest_sec_id(title)
+
+    else:
+        # Não detectou como nenhum dos dois
+        result["confidence"] = "none"
+        result["detected_as"] = "unidentified"
+
+    return result

-    batch = [
-        {"text": "Figure 1"},
-        {"text": "Tableau 2", "rid": "t2"},
-        {"rid": "B3"},
-        {"text": "Gleichung 4"},
-        {"text": "Material Suplementar S1"},
-    ]

-    results = batch_analyze_xrefs(batch)
-    for i, result in enumerate(results, 1):
-        print(f"\n{i}. {batch[i-1]}")
-        print(f"   → {result['ref_type']}/{result['element_name']}/{result['rid']}")
+def extract_fn_number(text: str) -> Optional[str]:
+    """
+    Extrai o número ou símbolo de uma footnote.
+
+    Args:
+        text: Texto da footnote
+
+    Returns:
+        Número, letra ou símbolo extraído
+    """
+    if not text:
+        return None
+
+    for pattern in FN_NUMBER_PATTERNS:
+        match = re.match(pattern, text)
+        if match:
+            return match.group(1)
+
+    return None
+
+
+def suggest_fn_id(fn_type: Optional[str], number: Optional[str] = None) -> str:
+    """
+    Sugere um ID para uma footnote baseado no tipo e número.
+
+    Args:
+        fn_type: Tipo da footnote
+        number: Número ou símbolo da footnote
+
+    Returns:
+        ID sugerido para a footnote
+
+    Exemplos:
+        >>> suggest_fn_id("corresp", "1")
+        'fn1'
+        >>> suggest_fn_id("equal", "*")
+        'fnast'
+        >>> suggest_fn_id("financial-disclosure")
+        'fn1'
+    """
+    if number:
+        # Trata símbolos especiais
+        symbol_map = {
+            "*": "ast",
+            "†": "dag",
+            "‡": "ddag",
+            "§": "sect",
+            "¶": "para",
+            "#": "hash",
+            "**": "dast",
+            "††": "ddag",
+        }
+
+        if number in symbol_map:
+            return f"fn{symbol_map[number]}"
+        else:
+            # Remove caracteres não alfanuméricos
+            clean_number = re.sub(r"[^\w]", "", number)
+            return f"fn{clean_number}"
+    else:
+        return "fn1"

-    # Teste 5: Detecção de sec-type
-    print("\n" + "=" * 80)
-    print("5. DETECÇÃO DE SEC-TYPE:")
-    print("-" * 40)

+def batch_detect_element_types(titles: List[str], context: Optional[str] = None) -> List[Dict[str, Optional[str]]]:
+    """
+    Detecta tipos de múltiplos elementos em lote.
+
+    Args:
+        titles: Lista de títulos para análise
+        context: Contexto opcional para todos os elementos
+
+    Returns:
+        Lista de resultados da detecção
+    """
+    results = []
+    for title in titles:
+        result = detect_element_type(title, context)
+        result["original_title"] = title
+        results.append(result)
+    return results


The new detector functions (detect_fn_type, detect_element_type, extract_fn_number, suggest_fn_id, batch_detect_element_types) lack test coverage. The repository has existing tests for other spsxml functionality, but these new detection functions are not tested. Given the complexity of the multilingual pattern matching and disambiguation logic, comprehensive test coverage is essential to ensure correctness across different languages and edge cases.

Copilot · 2025-12-22T00:33:04Z

scielo_classic_website/spsxml/detector_title_parent.py

+    r'^(figure|figura|abbildung|fig\.|ilustra|gráfico)': 'caption',
+    r'^(table|tabela|tabla|tableau|cuadro|quadro)': 'caption',
+    r'^(legend|legenda|leyenda|légende|explicación)': 'legend',


Copilot · 2025-12-22T00:33:05Z

scielo_classic_website/spsxml/detector_config_fn.py

+# Mapeamento de símbolos especiais que podem indicar tipos específicos de footnotes
+SYMBOL_TO_FN_TYPE = {
+    "*": "corresp",  # Asterisco geralmente para correspondência
+    "†": "deceased",  # Cruz para falecido ou contribuição igual


The mapping uses "†" (dagger symbol) to indicate both "deceased" and "equal" contribution in different contexts. Line 504 maps it to "deceased" while line 217 in the equal patterns includes "^†equal". This creates ambiguity - the same symbol cannot reliably indicate two different fn-types. Consider removing the dagger from SYMBOL_TO_FN_TYPE or using context-dependent logic to determine the correct type.

Suggested change

"†": "deceased", # Cruz para falecido ou contribuição igual

Copilot · 2025-12-22T00:33:05Z

scielo_classic_website/spsxml/detector.py



-def get_id_prefix_and_number(text: str, ref_type: Optional[str] = None) -> str:
+def get_id_prefix_and_number(text: str, ref_type: Optional[str] = None) -> Tuple[Optional[str], Optional[str]]:


The return type annotation changed from 'str' to 'Tuple[Optional[str], Optional[str]]', which is correct for the implementation. However, the function name 'get_id_prefix_and_number' suggests it returns an ID, but it now returns a tuple of (prefix, number). Consider renaming the function to 'extract_prefix_and_number' or 'parse_id_components' to better reflect what it returns.

Copilot · 2025-12-22T00:33:05Z

scielo_classic_website/spsxml/detector.py

+    for pattern in FN_NUMBER_PATTERNS:
+        clean_text = re.sub(pattern, "", text, count=1)
+        if clean_text != text:
+            break


Potential infinite loop in pattern matching. The code iterates through FN_NUMBER_PATTERNS and uses 'break' when a match is found (line 513). However, if the first pattern matches but produces the same clean_text as the original text, the loop breaks without actually removing anything. This is unlikely but could cause issues with certain edge cases. Consider checking if clean_text actually changed before breaking.

robertatakenaka added 5 commits December 21, 2025 21:15

feat(detector): adiciona padrões para detectar elementos dependendo d…

96d6faf

…o valor de title (ack, ref-list, abstract, etc)

feat(detector): adiciona padrões para detectar tipo de fn dependendo …

b3fd2e3

…do valor de title

Copilot AI review requested due to automatic review settings December 22, 2025 00:18

Copilot started reviewing on behalf of robertatakenaka December 22, 2025 00:18 View session

Copilot AI reviewed Dec 22, 2025

View reviewed changes

robertatakenaka merged commit ae160cb into scieloorg:main Dec 22, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cria modulos de deteccao de elementos e atributos para conversao html para xml #135

Cria modulos de deteccao de elementos e atributos para conversao html para xml #135

Uh oh!

robertatakenaka commented Dec 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		clean_number = re.sub(r"[^\w]", "", number)
		return f"fn{clean_number}"

	r'^(abstract\|resumo\|resumen\|résumé).*(english\|inglês\|español\|português\|français\|castellano\|espanhol)': 'trans-abstract',
	r'^(abstract\|resumo\|resumen\|résumé)\s(\(\|:\|-)\s(?:in\s+)?(english\|inglês\|español\|português\|français\|castellano\|espanhol)\)?\s*$': 'trans-abstract',



		def get_id_prefix_and_number(text: str, ref_type: Optional[str] = None) -> str:
		def get_id_prefix_and_number(text: str, ref_type: Optional[str] = None) -> Tuple[Optional[str], Optional[str]]:

Cria modulos de deteccao de elementos e atributos para conversao html para xml #135

Cria modulos de deteccao de elementos e atributos para conversao html para xml #135

Uh oh!

Conversation

robertatakenaka commented Dec 22, 2025

Pull Request: Sistema avançado de detecção de elementos XML baseado em títulos

Descrição

Mudanças Principais

🎯 Novo Sistema de Detecção de Elementos

detector_title_parent.py (NOVO)

detector_config_fn.py (NOVO)

🌍 Suporte Multilíngue Expandido

detector_config_sec.py

🔧 Correções Técnicas

detector_config_xref.py

🚀 Funcionalidades Avançadas

detector.py

Casos de Uso

Benefícios

Impacto

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant