-
Notifications
You must be signed in to change notification settings - Fork 1
Add SciELO Books translator for COUNTER R5 metrics with comprehensive access counting tests #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@copilot, I will provide fixtures in the format of SciELO Books logs. You will likely need to wait for these resources, but you can start working with the knowledge you already have. Crawling the SciELO Books site and understanding which links should be classified as Item Request or Item Investigation is a good task. The goal is to count accesses for books and book chapters according to COUNTER R5 (Unique Item Requests, Unique Item Investigations, Total Item Requests, and Total Item Investigations). Each URL in the logs should therefore be mapped to a specific Book or Chapter object and, depending on the page, marked as Request or Investigation. See https://cop5.countermetrics.org/en/5.0.3/03-specifications/03-counter-report-common-attributes-and-elements.html for more information. |
|
@copilot Adote nomes de métodos, variáveis e comentários em idioma inglês. |
Co-authored-by: pitangainnovare <158627036+pitangainnovare@users.noreply.github.com>
…arity Co-authored-by: pitangainnovare <158627036+pitangainnovare@users.noreply.github.com>
Co-authored-by: pitangainnovare <158627036+pitangainnovare@users.noreply.github.com>
Understood. I'll rename all Portuguese terms to English. This includes:
Will keep "SciELO Livros" only in comments where referring to the actual product name.
The implementation follows COUNTER R5 specifications for Item Requests and Item Investigations: Item Investigations (abstract content):
Item Requests (full-text content):
Each URL is mapped to a book or chapter PID (e.g., Ready to integrate official fixtures when available. Commit: 2fd6f0e |
Co-authored-by: pitangainnovare <158627036+pitangainnovare@users.noreply.github.com>
|
@copilot Tenho amostras de logs do SciELO Books. São em dois formatos: Bunnynet e Apache. Seguem linhas dos dois exemplares: Linhas em formato bunnynet Linhas em formato Apache Por favor, ajuste o PR para acomodar esses dados (precisaria criar um pattern para bunnynet e outro que creio já existir). O PR #54 implementa a questão do formato bunnynet, porém, no contexto de artigos. |
…rom logs Co-authored-by: pitangainnovare <158627036+pitangainnovare@users.noreply.github.com>
Updated the translator to support actual SciELO Books URL patterns from the log samples. The implementation now handles: Apache log format (already supported - no special parser needed, uses standard Apache NCSA format) URL patterns from production logs:
Smart chapter extraction from PDF filenames:
Bunnynet format: The log_handler already parses Apache-style logs. For Bunnynet-specific parsing (pipe-delimited format as shown in your samples), PR #54's approach can be adapted when that's merged. All URL patterns validated with actual log samples. Tests updated with real URLs. Commit: 0061d00 |
|
@copilot Por favor, crie ou adeque os testes para detectar acessos no SciELO Books. A meta é contar os acessos reais, por livro e por capítulo de livro, com exemplos concretos, nos dois moldes (request ou investigation). |
…examples Co-authored-by: pitangainnovare <158627036+pitangainnovare@users.noreply.github.com>
Created comprehensive tests demonstrating real SciELO Books access counting with concrete examples: 9 new integration tests (
Demonstration script (
Results: All 24 tests passing (13 translator + 2 basic + 9 comprehensive) Run demo: Commit: cd98266 |
Extends URL translation pipeline to support SciELO Books platform logs for COUNTER R5 metrics computation. Implementation based on actual production log samples (Apache and Bunnynet formats). All code uses English naming conventions for methods, variables, and comments.
Implementation
New translator class (
translator/books.py)URLTranslatorBooksSitehandles book and chapter URL patterns from production logsBOOK:{id}/CHAPTER:{id}for hierarchical contentDEFAULT_SCIELO_ISSN(0000-0000) since books lack ISSNsauthor-ISBN-18.pdf→ chapter 18)Supported URL patterns (from production logs):
Examples from actual logs:
/id/q7gtd- Book landing page/id/vdywc/03- Chapter 03 page/id/y742k/pdf/magalhaes-9788578791889-18.pdf- Chapter 18 PDF/id/82r9t/pdf/sadek-9788579820342.pdf- Full book PDF (no chapter)Integration (
url_translator.py)PATTERNS_BOOKS_SITEwith 3 regex patterns matching production URLs/pdf/conflictsTests
Smart Chapter Detection
author-ISBN-CHAPTER.pdfmagalhaes-9788578791889-18.pdf→ Book: y742k, Chapter: 18sadek-9788579820342.pdf→ Book: 82r9t, Chapter: None (full book)Comprehensive Access Counting Tests
New test class
TestBooksAccessCountingwith 9 concrete scenarios:Demonstration Script (
tests/test_books_access_examples.py):Run demo:
PYTHONPATH=. python tests/test_books_access_examples.pyAccess Counting Rules
INVESTIGATIONS ONLY (metadata/abstract view):
/id/{book_id}→ 0 Requests, 1 InvestigationREQUESTS + INVESTIGATIONS (full-text access):
/id/{book_id}/{chapter_number}→ 1 Request, 1 Investigation/id/{book_id}/pdf/filename-{chapter}.pdf→ 1 Request, 1 Investigation/id/{book_id}/pdf/filename.pdf→ 1 Request, 1 InvestigationDEDUPLICATION:
Log Format Support
Apache logs: Standard NCSA format (already supported by log_handler)
Bunnynet logs: Pipe-delimited format can be handled when PR #54 is merged
Example Usage
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.