Skip to content

Conversation

@Narabzad
Copy link
Collaborator

@Narabzad Narabzad commented Nov 1, 2025

No description provided.

- Add create_sample_dataset.py: Create complete sampled datasets with all related files
- Add sample_papers_by_category.py: Stratified sampling by category distribution
- Add outputs_oct/: 100-paper sample dataset with citations, related works, and LaTeX sources
- Sampling method: Stratified random sampling maintaining category proportions
- Categories: 29 CV, 17 LG, 9 SE, 8 HC, 6 RO, and 13 other CS categories
- Includes: papers.csv, paper_content.csv, citations/, related_works/, latex_source/
- All papers meet quality criteria: related works present, 5+ citations, 200-10K chars
- Random seed 42 for reproducibility
…nce citation counts

- Updated generate_nuggets_from_reports.py to work with data pipeline output structure
- Enhanced get_important_citations.py with flexible column mapping
- Added comprehensive documentation to README.md
- Created PULL_REQUEST.md documenting all new features
- Added nuggets generation for ground truth reports
- Added important citation filtering for ground truth reports
- Integrated reference citation counts using OpenAlex API
…ence citation counts

- Updated generate_nuggets_from_reports.py to work with data pipeline output structure
- Enhanced get_important_citations.py with flexible column mapping
- Added comprehensive documentation to README.md
- Created PULL_REQUEST.md documenting all new features
- Added nuggets generation for ground truth reports
- Added important citation filtering for ground truth reports
- Integrated reference citation counts using OpenAlex API
@Narabzad Narabzad merged commit d70a447 into datapipeline_testing Nov 1, 2025
@Narabzad Narabzad deleted the oct_dataset branch November 1, 2025 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants