Oct dataset #8

Narabzad · 2025-11-01T23:07:54Z

No description provided.

- Add create_sample_dataset.py: Create complete sampled datasets with all related files - Add sample_papers_by_category.py: Stratified sampling by category distribution - Add outputs_oct/: 100-paper sample dataset with citations, related works, and LaTeX sources - Sampling method: Stratified random sampling maintaining category proportions - Categories: 29 CV, 17 LG, 9 SE, 8 HC, 6 RO, and 13 other CS categories - Includes: papers.csv, paper_content.csv, citations/, related_works/, latex_source/ - All papers meet quality criteria: related works present, 5+ citations, 200-10K chars - Random seed 42 for reproducibility

…nce citation counts - Updated generate_nuggets_from_reports.py to work with data pipeline output structure - Enhanced get_important_citations.py with flexible column mapping - Added comprehensive documentation to README.md - Created PULL_REQUEST.md documenting all new features - Added nuggets generation for ground truth reports - Added important citation filtering for ground truth reports - Integrated reference citation counts using OpenAlex API

…ence citation counts - Updated generate_nuggets_from_reports.py to work with data pipeline output structure - Enhanced get_important_citations.py with flexible column mapping - Added comprehensive documentation to README.md - Created PULL_REQUEST.md documenting all new features - Added nuggets generation for ground truth reports - Added important citation filtering for ground truth reports - Integrated reference citation counts using OpenAlex API

Narabzad added 3 commits October 23, 2025 08:18

Narabzad merged commit d70a447 into datapipeline_testing Nov 1, 2025

Narabzad deleted the oct_dataset branch November 1, 2025 23:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oct dataset #8

Oct dataset #8

Uh oh!

Narabzad commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Oct dataset #8

Oct dataset #8

Uh oh!

Conversation

Narabzad commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants