csci-544-project

CI/CD linters for Python are enabled, so please format your code and pass the checks before pull requests.

First Steps

Clone this repo
Create your own branch named name-dev
Create a virtual environment using venv managers of your taste :) (preferably with something can generate requirements.txt)
1. You may want to name your requirements.txt as requirements-dev.txt for distinction.
Make sure only pull requests when CI/CD passed

What do all these files/directories mean??

$ tree -I "env" -L 3
.
├── LICENSE
├── Makefile # contains commands for linting/CICD
├── README.md
├── csci-544-project # actual package folder, you should add scripts under this folder
│   ├── __init__.py
│   ├── data # data folder
│   │   └── test-data.csv
│   └── tests # for testing scripts
│       └── test.py
├── notebooks # for experiments
│   └── notebook-test.ipynb
├── requirements.txt # pacakge requirements
├── scripts # not sure yet. prob for misc.
│   └── script-test.py
└── setup.cfg # linting configurations file

Note that files with name containing test are dummy files. You can remove them once you add something to that folder.

When you see bugs or when you have something you want to achieve in foreseeable future...

Add an issue in Issues tab!

Data

ArXiv
PubMed

The follow data with description are sourced from: https://github.com/armancohan/long-summarization.git

Get the datasets

ArXiv dataset: Download (mirror) PubMed dataset: Download (mirror)

The datasets are rather large. You need about 5G disk space to download and about 15G additional space when extracting the files. Each tar file consists of 4 files. train.txt, val.txt, test.txt respectively correspond to the training, validation, and test sets. The vocab file is a plaintext file for the vocabulary.

Format of the data

The files are in jsonlines format where each line is a json object corresponding to one scientific paper from ArXiv or PubMed. The abstract, sections and body are all sentence tokenized. The json objects are in the following format:

{ 
  'article_id': str,
  'abstract_text': List[str],
  'article_text': List[str],
  'section_names': List[str],
  'sections': List[List[str]]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

csci-544-project

First Steps

What do all these files/directories mean??

When you see bugs or when you have something you want to achieve in foreseeable future...

Data

Get the datasets

Format of the data

About

Releases

Packages

Contributors 5

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
DatasetBasicStatistics		DatasetBasicStatistics
csci-544-project		csci-544-project
notebooks		notebooks
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
The Influence of Data Pre-processing and Post-processing on Long Document Summarization.pdf		The Influence of Data Pre-processing and Post-processing on Long Document Summarization.pdf
requirements.txt		requirements.txt
setup.cfg		setup.cfg

License

Anthonyive/csci-544-project

Folders and files

Latest commit

History

Repository files navigation

csci-544-project

First Steps

What do all these files/directories mean??

When you see bugs or when you have something you want to achieve in foreseeable future...

Data

Get the datasets

Format of the data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages