Skip to content

Commit

Permalink
Merge pull request #133 from amir-zeldes/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
amir-zeldes authored Feb 2, 2023
2 parents d988aab + 067de51 commit ae96435
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,12 @@ The corpus is created as part of the course LING-367 (Computational Corpus Lingu

## A note about Reddit data

For one of the twelve text types in this corpus, Reddit forum discussions, plain text data is not supplied, and you will find **underscores** in place of word forms in documents from this data (files named `GUM_reddit_*`). To obtain this data, please run `python get_text.py`, which will allow you to reconstruct the text in these files. This, and all data, is provided with absolutely no warranty; users agree to use the data under the license with which it is provided, and Reddit data is subject to Reddit's terms and conditions. See [README_reddit.md](README_reddit.md) for more details.
For one of the twelve text types in this corpus, Reddit forum discussions, plain text data is not supplied, and you will find **underscores** in place of word forms in documents from this data (files named `GUM_reddit_*`). To obtain this data, please run `python get_text.py`, which will allow you to reconstruct the text in these files. This, and all data, is provided with absolutely no warranty; users agree to use the data under the license with which it is provided, and Reddit data is subject to Reddit's terms and conditions. See [README_reddit.md](README_reddit.md) for more details.

Note that the `get_text.py` script only regenerates the files named `GUM_reddit_*` in each folder, and will not create full versions of the data in `PAULA/` and `annis/`. If you require PAULA XML or searchable ANNIS data containing these documents, you will need to recompile the corpus from the source files under `_build/src/`. To do this, run `_build/process_reddit.py`, then run `_build/build_gum.py`.

You can also run searches in the complete version of the corpus using [our ANNIS server](https://gucorpling.org/annis/#_c=R1VN)

## Train / dev / test splits

Two documents from each genre are reserved for testing and devlopment (24 test documents, 24 dev documents). See [splits.md](splits.md) for the official training, development and testing partitions.
Expand Down

0 comments on commit ae96435

Please sign in to comment.