Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the review comments dataset README #123

Merged
merged 1 commit into from
Mar 26, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 22 additions & 5 deletions ReviewComments/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
# GitHub Pull Request Review Comments ![size 1.6GB](https://img.shields.io/badge/size-1.6GB-green.svg)
The dataset was extracted from [GH Archive](https://www.gharchive.org/) and consists of:

1. [25.3 million pull request review comments](https://drive.google.com/open?id=1rk6OTDrD09xVU0o_w8_dvtsLaeeUgwmP) since January 2015 till December 2018 - 1.6 GB (xz-compressed)
[Download link.](https://drive.google.com/file/d/1bPEmq0qS_jvRlb9M1veLzyUJiAcV_Buz)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uploaded as review_comments.csv.xz, please copy to your gdrive and change the link


25.3 million pull request review comments on GitHub since January 2015 till December 2018.

### Format

CSV, columns:
xz-compressed CSV, with columns:

* `COMMENT_ID` - identifier of the comment in mother dataset - [GH Archive](https://www.gharchive.org/)
* `COMMIT_ID` - commit hash to which the review comment is attached
Expand All @@ -14,9 +15,25 @@ CSV, columns:
* `CREATED_AT` - creation date of the comment
* `BODY` - raw content of the comment

### Dataset generation
### Sample code

Python:
```python
# too big for pandas.read_csv
import codecs
import csv
import lzma

with lzma.open("review_comments.csv.xz") as archf:
reader = csv.DictReader(codecs.getreader("utf-8")(archf))
for record in reader:
print(record)
```

### Origin

The dataset was generated in the [following notebook](PR_review_comments_generation.ipynb). The comments which exceeded Python's `csv.field_size_limit` equal to 128kB were discarded (~10 comments).
The dataset was generated from [GH Archive](https://www.gharchive.org/) in the [following notebook](PR_review_comments_generation.ipynb).
The comments which exceeded Python's `csv.field_size_limit` equal to 128KB were discarded (~10 comments).

We gathered some [statistics about the dataset](PR_review_comments_stats.ipynb).

Expand Down