src-d · vmarkovtsev · Mar 26, 2019 · Mar 26, 2019 · vmarkovtsev · Mar 26, 2019
diff --git a/ReviewComments/README.md b/ReviewComments/README.md
@@ -1,11 +1,12 @@
 # GitHub Pull Request Review Comments ![size 1.6GB](https://img.shields.io/badge/size-1.6GB-green.svg)
-The dataset was extracted from [GH Archive](https://www.gharchive.org/) and consists of:
 
-1. [25.3 million pull request review comments](https://drive.google.com/open?id=1rk6OTDrD09xVU0o_w8_dvtsLaeeUgwmP) since January 2015 till December 2018 - 1.6 GB (xz-compressed)
+[Download link.](https://drive.google.com/file/d/1bPEmq0qS_jvRlb9M1veLzyUJiAcV_Buz)
+
+25.3 million pull request review comments on GitHub since January 2015 till December 2018.
 
 ### Format
 
-CSV, columns:
+xz-compressed CSV, with columns:
 
 * `COMMENT_ID` - identifier of the comment in mother dataset - [GH Archive](https://www.gharchive.org/)
 * `COMMIT_ID` - commit hash to which the review comment is attached
@@ -14,9 +15,25 @@ CSV, columns:
 * `CREATED_AT` - creation date of the comment
 * `BODY` - raw content of the comment
 
-### Dataset generation
+### Sample code
+
+Python:
+```python
+# too big for pandas.read_csv
+import codecs
+import csv
+import lzma
+
+with lzma.open("review_comments.csv.xz") as archf:
+    reader = csv.DictReader(codecs.getreader("utf-8")(archf))
+    for record in reader:
+        print(record)
+```
+
+### Origin
 
-The dataset was generated in the [following notebook](PR_review_comments_generation.ipynb). The comments which exceeded Python's `csv.field_size_limit` equal to 128kB were discarded (~10 comments).
+The dataset was generated from [GH Archive](https://www.gharchive.org/) in the [following notebook](PR_review_comments_generation.ipynb).
+The comments which exceeded Python's `csv.field_size_limit` equal to 128KB were discarded (~10 comments).
 
 We gathered some [statistics about the dataset](PR_review_comments_stats.ipynb).