aut-1.0.0
Documentation
Release Notes
Implemented enhancements:
- Remove http headers, and html on webpages() #538
- Add domain column to webpages() #534
- Replace Java ARC/WARC record processing library #494
- Method to perform finer-grained selection of ARCs and WARCs #247
- Unnecessary buffer copying #18
Fixed bugs:
- Discard date RDD filter only takes a single string, not a list of strings. #532
- Extract gzip data from transfer-encoded WARC #493
- ARC reader string vs int error on record length #492
Closed issues:
- java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.Set$Set1 Set(liberal.ca) #529
- Improve CommandLineApp.scala test coverage #262
- Improve ExtractBoilerpipeText.scala test coverage #261
- Improve ArchiveRecord.scala test coverage #260
- Unit testing for RecordLoader #182
- Improve ArchiveRecordWritable.java test coverage #76
- Improve WarcRecordUtils.java test coverage #74
- Improve ArcRecordUtils.java test coverage #73
- Improve ExtractDate.scala test coverage #64
- Remove org.apache.commons.httpclient #23
Merged pull requests:
- Make webpages() consistent across aut and ARCH. #539 (ruebot)
- Update README #537 (ruebot)
- Fix codecov GitHub action. #536 (ruebot)
- Bump commons-compress from 1.14 to 1.21 #535 (dependabot[bot])
- Remove Java w/arc processing, and replace it with Sparkling. #533 (ruebot)
- Bump xercesImpl from 2.12.0 to 2.12.2 #527 (dependabot[bot])