aut-1.2.0 (2022-11-17)
Closed issues:
- Include last modified date for a resource #546
Merged pull requests:
aut-1.1.1 (2022-10-31)
Fixed bugs:
- DomainGraph should use YYYYMMDD not YYYYMMDDHHMMSS #544
Merged pull requests:
- Use YYYYMMDD for crawl_date for DomainGraphExtractor. #545 (ruebot)
- Bump jsoup from 1.14.2 to 1.15.3 #543 (dependabot[bot])
aut-1.1.0 (2022-06-17)
Fixed bugs:
- org.apache.tika.mime.MimeTypeException: Invalid media type name: application/rss+xml lang=utf-8 #542
Closed issues:
- Add ARCH text files derivatives #540
Merged pull requests:
aut-1.0.0 (2022-06-10)
Implemented enhancements:
- Remove http headers, and html on webpages() #538
- Add domain column to webpages() #534
- Replace Java ARC/WARC record processing library #494
- Method to perform finer-grained selection of ARCs and WARCs #247
- Unnecessary buffer copying #18
Fixed bugs:
- Discard date RDD filter only takes a single string, not a list of strings. #532
- Extract gzip data from transfer-encoded WARC #493
- ARC reader string vs int error on record length #492
Closed issues:
- java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.Set$Set1 Set(liberal.ca) #529
- Improve CommandLineApp.scala test coverage #262
- Improve ExtractBoilerpipeText.scala test coverage #261
- Improve ArchiveRecord.scala test coverage #260
- Unit testing for RecordLoader #182
- Improve ArchiveRecordWritable.java test coverage #76
- Improve WarcRecordUtils.java test coverage #74
- Improve ArcRecordUtils.java test coverage #73
- Improve ExtractDate.scala test coverage #64
- Remove org.apache.commons.httpclient #23
Merged pull requests:
- Make webpages() consistent across aut and ARCH. #539 (ruebot)
- Update README #537 (ruebot)
- Fix codecov GitHub action. #536 (ruebot)
- Bump commons-compress from 1.14 to 1.21 #535 (dependabot[bot])
- Remove Java w/arc processing, and replace it with Sparkling. #533 (ruebot)
- Bump xercesImpl from 2.12.0 to 2.12.2 #527 (dependabot[bot])
aut-0.91.0 (2022-01-21)
Implemented enhancements:
- Include timestamp in crawl date #525
Merged pull requests:
aut-0.90.4 (2021-11-01)
Implemented enhancements:
- Replace scala-uri library from ExtractDomain and just parse public_suffix_list.dat #521
Fixed bugs:
- Scaladocs haven't been created since 0.90.0 release #522
Merged pull requests:
aut-0.90.3 (2021-10-22)
Fixed bugs:
- ExtractDomains returns non-Apex Domains #519
Merged pull requests:
- Update ExtractDomain to extract apex domains. #520 (ruebot)
- Bump jsoup from 1.13.1 to 1.14.2 #518 (dependabot[bot])
aut-0.90.2 (2021-05-12)
Fixed bugs:
- ARC file name appearing in
url
list #516 - WARC-Target-URI in Wget warc files is not parsed properly #514
Merged pull requests:
- Filter or filedesc and dns records from arcs. #517 (ruebot)
- Handle wget WARC-Target-URI formatting. #515 (ruebot)
aut-0.90.1 (2021-04-29)
Fixed bugs:
- crawl_date is not included on binary information jobs when documentation says it is #512
Merged pull requests:
- Add missing crawl_date column to binary information jobs. #513 (ruebot)
- Update jsoup to 1.13.1 #511 (ruebot)
aut-0.90.0 (2021-01-27)
Fixed bugs:
- Python implementation of .all() has .keepValidPages() incorrectly applied to it #502
- Extract hyperlinks from wayback machine #501
- Release 0.80.0 JAR produces error; built 0.80.1 fatjar built on repo works #495
Closed issues:
- Migrate CI infrastructure from TravisCI to GitHub Action #506
- Split tf into it's own repo #498
- Change master branch to main branch #490
- GitHub action - Run isort and black on Python code #488
- Add scalafmt GitHub action #486
- Add Google Java Formatter as a GitHub action #484
- Packages build is often broken - should we support it? #483
- Implement SaveToDisk in Python #478
- Java 11 support #356
Merged pull requests:
- ars-cloud compatibility with aut and Java 11 #510 (ruebot)
- Update to Spark 3.0.1 #508 (ruebot)
- Replace TravisCI with GitHub Actions. #507 (ruebot)
- Bump junit from 4.12 to 4.13.1 #505 (dependabot[bot])
- Fix relative links extraction #504 (yxzhu16)
- Remove .keepValidPages() on .all() Python implmentation. #503 (ruebot)
- Updates read.me to include citation section #500 (SamFritz)
- Remove tf project; resolves #498. #499 (ruebot)
- Add Python formatter GitHub Action. #489 (ruebot)
- Add scalafmt GitHub action and apply it to scala code. #487 (ruebot)
- Add Google Java Formatter as an action, and apply it. #485 (ruebot)
- Add Python implementation of SaveBytes. #482 (ruebot)
- Bump xercesImpl from 2.11.0 to 2.12.0 #481 (dependabot[bot])
- [Skip Travis] Trim README down given aut.docs.archivesunleashed.org #480 (ruebot)
- Spark 3.0.0 + Java 11 support. #375 (ruebot)
aut-0.80.0 (2020-06-03)
Closed issues:
- Broken link in documentation #476
- Improve udfs/package.scala test coverage #473
- Remove tabDelimit #471
- Remove Extract Entities #469
- PEP8 Naming - UDFs, App method names, DataFrame names, and filters. #468
- Python UDFs - class or not? #467
- Remove ExtractImageDetailsDF.scala #464
- github-stite-deploy uses password based authentication which is being deprecated by GitHub #461
- Implement Python versions of Serializable APIs #410
- Implement Python versions of App utilities #409
- Implement Python versions of Matchbox utilities #408
- Improve TupleFormatter.scala test coverage #59
- Create tests for NERCombinedJson.scala #53
- Create tests for NER3Classifier.scala #52
- Create tests for ExtractEntities.scala #48
Merged pull requests:
- Remove RDD suffixes on file, class, and object names. #479 (ruebot)
- PEP8 Python app method names. #477 (ruebot)
- Move Python UDF methods out of their own class. #475 (ruebot)
- Add DataFrame udf tests. #474 (ruebot)
- Remove tabDelimit. #472 (ruebot)
- Remove NER functionality. #470 (ruebot)
- Add ExtractPopularImages, WriteGEXF, and WriteGraphML to Python. #466 (ruebot)
- Remove ExtractImageDetailsDF; resolves #464. #465 (ruebot)
- Implement Scala Matchbox UDFs in Python. #463 (ruebot)
- Import clean-up for df package. #462 (ruebot)
aut-0.70.0 (2020-05-04)
Implemented enhancements:
- Update PlainTextExtractor to just extract text #452
- Migration of all RDD functionality over to DataFrames #223
Fixed bugs:
- DomainFrequencyExtractor should remove WWW prefix #456
Closed issues:
- For extractor (spark-submit) job, set Spark app name to be the extractor job name. #458
- Remove RDD options from app #449
- Add parquet as an app format option #448
- Add datathon derivatives to app (binary info, web pages, web graph #447
- Update Java 8 instructions for MacOS #445
- Add spark-submit to README #444
Merged pull requests:
- [skip travis] README updates #460 (ruebot)
- Set spark-submit app name to be "aut - extractorName". #459 (ruebot)
- Add RemovePrefixWWWDF to DomainFrequencyExtractor. #457 (ruebot)
- Updating Java install instructions for MacOS, resolves #445 #455 (ianmilligan1)
- Add option to save to Parquet for app. #454 (ruebot)
- Update PlainTextExtractor to output a single column; text. #453 (ruebot)
- Add a number of additional app extractors. #451 (ruebot)
- Remove RDD option in app; DataFrame only now. #450 (ruebot)
- [skip-travis] Add spark-submit option to README; resolves #444. #446 (ruebot)
aut-0.60.0 (2020-04-15)
Implemented enhancements:
- Discussion: Restyle UDFs in the context of DataFrames #425
- Add alt text column to imageGraph (imageLinks) #420
- UDFs that filter on url should also filter on src #418
Fixed bugs:
- CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439
- DomainGraphExtractor produces different output in RDD vs DF #436
- Command line app fails because of missing log4j configuration #433
Closed issues:
- Remove GraphXML and ExtractGraphX #442
- Use Monochromatic Ids instead of hash to produce network identifiers. #440
- Add graphml output to DomainGraphExtractor #435
- Add webgraph, imagegraph, webpages, etc. to command line app #431
- Rename imageLinks to imageGraph #419
Merged pull requests:
- Remove GraphX support; resolves #442. #443 (ruebot)
- Remove WriteGraph; resolves #439. #441 (ruebot)
- Add graphml output to CommandLineApp and DomainGraphExtractor. #438 (ruebot)
- Align RDD and DF output for DomainGraphExtractor. #437 (ruebot)
- Update log4j configuration to resolve #433. #434 (ruebot)
- Add imagegraph, and webgraph to command line app. #432 (ruebot)
- Tweak hasDate to handle Seq. #430 (ruebot)
- Restyle keep/discard filter UDFs in the context of DataFrames #429 (ruebot)
- Update Spark and Hadoop versions. #426 (ruebot)
- update for 'src' column #424 (SinghGursimran)
- [skip travis] Add pre-print link to README. #423 (ruebot)
- Add img alt text to imagegraph(); resolves #420. #422 (ruebot)
- Rename imageLinks to imageGraph; resolves #419 #421 (ruebot)
- Need --repositories flag with --packages. #417 (ruebot)
aut-0.50.0 (2020-02-05)
Implemented enhancements:
- Add crawl_date to binary DataFrames and imageLinks #413
Fixed bugs:
- 0.18.0 with --packages is broken #407
Closed issues:
Merged pull requests:
- Clean up test descriptions, addresses #372. #416 (ruebot)
- Remaining Matchbox implementations for Scala #415 (SinghGursimran)
- Add crawl_date to binary DataFrames and imageLinks. #414 (ruebot)
- Various DataFrame implementation updates for documentation clean-up; Addresses #372. #406 (ruebot)
- Use https for maven repo. #405 (ruebot)
- Test clean-up. #404 (ruebot)
- Add language detection column to webpages. #403 (ruebot)
- DataFrame Implementation - Serializable APIs #401 (SinghGursimran)
- Filter blank src/dest out of webgraph. #400 (ruebot)
- More df implementations #399 (SinghGursimran)
- Scala imports cleanup. #398 (ruebot)
- More Serializable APIs for DataFrames #396 (SinghGursimran)
- Update ExtractDateRDD test #395 (ruebot)
- Add doc comments for webpages and webgraph; resolves #392. #394 (ruebot)
- Add additional filters for fextFiles; resolves #362. #393 (ruebot)
- API implementations for DataFrame #391 (SinghGursimran)
- Setup for Serializable APIs on DataFrames #389 (SinghGursimran)
- Add and update tests, resolve textFiles bug. #388 (ruebot)
- Dataframe matchbox Implementations #387 (SinghGursimran)
- Clean-up underscore import, and scalastyle warnings. #386 (ruebot)
- Rename pages() to webpages(). #384 (ruebot)
- More Data Frame Implementations + Code Refactoring #383 (SinghGursimran)
- Extract popular images - Data Frame implementation #382 (SinghGursimran)
- Append UDF with RDD or RF. #381 (ruebot)
- Matchbox utilities to DataFrames #380 (SinghGursimran)
- Rename DF functions to be consistent with Python DF functions. #379 (ruebot)
- Converting output of NER Classifier to WANE Format #378 (SinghGursimran)
- Finding Hyperlinks within Collection on Pages with Certain Keyword #377 (SinghGursimran)
- Update README.md #376 (lintool)
- Fix for Issue-368 #374 (SinghGursimran)
- [skip travis] update description. see https://github.com/archivesunle… #373 (ruebot)
- Various UDF implementation and cleanup for DF #370 (lintool)
- Update commons-compress to 1.19; CVE-2019-12402 #365 (ruebot)
- Add ComputeSHA1 method; resolves #363. #364 (ruebot)
- Align NER output to WANE format #361 (ruebot)
- Update keepValidPages to include a filter on 200 OK. #360 (ruebot)
- Update to Spark 2.4.4 #358 (ruebot)
- [skip travis] Update links #357 (ruebot)
- Improve test coverage. #354 (ruebot)
- Add discardLanguage filter to RecordLoader. #353 (ruebot)
aut-0.18.1 (2020-01-17)
Implemented enhancements:
Fixed bugs:
- textFiles does not filter properly #390
- DataFrame error with text files: java.net.MalformedURLException: unknown protocol: filedesc #362
Closed issues:
- Missing doc comments #392
- Bug in ArcTest? Why run RemoveHTML? #369
- UDF CaMeL cASe consistency issues #368
- ExtractDomain or ExtractBaseDomain? #367
- Align DataFrame boilerplate in Python and Scala #366
- Create a ComputeSHA1 method #363
- Discussion: Should we align our Named Entity Recognition output with WANE format? #297
- DataFrame discussion: open thread #190
aut-0.18.0 (2019-08-21)
Implemented enhancements:
- Add method for unknown extensions in binary extractions #343
- Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342
- Add filter/keep by http status to RecordLoader class #315
- Audio binary object extraction #307
- Video binary object extraction #306
- Powerpoint binary object extraction #305
- Doc binary object extraction #304
- Spreadsheet binary object extraction #303
- PDF binary object extraction #302
- Test aut with Apache Spark 2.4.0 #295
- Replace hashing of unique ids with .zipWithUniqueId() #243
- Integration of neural network models for image analysis #240
- More complete Twitter Ingestion #194
- Image Search Functionality #165
- feature request: log when loadArchives opens and closes warc files in a dir #156
Fixed bugs:
- DataFrame commands throwing java.lang.NullPointerException on example data #320
- Class issues when using aut-0.17.0-fatjar.jar #313
- Image extraction does not scale with number of WARCs #298
- ExtractDomain mistakenly checks source first then url #277
- Improve ExtractDomain to Better Isolate Domains #269
Security fixes:
- CVE-2017-7525 -- com.fasterxml.jackson.core:jackson-databind #279
Closed issues:
- Inconsistency in ArchiveRecord.getContentBytes #334
- Rationalize computeHash and ComputeMD5 #333
- Test additional Java versions with TravisCI #324
- Remove Twitter/tweet analysis #322
- Trouble testing s3 connectivity #319
- Depfu Error: No dependency files found #309
- Strategy to deal with conflict between application and Spark distribution dependencies #308
- SaveImageTest.scala should delete saved image file #299
- Remove Deprecated ExtractGraph.scala file for next release. #291
- DetectLanguage.scala: class LanguageIdentifier in package language is deprecated #286
- Maven build warning during release #273
- Improve DataFrameLoader.scala test coverage #265
- Improve package.scala test coverage #263
- Discussion: Idiom for loading DataFrames #231
- DataFrame field names: open thread #229
- DataFrame performance comparison: Scala vs. Python #215
- TweetUtilsTest.scala doesn't test Spark, only underlying json4s library #206
- feature request: ArchiveRecord.archiveFile #164
- feature request: possibility to query about the progress #162
- Update to Apache Tika 1.19.1; security vulnerabilities in 1.12 #131
- Create tests for ExtractGraph.scala #49
- Setup Victims #5
Merged pull requests:
- Update LICENSE and license headers. #351 (ruebot)
- Add binary extraction DataFrames to PySpark. #350 (ruebot)
- Add method for determining binary file extension #349 (jrwiebe)
- Add keep and discard by http status. #347 (ruebot)
- Add office document binary extraction. #346 (ruebot)
- Use version of tika-parsers without a classifier #345 (jrwiebe)
- Use Tika's detected MIME type instead of ArchiveRecord getMimeType. #344 (ruebot)
- Add Audio & Video binary extraction #341 (ruebot)
- Extract PDF #340 (jrwiebe)
- More scalastyle work; addresses #196. #339 (ruebot)
- Replace computeHash with ComputeMD5; resolves #333. #338 (ruebot)
- Update Tika to 1.22; address security alerts. #337 (ruebot)
- Tests #336 (ruebot)
- Make ArchiveRecord.getContentBytes consistent, Resolve #334 #335 (ianmilligan1)
- Enable S3 access #332 (jrwiebe)
- Updates to pom following 0e701b271e04e60c6fa89f39299dae7142d700b8 #328 (ruebot)
- Move data frame fields names to snake_case. #327 (ruebot)
- Python formatting, and gitignore additions. #326 (ruebot)
- Test Java 8 & 11, and remove OracleJDK; resolves #324. #325 (ruebot)
- Remove Tweet utils. #323 (ruebot)
- Update to Spark 2.4.3 and update Tika to 1.20. #321 (ruebot)
- add image analysis w/ tensorflow #318 (h324yang)
- Makes ArchiveRecordImpl serializable #316 (jrwiebe)
- Resolve cobertura-maven-plugin class issue; resolves #313. #314 (ruebot)
- Update spark-core_2.11 to 2.3.1. #312 (ruebot)
- Log closing of ARC and WARC files, per #156 #301 (jrwiebe)
- Delete saved image file; resolves #299 #300 (jrwiebe)
- Remove Deprecated ExtractGraph app #293 (greebie)
- Add .getHttpStatus and .getFilename to ArchiveRecordImpl class #198 & #164 #292 (greebie)
- Update license headers for #208. #290 (ruebot)
- Change Id generation for graphs from using hashes for urls to using .zipWithUniqueIds() #289 (greebie)
- CVE-2018-11771 update #288 (ruebot)
- CVE-2017-17485 update; follow-on to #281. #287 (ruebot)
- Update Apache Tika - security vulnerabilities; resolves #131. #285 (ruebot)
- [skip travis] Update README #284 (ruebot)
- Only trigger TravisCI on master. #283 (ruebot)
- Missed something for #208. #282 (ruebot)
- CVE-2018-7489 fix. #281 (ruebot)
- Update jackson-databind version; resolves #279. #280 (ruebot)
- Patch for #277: Fix bug and unit test for ExtractDomain #278 (borislin)
- Patch for #269: Replace backslash with forward slash in URL #276 (borislin)
- Clean-up pom.xml to remove plugin warnings; resolves #273. #274 (ruebot)
aut-0.17.0 (2018-10-04)
Implemented enhancements:
Fixed bugs:
- AUT exits/dies on java.util.zip.ZipException: too many length or distance symbols #271
- AUT exits/dies on java.util.zip.ZipException: invalid distance too far back #246
- Improve ExtractDomain Normalization #239
- Twitter analysis is broken; see also: json4s/json4s#496 #197
- Prevent encoding errors in PySpark #122
Closed issues:
- Cannot skip bad record while reading warc file #267
- Why did Scalastyle not reject
null
values in TweetUtilTest #255 - Create UDF to combine basic text filtering features #253
- spark-shell --packages "io.archivesunleashed:aut:0.16.0" fails with not_found dependencies #242
- CommandLineAppRunner.scala produces output per WARC instead of combined result. #235
- Extract images out of images DataFrame and store to disk #232
- Before the next release, make sure docker-aut builds on master... or make sure --packages works #227
- DataFrames for image analysis #220
- The attempt to upgrade Spark version to 2.3.0 is not successful #218
- Convert nulls to Option(T) #212
- Bringing Scala DataFrames into PySpark #209
- What is AUT? #208
- Refactor ExtractGraph and assess value of GraphX for producing network graphs #203
- Codify creation of standard derivatives into apps #195
- TweetUtils - support fulltext #192
- Combine UDFs into appropriate objects #187
- Register Scala functions for use in Pyspark #148
- PySpark performance bottlenecks: counting values #130
- Redesign of PySpark DataFrame interface for filtering #120
- Improve RecordLoader.scala test coverage #60
Merged pull requests:
- Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272 (borislin)
- Update Bug report template. #268 (ruebot)
- ExtractBoilerpipeText to remove headers as well. #253 #256 (greebie)
- Add additional tweet fields to TweetUtils; partially address #194. #254 (ruebot)
- Add support for full_text in tweets; resolve #192. #252 (ruebot)
- Get rid of 'filesystem-root relative reference' warning. #251 (ruebot)
- Remove stray characters from example commands. #250 (ruebot)
- Deal with final scalastyle assessments: Issue 212 #249 (greebie)
- Address main scalastyle errors - #196 #248 (greebie)
- Add ExtractGraphX including algorithms for PageRank and Components. Issue 203 #245 (greebie)
- Travis build fixes #244 (ruebot)
- Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236 (TitusAn)
- Save images from dataframe to disk #234 (jwli229)
- Add missing dependencies in; addresses #227. #233 (ruebot)
- Code cleanup: ArchiveRecord + impl moved into same Scala file #230 (lintool)
- Add Extract Image Details API #226 (jwli229)
- Implement DomainFrequency, DomainGraph and PlainText extractor that can be run from command line #225 (TitusAn)
- Remove duplicate call of keepValidPages #224 (jwli229)
- Extract Image Links DF API + Test #221 (jwli229)
- Update Apache Spark to 2.3.0; resolves #218 #219 (ruebot)
- Resolve archivesunleashed/docker-aut#17 #217 (ruebot)
- Create issue templates #216 (ruebot)
- Exposing Scala DataFrames in PySpark #214 (lintool)
- Update project description; resolves #208. #211 (ruebot)
- Initial DataFrames merge #210 (lintool)
- Add more instructions on how to use things to the README. #207 (ruebot)
aut-0.16.0 (2018-04-26)
Implemented enhancements:
- Revisit approach to .keepValidPages() #177
Closed issues:
- keepValidPages incorrectly filters out pages with mime-type text/html followed by charset #199
Merged pull requests:
- Unbork'ing tweet analysis (Fixes Issue 197) - take2 #205 (lintool)
- Update README.md #202 (lintool)
- Code reformatting #201 (lintool)
- fix #199: mime-type was incorrectly parsed from content-type when cha… #200 (dportabella)
aut-0.15.0 (2018-04-11)
Implemented enhancements:
- Clean-up scaladoc comments #184
Closed issues:
- Rename package io.archivesunleashed.io #188
- Major Refactoring: RecordRDD #180
- Major refactoring: matchbox cleanup #179
- Major refactoring: io.archivesunleashed.spark -> io.archivesunleashed #178
Merged pull requests:
- Improve and clean-up Scaladocs; resolves #184 #193 (ruebot)
- Major refactoring of package structure #189 (lintool)
- make ArchiveRecord a trait #186 (helgeho)
aut-0.14.0 (2018-03-20)
Closed issues:
- Incorporate Scala UDFs into Auto-documentation #176
Merged pull requests:
- Resolve #176; setup scaladocs. #183 (ruebot)
- Revert "make ArchiveRecord a trait (#175)" #181 (ruebot)
aut-0.13.0 (2018-03-07)
Merged pull requests:
aut-0.12.2 (2018-02-28)
Implemented enhancements:
- ArchiveRecord.warcFile #171
- Better approach to ids in WriteGraphML & WriteGEXF #168
- Build pre-filtered networks #109
- KeepDate UDF should support date range #108
- Changing keepDate to allow multiple dates, would close #108 #161 (ianmilligan1)
Fixed bugs:
- Broken GEXF Files Due to < and > characters in node id fields #172
- There is insufficient memory for the Java Runtime Environment to continue #159
- AUT Fails on Extracting Text from WARCs #158
Closed issues:
- RecordLoader.loadArchives fails with nested dirs #169
- Unparseable date error #163
- remove angle brackets from ArchiveRecord.getUrl #157
- Benchmarking Scala vs Python #121
- Improve WacArcInputFormat.java test coverage #80
- Improve WacWarcInputFormat.java test coverage #78
- Improve WarcRecordWritable.java test coverage #77
- Improve ArcRecordWritable.java test coverage #75
- Improve ArcRecord.scala test coverage #69
- Improve RemoveHttpHeader.scala test coverage #57
- Investigate Jupyter notebooks on Altiscale #37
Merged pull requests:
- Gexf Fixes & StringUtil Functions #172 #173 (greebie)
- Graphml Improvements #170 (greebie)
- Graphml #167 (greebie)
- Fix bug -- label type should be "string" not "label". #166 (greebie)
- Add link to docker-aut. #160 (ruebot)
- Remove references to Arc and WarcRecord libraries (covered by Archive… #146 (greebie)
aut-0.12.1 (2017-12-15)
Fixed bugs:
- ARC Handling Bug in 0.12.0 when Extracting Links #154
- Changes jsoup version in pom.xml (#154) #155 (ianmilligan1)
aut-0.12.0 (2017-12-11)
Implemented enhancements:
Fixed bugs:
- NullPointerException error during build #124
- Resolves Issue #128: Uses new getOrigins method #136 (ianmilligan1)
Closed issues:
- Create tests for WriteGEXF.scala #138
- ERROR ArcRecordUtils - Read 1224 bytes but expected 1300 bytes #128
- WarcRecordUtils.java uses or overrides a deprecated API #127
- class LanguageIdentifier in package language is deprecated #126
- multiple versions of scala #125
- ExtractLinks running slowly #123
- com.cloudera.cdh:hadoop-ant:pom:0.20.2-cdh3u4 -- errors #118
Merged pull requests:
- Too many JUNITs #152 (ruebot)
- Add more packages and exclusions for #113 #150 (ruebot)
- Tuple Formatter Test Improvement #145 (greebie)
- Check to replace partial coverage for ExtractDate. #144 (greebie)
- Add GraphML UDF #143 (greebie)
- Remove stackTrace output on caught error. #141 (greebie)
- Add deprecation warnings to outmoded Arc and Warc formats. #140 (greebie)
- Tests for WriteGEXF Issue #138 #139 (greebie)
- Include script to write to GEXF. (#103) #137 (greebie)
- Use correct import for WARCConstants; Resolves #127. #133 (ruebot)
- Downgrade Tika to 1.12. Resolves #126. #132 (ruebot)
- Pin everything to Scala 2.11.8; Resolves #125. #129 (ruebot)
- Exclude old version of Hadoop. Resolves #118. #119 (ruebot)
aut-0.11.0 (2017-11-22)
Implemented enhancements:
- GetCrawlYear to accompany GetCrawlMonth #104
- Refactor RecordLoader classes #102
- Adding getCrawlYear in ArchiveRecords, resolves #104 #105 (ianmilligan1)
Closed issues:
- spark-shell --packages "io.archivesunleashed:aut:0.10.0"` fails with not_found dependencies #113
- update the version of the dependencies not available on the central maven repository #111
- Bake keepValidPages() into RecordLoader #101
- Create tests for JsonUtil.scala #66
- Improve ExtractDomain.scala test coverage #63
- Improve ExtractImageLinks.scala test coverage #62
- Improve ExtractLinks.scala test coverage #61
- Improve StringUtils.scala test coverage #58
- Improve RemoveHTML.scala test coverage #56
- Create tests for TweetUtils.scala #54
- Create tests for ExtractTextFromPDFs.scala #51
- Create tests for ExtractPopularImages.scala #50
- Create tests for ExtractBoilerpipeText.scala #47
- Create tests for ComputeMD5.scala #46
- Create tests for ComputeImageSize.scala #45
Merged pull requests:
- This needs to hold steady. #117 (ruebot)
- Update all dependencies, and add missing dependencies to resolve #113. #116 (ruebot)
- Updated documentation links; link to project page #115 (ianmilligan1)
- Remove pom.xml cruft; Partially resolves #111. #112 (ruebot)
- Created Code of Conduct file #110 (SamFritz)
- Refactor ArchiveRecord classes; addresses #101 and #102 #107 (MapleOx)
- Improve coverage for issue-67 (RecordRDD.scala) #99 (greebie)
- Minor fix to improve coverage. #55 #98 (greebie)
- Test ExtractTextFromPDFs. #51 #97 (greebie)
- Remove example scripts. Resolves #95, #70, #71, #72. #96 (ruebot)
- Setup cobertura better so we have local html reports. #94 (ruebot)
- Create unit tests for Issue #50 (ExtractPopularImages) #93 (greebie)
- Add ExtractGraphTest; lint fixes on RemoveHttpHeaderTest. #92 (greebie)
- Improve coverage for Issue #80 #91 (greebie)
- Improve coverage for TweetUtils #90 (greebie)
- Increase coverage for ComputeImageSize. #45 #89 (greebie)
- Complete coverage for #66 #88 (greebie)
- Improve Test Coverage for #55, #56, #57, #58, #59, #60, #61, #62, #63, #64 & #66 #87 (greebie)
- Add PR template. #85 (ruebot)
- First round of unit tests #84 (greebie)
- Use Scala 2.11.8; Align further with Altiscale. #83 (ruebot)
aut-0.10.0 (2017-10-02)
Fixed bugs:
- NER breaks for WARC files? #41
Closed issues:
- Do we need pythonconverters/ArcRecordConverter.scala? If so, tests. If not, delete it. #65
- Upgrade to Spark 2 on Altiscale #43
- Investigate our test coverage according to codecov.io #36
- Update Scala version #35
- Update to use Java 8 #32
- Migrate warcbase-resources to aut-resources #30
- mvn site-deploy -DskipTests is still failing #27
- Retarget Hadoop #9
Merged pull requests:
- Update to Apache Spark 2.1.1; resolves #43. #82 (ruebot)
- Remove unused file; resolves #65. #81 (ruebot)
- Removed inaccurate information from README.md #44 (lintool)
- Add WARC support for ExtractEntities; Resolve #41. #42 (ruebot)
- Add OpenJDK8 and remove OracleJDK7 so we can use trusty. #39 (ruebot)
- Link to aut-docs in README #38 (ianmilligan1)
- Resolve #32; Update to Java 8 #34 (ruebot)
- Resolve #9; Update Hadoop and Spark versions. #33 (ruebot)
- Added reference to the releases #31 (ianmilligan1)
- Resolve #27 - Deploy javadocs to gh-pages #29 (ruebot)
- Add Maven Central badge. #28 (ruebot)
aut-0.9.0 (2017-08-24)
Closed issues:
- More work needs to be done on the pom.xml to get us to a release. #25
- Is src/main/java/io/archivesunleashed/demo required? #17
- Visualization Repo (aut-viz) #16
- Remove
src/main/python
#10 - What do we do with all the documentation at docs.warcbase.org? #8
- Setup to publish javadocs on ghpages #7
- Get a project setup on sonatype #6
- Setup license headers and mycila #4
- Setup checkstyle #3
- Setup codecov.io #1
Merged pull requests:
- Resolve #25 update pom.xml to do a release #26 (ruebot)
- Resolve #7 #24 (ruebot)
- Add Slack integration for TravisCI #21 (ruebot)
- Setup mycila plugin, and normalize all license headers; Resolves #4. #20 (ruebot)
- Add checkstyle plugin, and remove demo; resolves #3 #17. #19 (ruebot)
- Updating README #15 (ianmilligan1)
- Remove dir; resolves #10 #11 (ruebot)
- Setup codecov.io integration; resolves #1 #2 (ruebot)
* This Changelog was automatically generated by github_changelog_generator