Changelog

aut-1.2.0 (2022-11-17)

Full Changelog

Closed issues:

Include last modified date for a resource #546

Merged pull requests:

Add scalafix and remove unused imports. #548 (ruebot)
Last modified headers #547 (ruebot)

aut-1.1.1 (2022-10-31)

Full Changelog

Fixed bugs:

DomainGraph should use YYYYMMDD not YYYYMMDDHHMMSS #544

Merged pull requests:

Use YYYYMMDD for crawl_date for DomainGraphExtractor. #545 (ruebot)
Bump jsoup from 1.14.2 to 1.15.3 #543 (dependabot[bot])

aut-1.1.0 (2022-06-17)

Full Changelog

Fixed bugs:

org.apache.tika.mime.MimeTypeException: Invalid media type name: application/rss+xml lang=utf-8 #542

Closed issues:

Add ARCH text files derivatives #540

Merged pull requests:

Add ARCH text files derivatives. #541 (ruebot)

aut-1.0.0 (2022-06-10)

Full Changelog

Implemented enhancements:

Remove http headers, and html on webpages() #538
Add domain column to webpages() #534
Replace Java ARC/WARC record processing library #494
Method to perform finer-grained selection of ARCs and WARCs #247
Unnecessary buffer copying #18

Fixed bugs:

Discard date RDD filter only takes a single string, not a list of strings. #532
Extract gzip data from transfer-encoded WARC #493
ARC reader string vs int error on record length #492

Closed issues:

java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.Set$Set1 Set(liberal.ca) #529
Improve CommandLineApp.scala test coverage #262
Improve ExtractBoilerpipeText.scala test coverage #261
Improve ArchiveRecord.scala test coverage #260
Unit testing for RecordLoader #182
Improve ArchiveRecordWritable.java test coverage #76
Improve WarcRecordUtils.java test coverage #74
Improve ArcRecordUtils.java test coverage #73
Improve ExtractDate.scala test coverage #64
Remove org.apache.commons.httpclient #23

Merged pull requests:

Make webpages() consistent across aut and ARCH. #539 (ruebot)
Update README #537 (ruebot)
Fix codecov GitHub action. #536 (ruebot)
Bump commons-compress from 1.14 to 1.21 #535 (dependabot[bot])
Remove Java w/arc processing, and replace it with Sparkling. #533 (ruebot)
Bump xercesImpl from 2.12.0 to 2.12.2 #527 (dependabot[bot])

aut-0.91.0 (2022-01-21)

Full Changelog

Implemented enhancements:

Include timestamp in crawl date #525

Merged pull requests:

Change crawl_date format to YYYYMMDDHHMMSS, update hasDate filter. #526 (ruebot)

aut-0.90.4 (2021-11-01)

Full Changelog

Implemented enhancements:

Replace scala-uri library from ExtractDomain and just parse public_suffix_list.dat #521

Fixed bugs:

Scaladocs haven't been created since 0.90.0 release #522

Merged pull requests:

Replace scala-uri library from ExtractDomain. #524 (ruebot)
Issue 522 #523 (ruebot)

aut-0.90.3 (2021-10-22)

Full Changelog

Fixed bugs:

ExtractDomains returns non-Apex Domains #519

Merged pull requests:

Update ExtractDomain to extract apex domains. #520 (ruebot)
Bump jsoup from 1.13.1 to 1.14.2 #518 (dependabot[bot])

aut-0.90.2 (2021-05-12)

Full Changelog

Fixed bugs:

ARC file name appearing in url list #516
WARC-Target-URI in Wget warc files is not parsed properly #514

Merged pull requests:

Filter or filedesc and dns records from arcs. #517 (ruebot)
Handle wget WARC-Target-URI formatting. #515 (ruebot)

aut-0.90.1 (2021-04-29)

Full Changelog

Fixed bugs:

crawl_date is not included on binary information jobs when documentation says it is #512

Merged pull requests:

Add missing crawl_date column to binary information jobs. #513 (ruebot)
Update jsoup to 1.13.1 #511 (ruebot)

aut-0.90.0 (2021-01-27)

Full Changelog

Fixed bugs:

Python implementation of .all() has .keepValidPages() incorrectly applied to it #502
Extract hyperlinks from wayback machine #501
Release 0.80.0 JAR produces error; built 0.80.1 fatjar built on repo works #495

Closed issues:

Migrate CI infrastructure from TravisCI to GitHub Action #506
Split tf into it's own repo #498
Change master branch to main branch #490
GitHub action - Run isort and black on Python code #488
Add scalafmt GitHub action #486
Add Google Java Formatter as a GitHub action #484
Packages build is often broken - should we support it? #483
Implement SaveToDisk in Python #478
Java 11 support #356

Merged pull requests:

ars-cloud compatibility with aut and Java 11 #510 (ruebot)
Update to Spark 3.0.1 #508 (ruebot)
Replace TravisCI with GitHub Actions. #507 (ruebot)
Bump junit from 4.12 to 4.13.1 #505 (dependabot[bot])
Fix relative links extraction #504 (yxzhu16)
Remove .keepValidPages() on .all() Python implmentation. #503 (ruebot)
Updates read.me to include citation section #500 (SamFritz)
Remove tf project; resolves #498. #499 (ruebot)
Add Python formatter GitHub Action. #489 (ruebot)
Add scalafmt GitHub action and apply it to scala code. #487 (ruebot)
Add Google Java Formatter as an action, and apply it. #485 (ruebot)
Add Python implementation of SaveBytes. #482 (ruebot)
Bump xercesImpl from 2.11.0 to 2.12.0 #481 (dependabot[bot])
[Skip Travis] Trim README down given aut.docs.archivesunleashed.org #480 (ruebot)
Spark 3.0.0 + Java 11 support. #375 (ruebot)

aut-0.80.0 (2020-06-03)

Full Changelog

Closed issues:

Broken link in documentation #476
Improve udfs/package.scala test coverage #473
Remove tabDelimit #471
Remove Extract Entities #469
PEP8 Naming - UDFs, App method names, DataFrame names, and filters. #468
Python UDFs - class or not? #467
Remove ExtractImageDetailsDF.scala #464
github-stite-deploy uses password based authentication which is being deprecated by GitHub #461
Implement Python versions of Serializable APIs #410
Implement Python versions of App utilities #409
Implement Python versions of Matchbox utilities #408
Improve TupleFormatter.scala test coverage #59
Create tests for NERCombinedJson.scala #53
Create tests for NER3Classifier.scala #52
Create tests for ExtractEntities.scala #48

Merged pull requests:

Remove RDD suffixes on file, class, and object names. #479 (ruebot)
PEP8 Python app method names. #477 (ruebot)
Move Python UDF methods out of their own class. #475 (ruebot)
Add DataFrame udf tests. #474 (ruebot)
Remove tabDelimit. #472 (ruebot)
Remove NER functionality. #470 (ruebot)
Add ExtractPopularImages, WriteGEXF, and WriteGraphML to Python. #466 (ruebot)
Remove ExtractImageDetailsDF; resolves #464. #465 (ruebot)
Implement Scala Matchbox UDFs in Python. #463 (ruebot)
Import clean-up for df package. #462 (ruebot)

aut-0.70.0 (2020-05-04)

Full Changelog

Implemented enhancements:

Update PlainTextExtractor to just extract text #452
Migration of all RDD functionality over to DataFrames #223

Fixed bugs:

DomainFrequencyExtractor should remove WWW prefix #456

Closed issues:

For extractor (spark-submit) job, set Spark app name to be the extractor job name. #458
Remove RDD options from app #449
Add parquet as an app format option #448
Add datathon derivatives to app (binary info, web pages, web graph #447
Update Java 8 instructions for MacOS #445
Add spark-submit to README #444

Merged pull requests:

[skip travis] README updates #460 (ruebot)
Set spark-submit app name to be "aut - extractorName". #459 (ruebot)
Add RemovePrefixWWWDF to DomainFrequencyExtractor. #457 (ruebot)
Updating Java install instructions for MacOS, resolves #445 #455 (ianmilligan1)
Add option to save to Parquet for app. #454 (ruebot)
Update PlainTextExtractor to output a single column; text. #453 (ruebot)
Add a number of additional app extractors. #451 (ruebot)
Remove RDD option in app; DataFrame only now. #450 (ruebot)
[skip-travis] Add spark-submit option to README; resolves #444. #446 (ruebot)

aut-0.60.0 (2020-04-15)

Full Changelog

Implemented enhancements:

Discussion: Restyle UDFs in the context of DataFrames #425
Add alt text column to imageGraph (imageLinks) #420
UDFs that filter on url should also filter on src #418

Fixed bugs:

CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439
DomainGraphExtractor produces different output in RDD vs DF #436
Command line app fails because of missing log4j configuration #433

Closed issues:

Remove GraphXML and ExtractGraphX #442
Use Monochromatic Ids instead of hash to produce network identifiers. #440
Add graphml output to DomainGraphExtractor #435
Add webgraph, imagegraph, webpages, etc. to command line app #431
Rename imageLinks to imageGraph #419

Merged pull requests:

Remove GraphX support; resolves #442. #443 (ruebot)
Remove WriteGraph; resolves #439. #441 (ruebot)
Add graphml output to CommandLineApp and DomainGraphExtractor. #438 (ruebot)
Align RDD and DF output for DomainGraphExtractor. #437 (ruebot)
Update log4j configuration to resolve #433. #434 (ruebot)
Add imagegraph, and webgraph to command line app. #432 (ruebot)
Tweak hasDate to handle Seq. #430 (ruebot)
Restyle keep/discard filter UDFs in the context of DataFrames #429 (ruebot)
Update Spark and Hadoop versions. #426 (ruebot)
update for 'src' column #424 (SinghGursimran)
[skip travis] Add pre-print link to README. #423 (ruebot)
Add img alt text to imagegraph(); resolves #420. #422 (ruebot)
Rename imageLinks to imageGraph; resolves #419 #421 (ruebot)
Need --repositories flag with --packages. #417 (ruebot)

aut-0.50.0 (2020-02-05)

Full Changelog

Implemented enhancements:

Add crawl_date to binary DataFrames and imageLinks #413

Fixed bugs:

0.18.0 with --packages is broken #407

Closed issues:

.webpages() additional tokenized columns? #402
Test and documentation inventory #372

Merged pull requests:

Clean up test descriptions, addresses #372. #416 (ruebot)
Remaining Matchbox implementations for Scala #415 (SinghGursimran)
Add crawl_date to binary DataFrames and imageLinks. #414 (ruebot)
Various DataFrame implementation updates for documentation clean-up; Addresses #372. #406 (ruebot)
Use https for maven repo. #405 (ruebot)
Test clean-up. #404 (ruebot)
Add language detection column to webpages. #403 (ruebot)
DataFrame Implementation - Serializable APIs #401 (SinghGursimran)
Filter blank src/dest out of webgraph. #400 (ruebot)
More df implementations #399 (SinghGursimran)
Scala imports cleanup. #398 (ruebot)
More Serializable APIs for DataFrames #396 (SinghGursimran)
Update ExtractDateRDD test #395 (ruebot)
Add doc comments for webpages and webgraph; resolves #392. #394 (ruebot)
Add additional filters for fextFiles; resolves #362. #393 (ruebot)
API implementations for DataFrame #391 (SinghGursimran)
Setup for Serializable APIs on DataFrames #389 (SinghGursimran)
Add and update tests, resolve textFiles bug. #388 (ruebot)
Dataframe matchbox Implementations #387 (SinghGursimran)
Clean-up underscore import, and scalastyle warnings. #386 (ruebot)
Rename pages() to webpages(). #384 (ruebot)
More Data Frame Implementations + Code Refactoring #383 (SinghGursimran)
Extract popular images - Data Frame implementation #382 (SinghGursimran)
Append UDF with RDD or RF. #381 (ruebot)
Matchbox utilities to DataFrames #380 (SinghGursimran)
Rename DF functions to be consistent with Python DF functions. #379 (ruebot)
Converting output of NER Classifier to WANE Format #378 (SinghGursimran)
Finding Hyperlinks within Collection on Pages with Certain Keyword #377 (SinghGursimran)
Update README.md #376 (lintool)
Fix for Issue-368 #374 (SinghGursimran)
[skip travis] update description. see https://github.com/archivesunle… #373 (ruebot)
Various UDF implementation and cleanup for DF #370 (lintool)
Update commons-compress to 1.19; CVE-2019-12402 #365 (ruebot)
Add ComputeSHA1 method; resolves #363. #364 (ruebot)
Align NER output to WANE format #361 (ruebot)
Update keepValidPages to include a filter on 200 OK. #360 (ruebot)
Update to Spark 2.4.4 #358 (ruebot)
[skip travis] Update links #357 (ruebot)
Improve test coverage. #354 (ruebot)
Add discardLanguage filter to RecordLoader. #353 (ruebot)

aut-0.18.1 (2020-01-17)

Full Changelog

Implemented enhancements:

Enhance keepValidPages #359
Add discardLanguage filter #352

Fixed bugs:

textFiles does not filter properly #390
DataFrame error with text files: java.net.MalformedURLException: unknown protocol: filedesc #362

Closed issues:

Missing doc comments #392
Bug in ArcTest? Why run RemoveHTML? #369
UDF CaMeL cASe consistency issues #368
ExtractDomain or ExtractBaseDomain? #367
Align DataFrame boilerplate in Python and Scala #366
Create a ComputeSHA1 method #363
Discussion: Should we align our Named Entity Recognition output with WANE format? #297
DataFrame discussion: open thread #190

aut-0.18.0 (2019-08-21)

Full Changelog

Implemented enhancements:

Add method for unknown extensions in binary extractions #343
Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342
Add filter/keep by http status to RecordLoader class #315
Audio binary object extraction #307
Video binary object extraction #306
Powerpoint binary object extraction #305
Doc binary object extraction #304
Spreadsheet binary object extraction #303
PDF binary object extraction #302
Test aut with Apache Spark 2.4.0 #295
Replace hashing of unique ids with .zipWithUniqueId() #243
Integration of neural network models for image analysis #240
More complete Twitter Ingestion #194
Image Search Functionality #165
feature request: log when loadArchives opens and closes warc files in a dir #156

Fixed bugs:

DataFrame commands throwing java.lang.NullPointerException on example data #320
Class issues when using aut-0.17.0-fatjar.jar #313
Image extraction does not scale with number of WARCs #298
ExtractDomain mistakenly checks source first then url #277
Improve ExtractDomain to Better Isolate Domains #269

Security fixes:

CVE-2017-7525 -- com.fasterxml.jackson.core:jackson-databind #279

Closed issues:

Inconsistency in ArchiveRecord.getContentBytes #334
Rationalize computeHash and ComputeMD5 #333
Test additional Java versions with TravisCI #324
Remove Twitter/tweet analysis #322
Trouble testing s3 connectivity #319
Depfu Error: No dependency files found #309
Strategy to deal with conflict between application and Spark distribution dependencies #308
SaveImageTest.scala should delete saved image file #299
Remove Deprecated ExtractGraph.scala file for next release. #291
DetectLanguage.scala: class LanguageIdentifier in package language is deprecated #286
Maven build warning during release #273
Improve DataFrameLoader.scala test coverage #265
Improve package.scala test coverage #263
Discussion: Idiom for loading DataFrames #231
DataFrame field names: open thread #229
DataFrame performance comparison: Scala vs. Python #215
TweetUtilsTest.scala doesn't test Spark, only underlying json4s library #206
feature request: ArchiveRecord.archiveFile #164
feature request: possibility to query about the progress #162
Update to Apache Tika 1.19.1; security vulnerabilities in 1.12 #131
Create tests for ExtractGraph.scala #49
Setup Victims #5

Merged pull requests:

Update LICENSE and license headers. #351 (ruebot)
Add binary extraction DataFrames to PySpark. #350 (ruebot)
Add method for determining binary file extension #349 (jrwiebe)
Add keep and discard by http status. #347 (ruebot)
Add office document binary extraction. #346 (ruebot)
Use version of tika-parsers without a classifier #345 (jrwiebe)
Use Tika's detected MIME type instead of ArchiveRecord getMimeType. #344 (ruebot)
Add Audio & Video binary extraction #341 (ruebot)
Extract PDF #340 (jrwiebe)
More scalastyle work; addresses #196. #339 (ruebot)
Replace computeHash with ComputeMD5; resolves #333. #338 (ruebot)
Update Tika to 1.22; address security alerts. #337 (ruebot)
Tests #336 (ruebot)
Make ArchiveRecord.getContentBytes consistent, Resolve #334 #335 (ianmilligan1)
Enable S3 access #332 (jrwiebe)
Updates to pom following 0e701b271e04e60c6fa89f39299dae7142d700b8 #328 (ruebot)
Move data frame fields names to snake_case. #327 (ruebot)
Python formatting, and gitignore additions. #326 (ruebot)
Test Java 8 & 11, and remove OracleJDK; resolves #324. #325 (ruebot)
Remove Tweet utils. #323 (ruebot)
Update to Spark 2.4.3 and update Tika to 1.20. #321 (ruebot)
add image analysis w/ tensorflow #318 (h324yang)
Makes ArchiveRecordImpl serializable #316 (jrwiebe)
Resolve cobertura-maven-plugin class issue; resolves #313. #314 (ruebot)
Update spark-core_2.11 to 2.3.1. #312 (ruebot)
Log closing of ARC and WARC files, per #156 #301 (jrwiebe)
Delete saved image file; resolves #299 #300 (jrwiebe)
Remove Deprecated ExtractGraph app #293 (greebie)
Add .getHttpStatus and .getFilename to ArchiveRecordImpl class #198 & #164 #292 (greebie)
Update license headers for #208. #290 (ruebot)
Change Id generation for graphs from using hashes for urls to using .zipWithUniqueIds() #289 (greebie)
CVE-2018-11771 update #288 (ruebot)
CVE-2017-17485 update; follow-on to #281. #287 (ruebot)
Update Apache Tika - security vulnerabilities; resolves #131. #285 (ruebot)
[skip travis] Update README #284 (ruebot)
Only trigger TravisCI on master. #283 (ruebot)
Missed something for #208. #282 (ruebot)
CVE-2018-7489 fix. #281 (ruebot)
Update jackson-databind version; resolves #279. #280 (ruebot)
Patch for #277: Fix bug and unit test for ExtractDomain #278 (borislin)
Patch for #269: Replace backslash with forward slash in URL #276 (borislin)
Clean-up pom.xml to remove plugin warnings; resolves #273. #274 (ruebot)

aut-0.17.0 (2018-10-04)

Full Changelog

Implemented enhancements:

Add EscapeHTML Function for ExtractLinks #266
PySpark support #12

Fixed bugs:

AUT exits/dies on java.util.zip.ZipException: too many length or distance symbols #271
AUT exits/dies on java.util.zip.ZipException: invalid distance too far back #246
Improve ExtractDomain Normalization #239
Twitter analysis is broken; see also: json4s/json4s#496 #197
Prevent encoding errors in PySpark #122

Closed issues:

Cannot skip bad record while reading warc file #267
Why did Scalastyle not reject null values in TweetUtilTest #255
Create UDF to combine basic text filtering features #253
spark-shell --packages "io.archivesunleashed:aut:0.16.0" fails with not_found dependencies #242
CommandLineAppRunner.scala produces output per WARC instead of combined result. #235
Extract images out of images DataFrame and store to disk #232
Before the next release, make sure docker-aut builds on master... or make sure --packages works #227
DataFrames for image analysis #220
The attempt to upgrade Spark version to 2.3.0 is not successful #218
Convert nulls to Option(T) #212
Bringing Scala DataFrames into PySpark #209
What is AUT? #208
Refactor ExtractGraph and assess value of GraphX for producing network graphs #203
Codify creation of standard derivatives into apps #195
TweetUtils - support fulltext #192
Combine UDFs into appropriate objects #187
Register Scala functions for use in Pyspark #148
PySpark performance bottlenecks: counting values #130
Redesign of PySpark DataFrame interface for filtering #120
Improve RecordLoader.scala test coverage #60

Merged pull requests:

Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272 (borislin)
Update Bug report template. #268 (ruebot)
ExtractBoilerpipeText to remove headers as well. #253 #256 (greebie)
Add additional tweet fields to TweetUtils; partially address #194. #254 (ruebot)
Add support for full_text in tweets; resolve #192. #252 (ruebot)
Get rid of 'filesystem-root relative reference' warning. #251 (ruebot)
Remove stray characters from example commands. #250 (ruebot)
Deal with final scalastyle assessments: Issue 212 #249 (greebie)
Address main scalastyle errors - #196 #248 (greebie)
Add ExtractGraphX including algorithms for PageRank and Components. Issue 203 #245 (greebie)
Travis build fixes #244 (ruebot)
Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236 (TitusAn)
Save images from dataframe to disk #234 (jwli229)
Add missing dependencies in; addresses #227. #233 (ruebot)
Code cleanup: ArchiveRecord + impl moved into same Scala file #230 (lintool)
Add Extract Image Details API #226 (jwli229)
Implement DomainFrequency, DomainGraph and PlainText extractor that can be run from command line #225 (TitusAn)
Remove duplicate call of keepValidPages #224 (jwli229)
Extract Image Links DF API + Test #221 (jwli229)
Update Apache Spark to 2.3.0; resolves #218 #219 (ruebot)
Resolve archivesunleashed/docker-aut#17 #217 (ruebot)
Create issue templates #216 (ruebot)
Exposing Scala DataFrames in PySpark #214 (lintool)
Update project description; resolves #208. #211 (ruebot)
Initial DataFrames merge #210 (lintool)
Add more instructions on how to use things to the README. #207 (ruebot)

aut-0.16.0 (2018-04-26)

Full Changelog

Implemented enhancements:

Revisit approach to .keepValidPages() #177

Closed issues:

keepValidPages incorrectly filters out pages with mime-type text/html followed by charset #199

Merged pull requests:

Unbork'ing tweet analysis (Fixes Issue 197) - take2 #205 (lintool)
Update README.md #202 (lintool)
Code reformatting #201 (lintool)
fix #199: mime-type was incorrectly parsed from content-type when cha… #200 (dportabella)

aut-0.15.0 (2018-04-11)

Full Changelog

Implemented enhancements:

Clean-up scaladoc comments #184

Closed issues:

Rename package io.archivesunleashed.io #188
Major Refactoring: RecordRDD #180
Major refactoring: matchbox cleanup #179
Major refactoring: io.archivesunleashed.spark -> io.archivesunleashed #178

Merged pull requests:

Improve and clean-up Scaladocs; resolves #184 #193 (ruebot)
Major refactoring of package structure #189 (lintool)
make ArchiveRecord a trait #186 (helgeho)

aut-0.14.0 (2018-03-20)

Full Changelog

Closed issues:

Incorporate Scala UDFs into Auto-documentation #176

Merged pull requests:

Resolve #176; setup scaladocs. #183 (ruebot)
Revert "make ArchiveRecord a trait (#175)" #181 (ruebot)

aut-0.13.0 (2018-03-07)

Full Changelog

Merged pull requests:

make ArchiveRecord a trait #175 (helgeho)

aut-0.12.2 (2018-02-28)

Full Changelog

Implemented enhancements:

ArchiveRecord.warcFile #171
Better approach to ids in WriteGraphML & WriteGEXF #168
Build pre-filtered networks #109
KeepDate UDF should support date range #108
Changing keepDate to allow multiple dates, would close #108 #161 (ianmilligan1)

Fixed bugs:

Broken GEXF Files Due to < and > characters in node id fields #172
There is insufficient memory for the Java Runtime Environment to continue #159
AUT Fails on Extracting Text from WARCs #158

Closed issues:

RecordLoader.loadArchives fails with nested dirs #169
Unparseable date error #163
remove angle brackets from ArchiveRecord.getUrl #157
Benchmarking Scala vs Python #121
Improve WacArcInputFormat.java test coverage #80
Improve WacWarcInputFormat.java test coverage #78
Improve WarcRecordWritable.java test coverage #77
Improve ArcRecordWritable.java test coverage #75
Improve ArcRecord.scala test coverage #69
Improve RemoveHttpHeader.scala test coverage #57
Investigate Jupyter notebooks on Altiscale #37

Merged pull requests:

Gexf Fixes & StringUtil Functions #172 #173 (greebie)
Graphml Improvements #170 (greebie)
Graphml #167 (greebie)
Fix bug -- label type should be "string" not "label". #166 (greebie)
Add link to docker-aut. #160 (ruebot)
Remove references to Arc and WarcRecord libraries (covered by Archive… #146 (greebie)

aut-0.12.1 (2017-12-15)

Full Changelog

Fixed bugs:

ARC Handling Bug in 0.12.0 when Extracting Links #154
Changes jsoup version in pom.xml (#154) #155 (ianmilligan1)

aut-0.12.0 (2017-12-11)

Full Changelog

Implemented enhancements:

Add GraphML UDF #142
GEXF Output #103
Native notebook support #14
DataFrames support #13

Fixed bugs:

NullPointerException error during build #124
Resolves Issue #128: Uses new getOrigins method #136 (ianmilligan1)

Closed issues:

Create tests for WriteGEXF.scala #138
ERROR ArcRecordUtils - Read 1224 bytes but expected 1300 bytes #128
WarcRecordUtils.java uses or overrides a deprecated API #127
class LanguageIdentifier in package language is deprecated #126
multiple versions of scala #125
ExtractLinks running slowly #123
com.cloudera.cdh:hadoop-ant:pom:0.20.2-cdh3u4 -- errors #118

Merged pull requests:

Too many JUNITs #152 (ruebot)
Add more packages and exclusions for #113 #150 (ruebot)
- Add tests for RecordLoader #149 (greebie)
Tuple Formatter Test Improvement #145 (greebie)
Check to replace partial coverage for ExtractDate. #144 (greebie)
Add GraphML UDF #143 (greebie)
Remove stackTrace output on caught error. #141 (greebie)
Add deprecation warnings to outmoded Arc and Warc formats. #140 (greebie)
Tests for WriteGEXF Issue #138 #139 (greebie)
Include script to write to GEXF. (#103) #137 (greebie)
Use correct import for WARCConstants; Resolves #127. #133 (ruebot)
Downgrade Tika to 1.12. Resolves #126. #132 (ruebot)
Pin everything to Scala 2.11.8; Resolves #125. #129 (ruebot)
Exclude old version of Hadoop. Resolves #118. #119 (ruebot)

aut-0.11.0 (2017-11-22)

Full Changelog

Implemented enhancements:

GetCrawlYear to accompany GetCrawlMonth #104
Refactor RecordLoader classes #102
Adding getCrawlYear in ArchiveRecords, resolves #104 #105 (ianmilligan1)

Closed issues:

spark-shell --packages "io.archivesunleashed:aut:0.10.0"` fails with not_found dependencies #113
update the version of the dependencies not available on the central maven repository #111
Bake keepValidPages() into RecordLoader #101
Create tests for JsonUtil.scala #66
Improve ExtractDomain.scala test coverage #63
Improve ExtractImageLinks.scala test coverage #62
Improve ExtractLinks.scala test coverage #61
Improve StringUtils.scala test coverage #58
Improve RemoveHTML.scala test coverage #56
Create tests for TweetUtils.scala #54
Create tests for ExtractTextFromPDFs.scala #51
Create tests for ExtractPopularImages.scala #50
Create tests for ExtractBoilerpipeText.scala #47
Create tests for ComputeMD5.scala #46
Create tests for ComputeImageSize.scala #45

Merged pull requests:

This needs to hold steady. #117 (ruebot)
Update all dependencies, and add missing dependencies to resolve #113. #116 (ruebot)
Updated documentation links; link to project page #115 (ianmilligan1)
Remove pom.xml cruft; Partially resolves #111. #112 (ruebot)
Created Code of Conduct file #110 (SamFritz)
Refactor ArchiveRecord classes; addresses #101 and #102 #107 (MapleOx)
Improve coverage for issue-67 (RecordRDD.scala) #99 (greebie)
Minor fix to improve coverage. #55 #98 (greebie)
Test ExtractTextFromPDFs. #51 #97 (greebie)
Remove example scripts. Resolves #95, #70, #71, #72. #96 (ruebot)
Setup cobertura better so we have local html reports. #94 (ruebot)
Create unit tests for Issue #50 (ExtractPopularImages) #93 (greebie)
Add ExtractGraphTest; lint fixes on RemoveHttpHeaderTest. #92 (greebie)
Improve coverage for Issue #80 #91 (greebie)
Improve coverage for TweetUtils #90 (greebie)
Increase coverage for ComputeImageSize. #45 #89 (greebie)
Complete coverage for #66 #88 (greebie)
Improve Test Coverage for #55, #56, #57, #58, #59, #60, #61, #62, #63, #64 & #66 #87 (greebie)
Add PR template. #85 (ruebot)
First round of unit tests #84 (greebie)
Use Scala 2.11.8; Align further with Altiscale. #83 (ruebot)

aut-0.10.0 (2017-10-02)

Full Changelog

Fixed bugs:

NER breaks for WARC files? #41

Closed issues:

Do we need pythonconverters/ArcRecordConverter.scala? If so, tests. If not, delete it. #65
Upgrade to Spark 2 on Altiscale #43
Investigate our test coverage according to codecov.io #36
Update Scala version #35
Update to use Java 8 #32
Migrate warcbase-resources to aut-resources #30
mvn site-deploy -DskipTests is still failing #27
Retarget Hadoop #9

Merged pull requests:

Update to Apache Spark 2.1.1; resolves #43. #82 (ruebot)
Remove unused file; resolves #65. #81 (ruebot)
Removed inaccurate information from README.md #44 (lintool)
Add WARC support for ExtractEntities; Resolve #41. #42 (ruebot)
Add OpenJDK8 and remove OracleJDK7 so we can use trusty. #39 (ruebot)
Link to aut-docs in README #38 (ianmilligan1)
Resolve #32; Update to Java 8 #34 (ruebot)
Resolve #9; Update Hadoop and Spark versions. #33 (ruebot)
Added reference to the releases #31 (ianmilligan1)
Resolve #27 - Deploy javadocs to gh-pages #29 (ruebot)
Add Maven Central badge. #28 (ruebot)

aut-0.9.0 (2017-08-24)

Full Changelog

Closed issues:

More work needs to be done on the pom.xml to get us to a release. #25
Is src/main/java/io/archivesunleashed/demo required? #17
Visualization Repo (aut-viz) #16
Remove src/main/python #10
What do we do with all the documentation at docs.warcbase.org? #8
Setup to publish javadocs on ghpages #7
Get a project setup on sonatype #6
Setup license headers and mycila #4
Setup checkstyle #3
Setup codecov.io #1

Merged pull requests:

Resolve #25 update pom.xml to do a release #26 (ruebot)
Resolve #7 #24 (ruebot)
Add Slack integration for TravisCI #21 (ruebot)
Setup mycila plugin, and normalize all license headers; Resolves #4. #20 (ruebot)
Add checkstyle plugin, and remove demo; resolves #3 #17. #19 (ruebot)
Updating README #15 (ianmilligan1)
Remove dir; resolves #10 #11 (ruebot)
Setup codecov.io integration; resolves #1 #2 (ruebot)

* This Changelog was automatically generated by github_changelog_generator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Changelog

aut-1.2.0 (2022-11-17)

aut-1.1.1 (2022-10-31)

aut-1.1.0 (2022-06-17)

aut-1.0.0 (2022-06-10)

aut-0.91.0 (2022-01-21)

aut-0.90.4 (2021-11-01)

aut-0.90.3 (2021-10-22)

aut-0.90.2 (2021-05-12)

aut-0.90.1 (2021-04-29)

aut-0.90.0 (2021-01-27)

aut-0.80.0 (2020-06-03)

aut-0.70.0 (2020-05-04)

aut-0.60.0 (2020-04-15)

aut-0.50.0 (2020-02-05)

aut-0.18.1 (2020-01-17)

aut-0.18.0 (2019-08-21)

aut-0.17.0 (2018-10-04)

aut-0.16.0 (2018-04-26)

aut-0.15.0 (2018-04-11)

aut-0.14.0 (2018-03-20)

aut-0.13.0 (2018-03-07)

aut-0.12.2 (2018-02-28)

aut-0.12.1 (2017-12-15)

aut-0.12.0 (2017-12-11)

aut-0.11.0 (2017-11-22)

aut-0.10.0 (2017-10-02)

aut-0.9.0 (2017-08-24)

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

aut-1.2.0 (2022-11-17)

aut-1.1.1 (2022-10-31)

aut-1.1.0 (2022-06-17)

aut-1.0.0 (2022-06-10)

aut-0.91.0 (2022-01-21)

aut-0.90.4 (2021-11-01)

aut-0.90.3 (2021-10-22)

aut-0.90.2 (2021-05-12)

aut-0.90.1 (2021-04-29)

aut-0.90.0 (2021-01-27)

aut-0.80.0 (2020-06-03)

aut-0.70.0 (2020-05-04)

aut-0.60.0 (2020-04-15)

aut-0.50.0 (2020-02-05)

aut-0.18.1 (2020-01-17)

aut-0.18.0 (2019-08-21)

aut-0.17.0 (2018-10-04)

aut-0.16.0 (2018-04-26)

aut-0.15.0 (2018-04-11)

aut-0.14.0 (2018-03-20)

aut-0.13.0 (2018-03-07)

aut-0.12.2 (2018-02-28)

aut-0.12.1 (2017-12-15)

aut-0.12.0 (2017-12-11)

aut-0.11.0 (2017-11-22)

aut-0.10.0 (2017-10-02)

aut-0.9.0 (2017-08-24)