Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing keepDate to allow multiple dates, would close #108 #161

Merged
merged 2 commits into from
Jan 8, 2018

Conversation

ianmilligan1
Copy link
Member

GitHub issue(s):

If you are responding to an issue, please mention their numbers below.

What does this Pull Request do?

This Pull Request changes the use of keepDate. Right now, it is used like:

  .keepDate("2008", YYYY)

If a user wanted data from 2008 and 2010, they would have to run two separate scripts - one for 2008 and one for 2010.

This new format allows the use of a list in keepDate, so a command like:

  .keepDate(List("2007","2010"), YYYY)

How should this be tested?

I have tested this on a collection of WARCs with files from 2007 and from 2010. Running:

import io.archivesunleashed.spark.matchbox.RecordLoader
import io.archivesunleashed.spark.rdd.RecordRDD._
import io.archivesunleashed.spark.matchbox.{RemoveHTML, RecordLoader}
import io.archivesunleashed.spark.matchbox.ExtractDate.DateComponent._

RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/testdata/*.gz", sc)
  .keepValidPages()
  .keepDate(List("2007","2010"), YYYY)
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
  .saveAsTextFile("/users/ianmilligan1/desktop/output/2007and2010")

Resulted in all the records from 2007 and 2010. You can also still test only one year, but it still needs to be provided as a list.

Additional Notes:

If approved, this will require revision to all keepDate documentation.

Interested parties

@ruebot @greebie @lintool

@ianmilligan1 ianmilligan1 changed the title Changing keepDate to allow multiple dates, resolved #108 Changing keepDate to allow multiple dates, would close #108 Jan 6, 2018
@ianmilligan1
Copy link
Member Author

Hmm, the code coverage is slightly lower, but so far as I can tell, that line was never covered before. Maybe just because it is longer now?

@codecov
Copy link

codecov bot commented Jan 6, 2018

Codecov Report

Merging #161 into master will not change coverage.
The diff coverage is 50%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #161   +/-   ##
=======================================
  Coverage   67.95%   67.95%           
=======================================
  Files          38       38           
  Lines         774      774           
  Branches      143      142    -1     
=======================================
  Hits          526      526           
  Misses        199      199           
  Partials       49       49
Impacted Files Coverage Δ
...ala/io/archivesunleashed/spark/rdd/RecordRDD.scala 70.37% <50%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 128b86d...c81fde1. Read the comment docs.

@greebie
Copy link
Contributor

greebie commented Jan 7, 2018

Yes. If you add content to an uncovered line, it will decrease overall coverage.

@@ -45,12 +45,12 @@ class ArcTest extends FunSuite with BeforeAndAfter {

test("filter date") {
val four = RecordLoader.loadArchives(arcPath, sc, keepValidPages = false)
.keepDate("200804", DateComponent.YYYYMM)
.keepDate(List("200804"), DateComponent.YYYYMM)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example you use two dates, you should use two dates here as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in latest commit, thanks for the review!

@ruebot ruebot merged commit b36d82b into master Jan 8, 2018
@ruebot ruebot deleted the Issue-108 branch January 8, 2018 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants