Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update for 'src' column #424

Merged
merged 3 commits into from
Feb 12, 2020
Merged

Conversation

SinghGursimran
Copy link
Collaborator

update for 'src' column

#418

For Testing:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webgraph()
			.select($"src")
			.keepUrlPatternsDF(Set(".*index.*".r))
			.show(10,false)

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webgraph()
			.select($"src")
			.discardUrlPatternsDF(Set(".*images.*".r))
			.show(10,false)

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webgraph()
			.select($"src")
			.keepUrlsDF(Set("http://www.archive.org/","http://www.archive.org/index.php"))
			.show(10,false)

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webgraph()
			.select($"src")
			.discardUrlsDF(Set("http://www.archive.org/","http://www.archive.org/index.php"))
			.show(10,false)

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.imagegraph()
			.select($"src")
			.keepDomainsDF(Set("www.archive.org"))
			.show(10,false)

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webgraph()
			.select($"src")
			.discardDomainsDF(Set("www.archive.org"))
			.show(10,false)

@codecov
Copy link

codecov bot commented Feb 11, 2020

Codecov Report

Merging #424 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master     #424   +/-   ##
=======================================
  Coverage   78.15%   78.15%           
=======================================
  Files          41       41           
  Lines        1584     1584           
  Branches      299      299           
=======================================
  Hits         1238     1238           
  Misses        218      218           
  Partials      128      128

@ruebot
Copy link
Member

ruebot commented Feb 11, 2020

@SinghGursimran nice! I did a hasColumn function for a similar solution in twut. Can we get a test update too?

@lintool @ianmilligan1 do either of you see a use case for filtering on dest or image_url? Or is src, and url good enough here? If we add dest or image_url, we'd probably need to change the implementation to pass the column name as well.

@ruebot
Copy link
Member

ruebot commented Feb 11, 2020

@SinghGursimran based on the new related issue (#425) let's just worry about getting the tests updated here, and don't worry about dest and image_url for now since the implementation of #425 would resolve that.

I'm running the right now on the entire GeoCities dataset for other project, and everything appears to be running smoothly 🙌

@ruebot ruebot merged commit ebb5298 into archivesunleashed:master Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants