Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract popular images - Data Frame implementation #382

Merged
merged 13 commits into from
Nov 21, 2019

Conversation

SinghGursimran
Copy link
Collaborator

Extract popular images - Data Frame implementation

#380

For Testing:

import io.archivesunleashed._
import io.archivesunleashed.app._

val df = RecordLoader.loadArchives("./ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz",sc)
					 .images()	  

ExtractPopularImages(df,10,30,30).show()

@SinghGursimran
Copy link
Collaborator Author

Scala doesn't support function overloading with default arguments. For the RDD implementation, minWidth and minHeight arguments were optional. For the current data frame implementation, they are necessary. If it is required to be kept as optional, I can

  1. Shift df implementation to a new object (ExtractPopularImagesDF)
    OR
  2. give a new name to the method. (That would require calling method name along with the object at the time of implementation)

@codecov
Copy link

codecov bot commented Nov 20, 2019

Codecov Report

Merging #382 into master will increase coverage by 0.22%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master    #382      +/-   ##
=========================================
+ Coverage   76.47%   76.7%   +0.22%     
=========================================
  Files          40      41       +1     
  Lines        1437    1451      +14     
  Branches      268     268              
=========================================
+ Hits         1099    1113      +14     
  Misses        221     221              
  Partials      117     117

@ruebot
Copy link
Member

ruebot commented Nov 20, 2019

A new method, ExtractPopularImagesDF makes sense to me. @ianmilligan1 @lintool work for you, or y'all prefer another path?

@SinghGursimran tests? 😃

@ianmilligan1
Copy link
Member

A new method, ExtractPopularImagesDF makes sense to me. @ianmilligan1 @lintool work for you, or y'all prefer another path?

That makes sense to me too!

@lintool
Copy link
Member

lintool commented Nov 20, 2019

Let's use a different convention for "end-to-end" functionalities. One option would be to have all UDFs be verb phrases, e.g., ExtractX, and "end-to-end" functionalities be noun phrases. So this would be PopularImagesExtractor. That way it'll be clear on what to use in what context.

@ruebot
Copy link
Member

ruebot commented Nov 20, 2019

@lintool so, should we have @SinghGursimran change the existing RDD method to PopularImagesExtractorRDD, and this new one should be PopularImagesExtractorDF?

@lintool
Copy link
Member

lintool commented Nov 20, 2019

Yes, if you like my suggestion of nouns vs. verbs.

I.e., UDFs are verbs, "do this".
"Assemblies" are thing-doers.

@ruebot
Copy link
Member

ruebot commented Nov 20, 2019

Cool.

That make sense @SinghGursimran?

@SinghGursimran
Copy link
Collaborator Author

@ruebot
Ya! I will change accordingly.
Regarding tests, there's an orderBy() function in implementation. If the value of key is same it orders randomly. If the count of two image URLs is same, the order of Data Frame row might change on running the code again. For the archive in resources, all images appear only once that is they all have count one. Adding a static test won't be feasible in such a case.

@ruebot
Copy link
Member

ruebot commented Nov 21, 2019

@SinghGursimran so, for the test. Can we assert other items in the DataFrame that is returned, that is not dependent on the order it returns in? count maybe? Or, combine distinct and count to hit the same thing each time?

@SinghGursimran
Copy link
Collaborator Author

SinghGursimran commented Nov 21, 2019

@SinghGursimran so, for the test. Can we assert other items in the DataFrame that is returned, that is not dependent on the order it returns in? count maybe? Or, combine distinct and count to hit the same thing each time?

Actually, for the archive available in the resources, the count is 1 for each data entry in the row.
The sequence of the whole row changes for the same count on subsequent running of the code.
But, yes, I can keep count as the factor because it is always 1. Shall I do that?

@ruebot
Copy link
Member

ruebot commented Nov 21, 2019

Yes, let's do that to get something in there, and we can loop back around to it later and see if we can in improve it.

@ruebot
Copy link
Member

ruebot commented Nov 21, 2019

@ruebot
Copy link
Member

ruebot commented Nov 21, 2019

Tested on 10 local GeoCities WARCs:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.app._

val df = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1",sc)
                                  .images()        

ExtractPopularImages(df,10,30,30).show()

// Exiting paste mode, now interpreting.

19/11/21 12:09:57 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
[Stage 0:>                (0 + 10) / 10][Stage 1:>                 (0 + 0) / 10]19/11/21 12:10:00 WARN PDFParser: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

19/11/21 12:10:00 WARN SQLite3Parser: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
+--------------------+-----+                                                    
|                 url|count|
+--------------------+-----+
|http://geocities....|   51|
|http://geocities....|   31|
|http://www.geocit...|   29|
|http://i24.photob...|   28|
|http://geocities....|   26|
|http://geocities....|   24|
|http://geocities....|   22|
|http://www.geocit...|   22|
|http://www.geocit...|   22|
|http://geocities....|   21|
+--------------------+-----+

import io.archivesunleashed._
import io.archivesunleashed.app._
df: org.apache.spark.sql.DataFrame = [url: string, filename: string ... 8 more fields]

I'll squash and merge once we get the test.

@ruebot ruebot merged commit 4042180 into archivesunleashed:master Nov 21, 2019
ianmilligan1 pushed a commit to archivesunleashed/aut-docs that referenced this pull request Nov 22, 2019
#28)

* Add example for Scala DF version of "Extract Most Frequent Images MD5 Hash".

- See archivesunleashed/aut#382

* rename
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants