Extract popular images - Data Frame implementation #382

SinghGursimran · 2019-11-20T16:34:13Z

Extract popular images - Data Frame implementation

For Testing:

import io.archivesunleashed._
import io.archivesunleashed.app._

val df = RecordLoader.loadArchives("./ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz",sc)
					 .images()	  

ExtractPopularImages(df,10,30,30).show()

SinghGursimran · 2019-11-20T16:42:25Z

Scala doesn't support function overloading with default arguments. For the RDD implementation, minWidth and minHeight arguments were optional. For the current data frame implementation, they are necessary. If it is required to be kept as optional, I can

Shift df implementation to a new object (ExtractPopularImagesDF)
OR
give a new name to the method. (That would require calling method name along with the object at the time of implementation)

codecov · 2019-11-20T16:52:18Z

Codecov Report

Merging #382 into master will increase coverage by 0.22%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master    #382      +/-   ##
=========================================
+ Coverage   76.47%   76.7%   +0.22%     
=========================================
  Files          40      41       +1     
  Lines        1437    1451      +14     
  Branches      268     268              
=========================================
+ Hits         1099    1113      +14     
  Misses        221     221              
  Partials      117     117

ruebot · 2019-11-20T20:54:19Z

A new method, ExtractPopularImagesDF makes sense to me. @ianmilligan1 @lintool work for you, or y'all prefer another path?

@SinghGursimran tests? 😃

ianmilligan1 · 2019-11-20T20:55:10Z

A new method, ExtractPopularImagesDF makes sense to me. @ianmilligan1 @lintool work for you, or y'all prefer another path?

That makes sense to me too!

lintool · 2019-11-20T21:15:42Z

Let's use a different convention for "end-to-end" functionalities. One option would be to have all UDFs be verb phrases, e.g., ExtractX, and "end-to-end" functionalities be noun phrases. So this would be PopularImagesExtractor. That way it'll be clear on what to use in what context.

ruebot · 2019-11-20T21:23:36Z

@lintool so, should we have @SinghGursimran change the existing RDD method to PopularImagesExtractorRDD, and this new one should be PopularImagesExtractorDF?

lintool · 2019-11-20T22:26:07Z

Yes, if you like my suggestion of nouns vs. verbs.

I.e., UDFs are verbs, "do this".
"Assemblies" are thing-doers.

ruebot · 2019-11-20T22:36:36Z

Cool.

That make sense @SinghGursimran?

SinghGursimran · 2019-11-20T23:39:09Z

@ruebot
Ya! I will change accordingly.
Regarding tests, there's an orderBy() function in implementation. If the value of key is same it orders randomly. If the count of two image URLs is same, the order of Data Frame row might change on running the code again. For the archive in resources, all images appear only once that is they all have count one. Adding a static test won't be feasible in such a case.

ruebot · 2019-11-21T16:49:58Z

@SinghGursimran so, for the test. Can we assert other items in the DataFrame that is returned, that is not dependent on the order it returns in? count maybe? Or, combine distinct and count to hit the same thing each time?

SinghGursimran · 2019-11-21T16:53:27Z

@SinghGursimran so, for the test. Can we assert other items in the DataFrame that is returned, that is not dependent on the order it returns in? count maybe? Or, combine distinct and count to hit the same thing each time?

Actually, for the archive available in the resources, the count is 1 for each data entry in the row.
The sequence of the whole row changes for the same count on subsequent running of the code.
But, yes, I can keep count as the factor because it is always 1. Shall I do that?

ruebot · 2019-11-21T16:54:52Z

Yes, let's do that to get something in there, and we can loop back around to it later and see if we can in improve it.

… Hash". - See archivesunleashed/aut#382

ruebot · 2019-11-21T17:06:32Z

archivesunleashed/aut-docs#28

ruebot · 2019-11-21T17:19:05Z

Tested on 10 local GeoCities WARCs:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.app._

val df = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1",sc)
                                  .images()        

ExtractPopularImages(df,10,30,30).show()

// Exiting paste mode, now interpreting.

19/11/21 12:09:57 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
[Stage 0:>                (0 + 10) / 10][Stage 1:>                 (0 + 0) / 10]19/11/21 12:10:00 WARN PDFParser: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

19/11/21 12:10:00 WARN SQLite3Parser: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
+--------------------+-----+                                                    
|                 url|count|
+--------------------+-----+
|http://geocities....|   51|
|http://geocities....|   31|
|http://www.geocit...|   29|
|http://i24.photob...|   28|
|http://geocities....|   26|
|http://geocities....|   24|
|http://geocities....|   22|
|http://www.geocit...|   22|
|http://www.geocit...|   22|
|http://geocities....|   21|
+--------------------+-----+

import io.archivesunleashed._
import io.archivesunleashed.app._
df: org.apache.spark.sql.DataFrame = [url: string, filename: string ... 8 more fields]

I'll squash and merge once we get the test.

…df-impl

#28) * Add example for Scala DF version of "Extract Most Frequent Images MD5 Hash". - See archivesunleashed/aut#382 * rename

g285sing added 9 commits November 6, 2019 19:15

Issue-368

68922a2

Issue238

0db0093

Issue238

5817bf5

Merge branch 'master' of https://github.com/SinghGursimran/aut

d65cb41

test

4e5a066

Merge branch 'master' of https://github.com/archivesunleashed/aut

c60104f

Merge branch 'master' of https://github.com/archivesunleashed/aut

f18006e

Merge branch 'master' of https://github.com/archivesunleashed/aut

9141e2f

PopularImagesdf

23d7e7a

comment_updates

7ff8f91

Merge branch 'master' into df-impl

771bb80

ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request Nov 21, 2019

Add example for Scala DF version of "Extract Most Frequent Images MD5…

2361bf5

… Hash". - See archivesunleashed/aut#382

ruebot mentioned this pull request Nov 21, 2019

Add example for Scala DF version of "Extract Most Frequent Images MD5… archivesunleashed/aut-docs#28

Merged

g285sing added 2 commits November 21, 2019 15:37

tests

ce79a09

Merge branch 'df-impl' of https://github.com/SinghGursimran/aut into …

aa451a4

…df-impl

ruebot approved these changes Nov 21, 2019

View reviewed changes

ruebot merged commit 4042180 into archivesunleashed:master Nov 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract popular images - Data Frame implementation #382

Extract popular images - Data Frame implementation #382

SinghGursimran commented Nov 20, 2019

SinghGursimran commented Nov 20, 2019

codecov bot commented Nov 20, 2019 •

edited

Loading

ruebot commented Nov 20, 2019

ianmilligan1 commented Nov 20, 2019

lintool commented Nov 20, 2019

ruebot commented Nov 20, 2019

lintool commented Nov 20, 2019

ruebot commented Nov 20, 2019

SinghGursimran commented Nov 20, 2019

ruebot commented Nov 21, 2019

SinghGursimran commented Nov 21, 2019 •

edited

Loading

ruebot commented Nov 21, 2019

ruebot commented Nov 21, 2019

ruebot commented Nov 21, 2019

Extract popular images - Data Frame implementation #382

Extract popular images - Data Frame implementation #382

Conversation

SinghGursimran commented Nov 20, 2019

SinghGursimran commented Nov 20, 2019

codecov bot commented Nov 20, 2019 • edited Loading

Codecov Report

ruebot commented Nov 20, 2019

ianmilligan1 commented Nov 20, 2019

lintool commented Nov 20, 2019

ruebot commented Nov 20, 2019

lintool commented Nov 20, 2019

ruebot commented Nov 20, 2019

SinghGursimran commented Nov 20, 2019

ruebot commented Nov 21, 2019

SinghGursimran commented Nov 21, 2019 • edited Loading

ruebot commented Nov 21, 2019

ruebot commented Nov 21, 2019

ruebot commented Nov 21, 2019

codecov bot commented Nov 20, 2019 •

edited

Loading

SinghGursimran commented Nov 21, 2019 •

edited

Loading