Add ExtractGraphX including algorithms for PageRank and Components. Issue 203 #245

greebie · 2018-07-26T17:37:59Z

GitHub issue(s):

What does this Pull Request do?

Adds ExtractGraphX algorithm and GraphML output to go with GraphX output.
Adds feature to calculate pagerank and weak and strong component calculations.
Provides some lint fixes for other files (usually removing Magic Numbers)
Deprecates ExtractGraph as outdated (although there is some discussion that comes with that).

How should this be tested?

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.app._
import io.archivesunleashed.util._
import org.apache.spark.graphx._

val graph = ExtractGraphX.extractGraphX(RecordLoader.loadArchives("/Users/USERNAME/WARCFOLDER/", sc)
.keepValidPages()
.flatMap(r => ExtractLinks(r.getUrl, r.getContentString))
.map(r => (ExtractDomain(r._1).removePrefixWWW(), ExtractDomain(r._2).removePrefixWWW()))
.filter(r => r._1 != "" && r._2 != "")).subgraph(epred = eTriplet => eTriplet.attr.edgeCount>5)

val pRank = ExtractGraphX.runPageRankAlgorithm(graph)

WriteGraphXML(pRank, "graphML-path.graphml/")

Additional Notes:

The main feature is strong and weak connected components which can be used in Graphpass to reduce the size of a network graph if it is > 50k nodes.
strong connected components are also interesting as a potential measure of who is driving the overall conversation, or if factions exist in a community.

Interested parties

Tag (@ mention) interested parties.

Thanks in advance for your help with the Archives Unleashed Toolkit!

…limit.

- See: - https://stackoverflow.com/questions/51091539/maven-site-plugins-3-3-java-lang-classnotfoundexception-org-apache-maven-doxia - https://travis-ci.org/archivesunleashed/aut/jobs/408259462#L3201-L3202

- Various lint fixes (usually Magic Numbers) - Remove illegal imports from scala style (we use wildcard imports a lot)

greebie · 2018-07-26T17:40:30Z

ExtractGraph does provide a rudimentary json output that is not included in ExtractGraphX. I think it makes sense to create a more generic CreateJSONGraph or even a more detailed output class that will do whatever format we want. Either way, I'm not sure if the json creator is used very much.

codecov · 2018-07-26T17:56:12Z

Codecov Report

Merging #245 into master will increase coverage by 1.85%.
The diff coverage is 91.25%.

@@            Coverage Diff             @@
##           master     #245      +/-   ##
==========================================
+ Coverage   68.71%   70.57%   +1.85%     
==========================================
  Files          39       41       +2     
  Lines         911      982      +71     
  Branches      168      179      +11     
==========================================
+ Hits          626      693      +67     
- Misses        231      232       +1     
- Partials       54       57       +3

Impacted Files	Coverage Δ
...chivesunleashed/app/DomainFrequencyExtractor.scala	`100% <ø> (ø)`	⬆️
...c/main/scala/io/archivesunleashed/df/package.scala	`90.47% <ø> (+3.51%)`	⬆️
...o/archivesunleashed/app/DomainGraphExtractor.scala	`100% <ø> (ø)`	⬆️
.../scala/io/archivesunleashed/app/ExtractGraph.scala	`0% <0%> (ø)`	⬆️
src/main/scala/io/archivesunleashed/package.scala	`84.11% <0%> (ø)`	⬆️
...scala/io/archivesunleashed/app/WriteGraphXML.scala	`100% <100%> (ø)`
...ain/scala/io/archivesunleashed/ArchiveRecord.scala	`83.33% <100%> (ø)`	⬆️
.../scala/io/archivesunleashed/app/WriteGraphML.scala	`100% <100%> (ø)`	⬆️
...o/archivesunleashed/app/ExtractPopularImages.scala	`100% <100%> (ø)`	⬆️
...scala/io/archivesunleashed/app/ExtractGraphX.scala	`92.1% <92.1%> (ø)`
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 290b6aa...d1d1603. Read the comment docs.

ianmilligan1 · 2018-07-26T19:19:26Z

Thanks for this Ryan!

For the test script, what should we expect to see as a result?

greebie · 2018-07-26T19:20:53Z

It should produce a network graph at "graphML-path.graphml/" in graphml format with pagerank and other metadata.

ianmilligan1 · 2018-07-26T20:21:29Z

OK great, thanks @greebie - I will test it, probably tomorrow morning!

ruebot · 2018-07-27T12:48:19Z

What the roadmap is for this functionality in auk post the next release of aut? Is it a drop-in replacement, or will the spark background job need significant updates? How does this chain with GraphPass?

ianmilligan1 · 2018-07-27T13:38:55Z

Successfully generated the file with:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.app._
import io.archivesunleashed.util._
import org.apache.spark.graphx._

val graph = ExtractGraphX.extractGraphX(RecordLoader.loadArchives("/mnt/vol1/data_sets/cpp/cpp_warcs_accession_01/partner.archive-it.org/cgi-bin/getarcs.pl/*200912*", sc)
.keepValidPages()
.flatMap(r => ExtractLinks(r.getUrl, r.getContentString))
.map(r => (ExtractDomain(r._1).removePrefixWWW(), ExtractDomain(r._2).removePrefixWWW()))
.filter(r => r._1 != "" && r._2 != "")).subgraph(epred = eTriplet => eTriplet.attr.edgeCount>5)

val pRank = ExtractGraphX.runPageRankAlgorithm(graph)

WriteGraphXML(pRank, "/mnt/vol1/derivative_data/test/graphML-test.graphml")

However, when I attempted to open the ensuing graphML-test.graphml file in Gephi, received the following error message:

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[10,51]
Message: Element type "key" must be followed by either attribute specifications, ">" or "/>".
	at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:604)
	at org.gephi.io.importer.plugin.file.ImporterGraphML.execute(ImporterGraphML.java:158)
Caused: java.lang.RuntimeException
	at org.gephi.io.importer.plugin.file.ImporterGraphML.execute(ImporterGraphML.java:181)
	at org.gephi.io.importer.impl.ImportControllerImpl.importFile(ImportControllerImpl.java:199)
	at org.gephi.io.importer.impl.ImportControllerImpl.importFile(ImportControllerImpl.java:169)
	at org.gephi.desktop.importer.DesktopImportControllerUI$4.run(DesktopImportControllerUI.java:341)
Caused: java.lang.RuntimeException
	at org.gephi.desktop.importer.DesktopImportControllerUI$4.run(DesktopImportControllerUI.java:349)
[catch] at org.gephi.utils.longtask.api.LongTaskExecutor$RunningLongTask.run(LongTaskExecutor.java:274)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

ianmilligan1

Probably something weird happening in WriteGraphXML?

greebie · 2018-07-27T14:45:54Z

Re: auk.

The first thing is that this update offers some good features for network analysis, so it is worthwhile adding, even if it does not work for AUK.

The AUK spark script would need revision, which is mostly wrapping the current flatMap script in the GraphX graph object and using .runPageRankAlgorithm(). It will afford an additional advantage that the raw Gephi graphs could also be slightly more attractive, because they use pagerank instead of degree as the default sizing.

I'll definitely do some testing before we move forward on changing the algorithm. The new approach will take generally longer for all graphs, but much much shorter than what GraphPass does for huge ones.

Also, Hardik and I were looking at a paper that examined the differences in pageRank calculations from Spark to Igraph.

The "quickrun" feature in GraphPass would include something like this pseudocode:

i f NODESIZE (graph) > 50 0000:
 - check for "Weak" and/or "Strong" attribute
 - create the subgraph from largest weak components if the attr exists and check if the new graph < 50000 nodes
    otherwise
- create the strong subgraph if the attr exists and check if the new graph < 50000 nodes
- otherwise fail graphpass.

- if we have a good new graph, then run the usual quickpass stuff.
- maybe send a _WEAKPASS or _STRONGPASS file to tell AUK we didn't use the whole graph.

The downside is longer Spark calls, possibly for all graphs if we do not add some conditional reasoning behind choosing to run the PageRank algorithm

The upside is that this would be able to provide visualizations for very large graphs in a way that makes theoretical sense. Since the web archives are not going to get smaller in the long term, it is important we have a solution, even if its not optimal.

Further, we have a way forward if SNAP is simply beyond my capabilities or does not have the ability to do what we need it to.

Graphpass would not to eat up resources for huge graphs.

greebie · 2018-07-27T14:50:05Z

@ianmilligan1 Yup. That looks like I accidentally added or removed a "<" somewhere. Will revise.

…s time.

greebie · 2018-07-27T16:00:53Z

@ianmilligan1 The latest update should work in Gephi this time. I tested it with a pretty good set of WARCs. The weak component works nicely!

ianmilligan1 · 2018-07-27T17:22:40Z

Works now, thanks @greebie!

It was challenging to test the XML output - id & component generation is a little wonky.

…into issue-203

greebie · 2018-07-27T23:05:44Z

Hey 70% Codecov! That's a milestone! :)

ianmilligan1 · 2018-07-28T11:38:30Z

src/main/scala/io/archivesunleashed/ArchiveRecord.scala

@@ -98,14 +98,14 @@ class ArchiveRecordImpl(r: SerializableWritable[ArchiveRecordWritable]) extends
    new String(getContentBytes)
  }

-  val getMimeType = {
+  val getMimeType: String = {


Just a quick question - what's this change for?

There are a number of lint change request for scalastyle that need to be addressed.

I decided to pick some of these off to reduce the errors that show up in the build.

Part of that is requiring explicit types for any public method. I'm pretty sure mime types are always string.

hardiksahi and others added 28 commits May 4, 2018 02:34

pom.xml change for GraphX

62dfb3f

pom.xml change for GraphX

2778bf5

Changes for GraphXSLS

13e6723

Changes for GraphXSLS

5f5a4b0

Changes for SLS graph

8adb2b3

Changes for SLS graph

54c9133

Change

d64fe13

Changes

3f63b3c

Changes

e64e298

Changes

e22f01e

Changes

37e9aa7

Changes

2b81550

Changes

ff7dd7d

Changes

afba7b6

Changes for GraphX

41f6ef8

Changes

1ddd484

Changes

12c3ded

Changes for GraphX

eeb18c2

Changes

e5c9be7

Changes

6fdadc5

Changes

ae434c3

Changes for converting WARC RDD to GraphX object

7077035

Merge branch 'master' of github.com:hardiksahi/aut into issue-203

ef0ad13

Make the TravisCI build less verbose since we're hitting the 4MB log …

a02a74e

…limit.

Pin site.plugin and project-info-reports.plugin so mvn site builds.

e0c95fd

- See: - https://stackoverflow.com/questions/51091539/maven-site-plugins-3-3-java-lang-classnotfoundexception-org-apache-maven-doxia - https://travis-ci.org/archivesunleashed/aut/jobs/408259462#L3201-L3202

- Rename extractor to ExtractGraphX

a6aa179

- Various lint fixes (usually Magic Numbers) - Remove illegal imports from scala style (we use wildcard imports a lot)

Lint fixes.

00879e2

Setup GraphX test file.

e87ab2b

greebie added 2 commits July 26, 2018 14:12

Revise GraphX test for more completeness.

1475faa

Minor fixes to ExtractGraphXTest.

3d32493

ianmilligan1 requested changes Jul 27, 2018

View reviewed changes

greebie added 2 commits July 27, 2018 11:51

Fix typo errors in WriteGraphXML, and tested it properly in Gephi thi…

d33c06f

…s time.

Merge branch 'master' into issue-203

0657870

ianmilligan1 approved these changes Jul 27, 2018

View reviewed changes

greebie added 2 commits July 27, 2018 17:35

Add WriteGraphXMLTest.

9b12932

It was challenging to test the XML output - id & component generation is a little wonky.

Merge branch 'issue-203' of https://github.com/archivesunleashed/aut …

d1d1603

…into issue-203

ianmilligan1 reviewed Jul 28, 2018

View reviewed changes

ianmilligan1 merged commit afe9254 into master Jul 29, 2018

ianmilligan1 deleted the issue-203 branch July 29, 2018 14:19

ianmilligan1 mentioned this pull request Aug 14, 2018

Refactor ExtractGraph and assess value of GraphX for producing network graphs #203

Closed

ianmilligan1 mentioned this pull request Apr 13, 2020

CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ExtractGraphX including algorithms for PageRank and Components. Issue 203 #245

Add ExtractGraphX including algorithms for PageRank and Components. Issue 203 #245

greebie commented Jul 26, 2018

greebie commented Jul 26, 2018

codecov bot commented Jul 26, 2018 •

edited

Loading

ianmilligan1 commented Jul 26, 2018

greebie commented Jul 26, 2018

ianmilligan1 commented Jul 26, 2018

ruebot commented Jul 27, 2018

ianmilligan1 commented Jul 27, 2018

ianmilligan1 left a comment

greebie commented Jul 27, 2018 •

edited

Loading

greebie commented Jul 27, 2018

greebie commented Jul 27, 2018

ianmilligan1 commented Jul 27, 2018

greebie commented Jul 27, 2018

ianmilligan1 Jul 28, 2018

greebie Jul 28, 2018

Add ExtractGraphX including algorithms for PageRank and Components. Issue 203 #245

Add ExtractGraphX including algorithms for PageRank and Components. Issue 203 #245

Conversation

greebie commented Jul 26, 2018

What does this Pull Request do?

How should this be tested?

Additional Notes:

Interested parties

greebie commented Jul 26, 2018

codecov bot commented Jul 26, 2018 • edited Loading

Codecov Report

ianmilligan1 commented Jul 26, 2018

greebie commented Jul 26, 2018

ianmilligan1 commented Jul 26, 2018

ruebot commented Jul 27, 2018

ianmilligan1 commented Jul 27, 2018

ianmilligan1 left a comment

Choose a reason for hiding this comment

greebie commented Jul 27, 2018 • edited Loading

greebie commented Jul 27, 2018

greebie commented Jul 27, 2018

ianmilligan1 commented Jul 27, 2018

greebie commented Jul 27, 2018

ianmilligan1 Jul 28, 2018

Choose a reason for hiding this comment

greebie Jul 28, 2018

Choose a reason for hiding this comment

codecov bot commented Jul 26, 2018 •

edited

Loading

greebie commented Jul 27, 2018 •

edited

Loading