-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ExtractGraphX including algorithms for PageRank and Components. Issue 203 #245
Conversation
- Various lint fixes (usually Magic Numbers) - Remove illegal imports from scala style (we use wildcard imports a lot)
ExtractGraph does provide a rudimentary json output that is not included in ExtractGraphX. I think it makes sense to create a more generic CreateJSONGraph or even a more detailed output class that will do whatever format we want. Either way, I'm not sure if the json creator is used very much. |
Codecov Report
@@ Coverage Diff @@
## master #245 +/- ##
==========================================
+ Coverage 68.71% 70.57% +1.85%
==========================================
Files 39 41 +2
Lines 911 982 +71
Branches 168 179 +11
==========================================
+ Hits 626 693 +67
- Misses 231 232 +1
- Partials 54 57 +3
Continue to review full report at Codecov.
|
Thanks for this Ryan! For the test script, what should we expect to see as a result? |
It should produce a network graph at "graphML-path.graphml/" in graphml format with pagerank and other metadata. |
OK great, thanks @greebie - I will test it, probably tomorrow morning! |
What the roadmap is for this functionality in |
Successfully generated the file with:
However, when I attempted to open the ensuing
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably something weird happening in WriteGraphXML
?
Re: auk. The first thing is that this update offers some good features for network analysis, so it is worthwhile adding, even if it does not work for AUK. The AUK spark script would need revision, which is mostly wrapping the current flatMap script in the GraphX graph object and using .runPageRankAlgorithm(). It will afford an additional advantage that the raw Gephi graphs could also be slightly more attractive, because they use pagerank instead of degree as the default sizing. I'll definitely do some testing before we move forward on changing the algorithm. The new approach will take generally longer for all graphs, but much much shorter than what GraphPass does for huge ones. Also, Hardik and I were looking at a paper that examined the differences in pageRank calculations from Spark to Igraph. The "quickrun" feature in GraphPass would include something like this pseudocode:
The downside is longer Spark calls, possibly for all graphs if we do not add some conditional reasoning behind choosing to run the PageRank algorithm The upside is that this would be able to provide visualizations for very large graphs in a way that makes theoretical sense. Since the web archives are not going to get smaller in the long term, it is important we have a solution, even if its not optimal. Further, we have a way forward if SNAP is simply beyond my capabilities or does not have the ability to do what we need it to. Graphpass would not to eat up resources for huge graphs. |
@ianmilligan1 Yup. That looks like I accidentally added or removed a "<" somewhere. Will revise. |
@ianmilligan1 The latest update should work in Gephi this time. I tested it with a pretty good set of WARCs. The weak component works nicely! |
Works now, thanks @greebie! |
It was challenging to test the XML output - id & component generation is a little wonky.
…into issue-203
Hey 70% Codecov! That's a milestone! :) |
@@ -98,14 +98,14 @@ class ArchiveRecordImpl(r: SerializableWritable[ArchiveRecordWritable]) extends | |||
new String(getContentBytes) | |||
} | |||
|
|||
val getMimeType = { | |||
val getMimeType: String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick question - what's this change for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a number of lint change request for scalastyle that need to be addressed.
I decided to pick some of these off to reduce the errors that show up in the build.
Part of that is requiring explicit types for any public method. I'm pretty sure mime types are always string.
GitHub issue(s):
#203
What does this Pull Request do?
Adds ExtractGraphX algorithm and GraphML output to go with GraphX output.
Adds feature to calculate pagerank and weak and strong component calculations.
Provides some lint fixes for other files (usually removing Magic Numbers)
Deprecates ExtractGraph as outdated (although there is some discussion that comes with that).
How should this be tested?
Additional Notes:
The main feature is
strong
andweak
connected components which can be used in Graphpass to reduce the size of a network graph if it is > 50k nodes.strong
connected components are also interesting as a potential measure of who is driving the overall conversation, or if factions exist in a community.Interested parties
Tag (@ mention) interested parties.
Thanks in advance for your help with the Archives Unleashed Toolkit!