Documentation update for https://github.com/archivesunleashed/aut/pul… (

#67) * Documentation update for archivesunleashed/aut#477
archivesunleashed · May 29, 2020 · 96c0f20 · 96c0f20
1 parent 2a1fcd7
commit 96c0f20
Show file tree

Hide file tree

Showing 3 changed files with 45 additions and 23 deletions.
diff --git a/current/image-analysis.md b/current/image-analysis.md
@@ -297,7 +297,7 @@ from aut import *
 
 images = WebArchive(sc, sqlContext, "/path/to/warcs").images()
 
-popular_images = Extract_Popular_Images(images, 20, 10, 10)
+popular_images = ExtractPopularImages(images, 20, 10, 10)
 
 popular_images.show()
 ```

diff --git a/current/link-analysis.md b/current/link-analysis.md
@@ -379,47 +379,69 @@ WebArchive(sc, sqlContext, "/path/to/warcs")\
 
 ## Export to Gephi
 
-### Scala RDD
-
 You may want to export your data directly to the [Gephi software
 suite](http://gephi.github.io/), an open-source network analysis project. The
 following code writes to the GEXF format:
 
+### Scala RDD
+
+**Will not be implemented.**
+
+### Scala DF
+
 ```scala
 import io.archivesunleashed._
+import io.archivesunleashed.udfs._
 import io.archivesunleashed.app._
-import io.archivesunleashed.matchbox._
 
-val links = RecordLoader.loadArchives("/path/to/warcs", sc)
-  .keepValidPages()
-  .map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
-  .flatMap(r => r._2.map(f => (r._1,
-                               ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
-                               ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
-  .filter(r => r._2 != "" && r._3 != "")
-  .countItems()
-  .filter(r => r._2 > 5)
-
-WriteGEXF(links, "links-for-gephi.gexf")
+val graph = webgraph.groupBy(
+                       $"crawl_date",
+                       removePrefixWWW(extractDomain($"src")).as("src_domain"),
+                       removePrefixWWW(extractDomain($"dest")).as("dest_domain"))
+              .count()
+              .filter(!($"dest_domain"===""))
+              .filter(!($"src_domain"===""))
+              .filter($"count" > 5)
+              .orderBy(desc("count"))
+              .collect()
+
+WriteGEXF(graph, "links-for-gephi.gexf")
 ```
 
-This file can then be directly opened by Gephi.
-
 We also support exporting to the
 [GraphML](https://en.wikipedia.org/wiki/GraphML) format. To do so, use
 the `WriteGraphml` method:
 
 ```scala
-WriteGraphml(links, "links-for-gephi.graphml")
+WriteGraphML(graph, "links-for-gephi.graphml")
 ```
 
-### Scala DF
+### Python DF
 
-**To be implemented.**
+```python
+from aut import *
+from pyspark.sql.functions import col, desc
+
+graph = WebArchive(sc, sqlContext, "/path/to/data")\
+          .webgraph()\
+          .groupBy("crawl_date", remove_prefix_www(extract_domain("src")).alias("src_domain"), remove_prefix_www(extract_domain("dest")).alias("dest_domain"))\
+          .count()\
+          .filter((col("dest_domain").isNotNull()) & (col("dest_domain") !=""))\
+          .filter((col("src_domain").isNotNull()) & (col("src_domain") !=""))\
+          .filter(col("count") > 5)\
+          .orderBy(desc("count"))\
+          .collect()
+
+WriteGEXF(graph, "links-for-gephi.gexf")
+```
 
-### Python DF
+We also support exporting to the
+[GraphML](https://en.wikipedia.org/wiki/GraphML) format. To do so, use
+the `WriteGraphml` method:
 
-**To be implemented.**
+```python
+WriteGraphML(graph, "links-for-gephi.graphml")
+```
 
 ## Finding Hyperlinks within Collection on Pages with Certain Keyword
 

diff --git a/current/standard-derivatives.md b/current/standard-derivatives.md
@@ -136,7 +136,7 @@ graph = webgraph.groupBy("crawl_date", remove_prefix_www(extract_domain("src")).
           .orderBy(desc("count"))
 
 # Write the GraphML out to a file.
-Write_Graphml(graph.collect(), "/path/to/derivatives/auk/graph/example.graphml")
+WriteGraphML(graph.collect(), "/path/to/derivatives/auk/graph/example.graphml")
 ```
 
 ## Extract Binary Info