PySpark support for core AUT functionality. #12, #13. #100

MapleOx · 2017-10-20T13:57:55Z

GitHub issue(s):

What does this Pull Request do?

Adds PySpark support for the core functionality of the Archives Unleashed Toolkit. Added:

Ability to load arcs and warcs as RDDs or DataFrames in PySpark,
Python DataFrame transformation functions like keepValidPages and keepImages that mimic the RDD transformations in AUT Scala,
Python versions of several of the AUT matchbox functions, like ExtractLinks and RemoveHTML,
Two example python scripts to demonstrate usage.

How should this be tested?

First, run mvn clean package to rebuild AUT. After that, you can either run from the PySpark shell or submit your own script.

To run from the PySpark shell:

cd into the aut/ directory.
Run zip pyaut src/main/python/*.py to make a zip of the Python files.
Run the following command to start the PySpark shell:

pyspark --jars target/aut-0.10.1-SNAPSHOT-fatjar.jar --driver-class-path target/aut-0.10.1-SNAPSHOT-fatjar.jar --py-files python.zip

You can now use AUT in the PySpark shell!

To run a script:

cd into the aut/ directory.
Run zip pyaut src/main/python/*.py to make a zip of the Python files.
Run the following command to submit your script:

spark-submit --jars target/aut-0.10.1-SNAPSHOT-fatjar.jar --driver-class-path target/aut-0.10.1-SNAPSHOT-fatjar.jar --py-files python.zip path/to/myscript.py

Two example scripts can be found in the src/main/python/scripts directory.

Additional Notes:

The more complicated AUT Scala matchbox functions, like ExtractGraph and ExtractEntities, are not included in this PR. I plan to add them progressively.

The Python code uses BeautifulSoup4, a library for extracting data from HTML and XML.

Interested parties

@lintool @ruebot @ianmilligan1 @greebie

codecov · 2017-10-20T14:05:10Z

Codecov Report

Merging #100 into master will decrease coverage by 0.71%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master     #100      +/-   ##
==========================================
- Coverage   65.66%   64.95%   -0.72%     
==========================================
  Files          36       37       +1     
  Lines         731      739       +8     
  Branches      142      144       +2     
==========================================
  Hits          480      480              
- Misses        201      209       +8     
  Partials       50       50

Impacted Files	Coverage Δ
...spark/pythonhelpers/RecordLoaderPythonHelper.scala	`0% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3eb093a...85fade6. Read the comment docs.

ianmilligan1 · 2017-10-20T14:24:51Z

I'll leave reviews to @ruebot and @lintool, but I think it'd be good to get this into the repository so we (and others in the community) can explore PySpark using the main repo. It's sitting apart from the main Scala functions so wouldn't affect AUT's functionality.

ruebot

Looks like we're recreating some of the Scala scripts. Do we need to do that?

Also, do you have a plan for tests?

ruebot · 2017-10-20T15:59:22Z

src/main/python/DFTransformations.py

+    .reduceByKey(lambda c1, c2: c1 + c2) \
+    .sortBy(lambda f: f[1], ascending = False)
+
+# def keepImages(df): 


Do we need this commented out code?

I am willing and able to look at unit test coverage here once people have tried it and accept that it does what we want it to do. From I've seen, pytest is the best library for handling a pyspark context: https://stackoverflow.com/questions/33811882/how-do-i-unit-test-pyspark-programs

Will we need python dependency management here?

@ruebot By "Scala scripts", do you mean the things under src/main/python/scripts/ like extractLinkScript.py, or the matchbox functions like ExtractDomain?

greebie · 2017-10-20T16:12:25Z

src/main/python/DFTransformations.py

+  return df.filter(content_filter_udf(df['contentString']))
+
+
+# ---- TODO: All discard filtering operations ---- #


Is this still "TODO" or have you completed the task? (remove comment)

Good catch, that comment should be removed. I probably have some other TODOs comments that should be deleted too.

greebie · 2017-10-20T16:23:53Z

src/main/python/DetectLanguage.py

+
+def DetectLanguage(input: String):
+	if input == "":
+		return ""


I wonder if a try ... catch would work better here. It seems like there are a few different reasons why a langdetect might burp besides empty string.

https://github.com/shuyo/language-detection/blob/master/src/com/cybozu/labs/langdetect/LangDetectException.java

greebie · 2017-10-20T16:28:12Z

src/main/scala/io/archivesunleashed/spark/pythonhelpers/RecordLoaderPythonHelper.scala

+
+  def loadArchives(path: String, jssc: JavaSparkContext, spark: SparkSession): DataFrame = {
+    val sc = jssc.sc
+    val rdd = RecordLoader.loadArchives(path, sc).keepValidPages()


Bigger discussion, but is there a rationale for .keepValidPages() inside vs outside of the loadArchives function? In Scala, we require people use it separately from the load function.

Does it make sense to support accessing invalid pages? If not, maybe alter the scala script to match this?

Ping @lintool here

I think the approach here works okay for this PR, but we should add an Issue to address the difference in approaches as it might be confusing.

Yeah, my rationale was that .keepValidPages() was always being called after loading the archive, so I might as well make RecordLoader do it automatically. That being said, I can remove this if it's undesirable and revert to how it's done in Scala.

Assuming @ruebot is okay with, I think it makes sense to move it to the RecordLoader as well. I think that still means you need a invalid parameter check (path, sc, invalid) instead of including the .keepValidPages() call.

Also - I suggest letting me fix the RecordLoader because it's probably going to break a whole bunch of unit tests. I'll get on it Sat, Sun or Mon.

@greebie do you want to make a pull against @MapleOx's pull?

I'd rather not do that... let's get @MapleOx 's PR merged as soon as reasonably possible. We can start issues on what needs to be done next, and attach directly to this PR.

Yes. @MapleOx did a great job, and this is more of a code refactor having nothing directly to do with the work.

I'll create an issue and branch off until the PR is merged.

greebie · 2017-10-20T16:58:25Z

src/main/python/scripts/filterByDateScript.py

+
+if __name__ == "__main__":
+	# replace with your own path to archive file
+	path = "/Users/Prince/Projects/pyaut/aut/example.arc.gz"


Can't remember if there is an equivalent to the Resource Scala library for Python (Resource has a tool to finds paths in the package). It would be better for unit testing / coverage. If not, we could also move this script to aut-docs as we did with the Scala scripts.

Yeah, the scripts were just to give some examples of how to use AUT in PySpark, and it doesn't really matter where the scripts are located. I just put them under aut for convenience.

Yes, I would push this down to the loader. Add an option to keep only valid pages, which is always true by default. So load('path') would be the default form, which is really load('path', true) - the user can suppress with load('path', false) if they really want all the crap...

ruebot · 2017-10-22T09:24:51Z

@MapleOx can you clean up the TODOs and commented out code, and we can proceed on moving forward with merging. w/r/t some of the TODOs, it might make sense to create issues for them. Please don't hesitate to do that.

ruebot · 2017-10-22T09:30:19Z

I'm also looking for feedback on this still:

Looks like we're recreating some of the Scala scripts. Do we need to do that?

If it is necessary, we should probably have a discussion about keeping things in sync.

ruebot · 2017-10-23T06:34:03Z

@MapleOx let's keep [this](#100 (comment) comment going in the main thread, not on code that has already been removed.

By "Scala scripts", do you mean the things under src/main/python/scripts/ like extractLinkScript.py, or the matchbox functions like ExtractDomain?

I guess there are two parts to it:

The example scripts should probably not be in this repo. I think they'd be more useful over in aut-docs.

The matchbox functions, is the intention to redo all of those in Python? If we do, then I think we (@lintool, @ianmilligan1, @greebie) need to have a discussion about maintaining both sets.

MapleOx · 2017-10-23T16:24:04Z

@ruebot Okay, I'll remove the example scripts from this PR.

For the matchbox functions that are used in RDD transformations, I rewrote them in Python because I was unable to call them straightforwardly from Scala.

More details: I originally tried to just call them from Scala, like this:

def ExtractDomain(sc, url, source = ""):
  jvm = sc._jvm
  return jvm.io.archivesunleashed.spark.matchbox.ExtractDomain.apply(url, source)

However, this doesn't work if we call ExtractDomain in a transformation, like

rdd.map(lambda r => ExtractDomain(sc, r.url))

because of this:

pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to 
reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only
be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

I wasn't able to think of a way around this, so I just decided to write Python versions of the RDD transformation matchbox functions -- let me know if you guys have a better solution!

Now, for the matchbox functions that are not used in RDD transformations like ExtractGraph, it might be possible to just call them directly from Scala, but I haven't tried this yet. In any case, the Python matchbox functions we have right now are all ones that might be used inside RDD transformations.

greebie · 2017-10-23T17:06:01Z

I think the issue is that RDD does not support nested spark transformations. So you can create a dataframe and then run transformations, but you cannot run transformations and create a dataframe out of it.

Or something like that. I don't understand it fully. The example they give is trying to create two rdds and using one rdd to transform the other. Leads to too many issues in large datasets I suppose.

ianmilligan1 · 2017-11-02T19:16:54Z

@greebie will test by taking our raw scripts that work on our old RDDs and move them to PySpark data frames. Then I think we're good to merge. 👍

ianmilligan1 · 2017-11-20T18:02:14Z

Better format to create zip, from src/main/python run

zip -r ~/pyspark/aut/pyspark1.zip .

Or some variety of that.

greebie · 2017-11-20T18:15:37Z

Slightly better zipping script:

(cd src/main/python && zip -r - .) > pyaut.zip
Saves the trouble and confusion of people having to find src/main/python/.

ianmilligan1 · 2017-11-21T18:28:34Z

FWIW I have been running it with a vanilla Jupyter notebook by running

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ~/Dropbox/pyspark/spark-2.2.0-bin-hadoop2.7/bin/pyspark --jars target/aut-0.10.1-SNAPSHOT-fatjar.jar --driver-class-path target/aut-0.10.1-SNAPSHOT-fatjar.jar --py-files /Users/ianmilligan1/dropbox/pyspark/aut/pyspark1.zip

from the aut directory.

ruebot · 2017-11-23T12:53:12Z

This is going to need to be updated with the most recent commits again.

- Have DFTransformations & RowTransformations call DetectLanguage instead of detect to avoid error on null values.

greebie · 2017-11-23T18:46:54Z

Latest pull request fixes bugs I found in keepLanguages() while testing.

ruebot · 2017-11-23T19:32:54Z

Needs to be updated again now that #119 was merged.

ruebot · 2017-12-05T17:08:19Z

@ianmilligan1 @greebie or @MapleOx, I'm going to close this PR. I've created a branch here. Can you make a new PR against that. Then I'll merge that. After that, y'all can feel free to work and push to that branch as you please. When we're ready to move it into master, we can do another PR.

Billy Jin and others added 12 commits September 28, 2017 09:07

copy over scala code to new directory for pyspark

4b46c1f

Merge remote-tracking branch 'upstream/master'

23129e8

add pyspark files

2ea2d69

add spark-sql_2.11 dependency to pom.xml

d1b2659

Merge remote-tracking branch 'upstream/master'

6d940ce

delete aut/python directory

f871b40

add python files

7d1be99

remove extraneous pyspark directory

8e2e436

add RecordLoaderPythonHelper.scala

a40d93e

Merge remote-tracking branch 'upstream/master'

1133c9f

remove runpyspark script

e1b1ad4

fix some typos and add two example scripts

bfb2678

MapleOx added the enhancement label Oct 20, 2017

ruebot reviewed Oct 20, 2017

View reviewed changes

greebie reviewed Oct 20, 2017

View reviewed changes

Billy Jin added 2 commits October 22, 2017 14:40

delete some commented out code

c8a29f8

remove unneeded TODOs

b8629ea

remove python scripts

1845d34

Billy Jin added 2 commits October 29, 2017 21:13

Merge remote-tracking branch 'upstream/master'

a6de005

remove loadArc and loadWarc

ed1bd88

ianmilligan1 assigned greebie Nov 2, 2017

Removes old RecordRDD call

79e7475

Updated urlparse call for Python 3 support

7118169

greebie added 2 commits November 23, 2017 07:56

Merge branch 'master' into master

e292af2

- Remove errant type reference in DetectLanguage

bafa23f

- Have DFTransformations & RowTransformations call DetectLanguage instead of detect to avoid error on null values.

greebie and others added 5 commits November 23, 2017 14:33

Merge branch 'master' into master

8c5fb17

Fix bug with keepContent using re.search instead re.match.

9731e25

Merge branch 'master' of https://github.com/MapleOx/aut

f9fa7ab

Swapped match for search in DFTransformations.py

9e23524

Include git search instead of replace on RowTransformations as well.

d3859c2

ianmilligan1 mentioned this pull request Nov 27, 2017

PySpark support #12

Closed

ianmilligan1 and others added 3 commits November 27, 2017 22:02

removed _java2py import

58fb708

Merge branch 'master' into master

c569d6a

Merge remote-tracking branch 'upstream/master'

85fade6

ruebot closed this Dec 5, 2017

greebie mentioned this pull request Feb 4, 2018

Register Scala functions for use in Pyspark #148

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PySpark support for core AUT functionality. #12, #13. #100

PySpark support for core AUT functionality. #12, #13. #100

MapleOx commented Oct 20, 2017

codecov bot commented Oct 20, 2017 •

edited

Loading

ianmilligan1 commented Oct 20, 2017

ruebot left a comment •

edited

Loading

ruebot Oct 20, 2017

greebie Oct 20, 2017

MapleOx Oct 22, 2017

greebie Oct 20, 2017

MapleOx Oct 20, 2017

greebie Oct 20, 2017

greebie Oct 20, 2017

ianmilligan1 Oct 20, 2017

greebie Oct 20, 2017

MapleOx Oct 20, 2017

greebie Oct 20, 2017

greebie Oct 20, 2017

ruebot Oct 21, 2017

lintool Oct 21, 2017

greebie Oct 21, 2017

greebie Oct 20, 2017 •

edited

Loading

MapleOx Oct 20, 2017

lintool Oct 20, 2017

ruebot commented Oct 22, 2017

ruebot commented Oct 22, 2017

ruebot commented Oct 23, 2017

MapleOx commented Oct 23, 2017

greebie commented Oct 23, 2017

ianmilligan1 commented Nov 2, 2017 •

edited

Loading

ianmilligan1 commented Nov 20, 2017

greebie commented Nov 20, 2017 •

edited

Loading

ianmilligan1 commented Nov 21, 2017

ruebot commented Nov 23, 2017

greebie commented Nov 23, 2017

ruebot commented Nov 23, 2017

ruebot commented Dec 5, 2017

		return df.filter(content_filter_udf(df['contentString']))


		# ---- TODO: All discard filtering operations ---- #

PySpark support for core AUT functionality. #12, #13. #100

PySpark support for core AUT functionality. #12, #13. #100

Conversation

MapleOx commented Oct 20, 2017

What does this Pull Request do?

How should this be tested?

Additional Notes:

Interested parties

codecov bot commented Oct 20, 2017 • edited Loading

Codecov Report

ianmilligan1 commented Oct 20, 2017

ruebot left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greebie Oct 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruebot commented Oct 22, 2017

ruebot commented Oct 22, 2017

ruebot commented Oct 23, 2017

MapleOx commented Oct 23, 2017

greebie commented Oct 23, 2017

ianmilligan1 commented Nov 2, 2017 • edited Loading

ianmilligan1 commented Nov 20, 2017

greebie commented Nov 20, 2017 • edited Loading

ianmilligan1 commented Nov 21, 2017

ruebot commented Nov 23, 2017

greebie commented Nov 23, 2017

ruebot commented Nov 23, 2017

ruebot commented Dec 5, 2017

codecov bot commented Oct 20, 2017 •

edited

Loading

ruebot left a comment •

edited

Loading

greebie Oct 20, 2017 •

edited

Loading

ianmilligan1 commented Nov 2, 2017 •

edited

Loading

greebie commented Nov 20, 2017 •

edited

Loading