Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PySpark support for core AUT functionality. #12, #13. #100

Closed
wants to merge 29 commits into from

Conversation

MapleOx
Copy link
Collaborator

@MapleOx MapleOx commented Oct 20, 2017

GitHub issue(s):

What does this Pull Request do?

Adds PySpark support for the core functionality of the Archives Unleashed Toolkit. Added:

  • Ability to load arcs and warcs as RDDs or DataFrames in PySpark,
  • Python DataFrame transformation functions like keepValidPages and keepImages that mimic the RDD transformations in AUT Scala,
  • Python versions of several of the AUT matchbox functions, like ExtractLinks and RemoveHTML,
  • Two example python scripts to demonstrate usage.

How should this be tested?

First, run mvn clean package to rebuild AUT. After that, you can either run from the PySpark shell or submit your own script.

To run from the PySpark shell:

  1. cd into the aut/ directory.
  2. Run zip pyaut src/main/python/*.py to make a zip of the Python files.
  3. Run the following command to start the PySpark shell:
pyspark --jars target/aut-0.10.1-SNAPSHOT-fatjar.jar --driver-class-path target/aut-0.10.1-SNAPSHOT-fatjar.jar --py-files python.zip
  1. You can now use AUT in the PySpark shell!

To run a script:

  1. cd into the aut/ directory.
  2. Run zip pyaut src/main/python/*.py to make a zip of the Python files.
  3. Run the following command to submit your script:
spark-submit --jars target/aut-0.10.1-SNAPSHOT-fatjar.jar --driver-class-path target/aut-0.10.1-SNAPSHOT-fatjar.jar --py-files python.zip path/to/myscript.py 

Two example scripts can be found in the src/main/python/scripts directory.

Additional Notes:

The more complicated AUT Scala matchbox functions, like ExtractGraph and ExtractEntities, are not included in this PR. I plan to add them progressively.

The Python code uses BeautifulSoup4, a library for extracting data from HTML and XML.

Interested parties

@lintool @ruebot @ianmilligan1 @greebie

@codecov
Copy link

codecov bot commented Oct 20, 2017

Codecov Report

Merging #100 into master will decrease coverage by 0.71%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #100      +/-   ##
==========================================
- Coverage   65.66%   64.95%   -0.72%     
==========================================
  Files          36       37       +1     
  Lines         731      739       +8     
  Branches      142      144       +2     
==========================================
  Hits          480      480              
- Misses        201      209       +8     
  Partials       50       50
Impacted Files Coverage Δ
...spark/pythonhelpers/RecordLoaderPythonHelper.scala 0% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3eb093a...85fade6. Read the comment docs.

@ianmilligan1
Copy link
Member

I'll leave reviews to @ruebot and @lintool, but I think it'd be good to get this into the repository so we (and others in the community) can explore PySpark using the main repo. It's sitting apart from the main Scala functions so wouldn't affect AUT's functionality.

Copy link
Member

@ruebot ruebot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we're recreating some of the Scala scripts. Do we need to do that?

Also, do you have a plan for tests?

.reduceByKey(lambda c1, c2: c1 + c2) \
.sortBy(lambda f: f[1], ascending = False)

# def keepImages(df):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this commented out code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am willing and able to look at unit test coverage here once people have tried it and accept that it does what we want it to do. From I've seen, pytest is the best library for handling a pyspark context: https://stackoverflow.com/questions/33811882/how-do-i-unit-test-pyspark-programs

Will we need python dependency management here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruebot By "Scala scripts", do you mean the things under src/main/python/scripts/ like extractLinkScript.py, or the matchbox functions like ExtractDomain?

return df.filter(content_filter_udf(df['contentString']))


# ---- TODO: All discard filtering operations ---- #
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still "TODO" or have you completed the task? (remove comment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that comment should be removed. I probably have some other TODOs comments that should be deleted too.


def DetectLanguage(input: String):
if input == "":
return ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a try ... catch would work better here. It seems like there are a few different reasons why a langdetect might burp besides empty string.

https://github.com/shuyo/language-detection/blob/master/src/com/cybozu/labs/langdetect/LangDetectException.java


def loadArchives(path: String, jssc: JavaSparkContext, spark: SparkSession): DataFrame = {
val sc = jssc.sc
val rdd = RecordLoader.loadArchives(path, sc).keepValidPages()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bigger discussion, but is there a rationale for .keepValidPages() inside vs outside of the loadArchives function? In Scala, we require people use it separately from the load function.

Does it make sense to support accessing invalid pages? If not, maybe alter the scala script to match this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @lintool here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the approach here works okay for this PR, but we should add an Issue to address the difference in approaches as it might be confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my rationale was that .keepValidPages() was always being called after loading the archive, so I might as well make RecordLoader do it automatically. That being said, I can remove this if it's undesirable and revert to how it's done in Scala.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming @ruebot is okay with, I think it makes sense to move it to the RecordLoader as well. I think that still means you need a invalid parameter check (path, sc, invalid) instead of including the .keepValidPages() call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - I suggest letting me fix the RecordLoader because it's probably going to break a whole bunch of unit tests. I'll get on it Sat, Sun or Mon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greebie do you want to make a pull against @MapleOx's pull?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not do that... let's get @MapleOx 's PR merged as soon as reasonably possible. We can start issues on what needs to be done next, and attach directly to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. @MapleOx did a great job, and this is more of a code refactor having nothing directly to do with the work.

I'll create an issue and branch off until the PR is merged.


if __name__ == "__main__":
# replace with your own path to archive file
path = "/Users/Prince/Projects/pyaut/aut/example.arc.gz"
Copy link
Contributor

@greebie greebie Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't remember if there is an equivalent to the Resource Scala library for Python (Resource has a tool to finds paths in the package). It would be better for unit testing / coverage. If not, we could also move this script to aut-docs as we did with the Scala scripts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the scripts were just to give some examples of how to use AUT in PySpark, and it doesn't really matter where the scripts are located. I just put them under aut for convenience.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would push this down to the loader. Add an option to keep only valid pages, which is always true by default. So load('path') would be the default form, which is really load('path', true) - the user can suppress with load('path', false) if they really want all the crap...

@ruebot
Copy link
Member

ruebot commented Oct 22, 2017

@MapleOx can you clean up the TODOs and commented out code, and we can proceed on moving forward with merging. w/r/t some of the TODOs, it might make sense to create issues for them. Please don't hesitate to do that.

@ruebot
Copy link
Member

ruebot commented Oct 22, 2017

I'm also looking for feedback on this still:

Looks like we're recreating some of the Scala scripts. Do we need to do that?

If it is necessary, we should probably have a discussion about keeping things in sync.

@ruebot
Copy link
Member

ruebot commented Oct 23, 2017

@MapleOx let's keep [this](#100 (comment) comment going in the main thread, not on code that has already been removed.

By "Scala scripts", do you mean the things under src/main/python/scripts/ like extractLinkScript.py, or the matchbox functions like ExtractDomain?

I guess there are two parts to it:

The example scripts should probably not be in this repo. I think they'd be more useful over in aut-docs.

The matchbox functions, is the intention to redo all of those in Python? If we do, then I think we (@lintool, @ianmilligan1, @greebie) need to have a discussion about maintaining both sets.

@MapleOx
Copy link
Collaborator Author

MapleOx commented Oct 23, 2017

@ruebot Okay, I'll remove the example scripts from this PR.

For the matchbox functions that are used in RDD transformations, I rewrote them in Python because I was unable to call them straightforwardly from Scala.

More details: I originally tried to just call them from Scala, like this:

def ExtractDomain(sc, url, source = ""):
  jvm = sc._jvm
  return jvm.io.archivesunleashed.spark.matchbox.ExtractDomain.apply(url, source)

However, this doesn't work if we call ExtractDomain in a transformation, like

rdd.map(lambda r => ExtractDomain(sc, r.url))

because of this:

pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to 
reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only
be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

I wasn't able to think of a way around this, so I just decided to write Python versions of the RDD transformation matchbox functions -- let me know if you guys have a better solution!

Now, for the matchbox functions that are not used in RDD transformations like ExtractGraph, it might be possible to just call them directly from Scala, but I haven't tried this yet. In any case, the Python matchbox functions we have right now are all ones that might be used inside RDD transformations.

@greebie
Copy link
Contributor

greebie commented Oct 23, 2017

I think the issue is that RDD does not support nested spark transformations. So you can create a dataframe and then run transformations, but you cannot run transformations and create a dataframe out of it.

Or something like that. I don't understand it fully. The example they give is trying to create two rdds and using one rdd to transform the other. Leads to too many issues in large datasets I suppose.

@ianmilligan1
Copy link
Member

ianmilligan1 commented Nov 2, 2017

@greebie will test by taking our raw scripts that work on our old RDDs and move them to PySpark data frames. Then I think we're good to merge. 👍

@ianmilligan1
Copy link
Member

Better format to create zip, from src/main/python run

zip -r ~/pyspark/aut/pyspark1.zip .

Or some variety of that.

@greebie
Copy link
Contributor

greebie commented Nov 20, 2017

Slightly better zipping script:

(cd src/main/python && zip -r - .) > pyaut.zip
Saves the trouble and confusion of people having to find src/main/python/.

@ianmilligan1
Copy link
Member

FWIW I have been running it with a vanilla Jupyter notebook by running

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ~/Dropbox/pyspark/spark-2.2.0-bin-hadoop2.7/bin/pyspark --jars target/aut-0.10.1-SNAPSHOT-fatjar.jar --driver-class-path target/aut-0.10.1-SNAPSHOT-fatjar.jar --py-files /Users/ianmilligan1/dropbox/pyspark/aut/pyspark1.zip

from the aut directory.

@ruebot
Copy link
Member

ruebot commented Nov 23, 2017

This is going to need to be updated with the most recent commits again.

- Have DFTransformations & RowTransformations call DetectLanguage instead of detect to avoid error on null values.
@greebie
Copy link
Contributor

greebie commented Nov 23, 2017

Latest pull request fixes bugs I found in keepLanguages() while testing.

@ruebot
Copy link
Member

ruebot commented Nov 23, 2017

Needs to be updated again now that #119 was merged.

@ianmilligan1 ianmilligan1 mentioned this pull request Nov 27, 2017
@ruebot
Copy link
Member

ruebot commented Dec 5, 2017

@ianmilligan1 @greebie or @MapleOx, I'm going to close this PR. I've created a branch here. Can you make a new PR against that. Then I'll merge that. After that, y'all can feel free to work and push to that branch as you please. When we're ready to move it into master, we can do another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants