-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PySpark support for core AUT functionality. #12, #13. #100
Conversation
Codecov Report
@@ Coverage Diff @@
## master #100 +/- ##
==========================================
- Coverage 65.66% 64.95% -0.72%
==========================================
Files 36 37 +1
Lines 731 739 +8
Branches 142 144 +2
==========================================
Hits 480 480
- Misses 201 209 +8
Partials 50 50
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we're recreating some of the Scala scripts. Do we need to do that?
Also, do you have a plan for tests?
src/main/python/DFTransformations.py
Outdated
.reduceByKey(lambda c1, c2: c1 + c2) \ | ||
.sortBy(lambda f: f[1], ascending = False) | ||
|
||
# def keepImages(df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this commented out code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am willing and able to look at unit test coverage here once people have tried it and accept that it does what we want it to do. From I've seen, pytest is the best library for handling a pyspark context: https://stackoverflow.com/questions/33811882/how-do-i-unit-test-pyspark-programs
Will we need python dependency management here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruebot By "Scala scripts", do you mean the things under src/main/python/scripts/
like extractLinkScript.py
, or the matchbox functions like ExtractDomain
?
src/main/python/DFTransformations.py
Outdated
return df.filter(content_filter_udf(df['contentString'])) | ||
|
||
|
||
# ---- TODO: All discard filtering operations ---- # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still "TODO" or have you completed the task? (remove comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, that comment should be removed. I probably have some other TODOs comments that should be deleted too.
src/main/python/DetectLanguage.py
Outdated
|
||
def DetectLanguage(input: String): | ||
if input == "": | ||
return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a try ... catch would work better here. It seems like there are a few different reasons why a langdetect might burp besides empty string.
|
||
def loadArchives(path: String, jssc: JavaSparkContext, spark: SparkSession): DataFrame = { | ||
val sc = jssc.sc | ||
val rdd = RecordLoader.loadArchives(path, sc).keepValidPages() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bigger discussion, but is there a rationale for .keepValidPages() inside vs outside of the loadArchives function? In Scala, we require people use it separately from the load function.
Does it make sense to support accessing invalid pages? If not, maybe alter the scala script to match this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @lintool here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the approach here works okay for this PR, but we should add an Issue to address the difference in approaches as it might be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my rationale was that .keepValidPages()
was always being called after loading the archive, so I might as well make RecordLoader
do it automatically. That being said, I can remove this if it's undesirable and revert to how it's done in Scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming @ruebot is okay with, I think it makes sense to move it to the RecordLoader as well. I think that still means you need a invalid parameter check (path, sc, invalid)
instead of including the .keepValidPages() call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - I suggest letting me fix the RecordLoader because it's probably going to break a whole bunch of unit tests. I'll get on it Sat, Sun or Mon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not do that... let's get @MapleOx 's PR merged as soon as reasonably possible. We can start issues on what needs to be done next, and attach directly to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. @MapleOx did a great job, and this is more of a code refactor having nothing directly to do with the work.
I'll create an issue and branch off until the PR is merged.
|
||
if __name__ == "__main__": | ||
# replace with your own path to archive file | ||
path = "/Users/Prince/Projects/pyaut/aut/example.arc.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't remember if there is an equivalent to the Resource Scala library for Python (Resource has a tool to finds paths in the package). It would be better for unit testing / coverage. If not, we could also move this script to aut-docs as we did with the Scala scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the scripts were just to give some examples of how to use AUT in PySpark, and it doesn't really matter where the scripts are located. I just put them under aut for convenience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would push this down to the loader. Add an option to keep only valid pages, which is always true by default. So load('path')
would be the default form, which is really load('path', true)
- the user can suppress with load('path', false)
if they really want all the crap...
@MapleOx can you clean up the TODOs and commented out code, and we can proceed on moving forward with merging. w/r/t some of the TODOs, it might make sense to create issues for them. Please don't hesitate to do that. |
I'm also looking for feedback on this still:
If it is necessary, we should probably have a discussion about keeping things in sync. |
@MapleOx let's keep [this](#100 (comment) comment going in the main thread, not on code that has already been removed.
I guess there are two parts to it: The example scripts should probably not be in this repo. I think they'd be more useful over in The matchbox functions, is the intention to redo all of those in Python? If we do, then I think we (@lintool, @ianmilligan1, @greebie) need to have a discussion about maintaining both sets. |
@ruebot Okay, I'll remove the example scripts from this PR. For the matchbox functions that are used in RDD transformations, I rewrote them in Python because I was unable to call them straightforwardly from Scala. More details: I originally tried to just call them from Scala, like this:
However, this doesn't work if we call
because of this:
I wasn't able to think of a way around this, so I just decided to write Python versions of the RDD transformation matchbox functions -- let me know if you guys have a better solution! Now, for the matchbox functions that are not used in RDD transformations like |
I think the issue is that RDD does not support nested spark transformations. So you can create a dataframe and then run transformations, but you cannot run transformations and create a dataframe out of it. Or something like that. I don't understand it fully. The example they give is trying to create two rdds and using one rdd to transform the other. Leads to too many issues in large datasets I suppose. |
@greebie will test by taking our raw scripts that work on our old RDDs and move them to PySpark data frames. Then I think we're good to merge. 👍 |
Better format to create zip, from
Or some variety of that. |
Slightly better zipping script:
|
FWIW I have been running it with a vanilla Jupyter notebook by running
from the |
This is going to need to be updated with the most recent commits again. |
- Have DFTransformations & RowTransformations call DetectLanguage instead of detect to avoid error on null values.
Latest pull request fixes bugs I found in keepLanguages() while testing. |
Needs to be updated again now that #119 was merged. |
@ianmilligan1 @greebie or @MapleOx, I'm going to close this PR. I've created a branch here. Can you make a new PR against that. Then I'll merge that. After that, y'all can feel free to work and push to that branch as you please. When we're ready to move it into master, we can do another PR. |
GitHub issue(s):
What does this Pull Request do?
Adds PySpark support for the core functionality of the Archives Unleashed Toolkit. Added:
keepValidPages
andkeepImages
that mimic the RDD transformations in AUT Scala,ExtractLinks
andRemoveHTML
,How should this be tested?
First, run
mvn clean package
to rebuild AUT. After that, you can either run from the PySpark shell or submit your own script.To run from the PySpark shell:
cd
into theaut/
directory.zip pyaut src/main/python/*.py
to make a zip of the Python files.To run a script:
cd
into theaut/
directory.zip pyaut src/main/python/*.py
to make a zip of the Python files.Two example scripts can be found in the
src/main/python/scripts
directory.Additional Notes:
The more complicated AUT Scala matchbox functions, like
ExtractGraph
andExtractEntities
, are not included in this PR. I plan to add them progressively.The Python code uses BeautifulSoup4, a library for extracting data from HTML and XML.
Interested parties
@lintool @ruebot @ianmilligan1 @greebie