-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migration of all RDD functionality over to DataFrames #223
Comments
@jrwiebe I have a note here from our call to "Go through the scala dir and identify all the functions that take in RDD and do not take in DF, then create tickets for JWb." Do you want me to do that granular of a level, or do you want to use this issue to take care of it? |
This is fine. |
@ianmilligan1 @lintool this look the basic inventory?
|
I think we should just leave this as a "catch-all" issue, open. IMO, this should be driven by the documentation update - go through docs, everything that we do with RDDs, we make sure there's a corresponding DF code example. When the docs have everything in both RDD and DF, I think we're done. |
@SinghGursimran this one is tied to #372, and should become a lot clearer as to what needs to be done to close this one. I think we're pretty close here. |
- Extend GetExtensionMime, ExtractImageLinks, ComputeMD5, and ComputeSHA1 to DataFrames - Addresses #223
From @lintool
This spawned from a Slack convo Jimmy and I had in a non-public channel that never made it to the ticket. I have a branch working through this now. |
- Add `all()` DataFrame method - Refactor fixity DataFrame UDFs - Add ComputeImageSize UDF - Add Python implementation of `all()` - Addresses #223
- Add tests for ExtractPopularImagesDF - Rename ExtractPopularImages to ExtractPopularImagesRDD - Addresses #223
- Add DetectLanguageDF - Add ExtractBoilerpipeTextDF - Add ExtractDateDF - Update tests - Rename existing ExtractDate, ExtractBoilerpipeText, DetectLanguage udfs by appending RDD - Partially addresses #223
- Add keepValidPagesDF - Add HTTP status code column to all() - Add test for keepValidPagesDF - Addresses #223
- Partially addresses #223 - Add discardContentDF - Add discardUrlPatternsDF - Add discardLanguagesDF - Add keepImagesDF - Add keepContentDF - Add keepUrlPatternsDF - Add keepLanguagesDF - Update tests
@SinghGursimran we still need to do |
- Addresses #32 - Addresses archivesunleashed/aut#223 - Addresses archivesunleashed/aut#401 - Addresses archivesunleashed/aut#396 - Addresses archivesunleashed/aut#399
- Addresses #32 - Addresses archivesunleashed/aut#223 - Addresses archivesunleashed/aut#401 - Addresses archivesunleashed/aut#396 - Addresses archivesunleashed/aut#399
@SinghGursimran nvm. I see it now :-) aut/src/main/scala/io/archivesunleashed/package.scala Lines 243 to 250 in bc0d663
|
- Addresses #32 - Addresses archivesunleashed/aut#223 - Addresses archivesunleashed/aut#401 - Addresses archivesunleashed/aut#396 - Addresses archivesunleashed/aut#399
- Addresses #32 - Addresses archivesunleashed/aut#223 - Addresses archivesunleashed/aut#401 - Addresses archivesunleashed/aut#396 - Addresses archivesunleashed/aut#399
- Resolves #32 - Addresses archivesunleashed/aut#223 - Addresses archivesunleashed/aut#401 - Addresses archivesunleashed/aut#396 - Addresses archivesunleashed/aut#399
- Resolves #32 - Addresses archivesunleashed/aut#223 - Addresses archivesunleashed/aut#401 - Addresses archivesunleashed/aut#396 - Addresses archivesunleashed/aut#399
Created three new issues that should cover the Python implementations of most of the work here. |
Merging #441 fully resolved this. |
🎉 |
We forgot about |
Oh, it's there Sorry for the confusion. 🤦♂️ |
We need to migrate all current RDD functionality over to DataFrames. This means porting all matchbox UDFs over to DF UDFs.
There are two possible ways to do this - we can simply take matchbox UDFs and wrap them, or rewrite them from scratch. I suggest we revisit one by one, which will give us an opportunity to refine the UDF we actually want.
For example, the current RDD matchbox
ExtractDomain
is implemented a bit differently than the DF version we've been playing with... it, for example, strips the prefixwww
, whereas the RDD impl doesn't. I like the newer implementation better, but open to discussion.Also, this is an issue we'll come across sooner or later:
https://stackoverflow.com/questions/33664991/spark-udf-initialization
I have a general question we need to look into from the performance perspective: what's the lifecycle status of a Spark DF UDF? In particular, if there's initialization like compiling regexp, we don't want to do that over and over again... we want to have an init stage?
@TitusAn let's start developing in parallel the DF versions of the apps in #222 and try and work this out?
The text was updated successfully, but these errors were encountered: