Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] fix reusing a sparkSession using the force_reuse_spark_context keyword #3245

Closed
wants to merge 31 commits into from
Closed

[BUGFIX] fix reusing a sparkSession using the force_reuse_spark_context keyword #3245

wants to merge 31 commits into from

Conversation

mbakunze
Copy link
Contributor

@mbakunze mbakunze commented Aug 13, 2021

Changes proposed in this pull request

We are adding the force_reuse_spark_context and spark_config to the DatasourceConfigSchema in order to pass the values for these along. They seem to have previously been dropped when initiating a DataContext with a spark source.

This PR is based on PR #3126 (and would close it)- and there have been some discussions on Slack regarding this bug.

Context

The usage of force_reuse_spark_context is currently broken in a way that when using this keyword either via a .yml or via dynamically generated DataContext the sparkSession still gets stopped by GE. The ketywords were not actually passed to the part of the code that was responisble to get or create the sparkSession. This in turn makes it impossible to reuse a sparkSession and use an in-memory dataframe as input for GE. The ability to reuse the sparkSession is important for use in EMR and spark on kubernetes.

Checklist

  • My code follows the Great Expectations style guide
  • I have performed a self-review of my own code
  • I have added unit tests where applicable and made sure that new and existing tests are passing.

@netlify
Copy link

netlify bot commented Aug 13, 2021

👷 Deploy request for niobium-lead-7998 pending review.
Visit the deploys page to approve it

🔨 Explore the source changes: 1ac14a7

@mbakunze mbakunze marked this pull request as draft August 13, 2021 16:18
@mbakunze
Copy link
Contributor Author

@gipaetusb I think the issue with the test came from the missing=False entry which created default args for many tests that they could not (and should not) handle. I am too tired to wait for the tests to finish rn though :)

@mbakunze
Copy link
Contributor Author

https://dev.azure.com/great-expectations/great_expectations/_build/results?buildId=14175&view=logs&j=0be527af-e1cb-5224-1725-2baa5917c313&t=096d6768-99a4-5ec0-6a65-65122b15571c&l=306

Cannot initialize datasource my_bigquery_datasource, error: File $(gcp_authkey.secureFilePath) was not found. And it looks like this might be caused by not being able to access this within a PR? 🤷‍♂️

@mbakunze
Copy link
Contributor Author

Looks like other PRs have the same issue (e.g. #3245). I think this is due to being a PR as stated 👆. Would be great to get some help to resolve that. I will try to add some unittests to avoid regression on the stuff from this PR.

@talagluck
Copy link
Contributor

Thanks for this, @mbakunze ! I think you are right, and that these failing tests are unrelated to this PR. I've set the tests to re-run, and if they still fail, we'll figure out what's wrong.

How would it make the most sense to merge this with #3126? Would it make sense for you to attempt to merge this into that branch instead?

@mbakunze
Copy link
Contributor Author

Hey @talagluck - I will try to coordinate with @gipaetusb as he did the real work. I just want to add a unittest and will then figure that out.

@mbakunze
Copy link
Contributor Author

@gipaetusb I propose you merge this into your PR and we close this PR here. Maybe we should also re-label this as a bug fix, which imo is a better fit.

@talagluck
Copy link
Contributor

@mbakunze - do you want to try and merge into @gipaetusb's branch instead of develop? Also, just a heads-up, it looks like tests are failing here.

@mbakunze
Copy link
Contributor Author

@talagluck - just tried to create a PR in @gipaetusb 's branch w/o any luck. Looks like I do not have access to do that? 🤷‍♂️

FWIW: I am not fuzzed about how to get the changes in - I can also spend some time tomorrow to bring this PR here into shape.

@mbakunze mbakunze changed the title Maintenance/reuse spark context emr [BUGFIX] fix reusing a sparkSession using the force_reuse_spark_context keyword Aug 16, 2021
@mbakunze mbakunze marked this pull request as ready for review August 17, 2021 05:24
@mbakunze
Copy link
Contributor Author

@talagluck this is ready for review

@mbakunze
Copy link
Contributor Author

Had another quick look at the unittest and modified them to explicitly test the pass through of force_reuse_spark_context and spark_config

@pasmavie
Copy link
Contributor

Everyone, I've merged these changes into #3126 as proposed here

@mbakunze
Copy link
Contributor Author

mbakunze commented Aug 20, 2021

Closing this PR then!

@mbakunze mbakunze closed this Aug 20, 2021
@talagluck
Copy link
Contributor

Thank you, @mbakunze and @gipaetusb ! Moving conversation to #3126

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants