Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MAINTENANCE] Add force_reuse_spark_context to DatasourceConfigSchema #2968

Closed
Closed
Changes from 1 commit
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
7444e4d
Adds spark_config and force_reuse_spark_context fields to DataSourceC…
Jun 28, 2021
0b7f812
Formats with black
Jun 29, 2021
eb07e45
Merge branch 'develop' into reuse-spark-context
talagluck Jul 8, 2021
6b0f065
Merge branch 'develop' into reuse-spark-context
talagluck Jul 12, 2021
1049199
Adds spark_config and force_reuse_spark_context fields to DataSourceC…
Jun 28, 2021
d2f27f5
Formats with black
Jun 29, 2021
9b1e41f
[DOCS] How to connect to data on a filesystem using Spark guide (#2956)
Aylr Jun 28, 2021
a0f5829
[MAINTENANCE]: Refactor ExpectationSuite to include profiler_config i…
cdkini Jun 28, 2021
7edc9a9
[BUGFIX]: Update mssql image version for Azure (#2969)
cdkini Jun 29, 2021
ed1cade
[FEATURE]: Add citations to Profiler.profile() (#2966)
cdkini Jun 29, 2021
f50d404
[DOCS] GDOC-102/GDOC-127 Port in References and Tutorials (#2963)
petermoyer Jun 29, 2021
2da61df
[DOCS] How to connect to a MySQL database (#2970)
Aylr Jun 29, 2021
61070f2
[DOCS] improved clarity in how to write guide templates and docs (#2971)
Aylr Jun 30, 2021
1c91d00
[DOCS] Add documentation for Rule Based Profilers (#2933)
talagluck Jun 30, 2021
a832707
[FEATURE] Bootstrapped Range Parameter Builder (#2912)
alexsherstinsky Jun 30, 2021
c58d5a4
2021-06-30 release candidate (v0.13.21) (#2974)
Jun 30, 2021
73367cd
[DOCS] add image zoom plugin (#2979)
Aylr Jul 1, 2021
1a62388
[MAINTENANCE] Attempt to fix Numpy and Scipy Version Requirements wit…
alexsherstinsky Jul 2, 2021
5636e0b
[BUGFIX] Modify read_excel() to handle new optional-dependency openpy…
Jul 2, 2021
d54c13c
[DOCS] Update rule-based profiler docs (#2987)
talagluck Jul 2, 2021
022271c
[BUGFIX] Fix bug in getting non-existent parameter (#2986)
alexsherstinsky Jul 3, 2021
b8fded6
[MAINTENANCE] make citation cleaner in expectation suite (#2990)
alexsherstinsky Jul 3, 2021
2db3e0b
[MAINTENANCE] rephrase expectation suite meta profile comment (#2991)
alexsherstinsky Jul 3, 2021
a7b8b22
[MAINTENANCE] Update v-0.12 CLI test to reflect Pandas upgrade to ve…
alexsherstinsky Jul 6, 2021
34ed61b
[MAINTENANCE] Remove "mostly" from "bobster" test config (#2996)
alexsherstinsky Jul 6, 2021
633666f
[DOCS]/GDOC-108/GDOC-143/Add in Contributing fields and updates (#2972)
petermoyer Jul 6, 2021
e86e0e1
Adding a missing import to a documentation page (#2983)
rishabh-bhargava Jul 6, 2021
e274f29
[MAINTENANCE] Instrument test_yaml_config() (#2981)
anthonyburdi Jul 6, 2021
bfaef25
[BUGFIX] Improve support for dates for expect_column_distinct_values_…
xaniasd Jul 6, 2021
65de414
[BUGFIX] Fix issue where compression key was added to reader_method f…
jcampbell Jul 6, 2021
ef36a46
[BUGFIX] [batch.py] fix check for null value (#2994)
MHAbido Jul 6, 2021
1ee0c7d
[Maintenance] update header to match GE.io (#2811)
kyleaton Jul 7, 2021
8c46126
[FEATURE] bootstrap estimator for NumericMetricRangeMultiBatchParame…
alexsherstinsky Jul 7, 2021
f7a58f3
[FIX] Update naming of confidence_level in integration test fixture (…
cdkini Jul 7, 2021
8e1fb69
Adding in url links and style (#2999)
petermoyer Jul 7, 2021
9b18b80
[DOCS] Getting Started - Clean Up and Integration Tests (#2985)
Jul 7, 2021
7fb01c0
[MAINTENANCE] fix lint issues for docusaurus (#3004)
alexsherstinsky Jul 8, 2021
e29931e
[FEATURE] Port over guide for Slack notifications for validation acti…
cdkini Jul 8, 2021
8418aef
release-prep-2021-07-09 for release 0.13.22 (#3008)
allensallinger Jul 9, 2021
cd6f267
[FEATURE] how to validate without checkpoint (#3013)
alexsherstinsky Jul 9, 2021
15c7d7d
[DOCS]: Port over "How to configure validation result store in Azure"…
cdkini Jul 9, 2021
76553bf
[DOCS]: Port over "How to instantiate a Data Context w/o YML" from R…
cdkini Jul 9, 2021
a99b49b
Merge branch 'reuse-spark-context' of https://github.com/gipaetusb/gr…
Jul 13, 2021
227511e
Merge branch 'develop' into reuse-spark-context
talagluck Jul 13, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[DOCS]: Port over "How to instantiate a Data Context w/o YML" from R…
…TD to Docusaurus (#3011)

* docs: Port over guide from RTD to Docusaurus
  • Loading branch information
cdkini authored and gipaetusb committed Jul 13, 2021
commit 76553bf880a2644892257b2ad7da4f2a2d903166
Original file line number Diff line number Diff line change
@@ -1,5 +1,206 @@
---
title: How to instantiate a Data Context without a yml file
---
import Prerequisites from '../../connecting_to_your_data/components/prerequisites.jsx'

This guide will help you instantiate a Data Context without a yml file, aka configure a Data Context in code. If you are working in an environment without easy access to a local filesystem (e.g. AWS Spark EMR, Databricks, etc.) you may wish to configure your Data Context in code, within your notebook or workflow tool (e.g. Airflow DAG node).

<Prerequisites>

- [Followed the Getting Started tutorial and have a basic familiarity with the Great Expectations configuration](../../../tutorials/getting-started/intro.md)

</Prerequisites>

:::note
- See also our companion video for this guide: [Data Contexts In Code](https://youtu.be/4VMOYpjHNhM).
:::


Steps
-----

1. **Create a DataContextConfig**

The DataContextConfig holds all of the associated configuration parameters to build a DataContext. There are defaults set for you to minimize configuration in typical cases, but please note that every parameter is configurable and all defaults are overridable. Also note that DatasourceConfig also has defaults which can be overridden.

Here we will show a few examples of common configurations, using the ``store_backend_defaults`` parameter. Note that you can continue with the existing API sans defaults by omitting this parameter, and you can override all of the parameters as shown in the last example. Note that a parameter set in ``DataContextConfig`` will override a parameter set in ``store_backend_defaults`` if both are used.

**TODO(cdkini): These are links to our API Reference, which has not yet been implemented. What should we do here?**
The following ``store_backend_defaults`` are currently available:
- :py:class:`~great_expectations.data_context.types.base.S3StoreBackendDefaults`
- :py:class:`~great_expectations.data_context.types.base.GCSStoreBackendDefaults`
- :py:class:`~great_expectations.data_context.types.base.DatabaseStoreBackendDefaults`
- :py:class:`~great_expectations.data_context.types.base.FilesystemStoreBackendDefaults`

The following example shows a Data Context configuration with an SQLAlchemy datasource and an AWS s3 bucket for all metadata stores, using default prefixes. Note that you can still substitute environment variables as in the YAML based configuration to keep sensitive credentials out of your code.

```python
from great_expectations.data_context.types.base import DataContextConfig, DatasourceConfig
from great_expectations.data_context import BaseDataContext

data_context_config = DataContextConfig(
datasources={
"my_sqlalchemy_datasource": DatasourceConfig(
class_name="SqlAlchemyDatasource",
credentials={
"drivername": "custom_drivername",
"host": "custom_host",
"port": "custom_port",
"username": "${USERNAME_FROM_ENVIRONMENT_VARIABLE}",
"password": "${PASSWORD_FROM_ENVIRONMENT_VARIABLE}",
"database": "custom_database",
},
)
},
store_backend_defaults=S3StoreBackendDefaults(default_bucket_name="my_default_bucket"),
)
```

The following example shows a Data Context configuration with a Pandas datasource and local filesystem defaults for metadata stores. Note: imports are omitted in the following examples. Note: You may add an optional root_directory parameter to set the base location for the Store Backends.

```python
data_context_config = DataContextConfig(
datasources={
"my_pandas_datasource": DatasourceConfig(
class_name="PandasDatasource",
batch_kwargs_generators={
"subdir_reader": {
"class_name": "SubdirReaderBatchKwargsGenerator",
"base_directory": "/path/to/data",
}
},
)
},
store_backend_defaults=FilesystemStoreBackendDefaults(root_directory="optional/absolute/path/for/stores"),
)
```

The following example shows a Data Context configuration with an SQLAlchemy datasource and two GCS buckets for metadata stores, using some custom and some default prefixes. Note that you can still substitute environment variables as in the YAML based configuration to keep sensitive credentials out of your code. ``default_bucket_name``, ``default_project_name`` sets the default value for all stores that are not specified individually.

The resulting DataContextConfig from the following example creates an Expectations store and Data Docs using the ``my_default_bucket`` and ``my_default_project`` parameters since their bucket and project is not specified explicitly. The validations store is created using the explicitly specified ``my_validations_bucket`` and ``my_validations_project``. Further, the prefixes are set for the Expectations store and validations store, while data docs use the default ``data_docs`` prefix.

```python
data_context_config = DataContextConfig(
datasources={
"my_sqlalchemy_datasource": DatasourceConfig(
class_name="SqlAlchemyDatasource",
credentials={
"drivername": "custom_drivername",
"host": "custom_host",
"port": "custom_port",
"username": "${USERNAME_FROM_ENVIRONMENT_VARIABLE}",
"password": "${PASSWORD_FROM_ENVIRONMENT_VARIABLE}",
"database": "custom_database",
},
)
},
store_backend_defaults=GCSStoreBackendDefaults(
default_bucket_name="my_default_bucket",
default_project_name="my_default_project",
validations_store_bucket_name="my_validations_bucket",
validations_store_project_name="my_validations_project",
validations_store_prefix="my_validations_store_prefix",
expectations_store_prefix="my_expectations_store_prefix",
),
)
```

The following example sets overrides for many of the parameters available to you when creating a DataContextConfig and a Datasource.

```python
project_config = DataContextConfig(
config_version=2,
plugins_directory=None,
config_variables_file_path=None,
datasources={
"my_spark_datasource": {
"data_asset_type": {
"class_name": "SparkDFDataset",
"module_name": "great_expectations.dataset",
},
"class_name": "SparkDFDatasource",
"module_name": "great_expectations.datasource",
"batch_kwargs_generators": {},
}
},
stores={
"expectations_S3_store": {
"class_name": "ExpectationsStore",
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": "my_expectations_store_bucket",
"prefix": "my_expectations_store_prefix",
},
},
"validations_S3_store": {
"class_name": "ValidationsStore",
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": "my_validations_store_bucket",
"prefix": "my_validations_store_prefix",
},
},
"evaluation_parameter_store": {"class_name": "EvaluationParameterStore"},
},
expectations_store_name="expectations_S3_store",
validations_store_name="validations_S3_store",
evaluation_parameter_store_name="evaluation_parameter_store",
data_docs_sites={
"s3_site": {
"class_name": "SiteBuilder",
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": "my_data_docs_bucket",
"prefix": "my_optional_data_docs_prefix",
},
"site_index_builder": {
"class_name": "DefaultSiteIndexBuilder",
"show_cta_footer": True,
},
}
},
validation_operators={
"action_list_operator": {
"class_name": "ActionListValidationOperator",
"action_list": [
{
"name": "store_validation_result",
"action": {"class_name": "StoreValidationResultAction"},
},
{
"name": "store_evaluation_params",
"action": {"class_name": "StoreEvaluationParametersAction"},
},
{
"name": "update_data_docs",
"action": {"class_name": "UpdateDataDocsAction"},
},
],
}
},
anonymous_usage_statistics={
"enabled": True
}
)
```


2. **Pass this DataContextConfig as a project_config to BaseDataContext**

```python
context = BaseDataContext(project_config=data_context_config)
```

3. **Use this BaseDataContext instance as your DataContext**

If you are using Airflow, you may wish to pass this Data Context to your GreatExpectationsOperator as a parameter. See the following guide for more details:

- [Deploying Great Expectations with Airflow](../../../../docs/intro.md) - **TODO(cdkini): Where do we link this?**


Additional resources
--------------------

- [How to instantiate a Data Context on an EMR Spark cluster](../../../deployment_patterns/how-to-instantiate-a-data-context-on-an-emr-spark-cluster.md)
- [How to instantiate a Data Context on Databricks Spark cluster](../../../deployment_patterns/how-to-instantiate-a-data-context-on-databricks-spark-cluster.md)

This article is a stub.