Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[low code connectors] perform schema validation of the input config against the declarative language schema #15543

Merged
merged 16 commits into from
Aug 18, 2022

Conversation

brianjlai
Copy link
Contributor

@brianjlai brianjlai commented Aug 11, 2022

What

This PR adds schema validation of the declarative language schema against a connector's YAML file implementation. At a high level, we are perform the schema validation at each level of component creation. So the creation of the schema for a given class and validation integrated into the creation flow which is then controlled with a boolean instantiate. For validation we pass False and for the actual existing creation flow, we pass True.

How

Because we're leveraging the existing create() flow, we do not have to perform a complex traversal of object relationships. Instead we traverse the object like we already do during creation, however, we have an additional flow in factory.py where instantiate is false, to perform the validation. We do this for each component of the YAML as well as at the top level which doesn't use the factory.

The general flow is made up of expanding a components interface into concrete classes, generating the json schema, performing the validation against the component definition arguments.

The more interesting bit is how we perform this expansion of interfaces which is critical to validating correctness

  1. For a given dataclass, get all of its fields
  2. Unpack the field which expands all interfaces. This can be complex because we need to retain the same object structure of any parent generic structures.
  • If a class is not contained in a generic and is declarative, we attempt to get it's subclasses
  • If it has no subclasses or is a non-declarative type we simply return
  • For classes contained within a generic (or multiple ones) we get the underlying type and recursive unpack
  • Depending on the generic type we either package it back up into it's respective type
  1. Replace the annotations of the current dataclass with the expanded interface

note: This accounts for Optional (which is a union of type and none), lists, and Unions. So far in our schema i didn't see any cases for mappings that contained internal interfaces that needed expansion

Recommended reading order

  1. factory.py
  2. yaml_declarative_source.py
  3. declarative_authenticator.py
  4. test_factory.py
  5. test_yaml_declarative_source.py
  6. I also changed various parts of the language that were missed on the initial dataclass refactor now that we are running validations

Example Error Validations:

Wrong type than expected:

extractor:
  type: DpathExtractor
selector:
  class_name: airbyte_cdk.sources.declarative.extractors.record_selector.RecordSelector
  record_filter:
    class_name: airbyte_cdk.sources.declarative.extractors.record_filter.RecordFilter
    condition: "{{ record['id'] > stream_state['id'] }}"
  extractor:
    $ref: "*ref(extractor)"
    field_pointer: 408

Error:

instance = {'config': {'apikey': 'verysecrettoken', 'repos': ['airbyte', 'airbyte-cloud']}, 'limit_option': {'$options': {}, 'con...ions': {}, 'config': {'apikey': 'verysecrettoken', 'repos': ['airbyte', 'airbyte-cloud']}, 'inject_into': 'path'}, ...}
schema = {'$schema': 'http://json-schema.org/draft-06/schema#', 'definitions': {'CursorPaginationStrategy': {'description': '\n...$ref': '#/definitions/JsonDecoder', 'default': {}}, 'limit_option': {'$ref': '#/definitions/RequestOption'}, ...}, ...}
cls = <class 'jsonschema.validators.create.<locals>.Validator'>, args = ()
kwargs = {}
validator = <jsonschema.validators.create.<locals>.Validator object at 0x1049ef880>
error = <ValidationError: "408 is not of type 'array'">

Missing a required field:

stream_slicer:
  type: DatetimeStreamSlicer
  $options:
    datetime_format: "%Y-%m-%dT%H:%M:%S.%f%z"
  start_datetime:
    type: MinMaxDatetime
    datetime: "{{ config['start_time'] }}"
    min_datetime: "{{ config['start_time'] + day_delta(2) }}"
  end_datetime: "{{ config['end_time'] }}"
  cursor_field: "created"
  lookback_window: "5d"
  start_time_option:
    inject_into: request_parameter
    field_name: created[gte]

Error:

instance = {'config': {'apikey': 'verysecrettoken', 'repos': ['airbyte', 'airbyte-cloud']}, 'limit_option': {'$options': {}, 'con...ions': {}, 'config': {'apikey': 'verysecrettoken', 'repos': ['airbyte', 'airbyte-cloud']}, 'inject_into': 'path'}, ...}
schema = {'$schema': 'http://json-schema.org/draft-06/schema#', 'definitions': {'CursorPaginationStrategy': {'description': '\n...$ref': '#/definitions/JsonDecoder', 'default': {}}, 'limit_option': {'$ref': '#/definitions/RequestOption'}, ...}, ...}
cls = <class 'jsonschema.validators.create.<locals>.Validator'>, args = ()
kwargs = {}
validator = <jsonschema.validators.create.<locals>.Validator object at 0x107ca6280>
error = <ValidationError: "'step' is a required property">

Wrong type passed for a reference type (this one is interesting because validator doesn't know how to compare literal types, but it identifies that the passed in type is missing a field (cursor_value) that is expected on classes that implement pagination_strategy.

paginator:
  type: "LimitPaginator"
  page_size: 10
  url_base: "https://airbyte.io"
  limit_option:
    inject_into: request_parameter
    field_name: page_size
  page_token_option:
    inject_into: path
  pagination_strategy:
    type: "MinMaxDatetime"
    datetime: "{{ response._metadata.next }}"

Error:

instance = {'config': {'apikey': 'verysecrettoken', 'repos': ['airbyte', 'airbyte-cloud']}, 'limit_option': {'$options': {}, 'con...ions': {}, 'config': {'apikey': 'verysecrettoken', 'repos': ['airbyte', 'airbyte-cloud']}, 'inject_into': 'path'}, ...}
schema = {'$schema': 'http://json-schema.org/draft-06/schema#', 'definitions': {'CursorPaginationStrategy': {'description': '\n...$ref': '#/definitions/JsonDecoder', 'default': {}}, 'limit_option': {'$ref': '#/definitions/RequestOption'}, ...}, ...}
cls = <class 'jsonschema.validators.create.<locals>.Validator'>, args = ()
kwargs = {}
validator = <jsonschema.validators.create.<locals>.Validator object at 0x106768c40>
error = <ValidationError: "'cursor_value' is a required property">

The errors are bit more verbose

@github-actions github-actions bot added the CDK Connector Development Kit label Aug 11, 2022
@brianjlai brianjlai changed the title draft: first pass at generating the complete schema and performing schema validation in factory [low code connectors] perform schema validation of the input config against the declarative language schema Aug 14, 2022

# Leaving this test here to document a limitation of the validator. Decoder has no meaningful fields to validate on so it accepts
# the MinMaxDatetime despite being the wrong type
def test_validation_wrong_object_type():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a limitation of the validator itself


# This test should fail because the extractor doesn't match the Array of resolved classes. However, despite the schema being correct
# validation passes. Leaving this here to document it and revisit at another time. This is another validator limitation.
def test_validate_types_nested_in_list():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've actually tested this by explicitly replacing the Interface in our code with a union of relevant types, but interestingly the validator still couldn't pick up on it.



@dataclass
class DeclarativeAuthenticator(ABC):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This abstraction is needed so that we only attempt to expand certain authenticators during validation. It is a little bit of a concern that schema generation is coupled very closely to our exact object model which could get a little unwieldly in the future.

else:
# Because the component's data fields definitions use interfaces, we need to resolve the underlying types into the
# concrete classes that implement the interface before generating the schema
DeclarativeComponentFactory._transform_interface_to_union(class_)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, even though we make a copy of the class and replace those annotations, it does seem to modifying the incoming class argument. I tried returning the relevant class and the way it is currently shown, to the same effect of expansion. However, if I comment this line out, the expansion will not go through (which is expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is almost certainly because the fields don't get copied.

Looking at dataclasses.fields, it seems like they are in the "dataclass_fields" field of the class' __dict__.

They don't get copied because we don't do a deepcopy of the dict here.

That being said, I don't know if there's an easy way to copy them because mappingproxys can't be serialized.

I doubt this matters much...

copy_cls.__annotations__[field.name] = unpacked_field_types

@staticmethod
def unpack(field_type: type):
Copy link
Contributor Author

@brianjlai brianjlai Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where it gets particularly fun. This whole recursive function is necessary to retain the existing structure of any declarative components that must be expanded, without losing the shape of the container they are in. We basically retrieve the underlying type and use that as the next recursive call. And once we have unpacked the underlying classes, we package it back into its original container. I added a handful of tests to verify this was where the subclass replacement became less than trivial.

Happy to discuss this part more since this was something I realized we needed midway through.

@brianjlai brianjlai marked this pull request as ready for review August 15, 2022 22:19
@brianjlai brianjlai requested review from a team and girarda August 15, 2022 22:19
@sherifnada
Copy link
Contributor

@brianjlai could you show some examples of the error message associated with an invalid YAML?

@brianjlai
Copy link
Contributor Author

yup i'll add some sample test cases and the validation message that comes back!

if self.request_options_provider is None:
self._request_options_provider = InterpolatedRequestOptionsProvider(config=self.config, options=options)
elif isinstance(self.request_options_provider, dict):
self._request_options_provider = InterpolatedRequestOptionsProvider(config=self.config, **self.request_options_provider)
else:
self._request_options_provider = self.request_options_provider
self.authenticator = self.authenticator or NoAuth()
self.authenticator = self.authenticator or BasicHttpAuthenticator("", config=self.config, options={})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is BasicHttp the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's a good point. I think it makes sene to just have define a declarative NoAuth that implements the methods and returns empty strings. Will fix!


def _read_and_parse_yaml_file(self, path_to_yaml_file):
with open(path_to_yaml_file, "r") as f:
config_content = f.read()
return YamlParser().parse(config_content)

def _validate_source(self):
full_config = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: can we also check the "version" field we added last week?

Confirming its presence is enough for now. no need for additional validation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes will add!

else:
# Because the component's data fields definitions use interfaces, we need to resolve the underlying types into the
# concrete classes that implement the interface before generating the schema
DeclarativeComponentFactory._transform_interface_to_union(class_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is almost certainly because the fields don't get copied.

Looking at dataclasses.fields, it seems like they are in the "dataclass_fields" field of the class' __dict__.

They don't get copied because we don't do a deepcopy of the dict here.

That being said, I don't know if there's an easy way to copy them because mappingproxys can't be serialized.

I doubt this matters much...

@brianjlai
Copy link
Contributor Author

brianjlai commented Aug 18, 2022

/publish-cdk dry-run=true

🕑 https://github.com/airbytehq/airbyte/actions/runs/2884546299
https://github.com/airbytehq/airbyte/actions/runs/2884546299

@brianjlai
Copy link
Contributor Author

brianjlai commented Aug 18, 2022

/publish-cdk dry-run=false

🕑 https://github.com/airbytehq/airbyte/actions/runs/2884600096
https://github.com/airbytehq/airbyte/actions/runs/2884600096

@brianjlai
Copy link
Contributor Author

brianjlai commented Aug 18, 2022

/publish-cdk dry-run=true

🕑 https://github.com/airbytehq/airbyte/actions/runs/2884697466
https://github.com/airbytehq/airbyte/actions/runs/2884697466

@brianjlai
Copy link
Contributor Author

brianjlai commented Aug 18, 2022

/publish-cdk dry-run=false

🕑 https://github.com/airbytehq/airbyte/actions/runs/2884785622
https://github.com/airbytehq/airbyte/actions/runs/2884785622

@brianjlai brianjlai merged commit ca80d37 into master Aug 18, 2022
@brianjlai brianjlai deleted the brian/low_code_validate_schema branch August 18, 2022 19:29
rodireich pushed a commit that referenced this pull request Aug 25, 2022
…gainst the declarative language schema (#15543)

* draft: first pass at complete schema language generation and factory validator

* actually a working validator and fixes to the schema that went uncaught

* remove extra spike file

* fix formatting file

* pr feedback and a little bit of refactoring

* fix some types that were erroneously marked as invalid schema

* some comments

* add jsonschemamixin to interfaces

* update changelog

* bump version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants