Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: simple filter transformer nc #36

Merged
merged 1 commit into from
May 24, 2023
Merged

Conversation

LauJohansson
Copy link
Contributor

@LauJohansson LauJohansson commented May 23, 2023

Tranformer for making a simple filtering on one column for one value.

@MartinBoge
Copy link
Contributor

MartinBoge commented May 23, 2023

We could also have a transformer that accepts a custom filter statement to make it more flexible.

@LauJohansson
Copy link
Contributor Author

We could also have a transformer that accepts a custom filter statement to make it more flexible.

Good point. I think it is a balance between how much customization is needed as input to a transformer.

Should we have a simple filtertransformer like this + a transformer that can have a filter-string as input?

Or, should they be merged?

@LauJohansson LauJohansson temporarily deployed to azure May 24, 2023 06:51 — with GitHub Actions Inactive
@LauJohansson LauJohansson requested a review from MartinBoge May 24, 2023 07:14
@MartinBoge
Copy link
Contributor

MartinBoge commented May 24, 2023

We could also have a transformer that accepts a custom filter statement to make it more flexible.

Good point. I think it is a balance between how much customization is needed as input to a transformer.

Should we have a simple filtertransformer like this + a transformer that can have a filter-string as input?

Or, should they be merged?

My 2c is that we should have a simple transformer to accept an argument "query: str" because it gives more flexibility. Maybe it would be less confusing if there is only one.
I.e.:

from typing import List, Union

from pyspark.sql import DataFrame
from spetlr.etl import TransformerNC


class QueryFilterTransformer(TransformerNC):
    def __init__(
        self,
        query: str,
        dataset_input_keys: Union[str, List[str]] = None,
        dataset_output_key: str = None,
    ):
        super().__init__(
            dataset_input_keys=dataset_input_keys,
            dataset_output_key=dataset_output_key,
        )
        self.query = query

    def process(self, df: DataFrame) -> DataFrame:
        print("Filter transformer on sql query")
        return df.filter(self.query)

On the other hand, this might be harder to test.

@spetlr-org spetlr-org deleted a comment from Okami1 May 24, 2023
@LauJohansson LauJohansson merged commit 72dcd50 into main May 24, 2023
@LauJohansson LauJohansson deleted the feature/dataframefilternc branch May 24, 2023 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants