Skip to content

The allowed_deserialization_classes regex is broken #34093

@barrysteyn

Description

@barrysteyn

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

In #28829, the allowed_deserialization_classes regex was made more intuitive. It's aim (I guess) was to ensure that a period . be treated as a period and not as the wildcard regex character. The justification (given in #28829) was the following:

Typically someone would like to allow any classes below 'mymodule' to match. For example, 'mymodule.dataclasses' by setting allowed_deserialization_classes to 'mymodule.*'. However this matches everything starting with mymodule, so also mymodulemalicious.'

To correct the above, the following substitution was provided to each space separated string provided by allowed_deserialization_classes:

re.sub(r"(\w)\.", r"\1\..", p)

Now, assume I have a class that I want to be serialized called this.is.an.example. Here is the results of the following inputs to allowed_deserialization_classes

  • this.is.an.example --> This would not match (note it is the identical class name). The regular expression would change it to this\\..is\\..an\\..example. Which means that there is an extra wildcard . that would need be matched.
  • this.is.an* --> This would not match. The regular expression would change it to this\\..is\\..an* which suffer from the problem in the first example.
  • this.* --> This would match (note that this example is given in the unit tests for Make allowed_deserialization_classes more intuitive #28829). The regular expression would would change it to this\\..*. It will match because it will look for the word this followed by . and then the wildcard character with the wildstar character (which will match zero or all of anything). To note that this is THE only way to make things match without altering the input string.

In order to make a match for this.is.an.example, I need to put as input in allowed_deserialization_classes this[.]is[.]an[.]example:

  • The above input would be left alone by the regex expression substitution.
  • It also does what the author wants (which is to match the period .)
  • To note that one can also just escape the . by doing this\.is\.an\.example which would have the identical result.

In fact, any correct regex would match (even this[.]is[.]an*) as long as re.sub(r"(\w)\.", r"\1\..", p) does NOT alter things.

What I think should happen

It is dubious to alter a regex in the first place. Airflow is aimed at technical people, and as such, proper documentation would suffice. There I suggest the following:

  1. Remove the re.sub(r"(\w)\.", r"\1\..", p)
  2. Correct the documentation and warn people to escape the wildcard character.
  3. If the intention is to make it easier by altering the regex, then we need to have a serious chat about what the regex should be. Since we will be trying to guess what the user means. For example, hello.* obviously means match anything after hello, and hello.world.* means match the first period as a period, but the second one as a wildcard... You see how tough this is. Because regex can have multiple reserved symbols. A better way to do it would be look forward. Maybe something like this (although regex experts, please weigh in): re.sub(r"\.(\w)", r"\\.\1", p) (this will only escape a period if a word started directly afterwards).

How to reproduce

I reproduced by copying the logic and putting it in a test task. I did the following:

@dag(
    dag_id="test-dag-used-to-test",
    start_date=datetime(2021, 1, 1),  # start sometime in the past
    catchup=False,  # important, since we started in the past
)
def test_dag():
    @task
    def test_match():
        input = 'this.is.an.example'  # Note that is the exact class name
        p = re.sub(r"(\w)\.", r"\1\..", input)
        result = any(p.match(input))  # Returns false

    test_match()


test_dag()

Operating System

MacOS

Versions of Apache Airflow Providers

The change was introduced in #28829 which was from AirFlow 2.6.0

Deployment

Amazon (AWS) MWAA

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions