Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference Hub not updated during incremental load in Snowflake #128

Open
OGrohmann opened this issue Sep 15, 2023 · 10 comments
Open

Reference Hub not updated during incremental load in Snowflake #128

OGrohmann opened this issue Sep 15, 2023 · 10 comments
Labels
bug Something isn't working stale

Comments

@OGrohmann
Copy link

We are using the ref_hub macro in an incremental model in our Snowflake environment to build our reference tables. We have observed, that the initial load for the reference hub works fine, however, during the incremental loads new ref hub keys are not being added. The SQL syntax looks ok, however it seems not to work with Snowflake.
This issue is only observed in the reference hub model. The ref satellite is being correctly updated.
However, as a consequence new entries are missing in the final reference table, as the new hub entries are missing.
Steps to reproduce:

  1. Create new reference hub, example:

{{ config(materialized='incremental') }}

{%- set yaml_metadata -%}
source_models: staging_model
ref_keys:
- KEY1
- KEY2
{%- endset -%}

{% set metadata_dict = fromyaml(yaml_metadata) %}

{{ datavault4dbt.ref_hub(source_models=metadata_dict['source_models'],
ref_keys=metadata_dict['ref_keys']) }}

  1. Build the model
  2. Add new entries in the source with new key combination
  3. Re-run model to trigger incremental update
  4. Observe that new entries have not been added.
@bschlottfeldt
Copy link
Contributor

Hi @OGrohmann , thank you for reporting the issue, gonna test here on my side and let you know what I find.
Kind regards,
Barbara

@bschlottfeldt bschlottfeldt self-assigned this Sep 15, 2023
@bschlottfeldt bschlottfeldt added the bug Something isn't working label Sep 15, 2023
@bschlottfeldt
Copy link
Contributor

bschlottfeldt commented Sep 20, 2023

Hi @OGrohmann just tested both the initial load and incremental load here for the reference hub and it works fine here (also using Snowflake adapter).

My guess is the Load Datetimestamp for your records in your stage that you want to add to your ref hub are either earlier or equal to the Load Datetimestamp already in your Reference Hub.
Please make sure that the Load Datetimestamp in your stage for the new records is greater than the ones already in your Reference Hub, otherwise the incremental logic wont add your records.
Let me know if this is the case and if it solves your issue.

Kind regards,
Barbara

@OGrohmann
Copy link
Author

OGrohmann commented Oct 16, 2023

Hi @bschlottfeldt ,
sorry for the late feedback. We tried to provide an example to reproduce the issue and were quite surprised that we couldn't reproduce it in our artificial example. Finally we found it out how to reproduce. In your stage you have to assign the data type "TEXT" to one of your reference key columns. This is a synonymous for STRING or VARCHAR in Snowflake. However, the ghost record value for TEXT is NULL, whereas for VARCHAR it will be (unknown)/(error). Seems like this is causing the issue for incremental loads. However, it can be easily fixed by a workaround, i.e. replacing the TEXT data type with VARCHAR.
Can you please check and confirm the issue?
Thanks, Oliver.

@bschlottfeldt
Copy link
Contributor

@OGrohmann hi Oliver, i will test it next week, sorry about the late response. Thanks for pointing out the issue.
Kind regards,
Barbara

@bschlottfeldt bschlottfeldt removed their assignment Jan 22, 2024
@bschlottfeldt
Copy link
Contributor

@OGrohmann i can confirm the problem happens as you have described, this part of the compiled code in the incremental run of ref_hub macro wont return any results if one of the keys in the reference hub was inserted as null. Since the ghost record default datatype for TEXT in Snowflake was defined by our macro as NULL, it inserts null as the reference key for the ghost record. The presence of a null in the query of distinct_target_ref_keys the filter for NOT IN wont work, it will produce no results, therefore no records will be inserted in the incremental run, even though there are new records. FYI @tkirschke @thoffmann-sf this should be prevented in the ref_hub macro by adding a where refkey is not null clause in the distinct_target_ref_keys. and also should be fixed by defining the default value for ghost record of datatype TEXT as '(unknown)' for the zero key and '(error)' for the error key.

records_to_insert AS (
    SELECT
        refkey,
        ldts,
        rsrc
    FROM earliest_ref_key_over_all_sources
    WHERE CONCAT_WS('||', refkey) NOT IN (SELECT * FROM distinct_target_ref_keys)
)

Kind regards,
Barbara

@tkirschke
Copy link
Member

Hi @OGrohmann ,

I have added a Branch to this Issue where I tried to fix the problem, by excluding NULL refkeys, and defining the default value for TEXT columns, as @bschlottfeldt suggested.

Can you check if the fix works for you?

Best regards,
Tim

@tkiehn tkiehn linked a pull request Jun 21, 2024 that will close this issue
Copy link

github-actions bot commented Jul 8, 2024

This issue is stale because it has been open for 90 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue. Otherwise it will be closed in 14 days

@github-actions github-actions bot added the stale label Jul 8, 2024
@OGrohmann
Copy link
Author

Hello @tkirschke , can you please help me how to specifiy the correct branch in my packages? Thanks!

@tkiehn
Copy link
Collaborator

tkiehn commented Jul 16, 2024

Hi @OGrohmann,

to specify the branch you have to set the following in your packages.yml:

packages:
  - git: "https://github.com/ScalefreeCOM/datavault4dbt.git"
    revision: 'refkey-not-null-patch'

Best Regards,
Theo

@haswaniismail
Copy link

haswaniismail commented Aug 6, 2024

Hi @tkiehn / @tkirschke,

I have tested the fix from the branch but seems like the issue is still persists. From the logs, I can see its excluding records from stage layer where both refkeys are NULL [ WHERE NOT ( KEY1 IS NULL AND KEY2 IS NULL ) ], while for our case only KEY2 is declared as "TEXT" in stage model (tested the solution works if both our refkeys are of type "TEXT").

WITH

    distinct_target_ref_keys AS (

        SELECT
            CONCAT(KEY1,'||',KEY2
)
        FROM REF_HUB_MODEL

    ),
src_new_1 AS (

        SELECT
            KEY1,
            KEY2,
            ldts,
            rsrc
        FROM STAGE_MODEL src
        WHERE NOT (
            KEY1 IS NULL AND KEY2 IS NULL)

    ),

earliest_ref_key_over_all_sources AS (
    SELECT
        lcte.*
    FROM src_new_1 AS lcte

    QUALIFY ROW_NUMBER() OVER (PARTITION BY KEY1,  KEY2 ORDER BY ldts) = 1),

records_to_insert AS (
    SELECT
        
        KEY1,
        KEY2,
        ldts,
        rsrc
    FROM earliest_ref_key_over_all_sources
    WHERE CONCAT(KEY1,'||',KEY2
) NOT IN (SELECT * FROM distinct_target_ref_keys)
    )

SELECT * FROM records_to_insert

Please advise on this. Thank you!

Best Regards,
-Haswani

@tkiehn tkiehn removed a link to a pull request Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

5 participants