Skip to content

Conversation

@sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Apr 28, 2025

Description

We have some cryptic code in our pipeline load test utils. We needed this to abstract getting row counts and table contents across different destination types, but now we have the nice dataset which also supports filessystem, delta and iceberg, so we can probably just use that, this PR will attempt to remove a bunch of cold code.

Benefits:

  • Less code to maintain
  • dataset gets used more often and is tested more this way
  • dataset actually has tests ensuring that table counting works, the test utils did not.

This PR also fixes a few small issues.

@sh-rp sh-rp added the tech-debt Leftovers from previous work. Should be fixed over time label Apr 28, 2025
@netlify
Copy link

netlify bot commented Apr 28, 2025

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit 9d6722d
🔍 Latest deploy log https://app.netlify.com/projects/dlt-hub-docs/deploys/683963247759b800088fc331
😎 Deploy Preview https://deploy-preview-2566--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@sh-rp sh-rp force-pushed the tests/simplify-load-utils branch 4 times, most recently from 2e29f74 to d15b1fb Compare April 29, 2025 10:00
extra_info=bucket,
supports_merge=False,
file_format="parquet",
file_format="jsonl", # keep jsonl as default, test utils are setup for this
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for review: I changed the default format to jsonl so we save more core in the test utils for the sftp special case, I didn't see any downside to this.

@abstractmethod
def get_open_table_location(self, table_format: TTableFormat, table_name: str) -> str:
"""Computes location in which table metadata is stored. Does not verify if table exists."""
def get_open_table_location(
Copy link
Collaborator Author

@sh-rp sh-rp Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this tuple return, but I explain it in the docstring and it is the only way to properly support all filepaths in FilesystemSqlClient, since the wildcards in duckdb only seem to work for full folders.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add path separator at the end if this is a folder? I thought filesystem does that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already implement this interface for pyiceberg, there it is irrelevant

@sh-rp sh-rp marked this pull request as ready for review April 29, 2025 15:03
@sh-rp sh-rp requested review from djudjuu and rudolfix and removed request for rudolfix April 29, 2025 15:03
from os import environ
import io

import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is indeed a file that deserve the name utils.
It also twists my brain to look at it. I wouldn't count on me detecting any bugs in your changes here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK :)

assert dict_set1 == dict_set2, "Lists contain different dictionaries"


# TODO: merge with assert_table in main pipeline utils...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: leaving this for another PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, correct

# simulate missing table
if len(files) == 0:
raise DestinationUndefinedEntity(f"Table {table_name} not found")
for file in files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: this looks like a great change, but I can't actually think through the left side here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is basically still the same code as before, just simplified.

djudjuu
djudjuu previously approved these changes Apr 30, 2025
Copy link
Contributor

@djudjuu djudjuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good improvement: with the helper functions a lot of places become much more readable!
all the methods in tests/pipeline/utils.py seem very handy and it's nice to know they exist. I couldnt go over all of them to confirm you didnt mess it up, but if they produce the same results as before (in the tests) then that's a good indicator I guess.

i left a few comments in places that looked like maybe you forgot some code there or commented something out, just to make sure you intend to merge this.

otherwise, its approved from me. If we want to be extra sure, someone else needs to also look at tests/pipeline/utils.py

@anuunchin anuunchin self-requested a review April 30, 2025 14:19
rows = list(cur.fetchall())
return {r[0]: r[1] for r in rows}
def load_table_counts(p: dlt.Pipeline, *table_names: str) -> DictStrAny:
"""Returns row counts for `table_names` as dict, if no table names are given, all data tables are counted"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

learning question (feel free to ignore): does this have to be DictStrAny?
why not dict str int?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point, I'm just using it because this type has always been here ;)

@sh-rp sh-rp force-pushed the tests/simplify-load-utils branch 2 times, most recently from ea38dd1 to 8efc4a1 Compare May 3, 2025 10:13
@sh-rp sh-rp self-assigned this May 5, 2025
) -> None:
"""Asserts that `rows` conform to `table_schema`. Fields and their order must conform to columns. Null values and
python data types are checked.
"""Validate that *rows* conform to *table_schema*.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: I love the doc-strings.

djudjuu
djudjuu previously approved these changes May 6, 2025
Copy link
Contributor

@djudjuu djudjuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good stuff. Love the doc-strings too, I imagine this would make it much more likely that someone would use these helpers themselves



@dlt.resource(table_name="users")
def users_materialize_table_schema():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if specifying return types a must here (and for airtable_emojis ) 👀

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the time being we do not have super strict typing on the tests folder, so many functions to not have return types and the linter will pass..

structure types.
Raises:
AssertionError: If any of the schema constraints are violated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the json.loads() fails for ckeck_nested it will raise a different error - or is it an irrelevant detail 👀

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will, but I would say that is not really relevant, theoretically many types of exceptions could arise from calling this function, so I would document the ones that are important in the context. If json.loads fails, it will be quite clear what happened I think.

@abstractmethod
def get_open_table_location(self, table_format: TTableFormat, table_name: str) -> str:
"""Computes location in which table metadata is stored. Does not verify if table exists."""
def get_open_table_location(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add path separator at the end if this is a folder? I thought filesystem does that?

@abstractmethod
def get_open_table_location(self, table_format: TTableFormat, table_name: str) -> str:
"""Computes location in which table metadata is stored. Does not verify if table exists."""
def get_open_table_location(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already implement this interface for pyiceberg, there it is irrelevant

# pyiceberg cannot deal with windows absolute urls
location = location.replace("file:///", "file://")
return location
return location, (folder == prefix or (folder + os.path.sep) == prefix)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make_remote_url removes trailing separator from folder?

but overall filesystem does not enforce the convention that folder should always end with separator. maybe we should unit test a few functions there to make sure they follow this convention

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ensuring now that this function returns a url with a trailing slash if it is a folder (it has to be removed in some places downstream again for things to work)

# list_table_files returns a list of absolute paths but without scheme
files = self.remote_client.list_table_files(table_name)
resolved_files_string = ",".join(map(lambda f: f"'{protocol}://{f}'", files))
if protocol == "file":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was there something wrong with leaving the protocol for "file"? maybe on windows? I'm asking because I recently spent a lot of time fixing this place. but you seem to be right here!

btw. where we really lack docstiring is FsClient. does list_table_files return absolute path? with scheme or not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember why I changed this ..., but the tests seem to be passing


# build files string
supports_wildcard_notation = not self.is_abfss
supports_wildcard_notation = not self.is_abfss and is_folder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this duckdb limitation apply to all protocols? did you check it? could be useful

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, and that is also reflected in the statement. supports_wildcard_notation always if false if we do not have a folder, regardless of the protocol.

@sh-rp sh-rp force-pushed the tests/simplify-load-utils branch 2 times, most recently from eecafbb to 8001fae Compare May 26, 2025 21:08
@sh-rp sh-rp force-pushed the tests/simplify-load-utils branch from 8001fae to 33e58aa Compare May 27, 2025 05:49
if is_folder:
assert client.get_table_prefix("letters").endswith("_data/letters/")
else:
assert client.get_table_prefix("letters").endswith("_data/letters.")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this end with a dot? I forgot...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because . is a prefix separator and we include it in prefix. / is also a separator. this is how this function works. you wrote it ;p

@rudolfix rudolfix force-pushed the tests/simplify-load-utils branch from d1ec18a to 0a66403 Compare May 29, 2025 21:47
@rudolfix rudolfix force-pushed the tests/simplify-load-utils branch from 0a66403 to 9d6722d Compare May 30, 2025 07:49
@sh-rp sh-rp merged commit a3534b3 into devel May 30, 2025
89 of 90 checks passed
@sh-rp sh-rp deleted the tests/simplify-load-utils branch May 30, 2025 11:18
lucargir pushed a commit to lucargir/dlt that referenced this pull request Jun 6, 2025
* remove some duplicate test utils

* use dataset to get table counts

* add exception for sftp but use dataset otherwise for loading table counts and contents

* update checking of empty tables in filesystem tests

* support filesystemsqlclient for tables that have prefixes rather than folders

* fix table location resolution for internal tables

* make sftp check raise same errors as filesystemsqlclient

* more cleanup

* fix replace disposition tests

* simplify table count code in many places

* small cleanup

* fix tables to dicts function

* disable databricks and synapse ibis backend tests

(cherry picked from commit aba8de4)

* simplify table assertions

* add tests for tests :)

* fix two tests

* fix dbt tests

* makes open table locations to work in windows fs

* review comments

* adds docstrings plus linting to pipeline utils

* fix docstring linting on utils class

* bump adlfs in lockfile

* test loading abfss first

* test getting tables one by one for azure

* fix resolving of sql_client

* change folder detection

* add comment for abfss fix
fix iceberg

* move abfss fallback into utils method

* normalizes trailing separator in paths in filesystem

* fixes two tests

* fix glob resolution for tables that have nested folders

* removes globs from duckdb filesystem sql client, adds tests for edge cases

* disables globbing for iceberg, adds optional autorefresh flag for view, fixes tests and docs

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
dat-a-man pushed a commit that referenced this pull request Jun 24, 2025
* remove some duplicate test utils

* use dataset to get table counts

* add exception for sftp but use dataset otherwise for loading table counts and contents

* update checking of empty tables in filesystem tests

* support filesystemsqlclient for tables that have prefixes rather than folders

* fix table location resolution for internal tables

* make sftp check raise same errors as filesystemsqlclient

* more cleanup

* fix replace disposition tests

* simplify table count code in many places

* small cleanup

* fix tables to dicts function

* disable databricks and synapse ibis backend tests

(cherry picked from commit aba8de4)

* simplify table assertions

* add tests for tests :)

* fix two tests

* fix dbt tests

* makes open table locations to work in windows fs

* review comments

* adds docstrings plus linting to pipeline utils

* fix docstring linting on utils class

* bump adlfs in lockfile

* test loading abfss first

* test getting tables one by one for azure

* fix resolving of sql_client

* change folder detection

* add comment for abfss fix
fix iceberg

* move abfss fallback into utils method

* normalizes trailing separator in paths in filesystem

* fixes two tests

* fix glob resolution for tables that have nested folders

* removes globs from duckdb filesystem sql client, adds tests for edge cases

* disables globbing for iceberg, adds optional autorefresh flag for view, fixes tests and docs

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tech-debt Leftovers from previous work. Should be fixed over time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants