-
Couldn't load subscription status.
- Fork 355
Simplify pipeline test utils #2566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
2e29f74 to
d15b1fb
Compare
| extra_info=bucket, | ||
| supports_merge=False, | ||
| file_format="parquet", | ||
| file_format="jsonl", # keep jsonl as default, test utils are setup for this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for review: I changed the default format to jsonl so we save more core in the test utils for the sftp special case, I didn't see any downside to this.
dlt/common/destination/client.py
Outdated
| @abstractmethod | ||
| def get_open_table_location(self, table_format: TTableFormat, table_name: str) -> str: | ||
| """Computes location in which table metadata is stored. Does not verify if table exists.""" | ||
| def get_open_table_location( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like this tuple return, but I explain it in the docstring and it is the only way to properly support all filepaths in FilesystemSqlClient, since the wildcards in duckdb only seem to work for full folders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add path separator at the end if this is a folder? I thought filesystem does that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already implement this interface for pyiceberg, there it is irrelevant
| from os import environ | ||
| import io | ||
|
|
||
| import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is indeed a file that deserve the name utils.
It also twists my brain to look at it. I wouldn't count on me detecting any bugs in your changes here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK :)
| assert dict_set1 == dict_set2, "Lists contain different dictionaries" | ||
|
|
||
|
|
||
| # TODO: merge with assert_table in main pipeline utils... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: leaving this for another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, correct
| # simulate missing table | ||
| if len(files) == 0: | ||
| raise DestinationUndefinedEntity(f"Table {table_name} not found") | ||
| for file in files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: this looks like a great change, but I can't actually think through the left side here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is basically still the same code as before, just simplified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very good improvement: with the helper functions a lot of places become much more readable!
all the methods in tests/pipeline/utils.py seem very handy and it's nice to know they exist. I couldnt go over all of them to confirm you didnt mess it up, but if they produce the same results as before (in the tests) then that's a good indicator I guess.
i left a few comments in places that looked like maybe you forgot some code there or commented something out, just to make sure you intend to merge this.
otherwise, its approved from me. If we want to be extra sure, someone else needs to also look at tests/pipeline/utils.py
tests/pipeline/utils.py
Outdated
| rows = list(cur.fetchall()) | ||
| return {r[0]: r[1] for r in rows} | ||
| def load_table_counts(p: dlt.Pipeline, *table_names: str) -> DictStrAny: | ||
| """Returns row counts for `table_names` as dict, if no table names are given, all data tables are counted""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
learning question (feel free to ignore): does this have to be DictStrAny?
why not dict str int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point, I'm just using it because this type has always been here ;)
ea38dd1 to
8efc4a1
Compare
| ) -> None: | ||
| """Asserts that `rows` conform to `table_schema`. Fields and their order must conform to columns. Null values and | ||
| python data types are checked. | ||
| """Validate that *rows* conform to *table_schema*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: I love the doc-strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good stuff. Love the doc-strings too, I imagine this would make it much more likely that someone would use these helpers themselves
|
|
||
|
|
||
| @dlt.resource(table_name="users") | ||
| def users_materialize_table_schema(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if specifying return types a must here (and for airtable_emojis ) 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the time being we do not have super strict typing on the tests folder, so many functions to not have return types and the linter will pass..
| structure types. | ||
| Raises: | ||
| AssertionError: If any of the schema constraints are violated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the json.loads() fails for ckeck_nested it will raise a different error - or is it an irrelevant detail 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will, but I would say that is not really relevant, theoretically many types of exceptions could arise from calling this function, so I would document the ones that are important in the context. If json.loads fails, it will be quite clear what happened I think.
dlt/common/destination/client.py
Outdated
| @abstractmethod | ||
| def get_open_table_location(self, table_format: TTableFormat, table_name: str) -> str: | ||
| """Computes location in which table metadata is stored. Does not verify if table exists.""" | ||
| def get_open_table_location( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add path separator at the end if this is a folder? I thought filesystem does that?
dlt/common/destination/client.py
Outdated
| @abstractmethod | ||
| def get_open_table_location(self, table_format: TTableFormat, table_name: str) -> str: | ||
| """Computes location in which table metadata is stored. Does not verify if table exists.""" | ||
| def get_open_table_location( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already implement this interface for pyiceberg, there it is irrelevant
| # pyiceberg cannot deal with windows absolute urls | ||
| location = location.replace("file:///", "file://") | ||
| return location | ||
| return location, (folder == prefix or (folder + os.path.sep) == prefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make_remote_url removes trailing separator from folder?
but overall filesystem does not enforce the convention that folder should always end with separator. maybe we should unit test a few functions there to make sure they follow this convention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ensuring now that this function returns a url with a trailing slash if it is a folder (it has to be removed in some places downstream again for things to work)
| # list_table_files returns a list of absolute paths but without scheme | ||
| files = self.remote_client.list_table_files(table_name) | ||
| resolved_files_string = ",".join(map(lambda f: f"'{protocol}://{f}'", files)) | ||
| if protocol == "file": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was there something wrong with leaving the protocol for "file"? maybe on windows? I'm asking because I recently spent a lot of time fixing this place. but you seem to be right here!
btw. where we really lack docstiring is FsClient. does list_table_files return absolute path? with scheme or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember why I changed this ..., but the tests seem to be passing
|
|
||
| # build files string | ||
| supports_wildcard_notation = not self.is_abfss | ||
| supports_wildcard_notation = not self.is_abfss and is_folder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this duckdb limitation apply to all protocols? did you check it? could be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, and that is also reflected in the statement. supports_wildcard_notation always if false if we do not have a folder, regardless of the protocol.
# Conflicts: # dlt/helpers/ibis.py
eecafbb to
8001fae
Compare
fix iceberg
8001fae to
33e58aa
Compare
| if is_folder: | ||
| assert client.get_table_prefix("letters").endswith("_data/letters/") | ||
| else: | ||
| assert client.get_table_prefix("letters").endswith("_data/letters.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this end with a dot? I forgot...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because . is a prefix separator and we include it in prefix. / is also a separator. this is how this function works. you wrote it ;p
d1ec18a to
0a66403
Compare
…w, fixes tests and docs
0a66403 to
9d6722d
Compare
* remove some duplicate test utils * use dataset to get table counts * add exception for sftp but use dataset otherwise for loading table counts and contents * update checking of empty tables in filesystem tests * support filesystemsqlclient for tables that have prefixes rather than folders * fix table location resolution for internal tables * make sftp check raise same errors as filesystemsqlclient * more cleanup * fix replace disposition tests * simplify table count code in many places * small cleanup * fix tables to dicts function * disable databricks and synapse ibis backend tests (cherry picked from commit aba8de4) * simplify table assertions * add tests for tests :) * fix two tests * fix dbt tests * makes open table locations to work in windows fs * review comments * adds docstrings plus linting to pipeline utils * fix docstring linting on utils class * bump adlfs in lockfile * test loading abfss first * test getting tables one by one for azure * fix resolving of sql_client * change folder detection * add comment for abfss fix fix iceberg * move abfss fallback into utils method * normalizes trailing separator in paths in filesystem * fixes two tests * fix glob resolution for tables that have nested folders * removes globs from duckdb filesystem sql client, adds tests for edge cases * disables globbing for iceberg, adds optional autorefresh flag for view, fixes tests and docs --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* remove some duplicate test utils * use dataset to get table counts * add exception for sftp but use dataset otherwise for loading table counts and contents * update checking of empty tables in filesystem tests * support filesystemsqlclient for tables that have prefixes rather than folders * fix table location resolution for internal tables * make sftp check raise same errors as filesystemsqlclient * more cleanup * fix replace disposition tests * simplify table count code in many places * small cleanup * fix tables to dicts function * disable databricks and synapse ibis backend tests (cherry picked from commit aba8de4) * simplify table assertions * add tests for tests :) * fix two tests * fix dbt tests * makes open table locations to work in windows fs * review comments * adds docstrings plus linting to pipeline utils * fix docstring linting on utils class * bump adlfs in lockfile * test loading abfss first * test getting tables one by one for azure * fix resolving of sql_client * change folder detection * add comment for abfss fix fix iceberg * move abfss fallback into utils method * normalizes trailing separator in paths in filesystem * fixes two tests * fix glob resolution for tables that have nested folders * removes globs from duckdb filesystem sql client, adds tests for edge cases * disables globbing for iceberg, adds optional autorefresh flag for view, fixes tests and docs --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
Description
We have some cryptic code in our pipeline load test utils. We needed this to abstract getting row counts and table contents across different destination types, but now we have the nice dataset which also supports filessystem, delta and iceberg, so we can probably just use that, this PR will attempt to remove a bunch of cold code.
Benefits:
This PR also fixes a few small issues.