Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xbrl test speedups #2229

Merged
merged 8 commits into from
Jan 26, 2023
Merged

Xbrl test speedups #2229

merged 8 commits into from
Jan 26, 2023

Conversation

zschira
Copy link
Member

@zschira zschira commented Jan 23, 2023

Overview

This PR speeds up/improves XBRL extraction testing. To do so, it uses a subset of the XBRL filings during integration testing. This dramatically reduces the time required to extract XBRL filings, and seems to reduce total runtime for the CI by ~10 minutes. This PR also adds several unit tests focused on the extraction and XBRL datastore object.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • Include unit tests for new functions and classes.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@zschira zschira requested a review from jdangerx January 23, 2023 20:28
@@ -57,12 +57,14 @@ def get_filings(self, year: int, form: XbrlFormNumber):
published = datetime.fromisoformat(info["published_parsed"])

if published > latest:
latest_filing = f"{filing_id}.xbrl"
latest = published
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caught a little bug here while adding tests!

test/conftest.py Outdated
form_settings = ferc_to_sqlite_settings.get_xbrl_dataset_settings(form)

# Extract every fifth filing
filings_subset = datastore.get_filings(year, form)[::5]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using every fifth filing. This could be changed, but it seemed like a reasonable balance to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be kind of nice to pull this 5 into something like step_size - you use it in a couple places (line 216, 221) and it might save a few minutes of heartache down the line.

@codecov
Copy link

codecov bot commented Jan 23, 2023

Codecov Report

Base: 85.7% // Head: 85.7% // Increases project coverage by +0.0% 🎉

Coverage data is based on head (89d274e) compared to base (afe8d64).
Patch coverage: 100.0% of modified lines in pull request are covered.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2229   +/-   ##
=====================================
  Coverage   85.7%   85.7%           
=====================================
  Files         73      73           
  Lines       8973    8974    +1     
=====================================
+ Hits        7690    7692    +2     
+ Misses      1283    1282    -1     
Impacted Files Coverage Δ
src/pudl/extract/xbrl.py 96.1% <100.0%> (+2.0%) ⬆️
src/pudl/settings.py 96.0% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly questions for my own clarification - everything here looks good and faster tests is great! Approved, pending the test coverage issue. I'm happy to help dig into that with you as well!


If we are using the test database, we initialize it from scratch first. If we're
using the live database, then we just yield a conneciton to it.
Extracts a subset of filings for each form for the year 2021.
"""
if not live_dbs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if live_dbs is truthy? This whole fixture just returns None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixture just ensures that the XBRL extraction process happens, which means the DB and all metadata files should be in the right place. If live_dbs is true it's basically on the user to make sure that those files exist in the right places. Perhaps it would make sense to assert that all the files exist where expected so the tests fail early if they don't?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense! Might even put that check in the live_dbs fixture

test/conftest.py Outdated
form_settings = ferc_to_sqlite_settings.get_xbrl_dataset_settings(form)

# Extract every fifth filing
filings_subset = datastore.get_filings(year, form)[::5]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be kind of nice to pull this 5 into something like step_size - you use it in a couple places (line 216, 221) and it might save a few minutes of heartache down the line.

@@ -12,7 +12,7 @@
logger = logging.getLogger(__name__)


def test_datasette_metadata_to_yml(pudl_settings_fixture, ferc1_engine_xbrl):
def test_datasette_metadata_to_yml(pudl_settings_fixture, ferc_xbrl):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test needs some FERC data in the DB but doesn't need to actually have a connection to that DB, basically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually just needs the metadata files that are created alongside the DB

convert_form_mock.assert_not_called()

for form in forms:
if form != XbrlFormNumber.FORM714:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because 714 is the 5th one, so we expect the others to just do nothing in test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops this was something I put in for debugging because assert_any_call was giving me bad error messages that were easier to parse when only testing the 714 call for reasons that are hard to explain via text and probably not that important haha. Removed now.



@pytest.mark.parametrize(
"file_map,selected_filings",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's only one version of this test it might make sense to not parametrize it until we have multiples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it sure would be nice if we could send less stuff through parametrize - though that's a thought for when we have multiple cases we'd like to hit.

@zschira zschira merged commit b3ccaea into dev Jan 26, 2023
@zschira zschira deleted the xbrl_test_speedups branch January 26, 2023 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants