-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xbrl test speedups #2229
Xbrl test speedups #2229
Conversation
@@ -57,12 +57,14 @@ def get_filings(self, year: int, form: XbrlFormNumber): | |||
published = datetime.fromisoformat(info["published_parsed"]) | |||
|
|||
if published > latest: | |||
latest_filing = f"{filing_id}.xbrl" | |||
latest = published |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caught a little bug here while adding tests!
test/conftest.py
Outdated
form_settings = ferc_to_sqlite_settings.get_xbrl_dataset_settings(form) | ||
|
||
# Extract every fifth filing | ||
filings_subset = datastore.get_filings(year, form)[::5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using every fifth filing. This could be changed, but it seemed like a reasonable balance to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be kind of nice to pull this 5
into something like step_size
- you use it in a couple places (line 216, 221) and it might save a few minutes of heartache down the line.
Codecov ReportBase: 85.7% // Head: 85.7% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## dev #2229 +/- ##
=====================================
Coverage 85.7% 85.7%
=====================================
Files 73 73
Lines 8973 8974 +1
=====================================
+ Hits 7690 7692 +2
+ Misses 1283 1282 -1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly questions for my own clarification - everything here looks good and faster tests is great! Approved, pending the test coverage issue. I'm happy to help dig into that with you as well!
|
||
If we are using the test database, we initialize it from scratch first. If we're | ||
using the live database, then we just yield a conneciton to it. | ||
Extracts a subset of filings for each form for the year 2021. | ||
""" | ||
if not live_dbs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if live_dbs
is truthy? This whole fixture just returns None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixture just ensures that the XBRL extraction process happens, which means the DB and all metadata files should be in the right place. If live_dbs
is true it's basically on the user to make sure that those files exist in the right places. Perhaps it would make sense to assert
that all the files exist where expected so the tests fail early if they don't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense! Might even put that check in the live_dbs
fixture
test/conftest.py
Outdated
form_settings = ferc_to_sqlite_settings.get_xbrl_dataset_settings(form) | ||
|
||
# Extract every fifth filing | ||
filings_subset = datastore.get_filings(year, form)[::5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be kind of nice to pull this 5
into something like step_size
- you use it in a couple places (line 216, 221) and it might save a few minutes of heartache down the line.
@@ -12,7 +12,7 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
|
|||
def test_datasette_metadata_to_yml(pudl_settings_fixture, ferc1_engine_xbrl): | |||
def test_datasette_metadata_to_yml(pudl_settings_fixture, ferc_xbrl): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test needs some FERC data in the DB but doesn't need to actually have a connection to that DB, basically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually just needs the metadata files that are created alongside the DB
test/unit/extract/xbrl_test.py
Outdated
convert_form_mock.assert_not_called() | ||
|
||
for form in forms: | ||
if form != XbrlFormNumber.FORM714: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because 714 is the 5th one, so we expect the others to just do nothing in test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops this was something I put in for debugging because assert_any_call
was giving me bad error messages that were easier to parse when only testing the 714 call for reasons that are hard to explain via text and probably not that important haha. Removed now.
|
||
|
||
@pytest.mark.parametrize( | ||
"file_map,selected_filings", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's only one version of this test it might make sense to not parametrize it until we have multiples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it sure would be nice if we could send less stuff through parametrize
- though that's a thought for when we have multiple cases we'd like to hit.
Overview
This PR speeds up/improves XBRL extraction testing. To do so, it uses a subset of the XBRL filings during integration testing. This dramatically reduces the time required to extract XBRL filings, and seems to reduce total runtime for the CI by ~10 minutes. This PR also adds several unit tests focused on the extraction and XBRL
datastore
object.PR Checklist
dev
).