Xbrl test speedups #2229

zschira · 2023-01-23T20:28:03Z

Overview

This PR speeds up/improves XBRL extraction testing. To do so, it uses a subset of the XBRL filings during integration testing. This dramatically reduces the time required to extract XBRL filings, and seems to reduce total runtime for the CI by ~10 minutes. This PR also adds several unit tests focused on the extraction and XBRL datastore object.

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
Include unit tests for new functions and classes.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

zschira · 2023-01-23T20:29:23Z

src/pudl/extract/xbrl.py

@@ -57,12 +57,14 @@ def get_filings(self, year: int, form: XbrlFormNumber):
                    published = datetime.fromisoformat(info["published_parsed"])

                    if published > latest:
-                        latest_filing = f"{filing_id}.xbrl"
+                        latest = published


Caught a little bug here while adding tests!

zschira · 2023-01-23T20:30:18Z

test/conftest.py

+            form_settings = ferc_to_sqlite_settings.get_xbrl_dataset_settings(form)
+
+            # Extract every fifth filing
+            filings_subset = datastore.get_filings(year, form)[::5]


Using every fifth filing. This could be changed, but it seemed like a reasonable balance to me

Could be kind of nice to pull this 5 into something like step_size - you use it in a couple places (line 216, 221) and it might save a few minutes of heartache down the line.

codecov · 2023-01-23T20:37:59Z

Codecov Report

Base: 85.7% // Head: 85.7% // Increases project coverage by +0.0% 🎉

Coverage data is based on head (89d274e) compared to base (afe8d64).
Patch coverage: 100.0% of modified lines in pull request are covered.

Additional details and impacted files

@@          Coverage Diff          @@
##             dev   #2229   +/-   ##
=====================================
  Coverage   85.7%   85.7%           
=====================================
  Files         73      73           
  Lines       8973    8974    +1     
=====================================
+ Hits        7690    7692    +2     
+ Misses      1283    1282    -1

Impacted Files	Coverage Δ
src/pudl/extract/xbrl.py	`96.1% <100.0%> (+2.0%)`	⬆️
src/pudl/settings.py	`96.0% <100.0%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

jdangerx

Mostly questions for my own clarification - everything here looks good and faster tests is great! Approved, pending the test coverage issue. I'm happy to help dig into that with you as well!

jdangerx · 2023-01-25T00:47:49Z

test/conftest.py


-    If we are using the test database, we initialize it from scratch first. If we're
-    using the live database, then we just yield a conneciton to it.
+    Extracts a subset of filings for each form for the year 2021.
    """
    if not live_dbs:


What happens if live_dbs is truthy? This whole fixture just returns None?

This fixture just ensures that the XBRL extraction process happens, which means the DB and all metadata files should be in the right place. If live_dbs is true it's basically on the user to make sure that those files exist in the right places. Perhaps it would make sense to assert that all the files exist where expected so the tests fail early if they don't?

I think that makes sense! Might even put that check in the live_dbs fixture

jdangerx · 2023-01-25T00:48:47Z

test/conftest.py

+            form_settings = ferc_to_sqlite_settings.get_xbrl_dataset_settings(form)
+
+            # Extract every fifth filing
+            filings_subset = datastore.get_filings(year, form)[::5]


Could be kind of nice to pull this 5 into something like step_size - you use it in a couple places (line 216, 221) and it might save a few minutes of heartache down the line.

jdangerx · 2023-01-25T00:54:04Z

test/integration/datasette_metadata_test.py

@@ -12,7 +12,7 @@
 logger = logging.getLogger(__name__)


-def test_datasette_metadata_to_yml(pudl_settings_fixture, ferc1_engine_xbrl):
+def test_datasette_metadata_to_yml(pudl_settings_fixture, ferc_xbrl):


This test needs some FERC data in the DB but doesn't need to actually have a connection to that DB, basically?

It actually just needs the metadata files that are created alongside the DB

jdangerx · 2023-01-25T00:54:55Z

test/unit/extract/xbrl_test.py

+        convert_form_mock.assert_not_called()
+
+    for form in forms:
+        if form != XbrlFormNumber.FORM714:


This is because 714 is the 5th one, so we expect the others to just do nothing in test?

Ooops this was something I put in for debugging because assert_any_call was giving me bad error messages that were easier to parse when only testing the 714 call for reasons that are hard to explain via text and probably not that important haha. Removed now.

jdangerx · 2023-01-25T00:58:01Z

test/unit/extract/xbrl_test.py

+
+
+@pytest.mark.parametrize(
+    "file_map,selected_filings",


If there's only one version of this test it might make sense to not parametrize it until we have multiples.

And it sure would be nice if we could send less stuff through parametrize - though that's a thought for when we have multiple cases we'd like to hit.

zschira added 5 commits January 20, 2023 12:38

Limit number of XBRL filings to extract during testing

0c4ccb3

Add FercXbrlDatastore unit tests

8e94840

Test xbrl2sqlite

d9a7573

Merge branch 'dev' into xbrl_test_speedups

58ab4b0

Add return type annotation to XBRL datastore

6c37e7f

zschira requested a review from jdangerx January 23, 2023 20:28

zschira commented Jan 23, 2023

View reviewed changes

jdangerx approved these changes Jan 25, 2023

View reviewed changes

zschira added 3 commits January 25, 2023 12:22

Remove degub check from xbrl datastore unit test

24a9d76

Use variable for subset step_size in xbrl extraction test

17a1cbf

Add convert_form unit test

89d274e

zschira merged commit b3ccaea into dev Jan 26, 2023

zschira deleted the xbrl_test_speedups branch January 26, 2023 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xbrl test speedups #2229

Xbrl test speedups #2229

zschira commented Jan 23, 2023 •

edited

Loading

zschira Jan 23, 2023

zschira Jan 23, 2023

jdangerx Jan 25, 2023

codecov bot commented Jan 23, 2023 •

edited

Loading

jdangerx left a comment

jdangerx Jan 25, 2023

zschira Jan 25, 2023

jdangerx Jan 25, 2023

jdangerx Jan 25, 2023

jdangerx Jan 25, 2023

zschira Jan 25, 2023

jdangerx Jan 25, 2023

zschira Jan 25, 2023

jdangerx Jan 25, 2023

jdangerx Jan 25, 2023



		@pytest.mark.parametrize(
		"file_map,selected_filings",

Xbrl test speedups #2229

Xbrl test speedups #2229

Conversation

zschira commented Jan 23, 2023 • edited Loading

Overview

PR Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 23, 2023 • edited Loading

Codecov Report

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zschira commented Jan 23, 2023 •

edited

Loading

codecov bot commented Jan 23, 2023 •

edited

Loading