Update PUDL to SQLAlchemy 2.0 #2267

zaneselvans · 2023-02-03T15:31:36Z

PR Overview

While we allowed SQLAlchemy 2.0 in our dependencies one or more of our dependencies prohibited it, but that no longer seems to be the case as of this morning, so I'm getting some local failures in a fresh pudl-dev environment. This PR will address whatever issues come up from actually using SQLAlchemy 2.0

However for the moment it looks like pandas doesn't work with SQLAlchemy 2.0, so the first thing to do is probably re-pin our dependency.

Blocked by #2320

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

codecov · 2023-02-03T16:24:58Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (fdd444d) 88.7% compared to head (831d569) 88.7%.
Report is 5 commits behind head on dev.

Files	Patch %	Lines
src/pudl/io_managers.py	81.8%	2 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff          @@
##             dev   #2267   +/-   ##
=====================================
  Coverage   88.7%   88.7%           
=====================================
  Files         90      90           
  Lines      10994   10995    +1     
=====================================
+ Hits        9758    9759    +1     
  Misses      1236    1236

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bendnorman · 2023-11-17T07:02:44Z

The unit tests in test/unit/io_manager_test.py were passing, but data wasn't being written to the database during the ETL because the unit tests only test the parent class SQLiteIOManager._handle_pandas_output() which uses engine.begin() but PUDLSQLiteIOManager._handle_pandas_output() uses engine.connect() (I'm not why they use different methods to create Connection objects and haven't looked into when they changed). The behavior of engine.connect() changed in 2.0 to "commit as you go". To commit the transaction you need to execute con.commit() so all of our df.to_sql() statements weren't being committed to the database. con.begin() automatically commits the transaction.

I still don't fully understand the distinction and when it's better to use connect() vs begin(). You can read more about connection here.

zaneselvans · 2023-11-17T18:19:47Z

Fast ETL and integration tests are passing locally, so I guess this is ready. I will run a full ETL and validation tests locally to see how that goes.

With the centralization of all of our interactions with the database in the IOManager classes, we basically only had to change the context manager that was being used to write to and read from the DB in those classes, which is amazing.

I ran a full make nuke locally and fixed a couple of deprecation warnings. Ready to merge as far as I can tell.

* Use replace engine.connect() with engine.begin() * Update unit tests to work with SQLAlchemy 2.0 * Require SQLAlchemy 2.0. Again. Oops. * Remove deprecated use_nullable_dtypes arg from read_parquet calls. * Use string 'sum' rather than callable sum() in groupby transforms. * Use explicit observed=True in timezone groupby * Update conda-lock.yml and rendered conda environment files.

zaneselvans · 2023-11-17T21:24:07Z

src/pudl/analysis/allocate_gen_fuel.py

@@ -1012,7 +1012,7 @@ def _allocate_unassociated_pm_records(
    eia_generators_connected = gen_assoc.loc[connected_mask].assign(
        capacity_mw_minus_one=lambda x: x.groupby(idx_minus_one)[
            "capacity_mw"
-        ].transform(sum),
+        ].transform("sum"),


Addressing a deprecation warning. Future pandas will use the callable here directly, which is not what we want.

zaneselvans · 2023-11-17T21:25:54Z

src/pudl/analysis/state_demand.py

@@ -147,7 +147,7 @@ def local_to_utc(local: pd.Series, tz: Iterable, **kwargs: Any) -> pd.Series:
        1   2020-01-01 06:00:00
        dtype: datetime64[ns]
    """
-    return local.groupby(tz).transform(
+    return local.groupby(tz, observed=True).transform(


Addressing a deprecation warning. There are 15 million records with these timestamps and if we're ever operating on a subset (e.g. a single state that only contains one timezone) this behavior will be much more space efficient. Also it'll become the default pandas behavior in the future.

zaneselvans · 2023-11-17T21:26:31Z

src/pudl/output/epacems.py

@@ -137,7 +137,6 @@ def epacems(

    epacems = dd.read_parquet(
        epacems_path,
-        use_nullable_dtypes=True,


This argument is deprecated since pandas 2.0 and is now the default behavior.

zaneselvans · 2023-11-17T21:27:27Z

src/pudl/io_managers.py

@@ -193,7 +193,7 @@ def _get_fk_list(self, table: str) -> pd.DataFrame:
        method collapses foreign keys with multiple fields into one record for
        readability.
        """
-        with self.engine.connect() as con:
+        with self.engine.begin() as con:


Incredibly, now that all of our interactions with the database are centralized in the IOManager classes, this is the only place we need to change... anything to work with SQLAlchemy 2.0!

bendnorman

LGTM! Thanks for fixing these deprecation warnings.

zaneselvans mentioned this pull request Feb 3, 2023

Pin SQLAlchemy<2.0 and allow pip 23 #2268

Merged

8 tasks

zaneselvans mentioned this pull request Feb 6, 2023

Update sqlalchemy requirement from <2,>=1.4 to >=1.4,<3 #2277

Closed

jdangerx added the inframundo label Feb 7, 2023

zaneselvans added the dependencies Pull requests that update a dependency file label Feb 27, 2023

zaneselvans modified the milestone: PUDL 2023Q2 Release Mar 11, 2023

zaneselvans mentioned this pull request Mar 11, 2023

Update PUDL to new major versions of key dependencies #2384

Closed

zaneselvans changed the title ~~Upgrade to SQLAlchemy 2.0~~ Upgrade PUDL to SQLAlchemy 2.0 Mar 11, 2023

zaneselvans changed the title ~~Upgrade PUDL to SQLAlchemy 2.0~~ Update PUDL to SQLAlchemy 2.0 Mar 11, 2023

zaneselvans marked this pull request as ready for review March 13, 2023 16:04

zaneselvans marked this pull request as draft March 13, 2023 16:16

zaneselvans linked an issue Mar 13, 2023 that may be closed by this pull request

Update PUDL to be compatible with SQLAlchemy 2.0 #2395

Closed

zaneselvans changed the base branch from dev to pandas-2.0 March 21, 2023 08:28

Base automatically changed from pandas-2.0 to dev August 30, 2023 18:05

bendnorman marked this pull request as ready for review September 25, 2023 13:54

e-belfer assigned bendnorman Oct 2, 2023

zaneselvans marked this pull request as draft November 16, 2023 14:09

zaneselvans force-pushed the sqlalchemy-2.0 branch 2 times, most recently from 0ba3a5b to 4530d8e Compare November 17, 2023 16:45

zaneselvans marked this pull request as ready for review November 17, 2023 18:18

zaneselvans requested a review from bendnorman November 17, 2023 18:52

zaneselvans force-pushed the sqlalchemy-2.0 branch from 354c218 to 6f603e9 Compare November 17, 2023 21:05

Update conda-lock.yml and rendered conda environment files.

831d569

zaneselvans commented Nov 17, 2023

View reviewed changes

bendnorman approved these changes Nov 17, 2023

View reviewed changes

zaneselvans merged commit 6337426 into dev Nov 18, 2023
10 of 11 checks passed

zaneselvans deleted the sqlalchemy-2.0 branch November 18, 2023 00:05

zaneselvans mentioned this pull request Nov 21, 2023

Merge dev into main for 2023-11-21 #3070

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update PUDL to SQLAlchemy 2.0 #2267

Update PUDL to SQLAlchemy 2.0 #2267

zaneselvans commented Feb 3, 2023 •

edited

Loading

codecov bot commented Feb 3, 2023 •

edited

Loading

bendnorman commented Nov 17, 2023

zaneselvans commented Nov 17, 2023 •

edited

Loading

zaneselvans Nov 17, 2023

zaneselvans Nov 17, 2023

zaneselvans Nov 17, 2023

zaneselvans Nov 17, 2023

bendnorman left a comment

Update PUDL to SQLAlchemy 2.0 #2267

Update PUDL to SQLAlchemy 2.0 #2267

Conversation

zaneselvans commented Feb 3, 2023 • edited Loading

PR Overview

PR Checklist

codecov bot commented Feb 3, 2023 • edited Loading

Codecov Report

bendnorman commented Nov 17, 2023

zaneselvans commented Nov 17, 2023 • edited Loading

zaneselvans Nov 17, 2023

Choose a reason for hiding this comment

zaneselvans Nov 17, 2023

Choose a reason for hiding this comment

zaneselvans Nov 17, 2023

Choose a reason for hiding this comment

zaneselvans Nov 17, 2023

Choose a reason for hiding this comment

bendnorman left a comment

Choose a reason for hiding this comment

zaneselvans commented Feb 3, 2023 •

edited

Loading

codecov bot commented Feb 3, 2023 •

edited

Loading

zaneselvans commented Nov 17, 2023 •

edited

Loading