Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PUDL to SQLAlchemy 2.0 #2267

Merged
merged 2 commits into from
Nov 18, 2023
Merged

Update PUDL to SQLAlchemy 2.0 #2267

merged 2 commits into from
Nov 18, 2023

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Feb 3, 2023

PR Overview

While we allowed SQLAlchemy 2.0 in our dependencies one or more of our dependencies prohibited it, but that no longer seems to be the case as of this morning, so I'm getting some local failures in a fresh pudl-dev environment. This PR will address whatever issues come up from actually using SQLAlchemy 2.0

However for the moment it looks like pandas doesn't work with SQLAlchemy 2.0, so the first thing to do is probably re-pin our dependency.

Blocked by #2320

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@codecov
Copy link

codecov bot commented Feb 3, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (fdd444d) 88.7% compared to head (831d569) 88.7%.
Report is 5 commits behind head on dev.

Files Patch % Lines
src/pudl/io_managers.py 81.8% 2 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2267   +/-   ##
=====================================
  Coverage   88.7%   88.7%           
=====================================
  Files         90      90           
  Lines      10994   10995    +1     
=====================================
+ Hits        9758    9759    +1     
  Misses      1236    1236           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zaneselvans zaneselvans added the dependencies Pull requests that update a dependency file label Feb 27, 2023
@zaneselvans zaneselvans modified the milestone: PUDL 2023Q2 Release Mar 11, 2023
@zaneselvans zaneselvans changed the title Upgrade to SQLAlchemy 2.0 Upgrade PUDL to SQLAlchemy 2.0 Mar 11, 2023
@zaneselvans zaneselvans changed the title Upgrade PUDL to SQLAlchemy 2.0 Update PUDL to SQLAlchemy 2.0 Mar 11, 2023
@zaneselvans zaneselvans marked this pull request as ready for review March 13, 2023 16:04
@zaneselvans zaneselvans marked this pull request as draft March 13, 2023 16:16
@zaneselvans zaneselvans linked an issue Mar 13, 2023 that may be closed by this pull request
@zaneselvans zaneselvans changed the base branch from dev to pandas-2.0 March 21, 2023 08:28
Base automatically changed from pandas-2.0 to dev August 30, 2023 18:05
@bendnorman bendnorman marked this pull request as ready for review September 25, 2023 13:54
@zaneselvans zaneselvans marked this pull request as draft November 16, 2023 14:09
@bendnorman
Copy link
Member

The unit tests in test/unit/io_manager_test.py were passing, but data wasn't being written to the database during the ETL because the unit tests only test the parent class SQLiteIOManager._handle_pandas_output() which uses engine.begin() but PUDLSQLiteIOManager._handle_pandas_output() uses engine.connect() (I'm not why they use different methods to create Connection objects and haven't looked into when they changed). The behavior of engine.connect() changed in 2.0 to "commit as you go". To commit the transaction you need to execute con.commit() so all of our df.to_sql() statements weren't being committed to the database. con.begin() automatically commits the transaction.

I still don't fully understand the distinction and when it's better to use connect() vs begin(). You can read more about connection here.

@zaneselvans zaneselvans force-pushed the sqlalchemy-2.0 branch 2 times, most recently from 0ba3a5b to 4530d8e Compare November 17, 2023 16:45
@zaneselvans zaneselvans marked this pull request as ready for review November 17, 2023 18:18
@zaneselvans
Copy link
Member Author

zaneselvans commented Nov 17, 2023

Fast ETL and integration tests are passing locally, so I guess this is ready. I will run a full ETL and validation tests locally to see how that goes.

With the centralization of all of our interactions with the database in the IOManager classes, we basically only had to change the context manager that was being used to write to and read from the DB in those classes, which is amazing.

I ran a full make nuke locally and fixed a couple of deprecation warnings. Ready to merge as far as I can tell.

* Use replace engine.connect() with engine.begin()
* Update unit tests to work with SQLAlchemy 2.0
* Require SQLAlchemy 2.0. Again. Oops.
* Remove deprecated use_nullable_dtypes arg from read_parquet calls.
* Use string 'sum' rather than callable sum() in groupby transforms.
* Use explicit observed=True in timezone groupby
* Update conda-lock.yml and rendered conda environment files.
@@ -1012,7 +1012,7 @@ def _allocate_unassociated_pm_records(
eia_generators_connected = gen_assoc.loc[connected_mask].assign(
capacity_mw_minus_one=lambda x: x.groupby(idx_minus_one)[
"capacity_mw"
].transform(sum),
].transform("sum"),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing a deprecation warning. Future pandas will use the callable here directly, which is not what we want.

@@ -147,7 +147,7 @@ def local_to_utc(local: pd.Series, tz: Iterable, **kwargs: Any) -> pd.Series:
1 2020-01-01 06:00:00
dtype: datetime64[ns]
"""
return local.groupby(tz).transform(
return local.groupby(tz, observed=True).transform(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing a deprecation warning. There are 15 million records with these timestamps and if we're ever operating on a subset (e.g. a single state that only contains one timezone) this behavior will be much more space efficient. Also it'll become the default pandas behavior in the future.

@@ -137,7 +137,6 @@ def epacems(

epacems = dd.read_parquet(
epacems_path,
use_nullable_dtypes=True,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument is deprecated since pandas 2.0 and is now the default behavior.

@@ -193,7 +193,7 @@ def _get_fk_list(self, table: str) -> pd.DataFrame:
method collapses foreign keys with multiple fields into one record for
readability.
"""
with self.engine.connect() as con:
with self.engine.begin() as con:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incredibly, now that all of our interactions with the database are centralized in the IOManager classes, this is the only place we need to change... anything to work with SQLAlchemy 2.0!

Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for fixing these deprecation warnings.

@zaneselvans zaneselvans merged commit 6337426 into dev Nov 18, 2023
10 of 11 checks passed
@zaneselvans zaneselvans deleted the sqlalchemy-2.0 branch November 18, 2023 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file inframundo
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Update PUDL to be compatible with SQLAlchemy 2.0
3 participants