Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up FERC 714 hourly demand transform #873

Merged
merged 7 commits into from
Jan 2, 2021

Conversation

ezwelty
Copy link
Contributor

@ezwelty ezwelty commented Jan 1, 2021

pudl.transform.ferc714.demand_hourly_pa took about 10 minutes to complete on 14 years of data. As of now, I have it down to 15 s.

Below are some of the steps I took to speed things up. Timings below are for a single year (2006).

  • Since report_date format is consistent, use an explicit format= in pd.to_datetime.
# 14.6 s
pd.to_datetime(df.report_date)
# 0.5 s
pd.to_datetime(df.report_date, format="%m/%d/%Y %H:%M:%S", exact=True)
  • Format all columns used as id_vars in pd.DataFrame.melt before they are replicated (25 times in this case) in the melt.
# after melt: 18.7 s
# before melt: 0.9 s
df.pipe(_standardize_offset_codes, offset_fixes=OFFSET_CODE_FIXES)
  • Similarly, drop any unused columns (i.e. the 25th hour) before the melt.
  • Compute hour timedeltas once for each hour, rather than nyears * nrespondents times!
# 12.7 s
df.report_date + pd.to_timedelta(df.hour - 1, unit="h")
# 0.2 s
mapping = {i + 1: pd.to_timedelta(i, unit="h") for i in range(24)}
df.report_date + df.hour.map(mapping)
  • If replacing all values in a column with new values, use pd.Series.map, not pd.Series.replace which is much slower.
# 11 s
df.utc_offset_code.replace(offset_codes)
# 0.2 s
df.utc_offset_code.map(offset_codes)
  • Make changes in place when possible. As originally written, the dataframe was copied on nearly every change. For example, using pd.DataFrame.assign(column=*) instead of df['column'] = * – chains are tidy but come at a performance cost that add up with large dataframes. The use of helper functions that accept a dataframe and return a dataframe only complicates matters. Such functions that do not alter the number of rows in the dataframe should probably just return the result (so you can choose what to do with the result) rather than copy the dataframe, modify it, and return it. (I would also discourage functions like f(df) in favor of e.g. f(report_date, demand_mwh) so that it is clear from the function signature what variables are needed for the calculation.)

A contrived example to demonstrate:

import time
import numpy as np
import pandas as pd

df = pd.DataFrame(
    np.random.random((10000000, 10)),
    columns=[f"x{i}" for i in range(10)]
)

# Inplace: 1.7 s
start = time.time()
for _ in range(10):
    df["x0"] += 1
print(time.time() - start)

# Copy: 6.0 s
start = time.time()
for _ in range(10):
    df = df.assign(x0=df["x0"] + 1)
print(time.time() - start)

@ezwelty ezwelty added the ferc714 Anything having to do with FERC Form 714 label Jan 1, 2021
@codecov
Copy link

codecov bot commented Jan 1, 2021

Codecov Report

Merging #873 (b9f359c) into sprint29 (8fc12bc) will increase coverage by 0.15%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##           sprint29     #873      +/-   ##
============================================
+ Coverage     65.57%   65.71%   +0.15%     
============================================
  Files            44       44              
  Lines          5573     5556      -17     
============================================
- Hits           3654     3651       -3     
+ Misses         1919     1905      -14     
Impacted Files Coverage Δ
src/pudl/transform/ferc714.py 32.61% <0.00%> (+2.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8fc12bc...b9f359c. Read the comment docs.

@zaneselvans zaneselvans marked this pull request as ready for review January 2, 2021 00:37
@zaneselvans zaneselvans merged commit 8b3144c into sprint29 Jan 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ferc714 Anything having to do with FERC Form 714
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants