-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make CEMS extraction handle new listed year_quarter
partitions
#3187
Changes from 1 commit
4fb33e2
a4cbf89
aaecdfa
ee9b3bf
61bc757
542ce85
cf0454d
d26b4aa
0b08162
e1215fe
4f50183
65af95d
2cbd7dd
5587b2e
36752c3
19dfb7b
b8782cf
71f548f
35211db
cc0ebc9
e2c77bc
28d50df
36b823d
5c01dd4
f3833c3
fdffa26
69d40d2
b472667
efb8ac3
cf8e64c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -100,7 +100,7 @@ | |
API_DTYPE_DICT = { | ||
"State": pd.CategoricalDtype(), | ||
"Facility Name": pd.StringDtype(), # Not reading from CSV | ||
"Facility ID": pd.Int16Dtype(), # unique facility id for internal EPA database management (ORIS code) | ||
"Facility ID": pd.Int32Dtype(), # unique facility id for internal EPA database management (ORIS code) | ||
"Unit ID": pd.StringDtype(), | ||
"Associated Stacks": pd.StringDtype(), | ||
# These op_date, op_hour, and op_time variables get converted to | ||
|
@@ -109,21 +109,21 @@ | |
"Date": pd.StringDtype(), | ||
"Hour": pd.Int16Dtype(), | ||
"Operating Time": pd.Float32Dtype(), | ||
"Gross Load (MW)": pd.Float64Dtype(), | ||
"Steam Load (1000 lb/hr)": pd.Float64Dtype(), | ||
"SO2 Mass (lbs)": pd.Float64Dtype(), | ||
"Gross Load (MW)": pd.Float32Dtype(), | ||
"Steam Load (1000 lb/hr)": pd.Float32Dtype(), | ||
"SO2 Mass (lbs)": pd.Float32Dtype(), | ||
Comment on lines
+112
to
+114
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if switching these to 32bit floats was necessary. |
||
"SO2 Mass Measure Indicator": pd.CategoricalDtype(), | ||
"SO2 Rate (lbs/mmBtu)": pd.Float64Dtype(), # Not reading from CSV | ||
"SO2 Rate (lbs/mmBtu)": pd.Float32Dtype(), # Not reading from CSV | ||
"SO2 Rate Measure Indicator": pd.CategoricalDtype(), # Not reading from CSV | ||
"NOx Rate (lbs/mmBtu)": pd.Float64Dtype(), # Not reading from CSV | ||
"NOx Rate (lbs/mmBtu)": pd.Float32Dtype(), # Not reading from CSV | ||
"NOx Rate Measure Indicator": pd.CategoricalDtype(), # Not reading from CSV | ||
"NOx Mass (lbs)": pd.Float64Dtype(), | ||
"NOx Mass (lbs)": pd.Float32Dtype(), | ||
"NOx Mass Measure Indicator": pd.CategoricalDtype(), | ||
"CO2 Mass (short tons)": pd.Float64Dtype(), | ||
"CO2 Mass (short tons)": pd.Float32Dtype(), | ||
"CO2 Mass Measure Indicator": pd.CategoricalDtype(), | ||
"CO2 Rate (short tons/mmBtu)": pd.Float64Dtype(), # Not reading from CSV | ||
"CO2 Rate (short tons/mmBtu)": pd.Float32Dtype(), # Not reading from CSV | ||
"CO2 Rate Measure Indicator": pd.CategoricalDtype(), # Not reading from CSV | ||
"Heat Input (mmBtu)": pd.Float64Dtype(), | ||
"Heat Input (mmBtu)": pd.Float32Dtype(), | ||
"Heat Input Measure Indicator": pd.CategoricalDtype(), | ||
"Primary Fuel Type": pd.CategoricalDtype(), | ||
"Secondary Fuel Type": pd.CategoricalDtype(), | ||
|
@@ -191,26 +191,32 @@ def get_data_frame(self, partition: EpaCemsPartition) -> pd.DataFrame: | |
|
||
def _csv_to_dataframe( | ||
self, | ||
csv_file: Path, | ||
csv_path: Path, | ||
ignore_cols: dict[str, str], | ||
rename_dict: dict[str, str], | ||
dtype_dict: dict[str, type], | ||
chunksize: int = 100_000, | ||
) -> pd.DataFrame: | ||
"""Convert a CEMS csv file into a :class:`pandas.DataFrame`. | ||
|
||
Args: | ||
csv (file-like object): data to be read | ||
csv_path: Path to CSV file containing data to read. | ||
|
||
Returns: | ||
A DataFrame containing the contents of the CSV file. | ||
A DataFrame containing the filtered and dtyped contents of the CSV file. | ||
""" | ||
return pd.read_csv( | ||
csv_file, | ||
chunk_iter = pd.read_csv( | ||
csv_path, | ||
index_col=False, | ||
usecols=lambda col: col not in ignore_cols, | ||
dtype=dtype_dict, | ||
low_memory=False, | ||
).rename(columns=rename_dict) | ||
chunksize=chunksize, | ||
low_memory=True, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know what it does on the inside, but There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser)." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If all of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That would make sense, but to be consistent with how we're handling other datasets I'd probably want to map all the column dtypes in |
||
parse_dates=["Date"], | ||
) | ||
df = pd.concat(chunk_iter) | ||
dtypes = {k: v for k, v in dtype_dict.items() if k in df.columns} | ||
return df.astype(dtypes).rename(columns=rename_dict) | ||
Comment on lines
+213
to
+215
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This apparently worked, but I'm not sure why it worked. I didn't expect it to for several reasons:
|
||
|
||
|
||
def extract(year_quarter: str, ds: Datastore) -> pd.DataFrame: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
16 bit integers aren't actually large enough to hold these IDs, so some of them were wrapping around and becoming negative numbers.