Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework decorators #287

Merged
merged 4 commits into from
Jan 5, 2024
Merged

Rework decorators #287

merged 4 commits into from
Jan 5, 2024

Conversation

pee-po
Copy link
Contributor

@pee-po pee-po commented Dec 26, 2023

This PR introduces some improvements to decorators:

  1. @year_limited deduplication is changed from dropping index duplicates to truncating. This gets rid of unintended duplicates but leaves duplicates served by API. (Fixing the issue #234: documents_limited decorator removes data after concatenating by removing "duplicates" #235 documents_limited decorator removes data after concatenating by removing "duplicates" #234)
  2. PEP8 adherence, import sorting;
  3. Fixing of inconsistent logging (logging vs. printing to stderr);
  4. Fixing type checking to adhere to python3 docs

TODO: Same as 1. for @documents_limited- I need to find a test case and make sure that the direction is right.

Few discussions have been had about proper duplicate removal in limited decorators. It's not clear to me whether this is the best way to tackle the problem - after this PR is merged, the user will be tasked with removing duplicated indices in rare cases. This is better than the user not having any say in the matter (as is currently the case, when the duplicated indices are removed - seemingly - at random). Nevertheless after this PR the user still doesn't have much to go on when choosing which duplicates to keep.

In database design the solution would be to add column(s) that inform in what circumstances the record is valid, e.g. tech dates (valid_from, valid_to). I don't know enough about library's parsing to know whether this is even possible, and even if so - this would be a major change - e.g. appending an index level would be breaking.

Another solution would be to simply order resulting records and allow the user to pick last index occurrence for most recent data (df.loc[~df.index.duplicated(keep='last')]). Again, I'm not even sure this is possible and if so it would require some work in parsers.

@fboerman and everyone interested, let me know what you think - whether I should finish @documents_limited and merge as is or we should keep digging for a 'bestest' solution. In any case, keep in mind that this PR may already be braking for some users.

Please, excuse me for bundling code quality improvements int his PR, but the linter was killing me.

year_limited decorator (which appropriately splits API calls that should
be limited to a year per call) used deduplication based on index
timestamp. This was probably due to API partial matching (request one
day but recieve full month) and potential overlap of year blocks at the
ends of interval.

This commit introduces truncation of each year block to its nominal
start and end. Index deduplication is removed. This should remove index
duplicates stemming from partial matching and interval overlap but keep
the duplicates served by the API (e.g. due to corrections).
Calls to type and literal comparisons were changed to isinstance as
recommended by docs of type built-in.
PEP8 compliance, logging and import sorting.
@fboerman
Copy link
Collaborator

dear @pee-po many thanks for making this! this is a long lingering complicated issue. I think truncating removes most of the duplicated issues. due to the api design of entsoe there is no better solution, because duplicating entries is sometimes a valid result. i think its best to keep as true to the actual api as possible.

I think it would be great if you can finish it for documents_limited as well and then ill merge it. its possible that it breaks the workflow for some users, but then they can wrap some more code themselves to get the desired behaviour that they want, as you described.

many thanks for picking this up!

The documents_limited decorator appropriately splits queries that exceed
the limit of 100 documents (as per "Transparency Platform RESTful API -
user guide"). These splits occur before the data are tabulated and their
later alignment is not straightforward.

This commit changes duplicate removal based on index, to picking last
valid value for each column within groups based on index. This is not an
ideal solution but seems to work for the issues at hand. Firstly - if
any duplicated indices are returned by the API then they are dropped
invisibly for the user. Secondly - arguably, the spliting and
concatenation should happen before tabulation. It would make more sense
and be more efficient.
@pee-po
Copy link
Contributor Author

pee-po commented Jan 2, 2024

The documents_limited is a bit harder. Since the splitting occurs before XML parsing - the concatenation should, arguably, happen therein. I didn't want to flip this whole thing upside down so my proposition is a workaround. Within records that share the same index, the NaNs are forward filled, then last record is picked (so the duplicates are still dropped). This isn't elegant but gets the job done. It's not efficient either.

I added a test based on #234 - unfortunately it runs long and makes 5 requests. Also - this test provides a situation where there are no "intended" duplicates. I'll be happy to make more tests if someone proposes them.

After looking at it for some time, my idea of doing it better would be to do it directly in RawClient's methods - either concatenate responses to one XML and use the current parser, or return a list of responses with different offsets, modify parser to accept a list and join the responses while parsing, but before tabulating. This seems like a lot of work though, so maybe when someone requests it.

@pee-po pee-po marked this pull request as ready for review January 2, 2024 23:50
@fboerman
Copy link
Collaborator

fboerman commented Jan 5, 2024

hi @pee-po many thanks! I have tested a happy flow already working (day ahead prices) and a flow that wasnt working (#234) and they both work so I think its good to merge! lets see if any unexpected issues pop up haha.
Many thanks for this work!

@fboerman fboerman merged commit dcf6f2d into EnergieID:master Jan 5, 2024
Comment on lines +131 to +135
interval_mask = (
((frame.index > _start) & (frame.index <= _end))
| (frame.index == start)
)
frame = frame.loc[interval_mask]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit late to the party but I believe this change is breaking queries returning yearly data like client.query_installed_generation_capacity(country_code, start, end, psr_type=None)

example to recreate issue:

from entsoe import EntsoePandasClient
import pandas as pd


client = EntsoePandasClient(api_key=API_KEY)

start = pd.Timestamp('20171201', tz='Europe/Brussels')
end = pd.Timestamp('20180101', tz='Europe/Brussels')
country_code = "BE"

client.query_installed_generation_capacity(
    country_code=country_code, start=start, end=end, psr_type="B01"
)

Expected result:

  Biomass
2017-01-01 00:00:00+01:00    598.0

Returned results:

Empty DataFrame
Columns: [Biomass]
Index: []

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants