Rework decorators #287

pee-po · 2023-12-26T10:59:00Z

This PR introduces some improvements to decorators:

@year_limited deduplication is changed from dropping index duplicates to truncating. This gets rid of unintended duplicates but leaves duplicates served by API. (Fixing the issue #234: documents_limited decorator removes data after concatenating by removing "duplicates" #235 documents_limited decorator removes data after concatenating by removing "duplicates" #234)
PEP8 adherence, import sorting;
Fixing of inconsistent logging (logging vs. printing to stderr);
Fixing type checking to adhere to python3 docs

TODO: Same as 1. for @documents_limited- I need to find a test case and make sure that the direction is right.

Few discussions have been had about proper duplicate removal in limited decorators. It's not clear to me whether this is the best way to tackle the problem - after this PR is merged, the user will be tasked with removing duplicated indices in rare cases. This is better than the user not having any say in the matter (as is currently the case, when the duplicated indices are removed - seemingly - at random). Nevertheless after this PR the user still doesn't have much to go on when choosing which duplicates to keep.

In database design the solution would be to add column(s) that inform in what circumstances the record is valid, e.g. tech dates (valid_from, valid_to). I don't know enough about library's parsing to know whether this is even possible, and even if so - this would be a major change - e.g. appending an index level would be breaking.

Another solution would be to simply order resulting records and allow the user to pick last index occurrence for most recent data (df.loc[~df.index.duplicated(keep='last')]). Again, I'm not even sure this is possible and if so it would require some work in parsers.

@fboerman and everyone interested, let me know what you think - whether I should finish @documents_limited and merge as is or we should keep digging for a 'bestest' solution. In any case, keep in mind that this PR may already be braking for some users.

Please, excuse me for bundling code quality improvements int his PR, but the linter was killing me.

year_limited decorator (which appropriately splits API calls that should be limited to a year per call) used deduplication based on index timestamp. This was probably due to API partial matching (request one day but recieve full month) and potential overlap of year blocks at the ends of interval. This commit introduces truncation of each year block to its nominal start and end. Index deduplication is removed. This should remove index duplicates stemming from partial matching and interval overlap but keep the duplicates served by the API (e.g. due to corrections).

Calls to type and literal comparisons were changed to isinstance as recommended by docs of type built-in.

PEP8 compliance, logging and import sorting.

fboerman · 2023-12-28T10:59:43Z

dear @pee-po many thanks for making this! this is a long lingering complicated issue. I think truncating removes most of the duplicated issues. due to the api design of entsoe there is no better solution, because duplicating entries is sometimes a valid result. i think its best to keep as true to the actual api as possible.

I think it would be great if you can finish it for documents_limited as well and then ill merge it. its possible that it breaks the workflow for some users, but then they can wrap some more code themselves to get the desired behaviour that they want, as you described.

many thanks for picking this up!

The documents_limited decorator appropriately splits queries that exceed the limit of 100 documents (as per "Transparency Platform RESTful API - user guide"). These splits occur before the data are tabulated and their later alignment is not straightforward. This commit changes duplicate removal based on index, to picking last valid value for each column within groups based on index. This is not an ideal solution but seems to work for the issues at hand. Firstly - if any duplicated indices are returned by the API then they are dropped invisibly for the user. Secondly - arguably, the spliting and concatenation should happen before tabulation. It would make more sense and be more efficient.

pee-po · 2024-01-02T23:49:35Z

The documents_limited is a bit harder. Since the splitting occurs before XML parsing - the concatenation should, arguably, happen therein. I didn't want to flip this whole thing upside down so my proposition is a workaround. Within records that share the same index, the NaNs are forward filled, then last record is picked (so the duplicates are still dropped). This isn't elegant but gets the job done. It's not efficient either.

I added a test based on #234 - unfortunately it runs long and makes 5 requests. Also - this test provides a situation where there are no "intended" duplicates. I'll be happy to make more tests if someone proposes them.

After looking at it for some time, my idea of doing it better would be to do it directly in RawClient's methods - either concatenate responses to one XML and use the current parser, or return a list of responses with different offsets, modify parser to accept a list and join the responses while parsing, but before tabulating. This seems like a lot of work though, so maybe when someone requests it.

fboerman · 2024-01-05T09:06:02Z

hi @pee-po many thanks! I have tested a happy flow already working (day ahead prices) and a flow that wasnt working (#234) and they both work so I think its good to merge! lets see if any unexpected issues pop up haha.
Many thanks for this work!

badrbmb · 2024-01-17T20:41:49Z

entsoe/decorators.py

+                interval_mask = (
+                    ((frame.index > _start) & (frame.index <= _end))
+                    | (frame.index == start)
+                )
+                frame = frame.loc[interval_mask]


A bit late to the party but I believe this change is breaking queries returning yearly data like client.query_installed_generation_capacity(country_code, start, end, psr_type=None)

example to recreate issue:

from entsoe import EntsoePandasClient import pandas as pd client = EntsoePandasClient(api_key=API_KEY) start = pd.Timestamp('20171201', tz='Europe/Brussels') end = pd.Timestamp('20180101', tz='Europe/Brussels') country_code = "BE" client.query_installed_generation_capacity( country_code=country_code, start=start, end=end, psr_type="B01" )

Expected result:

Biomass 2017-01-01 00:00:00+01:00 598.0

Returned results:

Empty DataFrame Columns: [Biomass] Index: []

Rework decorators

pee-po added 3 commits December 26, 2023 11:15

Fix type checking in year_limited

1e058cd

Calls to type and literal comparisons were changed to isinstance as recommended by docs of type built-in.

Add code quality fixes for decorators

1370246

PEP8 compliance, logging and import sorting.

fboerman mentioned this pull request Dec 28, 2023

Fixing the issue #234: documents_limited decorator removes data after concatenating by removing "duplicates" #235

Closed

pee-po marked this pull request as ready for review January 2, 2024 23:50

fboerman merged commit dcf6f2d into EnergieID:master Jan 5, 2024

badrbmb reviewed Jan 17, 2024

View reviewed changes

badrbmb mentioned this pull request Jan 17, 2024

Queries using the EntsoePandasClient expecting yearly data return empty dataframes #292

Closed

pee-po mentioned this pull request Jan 18, 2024

Resolve #292 #293

Merged

pee-po mentioned this pull request Feb 5, 2024

introduction of comprehensive test suite #131

Open

fleimgruber pushed a commit to fleimgruber/entsoe-py that referenced this pull request May 14, 2024

Merge pull request EnergieID#287 from pee-po/rework-decorators

0682704

Rework decorators

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework decorators #287

Rework decorators #287

pee-po commented Dec 26, 2023

fboerman commented Dec 28, 2023

pee-po commented Jan 2, 2024

fboerman commented Jan 5, 2024

badrbmb Jan 17, 2024

Rework decorators #287

Rework decorators #287

Conversation

pee-po commented Dec 26, 2023

fboerman commented Dec 28, 2023

pee-po commented Jan 2, 2024

fboerman commented Jan 5, 2024

badrbmb Jan 17, 2024

Choose a reason for hiding this comment