-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework decorators #287
Rework decorators #287
Conversation
year_limited decorator (which appropriately splits API calls that should be limited to a year per call) used deduplication based on index timestamp. This was probably due to API partial matching (request one day but recieve full month) and potential overlap of year blocks at the ends of interval. This commit introduces truncation of each year block to its nominal start and end. Index deduplication is removed. This should remove index duplicates stemming from partial matching and interval overlap but keep the duplicates served by the API (e.g. due to corrections).
Calls to type and literal comparisons were changed to isinstance as recommended by docs of type built-in.
PEP8 compliance, logging and import sorting.
dear @pee-po many thanks for making this! this is a long lingering complicated issue. I think truncating removes most of the duplicated issues. due to the api design of entsoe there is no better solution, because duplicating entries is sometimes a valid result. i think its best to keep as true to the actual api as possible. I think it would be great if you can finish it for documents_limited as well and then ill merge it. its possible that it breaks the workflow for some users, but then they can wrap some more code themselves to get the desired behaviour that they want, as you described. many thanks for picking this up! |
The documents_limited decorator appropriately splits queries that exceed the limit of 100 documents (as per "Transparency Platform RESTful API - user guide"). These splits occur before the data are tabulated and their later alignment is not straightforward. This commit changes duplicate removal based on index, to picking last valid value for each column within groups based on index. This is not an ideal solution but seems to work for the issues at hand. Firstly - if any duplicated indices are returned by the API then they are dropped invisibly for the user. Secondly - arguably, the spliting and concatenation should happen before tabulation. It would make more sense and be more efficient.
The I added a test based on #234 - unfortunately it runs long and makes 5 requests. Also - this test provides a situation where there are no "intended" duplicates. I'll be happy to make more tests if someone proposes them. After looking at it for some time, my idea of doing it better would be to do it directly in |
interval_mask = ( | ||
((frame.index > _start) & (frame.index <= _end)) | ||
| (frame.index == start) | ||
) | ||
frame = frame.loc[interval_mask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit late to the party but I believe this change is breaking queries returning yearly data like client.query_installed_generation_capacity(country_code, start, end, psr_type=None)
example to recreate issue:
from entsoe import EntsoePandasClient
import pandas as pd
client = EntsoePandasClient(api_key=API_KEY)
start = pd.Timestamp('20171201', tz='Europe/Brussels')
end = pd.Timestamp('20180101', tz='Europe/Brussels')
country_code = "BE"
client.query_installed_generation_capacity(
country_code=country_code, start=start, end=end, psr_type="B01"
)
Expected result:
Biomass
2017-01-01 00:00:00+01:00 598.0
Returned results:
Empty DataFrame
Columns: [Biomass]
Index: []
Rework decorators
This PR introduces some improvements to decorators:
@year_limited
deduplication is changed from dropping index duplicates to truncating. This gets rid of unintended duplicates but leaves duplicates served by API. (Fixing the issue #234: documents_limited decorator removes data after concatenating by removing "duplicates" #235 documents_limited decorator removes data after concatenating by removing "duplicates" #234)logging
vs. printing tostderr
);TODO: Same as 1. for
@documents_limited
- I need to find a test case and make sure that the direction is right.Few discussions have been had about proper duplicate removal in
limited
decorators. It's not clear to me whether this is the best way to tackle the problem - after this PR is merged, the user will be tasked with removing duplicated indices in rare cases. This is better than the user not having any say in the matter (as is currently the case, when the duplicated indices are removed - seemingly - at random). Nevertheless after this PR the user still doesn't have much to go on when choosing which duplicates to keep.In database design the solution would be to add column(s) that inform in what circumstances the record is valid, e.g. tech dates (
valid_from
,valid_to
). I don't know enough about library's parsing to know whether this is even possible, and even if so - this would be a major change - e.g. appending an index level would be breaking.Another solution would be to simply order resulting records and allow the user to pick last index occurrence for most recent data (
df.loc[~df.index.duplicated(keep='last')]
). Again, I'm not even sure this is possible and if so it would require some work in parsers.@fboerman and everyone interested, let me know what you think - whether I should finish
@documents_limited
and merge as is or we should keep digging for a 'bestest' solution. In any case, keep in mind that this PR may already be braking for some users.Please, excuse me for bundling code quality improvements int his PR, but the linter was killing me.