-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More Checks on input data #149
Conversation
Codecov Report
@@ Coverage Diff @@
## master #149 +/- ##
========================================
- Coverage 98.73% 98% -0.73%
========================================
Files 33 33
Lines 1183 1202 +19
========================================
+ Hits 1168 1178 +10
- Misses 15 24 +9
Continue to review full report at Codecov.
|
@MaxBlesch: How much of this is still relevant and were you already including some of this work somewhere else? |
I think we wanted to infer dtypes of internal variables from type hints. This is the most beautiful and fruitful solution. I do not know about the changes to the test data. Are they useful? |
The change of the test data was just the introduction of a bool variable indicating gender. I think this not necessary! |
But I think we can not completely abstain from variable specific tests. I.e. testing if mietstufe is in 1 to 7, 1 to 6. But this may also depend on years and I think is still far down the road! |
Boolean is not future proof and not even present-proof. Not kidding, should be category or convertible into a category. m / f until 2019 (?), m / f / d since then. |
|
Can we have type hints for categorical-style variables? Thinking of things like |
I think we can easily check for the general dtype like float, int, categorical, bool using the type hints from this API: pandas-dev/pandas#26766. I think we can also create custom type hints and provide more information like all possible values of a categorical. Here is an example pandas-dev/pandas#14468 (comment). I am not sure what needs to be done so that it plays nicely with mypy, but this is not our real use-case anyway. Mainly, we want to check the dtype of input variables. And we can get the information by examing the signature of functions. During the test session, we can also check whether our internal function provide the expected dtype. |
Hmm, I do not fully see through the latter case, I am afraid. One issue would be that the possible types of a categorical would be dynamic -- e.g., they would need to correspond to the params object in a particular year. But this will be for someone else to worry about and it should not be a problem to have a two-step process (first check whether something is a categorical; then check possible values). |
I think we can close this PR. With #230 type hints will come in place and from this infrastructure, it will be easy to implement all other checks. For cases like |
What problem do you want to solve?
When trying to run 'self-made' households through
gettsim
, a couple of errors occured due to incorrect variable specifications and wrong data structure. I therefore added a couple of checks to be done on the input data.Todo
wohnfl
,mietstufe
,miete
.nan
values