-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added static typing to *_data classes in data_readers #677
Added static typing to *_data classes in data_readers #677
Conversation
@Sanketh7 make sure you do you new branches off of |
looks like tests are failing because
|
@Sanketh7 tests failing 🤔 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple material comments and some notes around code additions. Can't approve since I merged main
continue | ||
if ( | ||
column is not self._source_node | ||
or column is not self._destination_node | ||
): | ||
) and self._column_names is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: code change here
self._column_names = self.csv_column_names( | ||
self.input_file_path, self._header, self._delimiter, self.file_encoding | ||
) | ||
if self._source_node is None: | ||
if self._source_node is None and self._column_names is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: code change here
self._source_node = self._find_target_string_in_column( | ||
self._column_names, self._source_keywords | ||
) | ||
if self._destination_node is None: | ||
if self._destination_node is None and self._column_names is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: code change here
@@ -18,9 +20,14 @@ | |||
class CSVData(SpreadSheetDataMixin, BaseData): | |||
"""SpreadsheetData class to save and load spreadsheet data.""" | |||
|
|||
data_type = "csv" | |||
data_type: Optional[str] = "csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be an issue with the original code. The value shouldn't be optional, it is only `None for the base class b/c it needs to be set by the derived. Hence, setting it to optional seems like it isn't true.
Maybe in the base we could make it an abstract property?
import abc
class Base(abc.ABC):
@property
@abc.abstractmethod
def data_type(self) -> str:
...<FILL>
@data_type.setter
@abc.abstractmethod
def data_type(self, value: str) -> None:
...<FILL>
class CSVData(Base):
data_type: str = "csv
Would something like this work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 yeah I'd prefer too that the child classes aren't Optional[str]
type for data_type
attribute CC @Sanketh7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better yet, not the above...
class Base():
name: str
def print_name(self):
print(self.name) # will raise an Attribute error at runtime if `name` isn't defined in subclass
class Derived(Base):
name = "derived one"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last solution works well and I've committed that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Looks like it makes a test fail.
=================================== FAILURES ===================================
_______________ TestBaseDataClass.test_can_apply_data_functions ________________
self = <dataprofiler.tests.data_readers.test_base_data.TestBaseDataClass testMethod=test_can_apply_data_functions>
def test_can_apply_data_functions(self):
class FakeDataClass:
# matches the `data_type` value in BaseData for validating priority
data_type = "FakeData"
def func1(self):
return "success"
# initialize the data class
data = BaseData(input_file_path="", data=FakeDataClass(), options={})
# if the function exists in BaseData fail the test because the results
# may become inaccurate.
self.assertFalse(hasattr(BaseData, "func1"))
with self.assertRaisesRegex(
AttributeError,
"Neither 'BaseData' nor 'FakeDataClass' " "objects have attribute 'test'",
):
data.test
# validate it will take BaseData attribute over the data attribute
> self.assertIsNone(data.data_type)
E AssertionError: 'FakeData' is not None
dataprofiler/tests/data_readers/test_base_data.py:96: AssertionError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we could change the test to detect an attribute error instead of checking if the field is None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, basically bc data_type doesn't exist in BaseData, this test is no longer validating priority.
Attribute error still wouldn't be correct bc FakeData as it.
We would need to test a property they both have. I think instead we can update the test to use input_file_path
class FakeDataClass:
# matches the `data_type` value in BaseData for validating priority
options = {"not_empty": "data"}
def func1(self):
return "success"
In the test we can assert the options
is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change here: 2e85001
I didn't remove the data_type field from FakeDataClass because I wasn't sure if that would cause unintended side effects.
@@ -58,8 +65,9 @@ def __init__(self, input_file_path=None, data=None, options=None): | |||
:return: None | |||
""" | |||
options = self._check_and_return_options(options) | |||
options = cast(Dict, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the line above already indicate that it is a dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I probably forgot to delete this from a previous implementation.
BaseData.__init__(self, input_file_path, data, options) | ||
SpreadSheetDataMixin.__init__(self, input_file_path, data, options) | ||
SpreadSheetDataMixin.__init__(self, input_file_path, data, cast(Dict, options)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to cast here and above, if at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
""" | ||
Ensure options are valid inputs to the data reader. | ||
|
||
:param options: dictionary of options for the csv reader to validate | ||
:type options: dict | ||
:return: None | ||
""" | ||
options = super()._check_and_return_options(options) | ||
options = super(CSVData, CSVData)._check_and_return_options(options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going on here? this seems abnormal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to change _check_and_return_options
to be a static method because BaseData
implements it as a static method. However, super()
with no arguments only really works with instance and class methods. I ended up following https://stackoverflow.com/questions/26788214/super-and-staticmethod-interaction#:~:text=When%20you%20call%20a%20class,or%20class%20it%20was%20called. to figure out how to make it work with static methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JGSweets thoughts on this? Any additional thoughts around this implementation?
Found this example which helps provide an example for how @Sanketh7 implement this: https://stackoverflow.com/questions/26788214/super-and-staticmethod-interaction#:~:text=class%20Second(First)%3A%0A%20%20%40staticmethod%0A%20%20def%20getlist()%3A%0A%20%20%20%20l%20%3D%20super(Second%2C%20Second).getlist()%20%20%23%20note%20the%202nd%20argument%0A%20%20%20%20l.append(%27second%27)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with it. We can refactor in the future if necessary.
@@ -186,7 +198,7 @@ def _guess_delimiter_and_quotechar( | |||
vocab = Counter(data_as_str) | |||
if "\n" in vocab: | |||
vocab.pop("\n") | |||
for char in omitted + [quotechar]: | |||
for char in omitted + ([quotechar] if quotechar is not None else []): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line gets a little confusing if it's all in one. If this if is necessary, can we do it before the for
?
omitted_list: list[str] = ommitted
if quotechar is not None:
omitted_list: list[str] = omitted + [quotechar]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we can use that in the for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Sanketh7 in case you missed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -534,13 +546,13 @@ def _load_data_from_str(self, data_as_str): | |||
) | |||
return data_utils.read_csv_df( | |||
data_buffered, | |||
self.delimiter, | |||
self.header, | |||
cast(str, self.delimiter), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do these need to be casted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like with one of my recent commits to data_utils, self.delimiter
doesn't need to be casted. However, self.header
does because read_csv_df
only supports Optional[int]
but self.header
could also be a string the "auto"
case. As far as I can tell, this case doesn't exist at this point at runtime but it doesn't seem like mypy can detect that.
""" | ||
Ensure options are valid inputs to the data reader. | ||
|
||
:param options: dictionary of options for the csv reader to validate | ||
:type options: dict | ||
:return: None | ||
""" | ||
options = super()._check_and_return_options(options) | ||
options = super(CSVData, CSVData)._check_and_return_options(options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JGSweets thoughts on this? Any additional thoughts around this implementation?
Found this example which helps provide an example for how @Sanketh7 implement this: https://stackoverflow.com/questions/26788214/super-and-staticmethod-interaction#:~:text=class%20Second(First)%3A%0A%20%20%40staticmethod%0A%20%20def%20getlist()%3A%0A%20%20%20%20l%20%3D%20super(Second%2C%20Second).getlist()%20%20%23%20note%20the%202nd%20argument%0A%20%20%20%20l.append(%27second%27)
def reload( | ||
self, | ||
input_file_path: Optional[str] = None, | ||
data: Optional[pd.DataFrame] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:type data: multiple types
is the docstring for data
... but static typing is only saying None
or pd.DataFrame
. I think at least one or the other should be update to make sure the code and docstring match
@@ -737,4 +759,4 @@ def reload(self, input_file_path=None, data=None, options=None): | |||
header=self.header, delimiter=self.delimiter, quotechar=self.quotechar | |||
) | |||
super(CSVData, self).reload(input_file_path, data, options) | |||
self.__init__(self.input_file_path, data, options) | |||
self.__init__(self.input_file_path, data, options) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get rid of this #type: ignore
by resolving an issue upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that mypy doesn't like using self.__init__
because it doesn't know which constructor will end up being called (and therefore doesn't know what types are needed). However, the current code assumes that behavior so I just told mypy to ignore that line.
def __init__( | ||
self, | ||
input_file_path: Optional[str] = None, | ||
data: Optional[nx.Graph] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment from above in csv_data
--> :type data: multiple types
is type on data
but we only allow nx.graph
or None
in this __init__
. Should update so they match
file_path = cast( | ||
Union[StringIO, BytesIO], file_path | ||
) # guaranteed by is_stream_buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's see if we can get rid of these cast
statements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
file_path = cast( | ||
Union[StringIO, BytesIO], file_path | ||
) # guaranteed by is_stream_buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's see if we can get rid of these cast
statements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -148,4 +178,4 @@ def reload(self, input_file_path=None, data=None, options=None): | |||
:return: None | |||
""" | |||
super(ParquetData, self).reload(input_file_path, data, options) | |||
self.__init__(self.input_file_path, data, options) | |||
self.__init__(self.input_file_path, data, options) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any change we can get rid of this #type: ignore
@@ -93,7 +94,7 @@ def func1(self): | |||
data.test | |||
|
|||
# validate it will take BaseData attribute over the data attribute | |||
self.assertIsNone(data.data_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this getting removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to make data_type
have type str
instead of Optional[str]
, I had to make it so it doesn't get set to anything in BaseData
(and just has a type definition line for mypy). This makes it so you get an exception if you try getting the data_type
in BaseData
so @JGSweets recommended looking at another field to test the same behavior. In this case, I looked to make sure options
is taken from BaseData
instead of FakeData
which would mean that options
is an empty dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's replaced below with a different assert
#610