Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added static typing to *_data classes in data_readers #677

Merged
merged 20 commits into from
Oct 18, 2022

Conversation

Sanketh7
Copy link
Contributor

@Sanketh7 Sanketh7 commented Oct 5, 2022

@taylorfturner
Copy link
Contributor

@Sanketh7 make sure you do you new branches off of capitalone/main for DP so your branch isn't out of date

@taylorfturner taylorfturner added the static_typing mypy static typing issues label Oct 6, 2022
@taylorfturner
Copy link
Contributor

looks like tests are failing because options is None

>       self.SAMPLES_PER_LINE_DEFAULT: int = options.get("record_samples_per_line", 1)
E       AttributeError: 'NoneType' object has no attribute 'get'

@taylorfturner
Copy link
Contributor

@Sanketh7 tests failing 🤔

Copy link
Contributor

@taylorfturner taylorfturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple material comments and some notes around code additions. Can't approve since I merged main

dataprofiler/data_readers/graph_data.py Outdated Show resolved Hide resolved
continue
if (
column is not self._source_node
or column is not self._destination_node
):
) and self._column_names is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: code change here

self._column_names = self.csv_column_names(
self.input_file_path, self._header, self._delimiter, self.file_encoding
)
if self._source_node is None:
if self._source_node is None and self._column_names is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: code change here

self._source_node = self._find_target_string_in_column(
self._column_names, self._source_keywords
)
if self._destination_node is None:
if self._destination_node is None and self._column_names is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: code change here

dataprofiler/data_readers/text_data.py Outdated Show resolved Hide resolved
ksneab7
ksneab7 previously approved these changes Oct 12, 2022
@@ -18,9 +20,14 @@
class CSVData(SpreadSheetDataMixin, BaseData):
"""SpreadsheetData class to save and load spreadsheet data."""

data_type = "csv"
data_type: Optional[str] = "csv"
Copy link
Contributor

@JGSweets JGSweets Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be an issue with the original code. The value shouldn't be optional, it is only `None for the base class b/c it needs to be set by the derived. Hence, setting it to optional seems like it isn't true.

Maybe in the base we could make it an abstract property?

import abc

class Base(abc.ABC):
    @property
    @abc.abstractmethod
    def data_type(self) -> str:
        ...<FILL>

    @data_type.setter
    @abc.abstractmethod
    def data_type(self, value: str) -> None:
        ...<FILL>

class CSVData(Base):
    data_type: str = "csv

Would something like this work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 yeah I'd prefer too that the child classes aren't Optional[str] type for data_type attribute CC @Sanketh7

Copy link
Contributor

@JGSweets JGSweets Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better yet, not the above...

class Base():
    name: str
    
    def print_name(self):
        print(self.name)  # will raise an Attribute error at runtime if `name` isn't defined in subclass

class Derived(Base):
    name = "derived one"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last solution works well and I've committed that change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Looks like it makes a test fail.

=================================== FAILURES ===================================
_______________ TestBaseDataClass.test_can_apply_data_functions ________________
self = <dataprofiler.tests.data_readers.test_base_data.TestBaseDataClass testMethod=test_can_apply_data_functions>

    def test_can_apply_data_functions(self):
        class FakeDataClass:
            # matches the `data_type` value in BaseData for validating priority
            data_type = "FakeData"
    
            def func1(self):
                return "success"
    
        # initialize the data class
        data = BaseData(input_file_path="", data=FakeDataClass(), options={})
    
        # if the function exists in BaseData fail the test because the results
        # may become inaccurate.
        self.assertFalse(hasattr(BaseData, "func1"))
    
        with self.assertRaisesRegex(
            AttributeError,
            "Neither 'BaseData' nor 'FakeDataClass' " "objects have attribute 'test'",
        ):
            data.test
    
        # validate it will take BaseData attribute over the data attribute
>       self.assertIsNone(data.data_type)
E       AssertionError: 'FakeData' is not None

dataprofiler/tests/data_readers/test_base_data.py:96: AssertionError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we could change the test to detect an attribute error instead of checking if the field is None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, basically bc data_type doesn't exist in BaseData, this test is no longer validating priority.
Attribute error still wouldn't be correct bc FakeData as it.
We would need to test a property they both have. I think instead we can update the test to use input_file_path

        class FakeDataClass:
            # matches the `data_type` value in BaseData for validating priority
            options = {"not_empty": "data"}
    
            def func1(self):
                return "success"

In the test we can assert the options is empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change here: 2e85001

I didn't remove the data_type field from FakeDataClass because I wasn't sure if that would cause unintended side effects.

@@ -58,8 +65,9 @@ def __init__(self, input_file_path=None, data=None, options=None):
:return: None
"""
options = self._check_and_return_options(options)
options = cast(Dict, options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the line above already indicate that it is a dict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I probably forgot to delete this from a previous implementation.

BaseData.__init__(self, input_file_path, data, options)
SpreadSheetDataMixin.__init__(self, input_file_path, data, options)
SpreadSheetDataMixin.__init__(self, input_file_path, data, cast(Dict, options))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to cast here and above, if at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

"""
Ensure options are valid inputs to the data reader.

:param options: dictionary of options for the csv reader to validate
:type options: dict
:return: None
"""
options = super()._check_and_return_options(options)
options = super(CSVData, CSVData)._check_and_return_options(options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here? this seems abnormal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to change _check_and_return_options to be a static method because BaseData implements it as a static method. However, super() with no arguments only really works with instance and class methods. I ended up following https://stackoverflow.com/questions/26788214/super-and-staticmethod-interaction#:~:text=When%20you%20call%20a%20class,or%20class%20it%20was%20called. to figure out how to make it work with static methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with it. We can refactor in the future if necessary.

@@ -186,7 +198,7 @@ def _guess_delimiter_and_quotechar(
vocab = Counter(data_as_str)
if "\n" in vocab:
vocab.pop("\n")
for char in omitted + [quotechar]:
for char in omitted + ([quotechar] if quotechar is not None else []):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line gets a little confusing if it's all in one. If this if is necessary, can we do it before the for?

omitted_list: list[str] = ommitted
if quotechar is not None:
    omitted_list: list[str] = omitted + [quotechar]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we can use that in the for

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sanketh7 in case you missed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -534,13 +546,13 @@ def _load_data_from_str(self, data_as_str):
)
return data_utils.read_csv_df(
data_buffered,
self.delimiter,
self.header,
cast(str, self.delimiter),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these need to be casted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like with one of my recent commits to data_utils, self.delimiter doesn't need to be casted. However, self.header does because read_csv_df only supports Optional[int] but self.header could also be a string the "auto" case. As far as I can tell, this case doesn't exist at this point at runtime but it doesn't seem like mypy can detect that.

@taylorfturner taylorfturner added the High Priority Dramatic improvement, inaccurate calculation(s) or bug / feature making the library unusable label Oct 13, 2022
"""
Ensure options are valid inputs to the data reader.

:param options: dictionary of options for the csv reader to validate
:type options: dict
:return: None
"""
options = super()._check_and_return_options(options)
options = super(CSVData, CSVData)._check_and_return_options(options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def reload(
self,
input_file_path: Optional[str] = None,
data: Optional[pd.DataFrame] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:type data: multiple types is the docstring for data... but static typing is only saying None or pd.DataFrame. I think at least one or the other should be update to make sure the code and docstring match

@@ -737,4 +759,4 @@ def reload(self, input_file_path=None, data=None, options=None):
header=self.header, delimiter=self.delimiter, quotechar=self.quotechar
)
super(CSVData, self).reload(input_file_path, data, options)
self.__init__(self.input_file_path, data, options)
self.__init__(self.input_file_path, data, options) # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of this #type: ignore by resolving an issue upstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that mypy doesn't like using self.__init__ because it doesn't know which constructor will end up being called (and therefore doesn't know what types are needed). However, the current code assumes that behavior so I just told mypy to ignore that line.

def __init__(
self,
input_file_path: Optional[str] = None,
data: Optional[nx.Graph] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment from above in csv_data --> :type data: multiple types is type on data but we only allow nx.graph or None in this __init__. Should update so they match

Comment on lines 140 to 142
file_path = cast(
Union[StringIO, BytesIO], file_path
) # guaranteed by is_stream_buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's see if we can get rid of these cast statements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 153 to 155
file_path = cast(
Union[StringIO, BytesIO], file_path
) # guaranteed by is_stream_buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's see if we can get rid of these cast statements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -148,4 +178,4 @@ def reload(self, input_file_path=None, data=None, options=None):
:return: None
"""
super(ParquetData, self).reload(input_file_path, data, options)
self.__init__(self.input_file_path, data, options)
self.__init__(self.input_file_path, data, options) # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any change we can get rid of this #type: ignore

dataprofiler/data_readers/text_data.py Outdated Show resolved Hide resolved
@@ -93,7 +94,7 @@ def func1(self):
data.test

# validate it will take BaseData attribute over the data attribute
self.assertIsNone(data.data_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this getting removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to make data_type have type str instead of Optional[str], I had to make it so it doesn't get set to anything in BaseData (and just has a type definition line for mypy). This makes it so you get an exception if you try getting the data_type in BaseData so @JGSweets recommended looking at another field to test the same behavior. In this case, I looked to make sure options is taken from BaseData instead of FakeData which would mean that options is an empty dict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's replaced below with a different assert

@taylorfturner taylorfturner enabled auto-merge (squash) October 18, 2022 13:52
@taylorfturner taylorfturner merged commit 44a3256 into capitalone:main Oct 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority Dramatic improvement, inaccurate calculation(s) or bug / feature making the library unusable static_typing mypy static typing issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants