-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Dense arrays with missing data #54
Comments
I like this idea. Making Also, if you use bitstypes capable of supporting NA's (issue #45), you have arrays with NA support now. As to float arrays with missing data, I like the idea. Harlan and Stefan are resistant as you can see from the commentary on issue #45. For floating point values, NaN's don't completely have correct semantics for missing data. The one area of difference is comparisons. Comparisons involving NaN always return |
In general, I'm definitely in support of a data type for floating-point matrices (and/or higher-dimensional Float arrays) with NA implemented by NaN payload, presumably by a bits type with appropriate conversions, as Tom suggests. I don't see any reason why that type can't inherit from AbstractArray/Vector. Thanks for your efforts here! But DataFrames are semantically different from a DataMatrix/DataArray, and I strongly feel that there should be a single globally-useful implementation of NAs for the DataFrame type, and trying to push the round NaN peg into that square hole is not going to end up being easy for users (or package developers) to work with. For now, let's keep the code in this issue separate from the existing DataVec/DataFrame types. Definitely re-use Indexes and other ideas as you can, but let's treat this as a separate "JuliaData" type for working with separate types of data. I sure which I had more time to work on JuliaData! Maybe in a week or two... |
Thanks for the comments. Making As for float arrays I don't think the NaN comparison issue is really an issue as long as it's documented that you should only compare non-NaN elements. I have implemented a few functions for float arrays that allow skipping NaNs via a Bool argument. I have run into problems implementing Thanks. |
In a quick glance, it looks like you can have multidimensional BitArrays. As far as NA's or NaN's and functions that work on them, i don't really like nanmean, nanvar, etc. I'd rather see something like:
In this case, https://github.com/tshort/JuliaData/blob/floatNA/src/alternate_NA.jl Some of that is commented out, but at least one of the functions worked at one time. |
Multidimensional BitArrays should make I'm not a fan of the
but this is ugly. Also, it seems like implementing We could use the options module to pass in a skipna option to allow syntax like I will push some code on my fork of JuliaData later today so you can see the current state of things. |
I don't see why your concern about |
Here is the code for sum that uses regular arrays from the link I provided above. function sum(A::NAFilter)
A = A.x
v = 0.0
for x in A
if !isna(x)
v += x
end
end
v
end For DataVecs, you could have a DataVec-specific method that could handle both the replace and the filter flags. |
Regarding @johnmyleswhite's comment, you're right, this is not a problem for @tshort, I'm going to play around with your idea of Thanks. |
I agree with Tom here. Although the naFilter/naReplace operations need On Mon, Aug 27, 2012 at 10:51 AM, Tom Short notifications@github.comwrote:
|
I have pushed my current code implementing some functions to handle NaNs as missing data for Float arrays to the float-nan branch of my fork of JuliaData. Specifically, Feedback is appreciated. Thanks. |
Good stuff, nfoti, I don't have time for much of a review, and I'll be out for the next week, but here are some quick comments:
|
Thanks for taking a look, there's no need for a thorough review yet. I agree that of the options that are available now the You're right, the functions in nanarray.jl can probably be implemented with Good point with Thanks again. Nick |
I've pushed some new code (float-nan branch) that only implements the |
Closed by b95ee3f |
Stack should use similar_nullable, not NullableArray
Stack should use similar_nullable, not NullableArray
Support RData/RDS format version 3
One aspect of missing data that JuliaData does not support is dense arrays with missing data. Extending
DataVec
andPooledDataVec
to aDataArray
type that can handle missing data for an array of arbitrary dimension seems like a useful addition to the package. The semantics of aDataArray
should be the same as normal Arrays with the addition that functions that operate on them should have the option of excluding missing data. Additionally, slicing aDataArray
would return aDataArray
with the proper number of dimensions. In principle a 1dDataArray
would be aDataVec
, however, there may be compelling reasons to keep theDataVec
implementation separate. The proposed implementation ofDataArray
will be such that a 1dDataArray
will behave exactly as a currentDataVec
. Having a special typeDataMatrix
for 2d data also would be useful. The nafilter/naFilter and nareplace/naReplace would return flattened versions of the objects. I'm sure there are behaviors I have not specified or have not been clear about, so any thoughts on the design and implementation ofDataArray
are appreciated.One other special case that deserves attention is float arrays with missing data. In this case I think it is worth implementing something similar to the approach in issue #22 for arbitrary arrays of floats. That is using NaN to indicate missing data in arrays of floats. In this special case NaN has the correct semantics for missing data and does not require a separate mask. It is also straightforward to implement this behavior. Again, any thoughts are appreciated. This type could then be sub-typed to allow named rows and columns via the Index type already in JuliaData. However, just the added functionality for Float arrays would be very useful for machine learning and statistics algorithms that operate on float arrays.
I am planning on implementing these ideas as time permits, but help is welcome if anyone wants to run with the ideas.
The text was updated successfully, but these errors were encountered: