-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
redesign the package #73
Comments
Abstract datasetsHere, I was thinking something slightly different. If you look at FastAI.jl's containers, you see that MNIST datasetsThese datasets would now just wrap the low-level containers. You could envision something like the images are stored in directories and the labels are in a CSV. MNIST would then wrap a TransformsI think these can go in MLDataPattern.jl unless they are very specifically disk IO related. DeprecationsThis is definitely going to be tricky. My preference right now is to support as many paths as possible but allow a few hard breakages. Or if we want to be really conservative, then we can do |
I see the goal here is to 1) reduce code duplication, and most importantly 2) introduce a Dataset concept and maybe some predefined types (e.g., ImageFolder) like how PyTorch does. Here are some of my very brief thoughts:
I'm working on a draft version for more detailed discussion, in the meantime, I just created a |
Coming here from #98 since there are some design discussions to do. On Master
X = Iris.features() # a matrix
y = Iris.labels() # a vector Column names are missing but they are easy to obtain. On PR#98Now data = Iris()
X, y = data.features, data.targets # a matrix and a vector
X, y = getobs(data) # alternative syntax
X, y = data[] # alternative syntax
X, y = data[1:10] # a matrix and a vector
X, y = getobs(data, 1:10) # alternative syntax Suggestion#1Per @darsnack suggestion in #98 (review) we should compose the datasets with the recently added containers from #96, so A few comments on this approach:
Also, in this approach we obtain dataframe rows when indexing: rows = data[1:10] # a SubDataFrame Maybe we can tweak it into giving Xrows, yrows = data[1:10] # 2 SubDataFrames Is this what we want? I think we can serve two audiences, those doing the data wrangling using dataframes, and those that just want to work with arrays. Maybe the first can just grab the Suggestion#2Regarding having too much composition, can we find something in between? Like making |
Looking over this, I agree that we don't want My reason for wanting to use I added I see two things that need to happen then for the Iris case (assuming we use this approach):
Then we have |
That seems a bit opaque, I think newcomers would be more confused than helped. Also, reading a csv requires I looked into sklearn datasets for inspiration, e.g. see here. from sklearn import datasets
# datasets are dictionary-like objects of type `Bunch` where keys can be accessed as fields
# as_frame=True is the defaults for most datasets
d = datasets.fetch_openml("iris", as_frame=True)
df_X = d.data # a DataFrame with input features
df_y = d.target # a DataFrame with targets (a pd.Series actually when a single target)
df_Xy = d.frame # a DataFrame with both features and targets
# directly return features and targets dataframes
df_X, df_Y = datasets.fetch_openml("iris", as_frame=True, return_X_y = True)
# Store arrays instead of dataframes
d = datasets.fetch_openml("iris", as_frame=False)
X = d.data # numpy array
y = d.target # numpy array
# directly return features and targets arrays
X, y = datasets.fetch_openml("iris", as_frame=False, return_X_y = True) I think we can have something very similar: d = Iris(as_frame = true)
df_X = d.features # dataframe
df_y = d.targets # dataframe
df = d.df # dataframe
df_X, df_y = d[] # return underlying dataframes, # Iris(as_frame = true)[]
df_X, df_y = d[i] # two rows or two sub-dataframes
d = Iris(as_frame = false)
X = d.features # array
y = d.targets # array
X, y = d[] # return underlying arrays, # Iris(as_frame = false)[]
X, y = d[i] |
Sure, if that's the goal in #98, then I prefer we go with what we already have in that PR: the datasets load directly to arrays. We can consider the discussion here about allowing dataframes/tables/etc. as a feature for later. The rest here is just a continuation of the discussion, not concerns that block #98. I think we have two issues here. First is that a CSV file can be interpreted in two ways: as a table dataset or as an array dataset. If we consider CSV.jl, it doesn't allow an Second is being able to reuse code for in-memory and lazy data loading. I agree that the Maybe the least confusing thing is to avoid guessing or death by keywords and to use a more declarative syntax.
It still seems complicated though. With JuliaML/MLUtils.jl#61, this might only be required for tables that need to be lazy (i.e. too large for memory). |
the redesign has been largely done |
I'd like to open a discussion on how we should move forward with implementing a
getobs
andnobs
compliant api,while possibly also simplifying the interface and the maintenance burden.
See FluxML/FastAI.jl#22
I think we should move away from the module-based approach and adopt a type-based one. Also could be convenient to have some lean type hierarchy.
Below is an initial proposal
AbstractDatasets
MNIST Dataset
Usage
Transforms
Do we need transformations as part of the datasets?
This is a possible interface that assumes the transform to operate on whatever is returned by
getobs
Deprecation Path 1
We can create a deprecation path for the code
by implementing
Deprecation Path 2
The pattern
instead is more problematic, because assumes a module MNIST exists, but this (deprecated) module would collide with the struct MNIST. A workaround is to call the new struct
MNISTDataset
, although I'm not super happy with this long namecc @johnnychen94 @darsnack @lorenzoh
The text was updated successfully, but these errors were encountered: