-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal to refactor with traits and trees #162
Comments
Hi @ExpandingMan it seems to me that what you're working toward here and in #159 is similar to part of my work in JuliaComputing/DataSets.jl#45: I want a tree-like abstraction for data which can be used from Julia code, but which is independent of the underlying storage. Over there I've got file and tree types which are a bare git-like tree-of-blobs abstraction data model, intended to abstract the storage layer. (ie, storage could be on the local disk, on S3, in a git repo, in a tar file, etc etc) Unlike FileTrees.jl, DataSets.jl doesn't have an opinion on processing methodology: it just wants to be good at storing and retrieving data. I'm using AbstractTrees.jl, but also working on functions for mutation which are more applicable to data storage systems. |
Thanks, I really ought to take a look at it for DataSets.jl. It may well be the case that my biggest motivating use case (Parquet2.jl) is better off with DataSets.jl than FilePathsBase.jl. However, I still think it's important to address the problem of packages like AWSS3.jl. What is currently happening is that FilePathsBase makes a lot of assumptions that are only likely to be true on local file systems, and in so doing defers a lot of the tree logic to the file system itself. This is great for true file systems, but doesn't work very well for weird things like S3. Perhaps it's a mistake to conflate them: certainly I'd never even attempt it were it not for the fact that so much existing software already attempts to treat S3 like a file system. One approach that we might consider is that maybe FilePathsBase is perfectly great as is, but you really only should use it for true file systems where latency is not a major issue and other systems like S3 need a completely different package, maybe DataSets.jl. Btw, I just authored a huge overhaul of AbstractTrees.jl. We are going to be checking how bad the breaking changes are in the ecosystem before tagging (it will be 0.4). We'll be in touch if DataSets.jl requires more than very minor changes. |
I think this may be true. In working on DataSets.jl I've come to the conclusion that absolute paths are massively overrated as a general abstraction. Relative paths, on the other hand, are great and I provide a system-independent relative path literal in DataSets.jl. I wish I could rely on FilePathsBase for this, but it appears that the I think one of the main problems with path types is that they don't have an API which goes with them, unless you assume they're just filesystem paths. You can't |
By the way I love this package's design and I'm hoping to build a TabularDataSets.jl which plugs into DataSets.jl and relies on Parquet2.jl for serialization. |
I started #159 a while ago and since could mostly not decide what it should look like.
I've recently had some much better thoughts about all this stuff, and here's my proposal.
AbstractPath
is a tree, alwaysThe central problem we've had with S3 is that S3 does not understand that it is supposed to be a tree. In retrospect, this seems like a poor design choice on the part of S3. That's not to say there's something wrong with pure key-value stores, there isn't, but people are trying to copy S3 to and from trees all the time. This means that everything that interacts with S3 has to re-implement a tree on top of it.
Obviously S3 is not going to change, but I think this discussion supports the idea that there is no getting away from treating
S3Path
like a tree.Therefore,
FilePathsBase
should NOT try to represent objects that are not trees, meaning that function such asisdir
which distinguish leaves from non-leaf nodes on trees are mandatory and must return logically consistent results regardless of theAbstractPath
type (though we should be able to mitigate their cost in the case of key-value stores).The difference between
DirectoriesExplicit
andDirectoriesImplicit
In my proposed trait system paths either have
directoriestype(::PathType)
ofDirectoriesExplicit()
orDirectoriesImplicit()
. From the above discussion we see thatDirectoriesExplicit
:children
(getting immediate child nodes) is cheap.DirectoriesImplicit
:leaves
(getting descendant leaf nodes) is cheap.Therefore
FilePathsBase
should not attempt to callchildren
(readdir
) directly onDirectoriesImplicit
paths.Implementation
The
DirectoriesExplicit
code can mostly stay as is, however, it needs to be re-optimized. Tree functions such asparent
need to be made more efficient (I have done this already, but I may not have made a commit as of writing).For
DirectoriesImplicit
, the missing piece is a function which constructs an explicit tree fromleaves
by parsing the path names of theleaves
. Functions such aswalkpath
would then have to traverse the explicitly constructed tree.I strongly suggest this package adopt AbstractTrees.jl so that it can take advantage of the generalizations and optimizations therein. This should eliminate the vast majority of non-parsing logic from FilePathsBase itself (i.e. the most fiddly functions in
src/paths.jl
).I expect that doing this properly should also solve problems such as #145 .
Next steps
I'm willing (and eager, considering the current rather broken state of the ecosystem and the frustrating importance of AWSS3.jl) to carry this through to completion, I'm posting here because I'd like some feedback as I am not a maintainer. I would like to see this PR by Keno to
AbstractTrees
merged, tagged and fully documented first, so my next step is to help out with that.Afterwards, I can turn my attention to FilePathsBase. I'm hopeful that everyone will be pretty enthusiastic about the changes I wind up making because at this point I'm pretty confident they will lead to drastic simplifications in both FilePathsBase.jl and AWSS3.jl.
cc @ericphanson
The text was updated successfully, but these errors were encountered: