-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading from raw bytes? #145
Comments
Having the processing functions work on But for files being loaded from blob stores like S3, will a |
@tanmaykm I don't have first-hand knowledge on this but it seems like you could easily incur overhead from reading over the network for each fetch of bytes. And so a |
Yes, the reads need to be buffered by the abstraction of course. And most of the data access in this package are actually for reasonably large chunks of data, with byte level access in done from internal buffers, which I thought would suit this approach. |
I see. Looks like AWSS3.jl supports reading byte ranges from files in S3. But if this was behind an implementation of |
The filepath is not used apart from initial opening of file, and for filtering partitioned datasets. Those may work too with minor changes if we use URLs instead. I have not come across AbstractFile. We should have one maybe, and we probably only need methods for filesize, seek and reading a range of bytes implemented for S3 access. |
Got it
I feel like the Julia I/O ecosystem is really great thanks to hard work by you and others. But there really needs to be a better unifying abstraction for reading datasets from files. I'm working on something like Dask for Julia and greatly sensing the need for something similar to fsspec for Julia. PathsBase.jl and FileIO.jl are great but not sufficient for multi-file datasets. |
Yes, I think S3FS via FUSE may work well in this case. |
@tanmaykm Okay, my only concern is - do you know if S3FS will download files to disk if it isn't using cache? I would hope that it would just download ranges of bytes to memory.... |
It does seem like that from s3fs document, and I was not able to see files being written when I tried it. But it claims that using cache may make it faster and it has an option to limit cache size. |
I am having same issue: reading a Parquet file on S3 and hoping to benefit from reading a specific column only. I would think this is a very popular use case. |
this gets me "ERROR: ArgumentError: no arrow ipc messages found in provided input" I figured straight to Arrow and I can create a dataframe (if needed). Perhaps @quinnj can correct me here? |
That definitely won't work - |
right! But I do not see a way to construct a Parquet.File with s3Path. If I want to read Parquest Files on AWS S3 - I will need to use Python for now - it seems. Now the challenge is how to avoid copying all this data multiple time getting into Julia DF. |
Looking at the python equivalent:
which enables
|
I also need to read from raw bytes for a library I'm writing. What would it take to implement this? |
I'm downloading a Parquet file over network using AWSS3.jl. Can I parse this into a a DataFrame using Parquet.jl?
The text was updated successfully, but these errors were encountered: