-
Couldn't load subscription status.
- Fork 3.9k
ARROW-6041: [Website] Blog post announcing R library availability on CRAN #4948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cool to use a real/interesting example Parquet file here--anyone know of any? I found a few online but they're all multi-file partitioned things, which we don't have good support for in R yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always like to use the New York Taxi trip dataset for Parquet file usage as a month of data has a decent size but loads very quickly, sadly there is no official source for a Parquet file for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm you may especially want to review this section for historical accuracy and current policy stance.
|
Will review. We'll have to be careful about what we call a "release" on this blog, since that has a very specific meaning in Apache-land. When in doubt, say "Available on CRAN" rather than "Released on CRAN" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to CRAN (for people who don't know what that is)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "list of PPAs" is a bit too specific. Say "See ... to find pre-compiled binary packages for some common Linux distributions such as Debian, Ubuntu, CentOS, and Fedora. Other Linux distributions must install the libraries from source."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say "Apache Parquet support" here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to qualify that this is "preliminary" read and write support that is in its early stages of development. Otherwise you're setting the wrong expectations. It would be accurate (and helpful) to state that the Python Arrow library has much richer support for Parquet files, including multi-file datasets, and we hope to achieve feature equivalency in the next 12 months.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's accurate to say "includes a much faster implementation of the Feather file format"
when you say "initial products coming out of the Arrow project" -- it didn't actually. Perhaps say "was one of the initial applications of Apache Arrow for Python and R".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you want to say that we will look at adapting the "feather" package to be based on "arrow" (though this could upset some users).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you say "Parquet supports various compression formats" it might bring up some canards with the R community. It's simpler to say that "Parquet is optimized to create small files and as a result can be more expensive to read locally, but it performs very well with remote storage like HDFS or Amazon S3. Feather is designed for fast local reads, particularly with solid state drives, and is not intended for use with remote storage systems. Feather files can be memory-mapped and read in Arrow format without any deserialization while Parquet files always must be decompressed and decoded."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first time you reference "Spark" in the article -- you need to use "Apache Spark"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid peanuts being hurled from the gallery, you may want to state here that the functions like read_csv_arrow are being developed to optimize for the memory layout of the Arrow columnar format, and are not intended as a replacement for "native" functions that return R data.frame, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: change filename to remove "release"
055af21 to
c5dd6fa
Compare
No description provided.