-
Notifications
You must be signed in to change notification settings - Fork 457
PARQUET-2489: Guidance on feature releases #258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Rok Mihevc <rok@mihevc.org>
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the lead on this @emkornfield. It will be very helpful to have this codified.
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
|
@etseidl thank you for the proof reading |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is overly prescriptive wrt. timing of introducing and enabling new features. We should let every implementation choose their support policy, based on their user demographics and their respective targeted use cases.
cc @alamb
CONTRIBUTING.md
Outdated
|
|
||
| In some cases in addition to library level implementations it is | ||
| expected the changes will be justified via integration into a | ||
| processing engine to show their viability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a "processing engine" in this context? Perhaps this is out of scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the intent here is to demonstrate some real world benefit to justify the cost. For instance, for the recent size statistics addition we probably should have required some benchmark numbers showing a clear benefit to justify the cost of creating and storing them. My GPU workflow was made much faster due to the inclusion of the unencoded byte array size information and could have provided hard numbers demonstrating that. Similarly, it would have been nice to have a pushdown example that was made possible because of the histograms. Simply adding the feature and showing that two implementations produced them in a compatible way was probably insufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to rephrase and in a new pull request moved this to step 1.
CONTRIBUTING.md
Outdated
| library. | ||
|
|
||
| 2. Forwards compatible. A file written by a new version of the library can be read by an older version | ||
| of the library. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, this is true of almost no format enhancements, since the additional semantics couldn't be understood under an older version of the format.
Perhaps this should be relaxed a bit? For example:
Forwards compatible. A file written under a newer version of the format can be read under an older version of the format, but some information might be missing or performance might be suboptimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to use your wording (will be present in next commit).
Would it address most of your concerns if I replaced the "SHOULD" language, I'll revise to do so and we can hopefully iterate from there. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for writing this up @emkornfield -- it is very much needed and helpful
| ### Releases | ||
|
|
||
| The Parquet community aims to do releases of the format package only as needed | ||
| when new features are introduced. If multiple new features are being proposed | ||
| simultaneously some features might be consolidated into the same release. | ||
| Guidance is provided below on when implementations should enable features added | ||
| to the specification. Due to confusion in the past over parquet versioning it | ||
| is not expected that there will be a 3.0 release of the specification in the | ||
| foreseeable future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given there seems to be confusion about what even constitutes a parquet "release" maybe we simply should say that the parquet-format has no overall release version identifier and instead is a collection of individual features?
Or maybe we should give it a version date ("parquet-format as of 2024-06-07" etc) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at least for parquet-java we need a version identifier compatible with Maven because I think that is how it is consumed. We can maybe discuss separately just inlining the generated code there as well, as i think that is typically what other distributions do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @wgtmac thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this, I think releases are also potentially helpful for any third-party java implementations beyond parquet java. I don't have strong feelings here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the version identifier of parquet-format is compatible with Maven, it looks much weaker than usual Java libraries. IMO, a new version of parquet-format usually includes a set of new feature additions and format clarification/correction, meaning that a new version should be a good superset of previous versions. Therefore we should always stick to the latest version of parquet-format wherever possible. An exception is the deprecation of features to the format where a major version bump is required.
CONTRIBUTING.md
Outdated
| the feature enabled cannot be read under and older version of the format | ||
| (e.g. Adding a new compression algorithm. | ||
|
|
||
| The Parquet community hopes that new features are widely beneficial to users of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think trying to create a process to roll out a feature across all implementations is an admirable goal, however I am not sure there is sufficient consensus to include it in this document. If we include this (aspirational) process I fear there is a non trivial chance it would not be followed in practice, and thus its inclusion would be confusing
Thus I suggest we consider not including the rollout in docs
( I think this is similar to what @pitrou is saying here #258 (review))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to remove almost all prescriptions that actually require effort out of the suggestions. I feel that if we don't provide some guidance the ecosystem will never really move forward. If possible it would be nice to focus on concrete concerns. Are there any parts of what is documented here that you see as problematic as a maintainer from arrow-rs? Also @pitrou anything you see as specifically problematic from a parquet-cpp perspective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the update. However, it is my opinion that the updated text still adds more potential for confusion than it adds. Specifically trying to proscribe the rollout cadence across loosly (at best) coordinated implementations in a document is unlikely to change behaviors and is likely to create confusion
Are there any parts of what is documented here that you see as problematic as a maintainer from arrow-rs?
One specific example is the mention of feature flags. While I am sure we could over come the technical details, as a maintainer I would not want to be in the position to try and dictate when a featre should be used by our users -- instead I would imagine that as new features were supported in parquet-rs there would be an appropriate writer configuration that users would decide to enable/disable as they deemed appropriate.
Another potential improvement (orthogonal to this disucssion) would be to label new format additions with the date in which they were added to the spec, and perhaps track the date when support was added to other implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically trying to proscribe the rollout cadence across loosly (at best) coordinated implementations in a document is unlikely to change behaviors and is likely to create confusion
Is there specific language here that you find problematic here? With the latest revision, my intent was to make the recommendation only about preserving compatibility. i.e. forward incompatible changes must wait at least 2 years before an implementation decides to turn them on by default. I wonder if changing the phrasing from MAY to MUST NOT USE before would help clarify?
instead I would imagine that as new features were supported in parquet-rs there would be an appropriate writer configuration that users would decide to enable/disable as they deemed appropriate.
I agree. Maybe this is a terminology issue. I view feature flag as a type of configuration, and I think I removed all recommendations for when, if ever, an implementation should turn something on by default (please let me know if something is still ambiguous or maybe I missed something in rephrasing).
Another potential improvement (orthogonal to this disucssion) would be to label new format additions with the date in which they were added to the spec, and perhaps track the date when support was added to other implementations
Agreed, I'll add this to the process I thought the feature matrix that was already underreview would be a good place for this? I would like to avoid putting too many dates directly in the spec to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another read -- I think it was easy to miss the sentence that says the rollout dates are guidelines.
Therefore, the Parquet PMC gives the following guidance for changing a feature to be "on" by default:
So I guess I got confused that the text says it is providing guidance and then uses MAY which suggests to me the process as a whole is required but the individual steps have options. I think we should just make it clear these are recommended best practices
What about toning it down a bit like instead of
Forward compatible features/changes MAY be used by default in implementations
once the parquet-format containing those changes has been formally released.
For features that may pose a significant performance regression to prior
format readers, libaries SHOULD consider delaying until 1 year after the
release of the parquet-java implementation that contains the feature
implementation. Implementations MAY choose to do a major version bump when
turning on a feature in this category.
Something like
Forward compatible features/changes may be used by default in implementations
once the parquet-format containing those changes has been formally released.
For features that may pose a significant performance regression to prior
format readers, libraries should wait 1 year after the
release of the first parquet-java implementation that contains the feature
implementation.
I also think the major version bump suggestion is so library specific it is not worth putting in the guidance. The libraries have their own cadence (e.g. parquet-rs and parquet-cpp have major releases based on calendar cycles rather than feature content)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to rephrase to make the issues above clearer.
I also think the major version bump suggestion is so library specific it is not worth putting in the guidance. The libraries have their own cadence (e.g. parquet-rs and parquet-cpp have major releases based on calendar cycles rather than feature content)
My intent here was not to specify a release cadence but to specify that if changing a forward incompatible change to default the next release must be a major version release, I will try to clarify. I think parquet-rs and parquet-cpp trivially satisfy this given the current release policies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb thanks for feedback, I've update to clarify and remove upper case for SHOULD/MAY language, please let me know if the current iteration still raises concerns.
I think it could be rephrased as an example of how to deal with feature additions, rather than a recommendation. |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
My goal here is to provide at least a little bit stronger then just an example, I think we do want to try to align implementations for compatiblity purposes and moving the ecosystem further. Maybe we can continue the conversation here on the @alamb's comment |
- Soften language on recommendations and add clarifications - Fix some grammar issues.
CONTRIBUTING.md
Outdated
| discussion concrete. This step is complete when there is lazy consensus. Part | ||
| of the consensus is whether it sufficient to provide 2 working | ||
| implementations as outlined in step 2 or if demonstration of the feature with | ||
| a down-stream query engine is necessary to justify the feature (e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is overly specific IMHO. Not all features are developed in relation with a "down-stream query engine". For example, the BYTE_STREAM_SPLIT additions were motivated by size measurements with different encodings and compression algorithms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I think a lot of features will not need this per @etseidl point on a previous revision, this helps avoid extraneous features that sound good in practice but don't provide much value in practice. Other areas where this is more likely to apply could be for more complicated indexing schemes.
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more suggestions.
As to the most controversial section, I think the current wording strikes a good balance. Implementers should consider the ramifications of defaulting to breaking changes, and I like the idea of providing a deadline for downstream apps to adapt. Otherwise Parquet will always be trapped in amber, unable to steer users towards beneficial new features because some downstream project refuses to update for a decade. And these are now recommendations (unlike earlier revisions), so an implementation could still decide that a new feature is so beneficial for its user base that it's worth risking compatibility problems (so long as it clearly documents how to revert to non-breaking behavior). Most implementations already seem to be pretty conservative with their choice of defaults, so I don't think this guidance will be too burdensome.
Overall I'm +1 for the release process, and +0.5 for the last section.
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
| feature implementation. | ||
|
|
||
| 3. Forwards incompatible features/changes should not be turned on by default | ||
| until 2 years after the parquet-java implementation containing the feature is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, the 2-year guidance is better than other options. We don't want individual votes or discussions for action of each new feature. In addition, developers (of downstream projects or other Parquet implementation not governed under Parquet PMC) usually do not actively keep an eye on Parquet activities. The general 2-year guidance may give people enough time to take action to changes on the Parquet side.
| by default until the most conservative timelines outlined above have been | ||
| exceeded. | ||
|
|
||
| For features released prior to October 2024, target dates for each of these |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand this. Do we expect a parquet-java 2.0 release around October 2024? Doesn't any new feature should follow this guidance once published?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intent was to try for Parquet-java 2.0 in October. There are a lot of old features from the spec perspective that based on guidance would be ok to use (e.g. lz4-raw, byte stream split, data page v2, v2 encodings) that we need to decide on recommended dates (as a sample suggestion I would aim for 1 year for features we think we actually want to turn on)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a good timing to do Parquet-java 2.0 release? As per the discussion from https://lists.apache.org/thread/kttwbl5l7opz6nwb5bck2gghc2y3td0o, we intend to break the API compatibility by removing deprecated APIs (and remove Java 8 support) in the 2.0 release. If the intention is to set a target date for existing features without breaking anything, perhaps we should release 1.15.0 instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO I think API compatibility is a separate concern and it is fine to set future dates and change API compatibility. I can rephrase this section to mention some other criteria for initial target dates if there is an alternative proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be up for a Parquet 2.0 release in October and am eager to help! Do we also want a 1.15?
I'm trying to downstream 1.14.1 into various projects (apache/spark#46447, apache/iceberg#10209), but the new Jackson version is giving us some troubles. It feels to me that the different projects need some time to update. My argument is that releasing 1.15 right now would not give too many new features to the users, and I think it makes sense to jump to 2.0.0.
Since there are some delays mentioned earlier in this document:
- The API that I suggested removing was marked as deprecated almost six years go: PARQUET-1452: Deprecate old logical types API parquet-java#535
- We're still compiling against Hadoop 2.7.3 which was released in August 2016.
- I think the Java8 deprecation might be a bit harder and might require more discussion, but there it feels like all the projects are looking at each other :) Spark/Avro are Java 17+ right now.
For my understanding, the things you mentioned (e.g. lz4-raw, byte stream split, data page v2, v2 encodings), they are already shipped with the last release, so the discussion is when to make them the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct in understanding that We intend to:
- increment the major release of parquet-format when it has the definition of the new (forward incompatible) feature in it.
- implement the above new feature in parquet-java under a minor release in the previous major but turning it off by default
- increment the major release of parquet-java when we turn on the new encodings by default?
I think this is not unsound but it might be confusing. We should then expect to have several major releases coming: one for the new footer and one for each new encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julienledem no I think there is a misunderstanding.
increment the major release of parquet-format when it has the definition of the new (forward incompatible) feature in it.
Per above, I don't think we expect a major version change for parquet-format in the near future.
implement the above new feature in parquet-java under a minor release in the previous major but turning it off by default
yes, this is correct. this provides read support.
increment the major release of parquet-java when we turn on the new encodings by default?
Yes.
I think this is not unsound but it might be confusing. We should then expect to have several major releases coming: one for the new footer and one for each new encoding.
I think these should be grouped. Per apache/parquet-site#61 the proposal for parquet-java is to have at most one major release per year. Early adopters can use features once a minor release containing them is out as long as they feel comfortable with understanding the scope of impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not following which part of the conversation "per above" refers to here.
Could you point to it and explain what you mean?
I was assuming for example that once we define a new footer in a new format we would increment the format spec. Would that be wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the releases section:
The Parquet PMC aims to do releases of the format package only as needed when
new features are introduced. If multiple new features are being proposed
simultaneously some features might be consolidated into the same release.
Guidance is provided below on when implementations should enable features added
to the specification. Due to confusion in the past over Parquet versioning it
is not expected that there will be a 3.x release of the specification in the
foreseeable future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that sounds good to me.
Co-authored-by: Gang Wu <ustcwg@gmail.com>
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with most of the sections. +1 from my side.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally ok with this. Thanks a lot @emkornfield for pushing this forward!
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
julienledem
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very Much @emkornfield, I think this is a great addition.
I do believe that we have a few element to finish build consensus on before committing.
I've left comments in line
CONTRIBUTING.md
Outdated
| step. The implementations must be made available publicly, and they should be | ||
| fit for inclusion (for example, they were submitted as a pull request against | ||
| the target repository and committers gave positive reviews). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this section. Open source implementations are the first requirement. I think we should add that validation by proprietary engines is not required but welcome and useful. Third party vendors can publish reports of their analysis of the gains they observe in their proprietary non-open code base.
As Parquet is an open source standard that helps using open source and non-opensource solutions together in the wider industry, I think it is important to cover both. (but open source is still the requirement as you describe above)
CONTRIBUTING.md
Outdated
| of the consensus is whether it is sufficient to provide two working | ||
| implementations as outlined in step 2, or if demonstration of the feature with | ||
| a downstream query engine is necessary to justify the feature (e.g. | ||
| demonstrate performance improvements in the Apache Arrow C++ Dataset library, | ||
| the Apache DataFusion query engine, or any other open source engine). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is always a trade-off in adding complexity to the format vs adding improvement in term of performance and/or compression.
In that phase we should have a discussion about this. Often, a prototype helps confirm some of the assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to address this with additional language, please let me know if I covered your intent.
CONTRIBUTING.md
Outdated
| 2. Forward compatible. A file written under a newer version of the format with | ||
| the feature enabled can be read under an older version of the format, but | ||
| some information might be missing or performance might be suboptimal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
semantically we mean that we can read the data back without loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the data should be readable back "without loss", does it imply that new logical types are not considered forward compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they should be and made this explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. a new logical type is not fully forward compatible even if there is a degraded mode (you can ignore it and read the "raw" type). Then we would recommend new logical types to not be turned on by default.
| feature implementation. | ||
|
|
||
| 3. Forwards incompatible features/changes should not be turned on by default | ||
| until 2 years after the parquet-java implementation containing the feature is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What @emkornfield is saying above makes sense to me.
We should always have new forward incompatible features off by default at first. Otherwise, this will cause breakages every time we add a new encoding (or similarly incompatible feature). How long is debatable. 2 years seems long to me.
I don't think we should give a duration guidance to other implementers/users. We should document why we are doing this and explain the constraints. How long is very dependent on their circumstances. In an environment where there is a single system reading and writing Parquet files, they can enable it right away without problems. In a "data lakehouse" environment where people have multiple systems reading and writing they want to make sure they have upgraded other consumers before they turn it on. The more time passes, the more it becomes acceptable to have such new features on by default and ask the user to turn them off because they have some legacy system that can only read old files.
We could have a setting to adjust this in environments where there is a single system reading/writing parquet: setForwardIncompatibleFeaturesOnByDefault(boolean)
| by default until the most conservative timelines outlined above have been | ||
| exceeded. | ||
|
|
||
| For features released prior to October 2024, target dates for each of these |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct in understanding that We intend to:
- increment the major release of parquet-format when it has the definition of the new (forward incompatible) feature in it.
- implement the above new feature in parquet-java under a minor release in the previous major but turning it off by default
- increment the major release of parquet-java when we turn on the new encodings by default?
I think this is not unsound but it might be confusing. We should then expect to have several major releases coming: one for the new footer and one for each new encoding.
CONTRIBUTING.md
Outdated
| 1. To the greatest extent possible changes should have an option for forward | ||
| compatibility (old readers can still read files). | ||
| compatibility (old readers can still read files). The 'compatibility and | ||
| feature enablement' section below provides more details on expectations for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do [compatibility and feature enablement](#compatibility-and-feature-enablement)
| by default until the most conservative timelines outlined above have been | ||
| exceeded. | ||
|
|
||
| For features released prior to October 2024, target dates for each of these |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that sounds good to me.
Make sure you have checked all steps below.
Jira
Commits
Documentation