-
Notifications
You must be signed in to change notification settings - Fork 459
PARQUET-2489: Guidance on feature releases #258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3994164
527b8f4
ec56133
e2ef8d2
660c325
11c6b76
82c0fa1
5eba8d6
4fe4859
e6ce62e
f58c5d2
8421b24
7e18452
f8fc149
5117b03
7bd9c1d
fcb2eb1
27ba2f5
890fc2d
2a8875a
4a13c2a
12a79ab
c62b3f3
3493353
0841c94
4d6a947
1f8178e
f05a256
e706280
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,7 +17,7 @@ | |
| - under the License. | ||
| --> | ||
|
|
||
| Recommendations and requirements for how to best contribute to Parquet. We strive to obey these as best as possible. As always, thanks for contributing--we hope these guidelines make it easier and shed some light on our approach and processes. | ||
| Recommendations and requirements for how to best contribute to Parquet. We strive to obey these as best as possible. As always, thanks for contributing--we hope these guidelines make it easier and shed some light on our approach and processes. If you believe there should be a change or exception to these rules please bring it up for discussion on the developer mailing list (dev@parquet.apache.org). | ||
|
|
||
| ### Key branches | ||
| - `master` has the latest stable changes | ||
|
|
@@ -29,3 +29,174 @@ Recommendations and requirements for how to best contribute to Parquet. We striv | |
| ### License | ||
| By contributing your code, you agree to license your contribution under the terms of the APLv2: | ||
| https://github.com/apache/parquet-format/blob/master/LICENSE | ||
|
|
||
| ### Additions/Changes to the Format | ||
|
|
||
| Note: This section applies to actual functional changes to the specification. | ||
| Fixing typos, grammar, and clarifying concepts that would not change the | ||
| semantics of the specification can be done as long as a committer feels comfortable | ||
| to merge them. When in doubt starting a discussion on the dev mailing list is | ||
| encouraged. | ||
|
|
||
| The general steps for adding features to the format are as follows: | ||
|
|
||
| 1. Design/scoping: The goal of this phase is to identify design goals of a | ||
| feature and provide some demonstration that the feature meets those goals. | ||
| This phase starts with a discussion of changes on the developer mailing list | ||
| (dev@parquet.apache.org). Depending on the scope and goals of the feature the | ||
| it can be useful to provide additional artifacts as part of a discussion. The | ||
| artifacts can include a design docuemnt, a draft pull request to make the | ||
| discussion concrete and/or an prototype implementation to demostrate the | ||
| viability of implementation. This step is complete when there is lazy | ||
| consensus. Part of the consensus is whether it is sufficient to provide two | ||
| working implementations as outlined in step 2, or if demonstration of the | ||
| feature with a downstream query engine is necessary to justify the feature | ||
| (e.g. demonstrate performance improvements in the Apache Arrow C++ Dataset | ||
| library, the Apache DataFusion query engine, or any other open source | ||
| engine). | ||
|
|
||
| 2. Completeness: The goal of this phase is to ensure the feature is viable, | ||
| there is no ambiguity in its specification by demonstrating compatibility | ||
| between implementations. Once a change has lazy consensus, two | ||
| implementations of the feature demonstrating interopability must also be | ||
| provided. One implementation MUST be | ||
| [`parquet-java`](http://github.com/apache/parquet-java). It is preferred | ||
| that the second implementation be | ||
| [`parquet-cpp`](https://github.com/apache/arrow) or | ||
| [`parquet-rs`](https://github.com/apache/arrow-rs), however at the discretion | ||
| of the PMC any open source Parquet implementation may be acceptable. | ||
| Implementations whose contributors actively participate in the community | ||
| (e.g. keep their feature matrix up-to-date on the Parquet website) are more | ||
| likely to be considered. If discussed as a requirement in step 1 above, | ||
| demonstration of integration with a query engine is also required for this | ||
| step. The implementations must be made available publicly, and they should be | ||
| fit for inclusion (for example, they were submitted as a pull request against | ||
| the target repository and committers gave positive reviews). Reports on the | ||
| benefits from closed source implementations are welcome and can help lend | ||
| weight to features desirability but are not sufficient for acceptance of a | ||
| new feature. | ||
|
|
||
| Unless otherwise discussed, it is expected the implementations will be developed | ||
| from their respective main branch (i.e. backporting is not required), to | ||
| demonstrate that the feature is mergeable to its implementation. | ||
|
|
||
| 3. Ratification: After the first two steps are complete a formal vote is held on | ||
| dev@parquet.apache.org to officially ratify the feature. After the vote | ||
| passes the format change is merged into the `parquet-format` repository and | ||
| it is expected the changes from step 2 will also be merged soon after | ||
| (implementations should not be merged until the addition has been merged to | ||
| `parquet-format`). | ||
|
|
||
| #### General guidelines/preferences on additions. | ||
|
|
||
| 1. To the greatest extent possible changes should have an option for forward | ||
| compatibility (old readers can still read files). The [compatibility and | ||
| feature enablement](#compatibility-and-feature-enablement) section below | ||
| provides more details on expectations for changes that break compatibility. | ||
|
|
||
| 2. New encodings should be fully specified in this repository and not | ||
| rely on an external dependencies for implementation (i.e. `parquet-format` is | ||
| the source of truth for the encoding). If it does require an | ||
| external dependency, then the external dependency must have its | ||
| own specification separate from implementation. | ||
|
|
||
| 3. New compression mechanisms should have a pure Java implementation that can be | ||
| used as a dependency in `parquet-java`, exceptions may be | ||
| discussed on the mailing list to see if a non-native Java | ||
| implementation is acceptable. | ||
|
|
||
| ### Releases | ||
|
|
||
| The Parquet PMC aims to do releases of the format package only as needed when | ||
| new features are introduced. If multiple new features are being proposed | ||
| simultaneously some features might be consolidated into the same release. | ||
| Guidance is provided below on when implementations should enable features added | ||
| to the specification. Due to confusion in the past over Parquet versioning it | ||
| is not expected that there will be a 3.x release of the specification in the | ||
| foreseeable future. | ||
|
|
||
| ### Compatibility and Feature Enablement | ||
|
|
||
| For the purposes of this discussion we classify features into the following buckets: | ||
|
|
||
| 1. Backward compatible. A file written under an older version of the format | ||
| should be readable under a newer version of the format. | ||
|
|
||
| 2. Forward compatible. A file written under a newer version of the format with | ||
| the feature enabled can be read under an older version of the format, but | ||
| some metadata might be missing or performance might be suboptimal. Simply | ||
| phrased, forward compatible means all data can be read back in an older | ||
| version of the format. New logical types are considered forward | ||
| compatible despite the loss of semantic meaning. | ||
|
|
||
| 3. Forward incompatible. A file written under a newer version of the format with | ||
| the feature enabled cannot be read under an older version of the format (e.g. | ||
| adding and using a new compression algorithm). It is expected any feature in | ||
| this category will provide a signal to older readers, so they can | ||
| unambiguously determine that they cannot properly read the file (e.g. via | ||
| adding a new value to an existing enum). | ||
|
|
||
| New features are intended to be widely beneficial to users of Parquet, and | ||
| therefore it is hoped third-party implementations will adopt them quickly after | ||
| they are introduced. It is assumed that writing new parts of the format, and | ||
| especially forward incompatible features, will be configured with a feature flag | ||
| defaulted to "off", and at some future point the feature is turned on by default | ||
| (reading of the new feature will typically be enabled without configuration or | ||
| defaulted to on). Some amount of lead time is desirable to ensure a critical | ||
| mass of Parquet implementations support a feature to avoid compatibility issues | ||
| across the ecosystem. Therefore, the Parquet PMC gives the following | ||
| recommendations for managing features: | ||
|
|
||
| 1. Backward compatibility is the concern of implementations but given the | ||
| ubiquity of Parquet and the length of time it has been used, libraries should | ||
| support reading older versions of the format to the greatest extent possible. | ||
|
|
||
| 2. Forward compatible features/changes may be enabled and used by default in | ||
| implementations once the parquet-format containing those changes has been | ||
| formally released. For features that may pose a significant performance | ||
| regression to older format readers, libaries should consider delaying default | ||
| enablement until 1 year after the release of the parquet-java implementation | ||
| that contains the feature implementation. | ||
|
|
||
| 3. Forward incompatible features/changes should not be turned on by default | ||
| until 2 years after the parquet-java implementation containing the feature is | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Still skeptical about a delay.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Per above, I can understand the skepticism, I'd like to get more opinions here and we can update accordingly.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could add a shortcut like "or a PMC vote was in favour of it". If something gets really fast adoption, then we could enable it sooner. How did you come up with "2 years"? Was this a guess or a deeper reasoning behind it?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My suggestion is to avoid this sentence and just start with "It is recommended that changing the default value for a forward incompatible feature flag should be clearly advertised to consumers..." Otherwise, this could be read as "any change made to parquet will not even plausibly be used for at least two years". I think it would be a shame if people used this 2 years as a reason to slow down their adoption of new parquet features I think parquet would be better off if we let market demand drive the proliferation of features rather than trying to control a rollout. The thinking is that if new files begin appearing that are not readable by some system that is lagging, it would be beneficial to add pressure on that system upgrading their support
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think we should either do away with the guidance or release it. PMC can always modify the language necessary via vote if necessary, but I'm not sure there is a good scenario where we want to actively break guidance?
Mostly guesstimate and it might actually be too short. This assumes a least some downstream dependencies only do major releases on a yearly basis as well. So it leaves time for a downstream system to pickup a release in the following year and then ~1 year for rollout. Taking Spark as an example, it appears to average about 3.5 years beetween major releases. Minor releases are aimed at every six-months, but not everyone is going to use a release immediately. If you look at EMR ignoring the LTE release of 2.4.8, they are currently supporting 3.5.0 and 3.4.1. 3.4.1 was released 9 months ago and ~1 year ago respectively. Dataproc, explicitly supports images for 24 months, but the actual Spark versions are older then AWS (i.e. 3.3) If we do a parquet 2.0 release with a breaking change, we will probably learn how long it has taken people to actually adopt some of the newer breaking features.
I think this ignores how long it takes changes to roll through the ecosystem, I detailed more in my reply on why 2 years above. I'd welcome people to start experimenting features early but I think not recommending at least some delay will cause a lot of unnecessary pain.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What @emkornfield is saying above makes sense to me.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @julienledem ultimately whatever time period we pick is going to be arbitrary, but do you have a concrete suggestion.
This seems like a implementation specific thing but I'm not sure a global setting is the right thing to do here. This is probably a discussion to have specifically around what we want to do in different parquet implementations.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest the following guidance: In the Parquet reference implementations mentioned above (java, cpp, rust, ...)
Third party implementations are advised to:
My goal here is to not artificially slow down adoption of new features when it is not needed. We need a transition period, hence the always off by default. We also want the shortest possible transition period.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based off of offline discussion:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated to include recommendation for 1. For full deprecation I made a note we will update the document for further timeline. |
||
| released. It is recommended that changing the default value for a forward | ||
| incompatible feature flag should be clearly advertised to consumers (e.g. via | ||
| a major version release if using Semantic Versioning, or highlighed in | ||
| release notes). | ||
emkornfield marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| For forward compatible changes which have a high chance of performance | ||
emkornfield marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| regression for older readers and forward incompatible changes, implementations | ||
| should clearly document the compatibility issues. Additionally, while it is up | ||
| to maintainers of individual open-source implementations to make the best decision to serve | ||
| their ecosystem, they are encouraged to start enabling features by default along | ||
| the same timelines as `parquet-java`. Parquet-java will wait to enable features | ||
| by default until the most conservative timelines outlined above have been | ||
| exceeded. This timeline is an attempt to balance ensuring | ||
| new features make their way into the ecosystem and avoiding | ||
| breaking compatiblity for readers that are slower to adopt new standards. We | ||
| encourage earlier adoption of new features when an organization using Parquet | ||
| can guarantee that all readers of the parquet files they produce can read a new | ||
| feature. | ||
|
|
||
| After turning a feature on by default implementations | ||
| are encouraged to keep a configuration to turn off the feature. | ||
| A recommendation for full deprecation will be made in a future | ||
| iteration of this document. | ||
|
|
||
| For features released prior to October 2024, target dates for each of these | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't quite understand this. Do we expect a parquet-java 2.0 release around October 2024? Doesn't any new feature should follow this guidance once published?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My intent was to try for Parquet-java 2.0 in October. There are a lot of old features from the spec perspective that based on guidance would be ok to use (e.g. lz4-raw, byte stream split, data page v2, v2 encodings) that we need to decide on recommended dates (as a sample suggestion I would aim for 1 year for features we think we actually want to turn on)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it a good timing to do Parquet-java 2.0 release? As per the discussion from https://lists.apache.org/thread/kttwbl5l7opz6nwb5bck2gghc2y3td0o, we intend to break the API compatibility by removing deprecated APIs (and remove Java 8 support) in the 2.0 release. If the intention is to set a target date for existing features without breaking anything, perhaps we should release 1.15.0 instead?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO I think API compatibility is a separate concern and it is fine to set future dates and change API compatibility. I can rephrase this section to mention some other criteria for initial target dates if there is an alternative proposal?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would be up for a Parquet 2.0 release in October and am eager to help! Do we also want a 1.15? I'm trying to downstream 1.14.1 into various projects (apache/spark#46447, apache/iceberg#10209), but the new Jackson version is giving us some troubles. It feels to me that the different projects need some time to update. My argument is that releasing 1.15 right now would not give too many new features to the users, and I think it makes sense to jump to 2.0.0. Since there are some delays mentioned earlier in this document:
For my understanding, the things you mentioned (e.g. lz4-raw, byte stream split, data page v2, v2 encodings), they are already shipped with the last release, so the discussion is when to make them the default?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Am I correct in understanding that We intend to:
I think this is not unsound but it might be confusing. We should then expect to have several major releases coming: one for the new footer and one for each new encoding.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @julienledem no I think there is a misunderstanding.
Per above, I don't think we expect a major version change for parquet-format in the near future.
yes, this is correct. this provides read support.
Yes.
I think these should be grouped. Per apache/parquet-site#61 the proposal for parquet-java is to have at most one major release per year. Early adopters can use features once a minor release containing them is out as long as they feel comfortable with understanding the scope of impact.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I'm not following which part of the conversation "per above" refers to here. I was assuming for example that once we define a new footer in a new format we would increment the format spec. Would that be wrong?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the releases section:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, that sounds good to me. |
||
| categories will be updated as part of the `parquet-java 2.0` release process | ||
| based on a collected feature compatibility matrix. | ||
|
|
||
| For each release of `parquet-java` or `parquet-format` that influences this | ||
| guidance it is expected exact dates will be added to parquet-format to provide | ||
| clarity to implementors (e.g. When `parquet-java` 2.X.X is released, any new | ||
| format features it uses will be updated with concrete dates). As part of | ||
| `parquet-format` releases the compatibility matrix will be updated to contain | ||
| the release date in the format. Implementations are also encouraged to provide | ||
| implementation date/release version information when updating the feature | ||
| matrix. | ||
|
|
||
| End users of software are generally encouraged to consult the feature matrix | ||
| and vendor documentation before enabling features that are not yet widely | ||
| adopted. | ||
Uh oh!
There was an error while loading. Please reload this page.