Skip to content
Merged
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
3994164
DRAFT: Strawman guidance on feature releases
emkornfield Jun 5, 2024
527b8f4
fix some typos
emkornfield Jun 5, 2024
ec56133
Update CONTRIBUTING.md
emkornfield Jun 5, 2024
e2ef8d2
update based on comments
emkornfield Jun 6, 2024
660c325
fix typo
emkornfield Jun 6, 2024
11c6b76
Apply suggestions from code review
emkornfield Jun 6, 2024
82c0fa1
add paragraph covering minor changes
emkornfield Jun 6, 2024
5eba8d6
reflow text
Jun 7, 2024
4fe4859
rephrase to be less proscriptive
Jun 7, 2024
e6ce62e
Update CONTRIBUTING.md
emkornfield Jun 7, 2024
f58c5d2
Update CONTRIBUTING.md
emkornfield Jun 7, 2024
8421b24
clarify forward incompatible features
emkornfield Jun 7, 2024
7e18452
Respond to more feedback.
Jun 10, 2024
f8fc149
Apply suggestions from code review
emkornfield Jun 10, 2024
5117b03
address feedback.
emkornfield Jun 18, 2024
7bd9c1d
reflow
Jun 18, 2024
fcb2eb1
Update CONTRIBUTING.md
emkornfield Jun 26, 2024
27ba2f5
Update CONTRIBUTING.md
emkornfield Jun 26, 2024
890fc2d
Update CONTRIBUTING.md
emkornfield Jun 26, 2024
2a8875a
Update CONTRIBUTING.md
emkornfield Jul 1, 2024
4a13c2a
Apply suggestions from code review
emkornfield Jul 9, 2024
12a79ab
wip, address comments.
emkornfield Jul 12, 2024
c62b3f3
finish addressing comments
emkornfield Jul 12, 2024
3493353
reflow
Jul 12, 2024
0841c94
clarify new logical types
emkornfield Jul 12, 2024
4d6a947
Address some comments
emkornfield Jul 13, 2024
1f8178e
add exceptions to top and reflow the rest of the content.
Jul 13, 2024
f05a256
fix some typos, and sentence around keeping feature flags for compati…
emkornfield Jul 13, 2024
e706280
add link
emkornfield Jul 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
173 changes: 172 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
- under the License.
-->

Recommendations and requirements for how to best contribute to Parquet. We strive to obey these as best as possible. As always, thanks for contributing--we hope these guidelines make it easier and shed some light on our approach and processes.
Recommendations and requirements for how to best contribute to Parquet. We strive to obey these as best as possible. As always, thanks for contributing--we hope these guidelines make it easier and shed some light on our approach and processes. If you believe there should be a change or exception to these rules please bring it up for discussion on the developer mailing list (dev@parquet.apache.org).

### Key branches
- `master` has the latest stable changes
Expand All @@ -29,3 +29,174 @@ Recommendations and requirements for how to best contribute to Parquet. We striv
### License
By contributing your code, you agree to license your contribution under the terms of the APLv2:
https://github.com/apache/parquet-format/blob/master/LICENSE

### Additions/Changes to the Format

Note: This section applies to actual functional changes to the specification.
Fixing typos, grammar, and clarifying concepts that would not change the
semantics of the specification can be done as long as a committer feels comfortable
to merge them. When in doubt starting a discussion on the dev mailing list is
encouraged.

The general steps for adding features to the format are as follows:

1. Design/scoping: The goal of this phase is to identify design goals of a
feature and provide some demonstration that the feature meets those goals.
This phase starts with a discussion of changes on the developer mailing list
(dev@parquet.apache.org). Depending on the scope and goals of the feature the
it can be useful to provide additional artifacts as part of a discussion. The
artifacts can include a design docuemnt, a draft pull request to make the
discussion concrete and/or an prototype implementation to demostrate the
viability of implementation. This step is complete when there is lazy
consensus. Part of the consensus is whether it is sufficient to provide two
working implementations as outlined in step 2, or if demonstration of the
feature with a downstream query engine is necessary to justify the feature
(e.g. demonstrate performance improvements in the Apache Arrow C++ Dataset
library, the Apache DataFusion query engine, or any other open source
engine).

2. Completeness: The goal of this phase is to ensure the feature is viable,
there is no ambiguity in its specification by demonstrating compatibility
between implementations. Once a change has lazy consensus, two
implementations of the feature demonstrating interopability must also be
provided. One implementation MUST be
[`parquet-java`](http://github.com/apache/parquet-java). It is preferred
that the second implementation be
[`parquet-cpp`](https://github.com/apache/arrow) or
[`parquet-rs`](https://github.com/apache/arrow-rs), however at the discretion
of the PMC any open source Parquet implementation may be acceptable.
Implementations whose contributors actively participate in the community
(e.g. keep their feature matrix up-to-date on the Parquet website) are more
likely to be considered. If discussed as a requirement in step 1 above,
demonstration of integration with a query engine is also required for this
step. The implementations must be made available publicly, and they should be
fit for inclusion (for example, they were submitted as a pull request against
the target repository and committers gave positive reviews). Reports on the
benefits from closed source implementations are welcome and can help lend
weight to features desirability but are not sufficient for acceptance of a
new feature.

Unless otherwise discussed, it is expected the implementations will be developed
from their respective main branch (i.e. backporting is not required), to
demonstrate that the feature is mergeable to its implementation.

3. Ratification: After the first two steps are complete a formal vote is held on
dev@parquet.apache.org to officially ratify the feature. After the vote
passes the format change is merged into the `parquet-format` repository and
it is expected the changes from step 2 will also be merged soon after
(implementations should not be merged until the addition has been merged to
`parquet-format`).

#### General guidelines/preferences on additions.

1. To the greatest extent possible changes should have an option for forward
compatibility (old readers can still read files). The [compatibility and
feature enablement](#compatibility-and-feature-enablement) section below
provides more details on expectations for changes that break compatibility.

2. New encodings should be fully specified in this repository and not
rely on an external dependencies for implementation (i.e. `parquet-format` is
the source of truth for the encoding). If it does require an
external dependency, then the external dependency must have its
own specification separate from implementation.

3. New compression mechanisms should have a pure Java implementation that can be
used as a dependency in `parquet-java`, exceptions may be
discussed on the mailing list to see if a non-native Java
implementation is acceptable.

### Releases

The Parquet PMC aims to do releases of the format package only as needed when
new features are introduced. If multiple new features are being proposed
simultaneously some features might be consolidated into the same release.
Guidance is provided below on when implementations should enable features added
to the specification. Due to confusion in the past over Parquet versioning it
is not expected that there will be a 3.x release of the specification in the
foreseeable future.

### Compatibility and Feature Enablement

For the purposes of this discussion we classify features into the following buckets:

1. Backward compatible. A file written under an older version of the format
should be readable under a newer version of the format.

2. Forward compatible. A file written under a newer version of the format with
the feature enabled can be read under an older version of the format, but
some metadata might be missing or performance might be suboptimal. Simply
phrased, forward compatible means all data can be read back in an older
version of the format. New logical types are considered forward
compatible despite the loss of semantic meaning.

3. Forward incompatible. A file written under a newer version of the format with
the feature enabled cannot be read under an older version of the format (e.g.
adding and using a new compression algorithm). It is expected any feature in
this category will provide a signal to older readers, so they can
unambiguously determine that they cannot properly read the file (e.g. via
adding a new value to an existing enum).

New features are intended to be widely beneficial to users of Parquet, and
therefore it is hoped third-party implementations will adopt them quickly after
they are introduced. It is assumed that writing new parts of the format, and
especially forward incompatible features, will be configured with a feature flag
defaulted to "off", and at some future point the feature is turned on by default
(reading of the new feature will typically be enabled without configuration or
defaulted to on). Some amount of lead time is desirable to ensure a critical
mass of Parquet implementations support a feature to avoid compatibility issues
across the ecosystem. Therefore, the Parquet PMC gives the following
recommendations for managing features:

1. Backward compatibility is the concern of implementations but given the
ubiquity of Parquet and the length of time it has been used, libraries should
support reading older versions of the format to the greatest extent possible.

2. Forward compatible features/changes may be enabled and used by default in
implementations once the parquet-format containing those changes has been
formally released. For features that may pose a significant performance
regression to older format readers, libaries should consider delaying default
enablement until 1 year after the release of the parquet-java implementation
that contains the feature implementation.

3. Forward incompatible features/changes should not be turned on by default
until 2 years after the parquet-java implementation containing the feature is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still skeptical about a delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per above, I can understand the skepticism, I'd like to get more opinions here and we can update accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a shortcut like "or a PMC vote was in favour of it". If something gets really fast adoption, then we could enable it sooner.

How did you come up with "2 years"? Was this a guess or a deeper reasoning behind it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to avoid this sentence and just start with "It is recommended that changing the default value for a forward incompatible feature flag should be clearly advertised to consumers..."

Otherwise, this could be read as "any change made to parquet will not even plausibly be used for at least two years". I think it would be a shame if people used this 2 years as a reason to slow down their adoption of new parquet features

I think parquet would be better off if we let market demand drive the proliferation of features rather than trying to control a rollout. The thinking is that if new files begin appearing that are not readable by some system that is lagging, it would be beneficial to add pressure on that system upgrading their support

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a shortcut like "or a PMC vote was in favour of it". If something gets really fast adoption, then we could enable it sooner.

I think we should either do away with the guidance or release it. PMC can always modify the language necessary via vote if necessary, but I'm not sure there is a good scenario where we want to actively break guidance?

How did you come up with "2 years"? Was this a guess or a deeper reasoning behind it?

Mostly guesstimate and it might actually be too short. This assumes a least some downstream dependencies only do major releases on a yearly basis as well. So it leaves time for a downstream system to pickup a release in the following year and then ~1 year for rollout. Taking Spark as an example, it appears to average about 3.5 years beetween major releases. Minor releases are aimed at every six-months, but not everyone is going to use a release immediately. If you look at EMR ignoring the LTE release of 2.4.8, they are currently supporting 3.5.0 and 3.4.1. 3.4.1 was released 9 months ago and ~1 year ago respectively. Dataproc, explicitly supports images for 24 months, but the actual Spark versions are older then AWS (i.e. 3.3)

If we do a parquet 2.0 release with a breaking change, we will probably learn how long it has taken people to actually adopt some of the newer breaking features.

I think parquet would be better off if we let market demand drive the proliferation of features rather than trying to control a rollout. The thinking is that if new files begin appearing that are not readable by some system that is lagging, it would be beneficial to add pressure on that system upgrading their support

I think this ignores how long it takes changes to roll through the ecosystem, I detailed more in my reply on why 2 years above. I'd welcome people to start experimenting features early but I think not recommending at least some delay will cause a lot of unnecessary pain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @emkornfield is saying above makes sense to me.
We should always have new forward incompatible features off by default at first. Otherwise, this will cause breakages every time we add a new encoding (or similarly incompatible feature). How long is debatable. 2 years seems long to me.
I don't think we should give a duration guidance to other implementers/users. We should document why we are doing this and explain the constraints. How long is very dependent on their circumstances. In an environment where there is a single system reading and writing Parquet files, they can enable it right away without problems. In a "data lakehouse" environment where people have multiple systems reading and writing they want to make sure they have upgraded other consumers before they turn it on. The more time passes, the more it becomes acceptable to have such new features on by default and ask the user to turn them off because they have some legacy system that can only read old files.
We could have a setting to adjust this in environments where there is a single system reading/writing parquet: setForwardIncompatibleFeaturesOnByDefault(boolean)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julienledem ultimately whatever time period we pick is going to be arbitrary, but do you have a concrete suggestion.

We could have a setting to adjust this in environments where there is a single system reading/writing parquet: setForwardIncompatibleFeaturesOnByDefault(boolean)

This seems like a implementation specific thing but I'm not sure a global setting is the right thing to do here. This is probably a discussion to have specifically around what we want to do in different parquet implementations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest the following guidance:

In the Parquet reference implementations mentioned above (java, cpp, rust, ...)

  • forward incompatible features (ex: new encoding) are always off by default.
  • there is a setting to turn them on: enableEncodingFoo(true)
  • there is a setting to change the default: setForwardIncompatibleFeaturesOnByDefault(true)
  • We will change the default to on on a given feature, when a minimum amount of time is elapsed (say 6 months, 2 releases) and there is enough adoption (we should make a list here? ex: latest Flink, Spark, Trino releases updated to a version of parquet that supports it)

Third party implementations are advised to:

  • have forward incompatible features off by default.
  • have a setting to turn them on: config.encoding_foo=enabled
  • They can decide to enable them immediately if no other system is consuming their files or if they know that all the readers are compatible. That's what "setForwardIncompatibleFeaturesOnByDefault(true)" is for if they use reference libraries.

My goal here is to not artificially slow down adoption of new features when it is not needed. We need a transition period, hence the always off by default. We also want the shortest possible transition period.

Copy link
Contributor Author

@emkornfield emkornfield Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based off of offline discussion:

  1. We should be explicit about after turning on by default libraries are encouraged to enable turning it off.
  2. I think we should wait on specific policies that parquet-cpp and parquet-java will definitively turn these things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to include recommendation for 1.

For full deprecation I made a note we will update the document for further timeline.

released. It is recommended that changing the default value for a forward
incompatible feature flag should be clearly advertised to consumers (e.g. via
a major version release if using Semantic Versioning, or highlighed in
release notes).

For forward compatible changes which have a high chance of performance
regression for older readers and forward incompatible changes, implementations
should clearly document the compatibility issues. Additionally, while it is up
to maintainers of individual open-source implementations to make the best decision to serve
their ecosystem, they are encouraged to start enabling features by default along
the same timelines as `parquet-java`. Parquet-java will wait to enable features
by default until the most conservative timelines outlined above have been
exceeded. This timeline is an attempt to balance ensuring
new features make their way into the ecosystem and avoiding
breaking compatiblity for readers that are slower to adopt new standards. We
encourage earlier adoption of new features when an organization using Parquet
can guarantee that all readers of the parquet files they produce can read a new
feature.

After turning a feature on by default implementations
are encouraged to keep a configuration to turn off the feature.
A recommendation for full deprecation will be made in a future
iteration of this document.

For features released prior to October 2024, target dates for each of these
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this. Do we expect a parquet-java 2.0 release around October 2024? Doesn't any new feature should follow this guidance once published?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent was to try for Parquet-java 2.0 in October. There are a lot of old features from the spec perspective that based on guidance would be ok to use (e.g. lz4-raw, byte stream split, data page v2, v2 encodings) that we need to decide on recommended dates (as a sample suggestion I would aim for 1 year for features we think we actually want to turn on)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a good timing to do Parquet-java 2.0 release? As per the discussion from https://lists.apache.org/thread/kttwbl5l7opz6nwb5bck2gghc2y3td0o, we intend to break the API compatibility by removing deprecated APIs (and remove Java 8 support) in the 2.0 release. If the intention is to set a target date for existing features without breaking anything, perhaps we should release 1.15.0 instead?

cc @gszadovszky @Fokko @julienledem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I think API compatibility is a separate concern and it is fine to set future dates and change API compatibility. I can rephrase this section to mention some other criteria for initial target dates if there is an alternative proposal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be up for a Parquet 2.0 release in October and am eager to help! Do we also want a 1.15?

I'm trying to downstream 1.14.1 into various projects (apache/spark#46447, apache/iceberg#10209), but the new Jackson version is giving us some troubles. It feels to me that the different projects need some time to update. My argument is that releasing 1.15 right now would not give too many new features to the users, and I think it makes sense to jump to 2.0.0.

Since there are some delays mentioned earlier in this document:

  • The API that I suggested removing was marked as deprecated almost six years go: PARQUET-1452: Deprecate old logical types API parquet-java#535
  • We're still compiling against Hadoop 2.7.3 which was released in August 2016.
  • I think the Java8 deprecation might be a bit harder and might require more discussion, but there it feels like all the projects are looking at each other :) Spark/Avro are Java 17+ right now.

For my understanding, the things you mentioned (e.g. lz4-raw, byte stream split, data page v2, v2 encodings), they are already shipped with the last release, so the discussion is when to make them the default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in understanding that We intend to:

  • increment the major release of parquet-format when it has the definition of the new (forward incompatible) feature in it.
  • implement the above new feature in parquet-java under a minor release in the previous major but turning it off by default
  • increment the major release of parquet-java when we turn on the new encodings by default?

I think this is not unsound but it might be confusing. We should then expect to have several major releases coming: one for the new footer and one for each new encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julienledem no I think there is a misunderstanding.

increment the major release of parquet-format when it has the definition of the new (forward incompatible) feature in it.

Per above, I don't think we expect a major version change for parquet-format in the near future.

implement the above new feature in parquet-java under a minor release in the previous major but turning it off by default

yes, this is correct. this provides read support.

increment the major release of parquet-java when we turn on the new encodings by default?

Yes.

I think this is not unsound but it might be confusing. We should then expect to have several major releases coming: one for the new footer and one for each new encoding.

I think these should be grouped. Per apache/parquet-site#61 the proposal for parquet-java is to have at most one major release per year. Early adopters can use features once a minor release containing them is out as long as they feel comfortable with understanding the scope of impact.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not following which part of the conversation "per above" refers to here.
Could you point to it and explain what you mean?

I was assuming for example that once we define a new footer in a new format we would increment the format spec. Would that be wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the releases section:

The Parquet PMC aims to do releases of the format package only as needed when
new features are introduced. If multiple new features are being proposed
simultaneously some features might be consolidated into the same release.
Guidance is provided below on when implementations should enable features added
to the specification.  Due to confusion in the past over Parquet versioning it
is not expected that there will be a 3.x release of the specification in the
foreseeable future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that sounds good to me.

categories will be updated as part of the `parquet-java 2.0` release process
based on a collected feature compatibility matrix.

For each release of `parquet-java` or `parquet-format` that influences this
guidance it is expected exact dates will be added to parquet-format to provide
clarity to implementors (e.g. When `parquet-java` 2.X.X is released, any new
format features it uses will be updated with concrete dates). As part of
`parquet-format` releases the compatibility matrix will be updated to contain
the release date in the format. Implementations are also encouraged to provide
implementation date/release version information when updating the feature
matrix.

End users of software are generally encouraged to consult the feature matrix
and vendor documentation before enabling features that are not yet widely
adopted.